kbt-devops faaadc1297 Add transmission datasets and update mining data
Add two new static datasets for cross-region arbitrage calculations:
- transmission_capacity: region-to-region capacity limits (20 rows)
- transmission_cost: transmission costs per path (20 rows)

Update mining dataset with EUR pricing and power metrics:
- Change btc_price_usd to btc_price_eur
- Add power_efficiency_th_per_mw, power_demand_mw
- Add revenue_eur_per_mwh, profit_eur_per_mwh
- Remove mining_profitability column

Changes include:
- scripts/02_fetch_historical.py: rewrite fetch_bitcoin_mining_data()
- scripts/01_generate_synthetic.py: add transmission data generators
- config/data_config.yaml: add transmission config, update bitcoin config
- config/schema.yaml: add 2 new schemas, update bitcoin_mining schema
- scripts/03_process_merge.py: add 2 new datasets
- scripts/04_validate.py: add 2 new datasets
- test/test_data.py: update for new datasets and bitcoin price reference

Total datasets: 9 (734,491 rows, 17.89 MB)
2026-02-11 01:09:33 +07:00
2026-02-11 00:06:23 +07:00

Energy Test Data

Preparation of test data for energy trading strategy demo.

Overview

This project generates and processes realistic test data for energy trading strategies, including:

  • Electricity Prices: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
  • Battery Capacity: Storage system states with charge/discharge cycles
  • Renewable Generation: Solar, wind, and hydro generation with forecast errors
  • Conventional Generation: Gas, coal, and nuclear plant outputs
  • Load Profiles: Regional electricity demand with weather correlations
  • Data Centers: Power demand profiles including mining client
  • Mining: Hashrate, price, and profitability data (from mempool.space)

Project Structure

energy-test-data/
├── data/
│   ├── processed/              # Final Parquet files (<200MB total)
│   ├── raw/                    # Unprocessed source data
│   └── metadata/               # Data documentation and reports
├── scripts/
│   ├── 01_generate_synthetic.py    # Generate synthetic data
│   ├── 02_fetch_historical.py      # Fetch historical data
│   ├── 03_process_merge.py         # Process and compress
│   └── 04_validate.py              # Validate and report
├── config/
│   ├── data_config.yaml            # Configuration parameters
│   └── schema.yaml                 # Data schema definitions
├── requirements.txt
└── README.md

Installation

pip install -r requirements.txt

Usage

Generate all test data

Run scripts in sequence:

python scripts/01_generate_synthetic.py
python scripts/02_fetch_historical.py
python scripts/03_process_merge.py
python scripts/04_validate.py

Or run all at once:

python scripts/01_generate_synthetic.py && \
python scripts/02_fetch_historical.py && \
python scripts/03_process_merge.py && \
python scripts/04_validate.py

Individual scripts

01_generate_synthetic.py: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.

02_fetch_historical.py: Fetches electricity prices, mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).

03_process_merge.py: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.

04_validate.py: Validates data quality, checks for missing values and outliers, and generates validation reports.

Configuration

Edit config/data_config.yaml to customize:

  • Time range: Start/end dates and granularity
  • Regions: Market regions to include
  • Data sources: Synthetic vs historical for each dataset
  • Generation parameters: Noise levels, outlier rates, missing value rates
  • Battery parameters: Capacity ranges, efficiency, degradation
  • Generation parameters: Plant capacities, marginal costs
  • Mining parameters: Hashrate ranges, mining efficiency

Data Specifications

Dataset Time Range Rows (10d × 1min) Est. Size
electricity_prices 10 days 72,000 ~40MB
battery_capacity 10 days 144,000 ~20MB
renewable_generation 10 days 216,000 ~35MB
conventional_generation 10 days 144,000 ~25MB
load_profiles 10 days 72,000 ~30MB
data_centers 10 days 72,000 ~15MB
mining 10 days 14,400 ~20MB
Total ~185MB

Output Format

All processed datasets are saved as Parquet files with Snappy compression in data/processed/.

To read a dataset:

import pandas as pd

df = pd.read_parquet('data/processed/electricity_prices.parquet')
print(df.head())

Data Sources

  • Electricity Prices: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
  • Mining: Hybrid (mempool.space API + synthetic patterns)
  • Load Profiles: Hybrid (ENTSO-E transparency platform patterns + synthetic)

Validation Reports

After processing, validation reports are generated in data/metadata/:

  • validation_report.json: Data quality checks, missing values, range violations
  • final_metadata.json: Dataset sizes, row counts, processing details
Description
No description provided
Readme 70 KiB
Languages
Python 100%