Add complete test data preparation system for energy trading strategy demo. Includes configuration, data generation scripts, and validation tools for 7 datasets covering electricity prices, battery capacity, renewable/conventional generation, load profiles, data centers, and mining data. Excluded from git: Actual parquet data files (data/raw/, data/processed/) can be regenerated using the provided scripts. Datasets: - electricity_prices: Day-ahead and real-time prices (5 regions) - battery_capacity: Storage system charge/discharge cycles - renewable_generation: Solar, wind, hydro with forecast errors - conventional_generation: Gas, coal, nuclear plant outputs - load_profiles: Regional demand with weather correlations - data_centers: Power demand profiles including mining operations - mining_data: Hashrate, price, profitability (mempool.space API)
4.2 KiB
Energy Test Data
Preparation of test data for energy trading strategy demo.
Overview
This project generates and processes realistic test data for energy trading strategies, including:
- Electricity Prices: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
- Battery Capacity: Storage system states with charge/discharge cycles
- Renewable Generation: Solar, wind, and hydro generation with forecast errors
- Conventional Generation: Gas, coal, and nuclear plant outputs
- Load Profiles: Regional electricity demand with weather correlations
- Data Centers: Power demand profiles including Bitcoin mining client
- Bitcoin Mining: Hashrate, price, and profitability data (from mempool.space)
Project Structure
energy-test-data/
├── data/
│ ├── processed/ # Final Parquet files (<200MB total)
│ ├── raw/ # Unprocessed source data
│ └── metadata/ # Data documentation and reports
├── scripts/
│ ├── 01_generate_synthetic.py # Generate synthetic data
│ ├── 02_fetch_historical.py # Fetch historical data
│ ├── 03_process_merge.py # Process and compress
│ └── 04_validate.py # Validate and report
├── config/
│ ├── data_config.yaml # Configuration parameters
│ └── schema.yaml # Data schema definitions
├── requirements.txt
└── README.md
Installation
pip install -r requirements.txt
Usage
Generate all test data
Run scripts in sequence:
python scripts/01_generate_synthetic.py
python scripts/02_fetch_historical.py
python scripts/03_process_merge.py
python scripts/04_validate.py
Or run all at once:
python scripts/01_generate_synthetic.py && \
python scripts/02_fetch_historical.py && \
python scripts/03_process_merge.py && \
python scripts/04_validate.py
Individual scripts
01_generate_synthetic.py: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.
02_fetch_historical.py: Fetches electricity prices, Bitcoin mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).
03_process_merge.py: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.
04_validate.py: Validates data quality, checks for missing values and outliers, and generates validation reports.
Configuration
Edit config/data_config.yaml to customize:
- Time range: Start/end dates and granularity
- Regions: Market regions to include
- Data sources: Synthetic vs historical for each dataset
- Generation parameters: Noise levels, outlier rates, missing value rates
- Battery parameters: Capacity ranges, efficiency, degradation
- Generation parameters: Plant capacities, marginal costs
- Bitcoin parameters: Hashrate ranges, mining efficiency
Data Specifications
| Dataset | Time Range | Rows (10d × 1min) | Est. Size |
|---|---|---|---|
| electricity_prices | 10 days | 72,000 | ~40MB |
| battery_capacity | 10 days | 144,000 | ~20MB |
| renewable_generation | 10 days | 216,000 | ~35MB |
| conventional_generation | 10 days | 144,000 | ~25MB |
| load_profiles | 10 days | 72,000 | ~30MB |
| data_centers | 10 days | 72,000 | ~15MB |
| bitcoin_mining | 10 days | 14,400 | ~20MB |
| Total | ~185MB |
Output Format
All processed datasets are saved as Parquet files with Snappy compression in data/processed/.
To read a dataset:
import pandas as pd
df = pd.read_parquet('data/processed/electricity_prices.parquet')
print(df.head())
Data Sources
- Electricity Prices: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
- Bitcoin Mining: Hybrid (mempool.space API + synthetic patterns)
- Load Profiles: Hybrid (ENTSO-E transparency platform patterns + synthetic)
Validation Reports
After processing, validation reports are generated in data/metadata/:
validation_report.json: Data quality checks, missing values, range violationsfinal_metadata.json: Dataset sizes, row counts, processing details