Add utility script to quickly verify and explore generated test data. Provides data loading, sample previews, time range checks, and key statistics for all datasets.
Energy Test Data
Preparation of test data for energy trading strategy demo.
Overview
This project generates and processes realistic test data for energy trading strategies, including:
- Electricity Prices: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
- Battery Capacity: Storage system states with charge/discharge cycles
- Renewable Generation: Solar, wind, and hydro generation with forecast errors
- Conventional Generation: Gas, coal, and nuclear plant outputs
- Load Profiles: Regional electricity demand with weather correlations
- Data Centers: Power demand profiles including Bitcoin mining client
- Bitcoin Mining: Hashrate, price, and profitability data (from mempool.space)
Project Structure
energy-test-data/
├── data/
│ ├── processed/ # Final Parquet files (<200MB total)
│ ├── raw/ # Unprocessed source data
│ └── metadata/ # Data documentation and reports
├── scripts/
│ ├── 01_generate_synthetic.py # Generate synthetic data
│ ├── 02_fetch_historical.py # Fetch historical data
│ ├── 03_process_merge.py # Process and compress
│ └── 04_validate.py # Validate and report
├── config/
│ ├── data_config.yaml # Configuration parameters
│ └── schema.yaml # Data schema definitions
├── requirements.txt
└── README.md
Installation
pip install -r requirements.txt
Usage
Generate all test data
Run scripts in sequence:
python scripts/01_generate_synthetic.py
python scripts/02_fetch_historical.py
python scripts/03_process_merge.py
python scripts/04_validate.py
Or run all at once:
python scripts/01_generate_synthetic.py && \
python scripts/02_fetch_historical.py && \
python scripts/03_process_merge.py && \
python scripts/04_validate.py
Individual scripts
01_generate_synthetic.py: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.
02_fetch_historical.py: Fetches electricity prices, Bitcoin mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).
03_process_merge.py: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.
04_validate.py: Validates data quality, checks for missing values and outliers, and generates validation reports.
Configuration
Edit config/data_config.yaml to customize:
- Time range: Start/end dates and granularity
- Regions: Market regions to include
- Data sources: Synthetic vs historical for each dataset
- Generation parameters: Noise levels, outlier rates, missing value rates
- Battery parameters: Capacity ranges, efficiency, degradation
- Generation parameters: Plant capacities, marginal costs
- Bitcoin parameters: Hashrate ranges, mining efficiency
Data Specifications
| Dataset | Time Range | Rows (10d × 1min) | Est. Size |
|---|---|---|---|
| electricity_prices | 10 days | 72,000 | ~40MB |
| battery_capacity | 10 days | 144,000 | ~20MB |
| renewable_generation | 10 days | 216,000 | ~35MB |
| conventional_generation | 10 days | 144,000 | ~25MB |
| load_profiles | 10 days | 72,000 | ~30MB |
| data_centers | 10 days | 72,000 | ~15MB |
| bitcoin_mining | 10 days | 14,400 | ~20MB |
| Total | ~185MB |
Output Format
All processed datasets are saved as Parquet files with Snappy compression in data/processed/.
To read a dataset:
import pandas as pd
df = pd.read_parquet('data/processed/electricity_prices.parquet')
print(df.head())
Data Sources
- Electricity Prices: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
- Bitcoin Mining: Hybrid (mempool.space API + synthetic patterns)
- Load Profiles: Hybrid (ENTSO-E transparency platform patterns + synthetic)
Validation Reports
After processing, validation reports are generated in data/metadata/:
validation_report.json: Data quality checks, missing values, range violationsfinal_metadata.json: Dataset sizes, row counts, processing details