Add complete test data preparation system for energy trading strategy demo. Includes configuration, data generation scripts, and validation tools for 7 datasets covering electricity prices, battery capacity, renewable/conventional generation, load profiles, data centers, and mining data. Excluded from git: Actual parquet data files (data/raw/, data/processed/) can be regenerated using the provided scripts. Datasets: - electricity_prices: Day-ahead and real-time prices (5 regions) - battery_capacity: Storage system charge/discharge cycles - renewable_generation: Solar, wind, hydro with forecast errors - conventional_generation: Gas, coal, nuclear plant outputs - load_profiles: Regional demand with weather correlations - data_centers: Power demand profiles including mining operations - mining_data: Hashrate, price, profitability (mempool.space API)
125 lines
4.2 KiB
Markdown
125 lines
4.2 KiB
Markdown
# Energy Test Data
|
||
|
||
Preparation of test data for energy trading strategy demo.
|
||
|
||
## Overview
|
||
|
||
This project generates and processes realistic test data for energy trading strategies, including:
|
||
|
||
- **Electricity Prices**: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
|
||
- **Battery Capacity**: Storage system states with charge/discharge cycles
|
||
- **Renewable Generation**: Solar, wind, and hydro generation with forecast errors
|
||
- **Conventional Generation**: Gas, coal, and nuclear plant outputs
|
||
- **Load Profiles**: Regional electricity demand with weather correlations
|
||
- **Data Centers**: Power demand profiles including Bitcoin mining client
|
||
- **Bitcoin Mining**: Hashrate, price, and profitability data (from mempool.space)
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
energy-test-data/
|
||
├── data/
|
||
│ ├── processed/ # Final Parquet files (<200MB total)
|
||
│ ├── raw/ # Unprocessed source data
|
||
│ └── metadata/ # Data documentation and reports
|
||
├── scripts/
|
||
│ ├── 01_generate_synthetic.py # Generate synthetic data
|
||
│ ├── 02_fetch_historical.py # Fetch historical data
|
||
│ ├── 03_process_merge.py # Process and compress
|
||
│ └── 04_validate.py # Validate and report
|
||
├── config/
|
||
│ ├── data_config.yaml # Configuration parameters
|
||
│ └── schema.yaml # Data schema definitions
|
||
├── requirements.txt
|
||
└── README.md
|
||
```
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
## Usage
|
||
|
||
### Generate all test data
|
||
|
||
Run scripts in sequence:
|
||
|
||
```bash
|
||
python scripts/01_generate_synthetic.py
|
||
python scripts/02_fetch_historical.py
|
||
python scripts/03_process_merge.py
|
||
python scripts/04_validate.py
|
||
```
|
||
|
||
Or run all at once:
|
||
|
||
```bash
|
||
python scripts/01_generate_synthetic.py && \
|
||
python scripts/02_fetch_historical.py && \
|
||
python scripts/03_process_merge.py && \
|
||
python scripts/04_validate.py
|
||
```
|
||
|
||
### Individual scripts
|
||
|
||
**01_generate_synthetic.py**: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.
|
||
|
||
**02_fetch_historical.py**: Fetches electricity prices, Bitcoin mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).
|
||
|
||
**03_process_merge.py**: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.
|
||
|
||
**04_validate.py**: Validates data quality, checks for missing values and outliers, and generates validation reports.
|
||
|
||
## Configuration
|
||
|
||
Edit `config/data_config.yaml` to customize:
|
||
|
||
- **Time range**: Start/end dates and granularity
|
||
- **Regions**: Market regions to include
|
||
- **Data sources**: Synthetic vs historical for each dataset
|
||
- **Generation parameters**: Noise levels, outlier rates, missing value rates
|
||
- **Battery parameters**: Capacity ranges, efficiency, degradation
|
||
- **Generation parameters**: Plant capacities, marginal costs
|
||
- **Bitcoin parameters**: Hashrate ranges, mining efficiency
|
||
|
||
## Data Specifications
|
||
|
||
| Dataset | Time Range | Rows (10d × 1min) | Est. Size |
|
||
|---------|-----------|-------------------|-----------|
|
||
| electricity_prices | 10 days | 72,000 | ~40MB |
|
||
| battery_capacity | 10 days | 144,000 | ~20MB |
|
||
| renewable_generation | 10 days | 216,000 | ~35MB |
|
||
| conventional_generation | 10 days | 144,000 | ~25MB |
|
||
| load_profiles | 10 days | 72,000 | ~30MB |
|
||
| data_centers | 10 days | 72,000 | ~15MB |
|
||
| bitcoin_mining | 10 days | 14,400 | ~20MB |
|
||
| **Total** | | | **~185MB** |
|
||
|
||
## Output Format
|
||
|
||
All processed datasets are saved as Parquet files with Snappy compression in `data/processed/`.
|
||
|
||
To read a dataset:
|
||
|
||
```python
|
||
import pandas as pd
|
||
|
||
df = pd.read_parquet('data/processed/electricity_prices.parquet')
|
||
print(df.head())
|
||
```
|
||
|
||
## Data Sources
|
||
|
||
- **Electricity Prices**: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
|
||
- **Bitcoin Mining**: Hybrid (mempool.space API + synthetic patterns)
|
||
- **Load Profiles**: Hybrid (ENTSO-E transparency platform patterns + synthetic)
|
||
|
||
## Validation Reports
|
||
|
||
After processing, validation reports are generated in `data/metadata/`:
|
||
|
||
- `validation_report.json`: Data quality checks, missing values, range violations
|
||
- `final_metadata.json`: Dataset sizes, row counts, processing details
|