Initial commit: Energy test data generation pipeline

Add complete test data preparation system for energy trading strategy demo. Includes configuration, data generation scripts, and validation tools for 7 datasets covering electricity prices, battery capacity, renewable/conventional generation, load profiles, data centers, and mining data. Excluded from git: Actual parquet data files (data/raw/, data/processed/) can be regenerated using the provided scripts. Datasets: - electricity_prices: Day-ahead and real-time prices (5 regions) - battery_capacity: Storage system charge/discharge cycles - renewable_generation: Solar, wind, hydro with forecast errors - conventional_generation: Gas, coal, nuclear plant outputs - load_profiles: Regional demand with weather correlations - data_centers: Power demand profiles including mining operations - mining_data: Hashrate, price, profitability (mempool.space API)
2026-02-10 23:28:23 +07:00
commit a643767359
12 changed files with 1869 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,124 @@
+# Energy Test Data
+
+Preparation of test data for energy trading strategy demo.
+
+## Overview
+
+This project generates and processes realistic test data for energy trading strategies, including:
+
+- **Electricity Prices**: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
+- **Battery Capacity**: Storage system states with charge/discharge cycles
+- **Renewable Generation**: Solar, wind, and hydro generation with forecast errors
+- **Conventional Generation**: Gas, coal, and nuclear plant outputs
+- **Load Profiles**: Regional electricity demand with weather correlations
+- **Data Centers**: Power demand profiles including Bitcoin mining client
+- **Bitcoin Mining**: Hashrate, price, and profitability data (from mempool.space)
+
+## Project Structure
+
+```
+energy-test-data/
+├── data/
+│   ├── processed/              # Final Parquet files (<200MB total)
+│   ├── raw/                    # Unprocessed source data
+│   └── metadata/               # Data documentation and reports
+├── scripts/
+│   ├── 01_generate_synthetic.py    # Generate synthetic data
+│   ├── 02_fetch_historical.py      # Fetch historical data
+│   ├── 03_process_merge.py         # Process and compress
+│   └── 04_validate.py              # Validate and report
+├── config/
+│   ├── data_config.yaml            # Configuration parameters
+│   └── schema.yaml                 # Data schema definitions
+├── requirements.txt
+└── README.md
+```
+
+## Installation
+
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Generate all test data
+
+Run scripts in sequence:
+
+```bash
+python scripts/01_generate_synthetic.py
+python scripts/02_fetch_historical.py
+python scripts/03_process_merge.py
+python scripts/04_validate.py
+```
+
+Or run all at once:
+
+```bash
+python scripts/01_generate_synthetic.py && \
+python scripts/02_fetch_historical.py && \
+python scripts/03_process_merge.py && \
+python scripts/04_validate.py
+```
+
+### Individual scripts
+
+**01_generate_synthetic.py**: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.
+
+**02_fetch_historical.py**: Fetches electricity prices, Bitcoin mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).
+
+**03_process_merge.py**: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.
+
+**04_validate.py**: Validates data quality, checks for missing values and outliers, and generates validation reports.
+
+## Configuration
+
+Edit `config/data_config.yaml` to customize:
+
+- **Time range**: Start/end dates and granularity
+- **Regions**: Market regions to include
+- **Data sources**: Synthetic vs historical for each dataset
+- **Generation parameters**: Noise levels, outlier rates, missing value rates
+- **Battery parameters**: Capacity ranges, efficiency, degradation
+- **Generation parameters**: Plant capacities, marginal costs
+- **Bitcoin parameters**: Hashrate ranges, mining efficiency
+
+## Data Specifications
+
+| Dataset | Time Range | Rows (10d × 1min) | Est. Size |
+|---------|-----------|-------------------|-----------|
+| electricity_prices | 10 days | 72,000 | ~40MB |
+| battery_capacity | 10 days | 144,000 | ~20MB |
+| renewable_generation | 10 days | 216,000 | ~35MB |
+| conventional_generation | 10 days | 144,000 | ~25MB |
+| load_profiles | 10 days | 72,000 | ~30MB |
+| data_centers | 10 days | 72,000 | ~15MB |
+| bitcoin_mining | 10 days | 14,400 | ~20MB |
+| **Total** | | | **~185MB** |
+
+## Output Format
+
+All processed datasets are saved as Parquet files with Snappy compression in `data/processed/`.
+
+To read a dataset:
+
+```python
+import pandas as pd
+
+df = pd.read_parquet('data/processed/electricity_prices.parquet')
+print(df.head())
+```
+
+## Data Sources
+
+- **Electricity Prices**: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
+- **Bitcoin Mining**: Hybrid (mempool.space API + synthetic patterns)
+- **Load Profiles**: Hybrid (ENTSO-E transparency platform patterns + synthetic)
+
+## Validation Reports
+
+After processing, validation reports are generated in `data/metadata/`:
+
+- `validation_report.json`: Data quality checks, missing values, range violations
+- `final_metadata.json`: Dataset sizes, row counts, processing details