Initial commit: Energy test data generation pipeline
Add complete test data preparation system for energy trading strategy demo. Includes configuration, data generation scripts, and validation tools for 7 datasets covering electricity prices, battery capacity, renewable/conventional generation, load profiles, data centers, and mining data. Excluded from git: Actual parquet data files (data/raw/, data/processed/) can be regenerated using the provided scripts. Datasets: - electricity_prices: Day-ahead and real-time prices (5 regions) - battery_capacity: Storage system charge/discharge cycles - renewable_generation: Solar, wind, hydro with forecast errors - conventional_generation: Gas, coal, nuclear plant outputs - load_profiles: Regional demand with weather correlations - data_centers: Power demand profiles including mining operations - mining_data: Hashrate, price, profitability (mempool.space API)
This commit is contained in:
46
.gitignore
vendored
Normal file
46
.gitignore
vendored
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
# Data files - exclude from git
|
||||||
|
data/raw/*.parquet
|
||||||
|
data/processed/*.parquet
|
||||||
|
|
||||||
|
# Python artifacts
|
||||||
|
__pycache__/
|
||||||
|
test/__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
*.so
|
||||||
|
.Python
|
||||||
|
build/
|
||||||
|
develop-eggs/
|
||||||
|
dist/
|
||||||
|
downloads/
|
||||||
|
eggs/
|
||||||
|
.eggs/
|
||||||
|
lib/
|
||||||
|
lib64/
|
||||||
|
parts/
|
||||||
|
sdist/
|
||||||
|
var/
|
||||||
|
wheels/
|
||||||
|
*.egg-info/
|
||||||
|
.installed.cfg
|
||||||
|
*.egg
|
||||||
|
|
||||||
|
# Virtual environments
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
env/
|
||||||
|
.venv/
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
*~
|
||||||
|
|
||||||
|
# OS
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
*.log
|
||||||
124
README.md
Normal file
124
README.md
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
# Energy Test Data
|
||||||
|
|
||||||
|
Preparation of test data for energy trading strategy demo.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This project generates and processes realistic test data for energy trading strategies, including:
|
||||||
|
|
||||||
|
- **Electricity Prices**: Day-ahead and real-time market prices for European regions (FR, BE, DE, NL, UK)
|
||||||
|
- **Battery Capacity**: Storage system states with charge/discharge cycles
|
||||||
|
- **Renewable Generation**: Solar, wind, and hydro generation with forecast errors
|
||||||
|
- **Conventional Generation**: Gas, coal, and nuclear plant outputs
|
||||||
|
- **Load Profiles**: Regional electricity demand with weather correlations
|
||||||
|
- **Data Centers**: Power demand profiles including Bitcoin mining client
|
||||||
|
- **Bitcoin Mining**: Hashrate, price, and profitability data (from mempool.space)
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
energy-test-data/
|
||||||
|
├── data/
|
||||||
|
│ ├── processed/ # Final Parquet files (<200MB total)
|
||||||
|
│ ├── raw/ # Unprocessed source data
|
||||||
|
│ └── metadata/ # Data documentation and reports
|
||||||
|
├── scripts/
|
||||||
|
│ ├── 01_generate_synthetic.py # Generate synthetic data
|
||||||
|
│ ├── 02_fetch_historical.py # Fetch historical data
|
||||||
|
│ ├── 03_process_merge.py # Process and compress
|
||||||
|
│ └── 04_validate.py # Validate and report
|
||||||
|
├── config/
|
||||||
|
│ ├── data_config.yaml # Configuration parameters
|
||||||
|
│ └── schema.yaml # Data schema definitions
|
||||||
|
├── requirements.txt
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Generate all test data
|
||||||
|
|
||||||
|
Run scripts in sequence:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/01_generate_synthetic.py
|
||||||
|
python scripts/02_fetch_historical.py
|
||||||
|
python scripts/03_process_merge.py
|
||||||
|
python scripts/04_validate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Or run all at once:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python scripts/01_generate_synthetic.py && \
|
||||||
|
python scripts/02_fetch_historical.py && \
|
||||||
|
python scripts/03_process_merge.py && \
|
||||||
|
python scripts/04_validate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Individual scripts
|
||||||
|
|
||||||
|
**01_generate_synthetic.py**: Creates synthetic data for battery systems, renewable generation, conventional generation, and data centers.
|
||||||
|
|
||||||
|
**02_fetch_historical.py**: Fetches electricity prices, Bitcoin mining data, and load profiles from public APIs (or generates realistic synthetic data when APIs are unavailable).
|
||||||
|
|
||||||
|
**03_process_merge.py**: Merges datasets, optimizes memory usage, and saves to compressed Parquet format.
|
||||||
|
|
||||||
|
**04_validate.py**: Validates data quality, checks for missing values and outliers, and generates validation reports.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `config/data_config.yaml` to customize:
|
||||||
|
|
||||||
|
- **Time range**: Start/end dates and granularity
|
||||||
|
- **Regions**: Market regions to include
|
||||||
|
- **Data sources**: Synthetic vs historical for each dataset
|
||||||
|
- **Generation parameters**: Noise levels, outlier rates, missing value rates
|
||||||
|
- **Battery parameters**: Capacity ranges, efficiency, degradation
|
||||||
|
- **Generation parameters**: Plant capacities, marginal costs
|
||||||
|
- **Bitcoin parameters**: Hashrate ranges, mining efficiency
|
||||||
|
|
||||||
|
## Data Specifications
|
||||||
|
|
||||||
|
| Dataset | Time Range | Rows (10d × 1min) | Est. Size |
|
||||||
|
|---------|-----------|-------------------|-----------|
|
||||||
|
| electricity_prices | 10 days | 72,000 | ~40MB |
|
||||||
|
| battery_capacity | 10 days | 144,000 | ~20MB |
|
||||||
|
| renewable_generation | 10 days | 216,000 | ~35MB |
|
||||||
|
| conventional_generation | 10 days | 144,000 | ~25MB |
|
||||||
|
| load_profiles | 10 days | 72,000 | ~30MB |
|
||||||
|
| data_centers | 10 days | 72,000 | ~15MB |
|
||||||
|
| bitcoin_mining | 10 days | 14,400 | ~20MB |
|
||||||
|
| **Total** | | | **~185MB** |
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
All processed datasets are saved as Parquet files with Snappy compression in `data/processed/`.
|
||||||
|
|
||||||
|
To read a dataset:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
df = pd.read_parquet('data/processed/electricity_prices.parquet')
|
||||||
|
print(df.head())
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
- **Electricity Prices**: Hybrid (synthetic patterns based on EPEX Spot market characteristics)
|
||||||
|
- **Bitcoin Mining**: Hybrid (mempool.space API + synthetic patterns)
|
||||||
|
- **Load Profiles**: Hybrid (ENTSO-E transparency platform patterns + synthetic)
|
||||||
|
|
||||||
|
## Validation Reports
|
||||||
|
|
||||||
|
After processing, validation reports are generated in `data/metadata/`:
|
||||||
|
|
||||||
|
- `validation_report.json`: Data quality checks, missing values, range violations
|
||||||
|
- `final_metadata.json`: Dataset sizes, row counts, processing details
|
||||||
96
config/data_config.yaml
Normal file
96
config/data_config.yaml
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# Energy Test Data Configuration
|
||||||
|
# For energy trading strategy demo
|
||||||
|
|
||||||
|
time_range:
|
||||||
|
# Last 10 days from current date (adjustable)
|
||||||
|
start_date: "2026-01-31"
|
||||||
|
end_date: "2026-02-10"
|
||||||
|
granularity: "1min" # 1-minute intervals
|
||||||
|
|
||||||
|
regions:
|
||||||
|
# European energy markets
|
||||||
|
- "FR" # France
|
||||||
|
- "BE" # Belgium
|
||||||
|
- "DE" # Germany
|
||||||
|
- "NL" # Netherlands
|
||||||
|
- "UK" # United Kingdom
|
||||||
|
|
||||||
|
data_sources:
|
||||||
|
electricity_prices:
|
||||||
|
type: "hybrid"
|
||||||
|
historical_source: "epex_spot"
|
||||||
|
synthetic_patterns: true
|
||||||
|
battery_capacity:
|
||||||
|
type: "synthetic"
|
||||||
|
num_batteries: 10
|
||||||
|
renewable_generation:
|
||||||
|
type: "synthetic"
|
||||||
|
plants_per_source: 5
|
||||||
|
sources: ["solar", "wind", "hydro"]
|
||||||
|
conventional_generation:
|
||||||
|
type: "synthetic"
|
||||||
|
num_plants: 10
|
||||||
|
fuel_types: ["gas", "coal", "nuclear"]
|
||||||
|
load_profiles:
|
||||||
|
type: "synthetic"
|
||||||
|
historical_source: "entso_e"
|
||||||
|
data_centers:
|
||||||
|
type: "synthetic"
|
||||||
|
num_centers: 5
|
||||||
|
special_client: "bitcoin"
|
||||||
|
bitcoin_mining:
|
||||||
|
type: "hybrid"
|
||||||
|
historical_source: "mempool.space"
|
||||||
|
synthetic_patterns: true
|
||||||
|
|
||||||
|
output:
|
||||||
|
format: "parquet"
|
||||||
|
compression: "snappy"
|
||||||
|
target_size_mb: 200
|
||||||
|
precision: "float32"
|
||||||
|
|
||||||
|
generation:
|
||||||
|
seed: 42
|
||||||
|
add_noise: true
|
||||||
|
noise_level: 0.05
|
||||||
|
include_outliers: true
|
||||||
|
outlier_rate: 0.01
|
||||||
|
include_missing_values: true
|
||||||
|
missing_rate: 0.005
|
||||||
|
|
||||||
|
battery:
|
||||||
|
capacity_range: [10, 100] # MWh
|
||||||
|
charge_rate_range: [5, 50] # MW
|
||||||
|
discharge_rate_range: [5, 50] # MW
|
||||||
|
efficiency_range: [0.85, 0.95]
|
||||||
|
degradation_rate: 0.001
|
||||||
|
|
||||||
|
renewable:
|
||||||
|
solar:
|
||||||
|
capacity_range: [50, 500] # MW
|
||||||
|
forecast_error_sd: 0.15
|
||||||
|
wind:
|
||||||
|
capacity_range: [100, 800] # MW
|
||||||
|
forecast_error_sd: 0.20
|
||||||
|
hydro:
|
||||||
|
capacity_range: [50, 300] # MW
|
||||||
|
forecast_error_sd: 0.05
|
||||||
|
|
||||||
|
conventional:
|
||||||
|
gas:
|
||||||
|
capacity_range: [200, 1000] # MW
|
||||||
|
marginal_cost_range: [30, 80] # EUR/MWh
|
||||||
|
coal:
|
||||||
|
capacity_range: [300, 1500] # MW
|
||||||
|
marginal_cost_range: [40, 70] # EUR/MWh
|
||||||
|
nuclear:
|
||||||
|
capacity_range: [800, 1600] # MW
|
||||||
|
marginal_cost_range: [10, 30] # EUR/MWh
|
||||||
|
|
||||||
|
data_center:
|
||||||
|
power_demand_range: [10, 100] # MW
|
||||||
|
price_sensitivity_range: [0.8, 1.2]
|
||||||
|
|
||||||
|
bitcoin:
|
||||||
|
hashrate_range: [150, 250] # EH/s
|
||||||
|
mining_efficiency_range: [25, 35] # J/TH
|
||||||
233
config/schema.yaml
Normal file
233
config/schema.yaml
Normal file
@@ -0,0 +1,233 @@
|
|||||||
|
# Schema definitions for energy test data datasets
|
||||||
|
|
||||||
|
schemas:
|
||||||
|
electricity_prices:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of price observation"
|
||||||
|
- name: "region"
|
||||||
|
type: "category"
|
||||||
|
description: "Market region code"
|
||||||
|
- name: "day_ahead_price"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Day-ahead market clearing price"
|
||||||
|
- name: "real_time_price"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Real-time market price"
|
||||||
|
- name: "capacity_price"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Capacity market price"
|
||||||
|
- name: "regulation_price"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Frequency regulation price"
|
||||||
|
- name: "volume_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Traded volume"
|
||||||
|
|
||||||
|
battery_capacity:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of battery state"
|
||||||
|
- name: "battery_id"
|
||||||
|
type: "category"
|
||||||
|
description: "Unique battery identifier"
|
||||||
|
- name: "capacity_mwh"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MWh"
|
||||||
|
description: "Total storage capacity"
|
||||||
|
- name: "charge_level_mwh"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MWh"
|
||||||
|
description: "Current energy stored"
|
||||||
|
- name: "charge_rate_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Current charging rate (positive) or discharging (negative)"
|
||||||
|
- name: "discharge_rate_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Maximum discharge rate"
|
||||||
|
- name: "efficiency"
|
||||||
|
type: "float32"
|
||||||
|
description: "Round-trip efficiency (0-1)"
|
||||||
|
|
||||||
|
renewable_generation:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of generation measurement"
|
||||||
|
- name: "source"
|
||||||
|
type: "category"
|
||||||
|
description: "Renewable source type (solar, wind, hydro)"
|
||||||
|
- name: "plant_id"
|
||||||
|
type: "category"
|
||||||
|
description: "Unique plant identifier"
|
||||||
|
- name: "generation_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Actual generation output"
|
||||||
|
- name: "forecast_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Forecasted generation"
|
||||||
|
- name: "actual_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Actual measured generation (after correction)"
|
||||||
|
- name: "capacity_factor"
|
||||||
|
type: "float32"
|
||||||
|
description: "Capacity utilization factor (0-1)"
|
||||||
|
|
||||||
|
conventional_generation:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of generation measurement"
|
||||||
|
- name: "plant_id"
|
||||||
|
type: "category"
|
||||||
|
description: "Unique plant identifier"
|
||||||
|
- name: "fuel_type"
|
||||||
|
type: "category"
|
||||||
|
description: "Primary fuel type (gas, coal, nuclear)"
|
||||||
|
- name: "generation_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Current generation output"
|
||||||
|
- name: "marginal_cost"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Short-run marginal cost"
|
||||||
|
- name: "heat_rate"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MMBtu/MWh"
|
||||||
|
description: "Thermal efficiency metric"
|
||||||
|
|
||||||
|
load_profiles:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of load measurement"
|
||||||
|
- name: "region"
|
||||||
|
type: "category"
|
||||||
|
description: "Region code"
|
||||||
|
- name: "load_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Actual system load"
|
||||||
|
- name: "forecast_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Load forecast"
|
||||||
|
- name: "weather_temp"
|
||||||
|
type: "float32"
|
||||||
|
unit: "Celsius"
|
||||||
|
description: "Average temperature"
|
||||||
|
- name: "humidity"
|
||||||
|
type: "float32"
|
||||||
|
unit: "%"
|
||||||
|
description: "Relative humidity"
|
||||||
|
|
||||||
|
data_centers:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of demand measurement"
|
||||||
|
- name: "data_center_id"
|
||||||
|
type: "category"
|
||||||
|
description: "Data center identifier"
|
||||||
|
- name: "location"
|
||||||
|
type: "category"
|
||||||
|
description: "Geographic location"
|
||||||
|
- name: "power_demand_mw"
|
||||||
|
type: "float32"
|
||||||
|
unit: "MW"
|
||||||
|
description: "Current power demand"
|
||||||
|
- name: "max_bid_price"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Maximum price willing to pay"
|
||||||
|
- name: "client_type"
|
||||||
|
type: "category"
|
||||||
|
description: "Client type (bitcoin, enterprise, etc.)"
|
||||||
|
|
||||||
|
bitcoin_mining:
|
||||||
|
columns:
|
||||||
|
- name: "timestamp"
|
||||||
|
type: "datetime64[ns]"
|
||||||
|
description: "Timestamp of mining measurement"
|
||||||
|
- name: "pool_id"
|
||||||
|
type: "category"
|
||||||
|
description: "Mining pool identifier"
|
||||||
|
- name: "hashrate_ths"
|
||||||
|
type: "float32"
|
||||||
|
unit: "TH/s"
|
||||||
|
description: "Mining pool hashrate"
|
||||||
|
- name: "btc_price_usd"
|
||||||
|
type: "float32"
|
||||||
|
unit: "USD"
|
||||||
|
description: "Bitcoin price"
|
||||||
|
- name: "mining_profitability"
|
||||||
|
type: "float32"
|
||||||
|
unit: "USD/TH/day"
|
||||||
|
description: "Mining profitability per terahash per day"
|
||||||
|
- name: "electricity_cost"
|
||||||
|
type: "float32"
|
||||||
|
unit: "EUR/MWh"
|
||||||
|
description: "Electricity cost breakeven point"
|
||||||
|
|
||||||
|
validation_rules:
|
||||||
|
electricity_prices:
|
||||||
|
- column: "day_ahead_price"
|
||||||
|
min: -500
|
||||||
|
max: 3000
|
||||||
|
- column: "real_time_price"
|
||||||
|
min: -500
|
||||||
|
max: 5000
|
||||||
|
|
||||||
|
battery_capacity:
|
||||||
|
- column: "charge_level_mwh"
|
||||||
|
min: 0
|
||||||
|
check_max: "capacity_mwh"
|
||||||
|
- column: "efficiency"
|
||||||
|
min: 0.5
|
||||||
|
max: 1.0
|
||||||
|
|
||||||
|
renewable_generation:
|
||||||
|
- column: "generation_mw"
|
||||||
|
min: 0
|
||||||
|
- column: "capacity_factor"
|
||||||
|
min: 0
|
||||||
|
max: 1.0
|
||||||
|
|
||||||
|
conventional_generation:
|
||||||
|
- column: "generation_mw"
|
||||||
|
min: 0
|
||||||
|
- column: "heat_rate"
|
||||||
|
min: 5
|
||||||
|
max: 15
|
||||||
|
|
||||||
|
load_profiles:
|
||||||
|
- column: "load_mw"
|
||||||
|
min: 0
|
||||||
|
- column: "weather_temp"
|
||||||
|
min: -30
|
||||||
|
max: 50
|
||||||
|
|
||||||
|
data_centers:
|
||||||
|
- column: "power_demand_mw"
|
||||||
|
min: 0
|
||||||
|
- column: "max_bid_price"
|
||||||
|
min: 0
|
||||||
|
|
||||||
|
bitcoin_mining:
|
||||||
|
- column: "hashrate_ths"
|
||||||
|
min: 0
|
||||||
|
- column: "btc_price_usd"
|
||||||
|
min: 1000
|
||||||
49
data/metadata/final_metadata.json
Normal file
49
data/metadata/final_metadata.json
Normal file
@@ -0,0 +1,49 @@
|
|||||||
|
{
|
||||||
|
"processed_at": "2026-02-10T16:10:49.295018+00:00",
|
||||||
|
"total_datasets": 7,
|
||||||
|
"total_size_mb": 16.977967262268066,
|
||||||
|
"datasets": {
|
||||||
|
"electricity_prices": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/electricity_prices.parquet",
|
||||||
|
"size_mb": 2.2755775451660156,
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 7
|
||||||
|
},
|
||||||
|
"battery_capacity": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/battery_capacity.parquet",
|
||||||
|
"size_mb": 4.204527854919434,
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": 7
|
||||||
|
},
|
||||||
|
"renewable_generation": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/renewable_generation.parquet",
|
||||||
|
"size_mb": 4.482715606689453,
|
||||||
|
"rows": 216015,
|
||||||
|
"columns": 7
|
||||||
|
},
|
||||||
|
"conventional_generation": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/conventional_generation.parquet",
|
||||||
|
"size_mb": 2.749570846557617,
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": 6
|
||||||
|
},
|
||||||
|
"load_profiles": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/load_profiles.parquet",
|
||||||
|
"size_mb": 1.861943244934082,
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 6
|
||||||
|
},
|
||||||
|
"data_centers": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/data_centers.parquet",
|
||||||
|
"size_mb": 1.0422554016113281,
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 6
|
||||||
|
},
|
||||||
|
"bitcoin_mining": {
|
||||||
|
"path": "/home/user/energy-test-data/data/processed/bitcoin_mining.parquet",
|
||||||
|
"size_mb": 0.3613767623901367,
|
||||||
|
"rows": 14401,
|
||||||
|
"columns": 6
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
89
data/metadata/generation_metadata.json
Normal file
89
data/metadata/generation_metadata.json
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
{
|
||||||
|
"generated_at": "2026-02-10T16:10:43.522420",
|
||||||
|
"datasets": {
|
||||||
|
"battery_capacity": {
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": [
|
||||||
|
"timestamp",
|
||||||
|
"battery_id",
|
||||||
|
"capacity_mwh",
|
||||||
|
"charge_level_mwh",
|
||||||
|
"charge_rate_mw",
|
||||||
|
"discharge_rate_mw",
|
||||||
|
"efficiency"
|
||||||
|
],
|
||||||
|
"memory_usage_mb": 15.38205337524414,
|
||||||
|
"dtypes": {
|
||||||
|
"timestamp": "datetime64[ns]",
|
||||||
|
"battery_id": "object",
|
||||||
|
"capacity_mwh": "float64",
|
||||||
|
"charge_level_mwh": "float64",
|
||||||
|
"charge_rate_mw": "float64",
|
||||||
|
"discharge_rate_mw": "float64",
|
||||||
|
"efficiency": "float64"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"renewable_generation": {
|
||||||
|
"rows": 216015,
|
||||||
|
"columns": [
|
||||||
|
"timestamp",
|
||||||
|
"source",
|
||||||
|
"plant_id",
|
||||||
|
"generation_mw",
|
||||||
|
"forecast_mw",
|
||||||
|
"actual_mw",
|
||||||
|
"capacity_factor"
|
||||||
|
],
|
||||||
|
"memory_usage_mb": 34.472124099731445,
|
||||||
|
"dtypes": {
|
||||||
|
"timestamp": "datetime64[ns]",
|
||||||
|
"source": "object",
|
||||||
|
"plant_id": "object",
|
||||||
|
"generation_mw": "float64",
|
||||||
|
"forecast_mw": "float64",
|
||||||
|
"actual_mw": "float64",
|
||||||
|
"capacity_factor": "float64"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"conventional_generation": {
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": [
|
||||||
|
"timestamp",
|
||||||
|
"plant_id",
|
||||||
|
"fuel_type",
|
||||||
|
"generation_mw",
|
||||||
|
"marginal_cost",
|
||||||
|
"heat_rate"
|
||||||
|
],
|
||||||
|
"memory_usage_mb": 26.149402618408203,
|
||||||
|
"dtypes": {
|
||||||
|
"timestamp": "datetime64[ns]",
|
||||||
|
"plant_id": "object",
|
||||||
|
"fuel_type": "object",
|
||||||
|
"generation_mw": "float64",
|
||||||
|
"marginal_cost": "float64",
|
||||||
|
"heat_rate": "float64"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"data_centers": {
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": [
|
||||||
|
"timestamp",
|
||||||
|
"data_center_id",
|
||||||
|
"location",
|
||||||
|
"power_demand_mw",
|
||||||
|
"max_bid_price",
|
||||||
|
"client_type"
|
||||||
|
],
|
||||||
|
"memory_usage_mb": 14.585489273071289,
|
||||||
|
"dtypes": {
|
||||||
|
"timestamp": "datetime64[ns]",
|
||||||
|
"data_center_id": "object",
|
||||||
|
"location": "object",
|
||||||
|
"power_demand_mw": "float64",
|
||||||
|
"max_bid_price": "float64",
|
||||||
|
"client_type": "object"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
239
data/metadata/validation_report.json
Normal file
239
data/metadata/validation_report.json
Normal file
@@ -0,0 +1,239 @@
|
|||||||
|
{
|
||||||
|
"generated_at": "2026-02-10T16:10:53.614368",
|
||||||
|
"summary": {
|
||||||
|
"total_datasets": 7,
|
||||||
|
"passed": 2,
|
||||||
|
"warnings": 5,
|
||||||
|
"failed": 0,
|
||||||
|
"total_size_mb": 17.72,
|
||||||
|
"total_rows": 734451
|
||||||
|
},
|
||||||
|
"datasets": [
|
||||||
|
{
|
||||||
|
"dataset": "electricity_prices",
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 7,
|
||||||
|
"memory_mb": 1.99,
|
||||||
|
"missing_values": {},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 72005
|
||||||
|
},
|
||||||
|
"data_ranges": [],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "pass"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "battery_capacity",
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": 7,
|
||||||
|
"memory_mb": 3.98,
|
||||||
|
"missing_values": {
|
||||||
|
"capacity_mwh": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"charge_level_mwh": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"charge_rate_mw": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"discharge_rate_mw": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"efficiency": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 144010
|
||||||
|
},
|
||||||
|
"data_ranges": [
|
||||||
|
{
|
||||||
|
"column": "efficiency",
|
||||||
|
"rule": "min >= 0.5",
|
||||||
|
"violations": 36,
|
||||||
|
"severity": "error"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"column": "efficiency",
|
||||||
|
"rule": "max <= 1.0",
|
||||||
|
"violations": 4371,
|
||||||
|
"severity": "error"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "warning"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "renewable_generation",
|
||||||
|
"rows": 216015,
|
||||||
|
"columns": 7,
|
||||||
|
"memory_mb": 5.36,
|
||||||
|
"missing_values": {
|
||||||
|
"generation_mw": {
|
||||||
|
"count": 1080,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"forecast_mw": {
|
||||||
|
"count": 1080,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"actual_mw": {
|
||||||
|
"count": 1080,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"capacity_factor": {
|
||||||
|
"count": 1080,
|
||||||
|
"percentage": 0.5
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 216015
|
||||||
|
},
|
||||||
|
"data_ranges": [
|
||||||
|
{
|
||||||
|
"column": "capacity_factor",
|
||||||
|
"rule": "max <= 1.0",
|
||||||
|
"violations": 6382,
|
||||||
|
"severity": "error"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "warning"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "conventional_generation",
|
||||||
|
"rows": 144010,
|
||||||
|
"columns": 6,
|
||||||
|
"memory_mb": 3.02,
|
||||||
|
"missing_values": {
|
||||||
|
"generation_mw": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"marginal_cost": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"heat_rate": {
|
||||||
|
"count": 720,
|
||||||
|
"percentage": 0.5
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 144010
|
||||||
|
},
|
||||||
|
"data_ranges": [
|
||||||
|
{
|
||||||
|
"column": "heat_rate",
|
||||||
|
"rule": "min >= 5",
|
||||||
|
"violations": 29,
|
||||||
|
"severity": "error"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"column": "heat_rate",
|
||||||
|
"rule": "max <= 15",
|
||||||
|
"violations": 867,
|
||||||
|
"severity": "error"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "warning"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "load_profiles",
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 6,
|
||||||
|
"memory_mb": 1.72,
|
||||||
|
"missing_values": {},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 72005
|
||||||
|
},
|
||||||
|
"data_ranges": [],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "pass"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "data_centers",
|
||||||
|
"rows": 72005,
|
||||||
|
"columns": 6,
|
||||||
|
"memory_mb": 1.31,
|
||||||
|
"missing_values": {
|
||||||
|
"power_demand_mw": {
|
||||||
|
"count": 360,
|
||||||
|
"percentage": 0.5
|
||||||
|
},
|
||||||
|
"max_bid_price": {
|
||||||
|
"count": 360,
|
||||||
|
"percentage": 0.5
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 72005
|
||||||
|
},
|
||||||
|
"data_ranges": [
|
||||||
|
{
|
||||||
|
"column": "power_demand_mw",
|
||||||
|
"rule": "min >= 0",
|
||||||
|
"violations": 137,
|
||||||
|
"severity": "error"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "warning"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"dataset": "bitcoin_mining",
|
||||||
|
"rows": 14401,
|
||||||
|
"columns": 6,
|
||||||
|
"memory_mb": 0.34,
|
||||||
|
"missing_values": {},
|
||||||
|
"duplicated_rows": 0,
|
||||||
|
"timestamp_continuity": {
|
||||||
|
"status": "checked",
|
||||||
|
"expected_frequency": "1min",
|
||||||
|
"gaps_detected": 0,
|
||||||
|
"total_rows": 14401
|
||||||
|
},
|
||||||
|
"data_ranges": [
|
||||||
|
{
|
||||||
|
"column": "btc_price_usd",
|
||||||
|
"rule": "min >= 1000",
|
||||||
|
"violations": 456,
|
||||||
|
"severity": "error"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"data_types": [],
|
||||||
|
"status": "warning"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
7
requirements.txt
Normal file
7
requirements.txt
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
pandas>=2.0.0
|
||||||
|
numpy>=1.24.0
|
||||||
|
pyarrow>=14.0.0
|
||||||
|
pyyaml>=6.0
|
||||||
|
requests>=2.31.0
|
||||||
|
scipy>=1.11.0
|
||||||
|
python-dateutil>=2.8.0
|
||||||
320
scripts/01_generate_synthetic.py
Normal file
320
scripts/01_generate_synthetic.py
Normal file
@@ -0,0 +1,320 @@
|
|||||||
|
"""
|
||||||
|
Generate synthetic data for energy trading strategy test data.
|
||||||
|
Handles: battery capacity, data centers, renewable generation, conventional generation.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
import json
|
||||||
|
|
||||||
|
def load_config():
|
||||||
|
config_path = Path(__file__).parent.parent / "config" / "data_config.yaml"
|
||||||
|
with open(config_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
def generate_timestamps(start_date, end_date, granularity):
|
||||||
|
start = pd.to_datetime(start_date)
|
||||||
|
end = pd.to_datetime(end_date)
|
||||||
|
freq = granularity
|
||||||
|
return pd.date_range(start=start, end=end, freq=freq)
|
||||||
|
|
||||||
|
def generate_battery_data(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'])
|
||||||
|
num_batteries = config['data_sources']['battery_capacity']['num_batteries']
|
||||||
|
|
||||||
|
params = config['battery']
|
||||||
|
gen_params = config['generation']
|
||||||
|
|
||||||
|
batteries = []
|
||||||
|
for i in range(num_batteries):
|
||||||
|
battery_id = f"BAT_{i+1:03d}"
|
||||||
|
capacity = np.random.uniform(*params['capacity_range'])
|
||||||
|
charge_rate = np.random.uniform(*params['charge_rate_range'])
|
||||||
|
discharge_rate = np.random.uniform(*params['discharge_rate_range'])
|
||||||
|
efficiency = np.random.uniform(*params['efficiency_range'])
|
||||||
|
|
||||||
|
n = len(timestamps)
|
||||||
|
|
||||||
|
charge_level = np.zeros(n)
|
||||||
|
charge_level[0] = capacity * np.random.uniform(0.3, 0.7)
|
||||||
|
|
||||||
|
for t in range(1, n):
|
||||||
|
action = np.random.choice([-1, 0, 1], p=[0.3, 0.2, 0.5])
|
||||||
|
rate = charge_rate if action > 0 else discharge_rate
|
||||||
|
|
||||||
|
change = action * rate / 60
|
||||||
|
charge_level[t] = np.clip(charge_level[t-1] + change, 0, capacity)
|
||||||
|
|
||||||
|
current_rate = np.diff(charge_level, prepend=charge_level[0]) * 60
|
||||||
|
current_rate = np.clip(current_rate, -discharge_rate, charge_rate)
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'battery_id': battery_id,
|
||||||
|
'capacity_mwh': capacity,
|
||||||
|
'charge_level_mwh': charge_level,
|
||||||
|
'charge_rate_mw': current_rate,
|
||||||
|
'discharge_rate_mw': discharge_rate,
|
||||||
|
'efficiency': efficiency
|
||||||
|
})
|
||||||
|
batteries.append(data)
|
||||||
|
|
||||||
|
return pd.concat(batteries, ignore_index=True)
|
||||||
|
|
||||||
|
def generate_renewable_data(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 1)
|
||||||
|
|
||||||
|
sources = config['data_sources']['renewable_generation']['sources']
|
||||||
|
plants_per_source = config['data_sources']['renewable_generation']['plants_per_source']
|
||||||
|
|
||||||
|
params = config['renewable']
|
||||||
|
gen_params = config['generation']
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
plant_counter = 0
|
||||||
|
|
||||||
|
for source in sources:
|
||||||
|
source_params = params[source]
|
||||||
|
for i in range(plants_per_source):
|
||||||
|
plant_id = f"{source.upper()}_{i+1:03d}"
|
||||||
|
plant_counter += 1
|
||||||
|
capacity = np.random.uniform(*source_params['capacity_range'])
|
||||||
|
forecast_error_sd = source_params['forecast_error_sd']
|
||||||
|
|
||||||
|
n = len(timestamps)
|
||||||
|
|
||||||
|
hours = timestamps.hour + timestamps.minute / 60
|
||||||
|
|
||||||
|
if source == 'solar':
|
||||||
|
base_pattern = np.maximum(0, np.sin(np.pi * (hours - 6) / 12))
|
||||||
|
seasonal = 0.7 + 0.3 * np.sin(2 * np.pi * timestamps.dayofyear / 365)
|
||||||
|
elif source == 'wind':
|
||||||
|
base_pattern = 0.4 + 0.3 * np.sin(2 * np.pi * hours / 24) + 0.3 * np.random.randn(n)
|
||||||
|
seasonal = 0.8 + 0.2 * np.sin(2 * np.pi * timestamps.dayofyear / 365)
|
||||||
|
else:
|
||||||
|
base_pattern = 0.6 + 0.2 * np.random.randn(n)
|
||||||
|
seasonal = 1.0
|
||||||
|
|
||||||
|
generation = base_pattern * seasonal * capacity * np.random.uniform(0.8, 1.2, n)
|
||||||
|
generation = np.maximum(0, generation)
|
||||||
|
|
||||||
|
forecast_error = np.random.normal(0, forecast_error_sd, n)
|
||||||
|
forecast = generation * (1 + forecast_error)
|
||||||
|
forecast = np.maximum(0, forecast)
|
||||||
|
|
||||||
|
capacity_factor = generation / capacity
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'source': source,
|
||||||
|
'plant_id': plant_id,
|
||||||
|
'generation_mw': generation,
|
||||||
|
'forecast_mw': forecast,
|
||||||
|
'actual_mw': generation,
|
||||||
|
'capacity_factor': capacity_factor
|
||||||
|
})
|
||||||
|
df_list.append(data)
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True)
|
||||||
|
|
||||||
|
def generate_conventional_data(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 2)
|
||||||
|
|
||||||
|
num_plants = config['data_sources']['conventional_generation']['num_plants']
|
||||||
|
fuel_types = config['data_sources']['conventional_generation']['fuel_types']
|
||||||
|
|
||||||
|
params = config['conventional']
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
|
||||||
|
for i in range(num_plants):
|
||||||
|
plant_id = f"CONV_{i+1:03d}"
|
||||||
|
fuel_type = np.random.choice(fuel_types)
|
||||||
|
|
||||||
|
fuel_params = params[fuel_type]
|
||||||
|
capacity = np.random.uniform(*fuel_params['capacity_range'])
|
||||||
|
marginal_cost = np.random.uniform(*fuel_params['marginal_cost_range'])
|
||||||
|
heat_rate = np.random.uniform(6, 12) if fuel_type == 'gas' else np.random.uniform(8, 14)
|
||||||
|
|
||||||
|
n = len(timestamps)
|
||||||
|
hours = timestamps.hour + timestamps.minute / 60
|
||||||
|
|
||||||
|
if fuel_type == 'nuclear':
|
||||||
|
base_load = 0.9 * capacity
|
||||||
|
generation = base_load + np.random.normal(0, 0.01 * capacity, n)
|
||||||
|
elif fuel_type == 'gas':
|
||||||
|
peaking_pattern = 0.3 + 0.4 * np.sin(2 * np.pi * (hours - 12) / 24)
|
||||||
|
generation = peaking_pattern * capacity + np.random.normal(0, 0.05 * capacity, n)
|
||||||
|
else:
|
||||||
|
baseload_pattern = 0.5 + 0.2 * np.sin(2 * np.pi * hours / 24)
|
||||||
|
generation = baseload_pattern * capacity + np.random.normal(0, 0.03 * capacity, n)
|
||||||
|
|
||||||
|
generation = np.clip(generation, 0, capacity)
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'plant_id': plant_id,
|
||||||
|
'fuel_type': fuel_type,
|
||||||
|
'generation_mw': generation,
|
||||||
|
'marginal_cost': marginal_cost,
|
||||||
|
'heat_rate': heat_rate
|
||||||
|
})
|
||||||
|
df_list.append(data)
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True)
|
||||||
|
|
||||||
|
def generate_data_center_data(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 3)
|
||||||
|
|
||||||
|
num_centers = config['data_sources']['data_centers']['num_centers']
|
||||||
|
params = config['data_center']
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
locations = ['FR', 'BE', 'DE', 'NL', 'UK']
|
||||||
|
|
||||||
|
for i in range(num_centers):
|
||||||
|
data_center_id = f"DC_{i+1:03d}"
|
||||||
|
location = locations[i % len(locations)]
|
||||||
|
|
||||||
|
base_demand = np.random.uniform(*params['power_demand_range'])
|
||||||
|
price_sensitivity = np.random.uniform(*params['price_sensitivity_range'])
|
||||||
|
|
||||||
|
is_bitcoin = (i == 0)
|
||||||
|
client_type = 'bitcoin' if is_bitcoin else 'enterprise'
|
||||||
|
|
||||||
|
n = len(timestamps)
|
||||||
|
hours = timestamps.hour + timestamps.minute / 60
|
||||||
|
|
||||||
|
if is_bitcoin:
|
||||||
|
base_profile = 0.7 + 0.3 * np.random.randn(n)
|
||||||
|
else:
|
||||||
|
base_profile = 0.6 + 0.2 * np.sin(2 * np.pi * (hours - 12) / 24)
|
||||||
|
|
||||||
|
demand = base_demand * base_profile
|
||||||
|
demand = np.maximum(demand * 0.5, demand)
|
||||||
|
|
||||||
|
max_bid = base_demand * price_sensitivity * (0.8 + 0.4 * np.random.rand(n))
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'data_center_id': data_center_id,
|
||||||
|
'location': location,
|
||||||
|
'power_demand_mw': demand,
|
||||||
|
'max_bid_price': max_bid,
|
||||||
|
'client_type': client_type
|
||||||
|
})
|
||||||
|
df_list.append(data)
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True)
|
||||||
|
|
||||||
|
def apply_noise_and_outliers(df, config):
|
||||||
|
if not config['generation']['add_noise']:
|
||||||
|
return df
|
||||||
|
|
||||||
|
noise_level = config['generation']['noise_level']
|
||||||
|
outlier_rate = config['generation']['outlier_rate']
|
||||||
|
|
||||||
|
for col in df.select_dtypes(include=[np.number]).columns:
|
||||||
|
if col == 'timestamp':
|
||||||
|
continue
|
||||||
|
|
||||||
|
noise = np.random.normal(0, noise_level, len(df))
|
||||||
|
df[col] = df[col] * (1 + noise)
|
||||||
|
|
||||||
|
num_outliers = int(len(df) * outlier_rate)
|
||||||
|
outlier_idx = np.random.choice(len(df), num_outliers, replace=False)
|
||||||
|
df.loc[outlier_idx, col] = df.loc[outlier_idx, col] * np.random.uniform(0.5, 2.0, num_outliers)
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def add_missing_values(df, config):
|
||||||
|
if not config['generation']['include_missing_values']:
|
||||||
|
return df
|
||||||
|
|
||||||
|
missing_rate = config['generation']['missing_rate']
|
||||||
|
|
||||||
|
for col in df.select_dtypes(include=[np.number]).columns:
|
||||||
|
if col == 'timestamp':
|
||||||
|
continue
|
||||||
|
|
||||||
|
num_missing = int(len(df) * missing_rate)
|
||||||
|
missing_idx = np.random.choice(len(df), num_missing, replace=False)
|
||||||
|
df.loc[missing_idx, col] = np.nan
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def save_metadata(datasets, output_dir):
|
||||||
|
metadata = {
|
||||||
|
'generated_at': datetime.utcnow().isoformat(),
|
||||||
|
'datasets': {}
|
||||||
|
}
|
||||||
|
|
||||||
|
for name, df in datasets.items():
|
||||||
|
metadata['datasets'][name] = {
|
||||||
|
'rows': len(df),
|
||||||
|
'columns': len(df.columns),
|
||||||
|
'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024 / 1024,
|
||||||
|
'dtypes': {col: str(dtype) for col, dtype in df.dtypes.items()},
|
||||||
|
'columns': list(df.columns)
|
||||||
|
}
|
||||||
|
|
||||||
|
output_path = Path(output_dir) / 'metadata' / 'generation_metadata.json'
|
||||||
|
with open(output_path, 'w') as f:
|
||||||
|
json.dump(metadata, f, indent=2, default=str)
|
||||||
|
|
||||||
|
return metadata
|
||||||
|
|
||||||
|
def main():
|
||||||
|
config = load_config()
|
||||||
|
|
||||||
|
time_config = config['time_range']
|
||||||
|
timestamps = generate_timestamps(
|
||||||
|
time_config['start_date'],
|
||||||
|
time_config['end_date'],
|
||||||
|
time_config['granularity']
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Generating synthetic data for {len(timestamps)} timestamps...")
|
||||||
|
|
||||||
|
datasets = {}
|
||||||
|
|
||||||
|
datasets['battery_capacity'] = generate_battery_data(config, timestamps)
|
||||||
|
print(f" - Battery capacity: {len(datasets['battery_capacity'])} rows")
|
||||||
|
|
||||||
|
datasets['renewable_generation'] = generate_renewable_data(config, timestamps)
|
||||||
|
print(f" - Renewable generation: {len(datasets['renewable_generation'])} rows")
|
||||||
|
|
||||||
|
datasets['conventional_generation'] = generate_conventional_data(config, timestamps)
|
||||||
|
print(f" - Conventional generation: {len(datasets['conventional_generation'])} rows")
|
||||||
|
|
||||||
|
datasets['data_centers'] = generate_data_center_data(config, timestamps)
|
||||||
|
print(f" - Data centers: {len(datasets['data_centers'])} rows")
|
||||||
|
|
||||||
|
for name, df in datasets.items():
|
||||||
|
df = apply_noise_and_outliers(df, config)
|
||||||
|
df = add_missing_values(df, config)
|
||||||
|
datasets[name] = df
|
||||||
|
|
||||||
|
output_base = Path(__file__).parent.parent / 'data'
|
||||||
|
output_base.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
raw_dir = output_base / 'raw'
|
||||||
|
raw_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
for name, df in datasets.items():
|
||||||
|
file_path = raw_dir / f'{name}_raw.parquet'
|
||||||
|
df.to_parquet(file_path, compression='snappy')
|
||||||
|
print(f" Saved: {file_path}")
|
||||||
|
|
||||||
|
metadata = save_metadata(datasets, output_base)
|
||||||
|
|
||||||
|
print("\nMetadata saved to data/metadata/generation_metadata.json")
|
||||||
|
print(f"Total datasets generated: {len(datasets)}")
|
||||||
|
|
||||||
|
return datasets
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
222
scripts/02_fetch_historical.py
Normal file
222
scripts/02_fetch_historical.py
Normal file
@@ -0,0 +1,222 @@
|
|||||||
|
"""
|
||||||
|
Fetch historical data for energy trading strategy test data.
|
||||||
|
Handles: electricity prices, bitcoin mining data, load profiles.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
import requests
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
|
||||||
|
def load_config():
|
||||||
|
config_path = Path(__file__).parent.parent / "config" / "data_config.yaml"
|
||||||
|
with open(config_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
def generate_timestamps(start_date, end_date, granularity):
|
||||||
|
start = pd.to_datetime(start_date)
|
||||||
|
end = pd.to_datetime(end_date)
|
||||||
|
return pd.date_range(start=start, end=end, freq=granularity)
|
||||||
|
|
||||||
|
def fetch_electricity_prices(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 10)
|
||||||
|
|
||||||
|
regions = config['regions']
|
||||||
|
print(f"Fetching electricity prices for {len(regions)} regions...")
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
|
||||||
|
for region in regions:
|
||||||
|
n = len(timestamps)
|
||||||
|
hours = timestamps.hour + timestamps.minute / 60
|
||||||
|
days = timestamps.dayofyear
|
||||||
|
|
||||||
|
if region == 'FR':
|
||||||
|
base_price = 80
|
||||||
|
volatility = 30
|
||||||
|
elif region == 'DE':
|
||||||
|
base_price = 90
|
||||||
|
volatility = 40
|
||||||
|
elif region == 'NL':
|
||||||
|
base_price = 85
|
||||||
|
volatility = 35
|
||||||
|
elif region == 'BE':
|
||||||
|
base_price = 82
|
||||||
|
volatility = 32
|
||||||
|
else:
|
||||||
|
base_price = 100
|
||||||
|
volatility = 50
|
||||||
|
|
||||||
|
day_ahead = base_price + volatility * np.sin(2 * np.pi * hours / 24) + np.random.normal(0, 10, n)
|
||||||
|
real_time = day_ahead + np.random.normal(0, 20, n)
|
||||||
|
|
||||||
|
price_spikes = np.random.random(n) < 0.02
|
||||||
|
real_time = np.array(real_time)
|
||||||
|
real_time[price_spikes] += np.random.uniform(100, 500, int(np.sum(price_spikes)))
|
||||||
|
|
||||||
|
capacity_price = np.abs(np.random.normal(5, 2, n))
|
||||||
|
regulation_price = np.abs(np.random.normal(3, 1, n))
|
||||||
|
|
||||||
|
volume = np.random.uniform(1000, 5000, n)
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'region': region,
|
||||||
|
'day_ahead_price': day_ahead,
|
||||||
|
'real_time_price': real_time,
|
||||||
|
'capacity_price': capacity_price,
|
||||||
|
'regulation_price': regulation_price,
|
||||||
|
'volume_mw': volume
|
||||||
|
})
|
||||||
|
df_list.append(data)
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True)
|
||||||
|
|
||||||
|
def fetch_bitcoin_mining_data(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 11)
|
||||||
|
|
||||||
|
print(f"Fetching bitcoin mining data from mempool.space (simulated)...")
|
||||||
|
|
||||||
|
n = len(timestamps)
|
||||||
|
|
||||||
|
try:
|
||||||
|
btc_api = "https://mempool.space/api/v1/fees/recommended"
|
||||||
|
response = requests.get(btc_api, timeout=10)
|
||||||
|
if response.status_code == 200:
|
||||||
|
fees = response.json()
|
||||||
|
base_btc_price = 45000
|
||||||
|
else:
|
||||||
|
base_btc_price = 45000
|
||||||
|
except:
|
||||||
|
base_btc_price = 45000
|
||||||
|
|
||||||
|
btc_params = config['bitcoin']
|
||||||
|
|
||||||
|
btc_trend = np.linspace(0.95, 1.05, n)
|
||||||
|
btc_daily_volatility = np.cumsum(np.random.normal(0, 0.01, n)) + 1
|
||||||
|
btc_daily_volatility = btc_daily_volatility / btc_daily_volatility[0]
|
||||||
|
|
||||||
|
btc_price = base_btc_price * btc_trend * btc_daily_volatility * (1 + 0.03 * np.random.randn(n))
|
||||||
|
|
||||||
|
hashrate_base = np.random.uniform(*btc_params['hashrate_range'])
|
||||||
|
hashrate = hashrate_base * (1 + 0.05 * np.sin(2 * np.pi * np.arange(n) / (n / 10))) * (1 + 0.02 * np.random.randn(n))
|
||||||
|
|
||||||
|
electricity_efficiency = np.random.uniform(*btc_params['mining_efficiency_range'])
|
||||||
|
|
||||||
|
btc_price_eur = btc_price * 0.92
|
||||||
|
power_cost_eur = 50
|
||||||
|
mining_profitability = (btc_price_eur * 0.0001 / 3.6) / (electricity_efficiency / 1000)
|
||||||
|
|
||||||
|
electricity_breakeven = (btc_price_eur * 0.0001 / 3.6) / (mining_profitability / 24 * electricity_efficiency / 1000) * 24
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'pool_id': 'POOL_001',
|
||||||
|
'hashrate_ths': hashrate,
|
||||||
|
'btc_price_usd': btc_price,
|
||||||
|
'mining_profitability': mining_profitability,
|
||||||
|
'electricity_cost': electricity_breakeven
|
||||||
|
})
|
||||||
|
|
||||||
|
return data
|
||||||
|
|
||||||
|
def fetch_load_profiles(config, timestamps):
|
||||||
|
np.random.seed(config['generation']['seed'] + 12)
|
||||||
|
|
||||||
|
regions = config['regions']
|
||||||
|
print(f"Fetching load profiles for {len(regions)} regions...")
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
|
||||||
|
for region in regions:
|
||||||
|
n = len(timestamps)
|
||||||
|
hours = timestamps.hour + timestamps.minute / 60
|
||||||
|
day_of_year = timestamps.dayofyear
|
||||||
|
|
||||||
|
if region == 'FR':
|
||||||
|
base_load = 60000
|
||||||
|
peak_hours = [10, 20]
|
||||||
|
elif region == 'DE':
|
||||||
|
base_load = 70000
|
||||||
|
peak_hours = [9, 19]
|
||||||
|
elif region == 'NL':
|
||||||
|
base_load = 15000
|
||||||
|
peak_hours = [11, 21]
|
||||||
|
elif region == 'BE':
|
||||||
|
base_load = 12000
|
||||||
|
peak_hours = [10, 20]
|
||||||
|
else:
|
||||||
|
base_load = 45000
|
||||||
|
peak_hours = [9, 19]
|
||||||
|
|
||||||
|
daily_pattern = 0.7 + 0.3 * np.exp(-0.5 * ((hours - 18) / 4) ** 2)
|
||||||
|
seasonal_pattern = 0.8 + 0.2 * np.sin(2 * np.pi * (day_of_year - 15) / 365)
|
||||||
|
|
||||||
|
load = base_load * daily_pattern * seasonal_pattern * (1 + 0.05 * np.random.randn(n))
|
||||||
|
|
||||||
|
forecast = load * (1 + np.random.normal(0, 0.03, n))
|
||||||
|
|
||||||
|
temp = 15 + 15 * np.sin(2 * np.pi * (day_of_year - 15) / 365) + np.random.normal(0, 3, n)
|
||||||
|
humidity = 60 + 20 * np.sin(2 * np.pi * (day_of_year - 15) / 365) + np.random.normal(0, 10, n)
|
||||||
|
|
||||||
|
data = pd.DataFrame({
|
||||||
|
'timestamp': timestamps,
|
||||||
|
'region': region,
|
||||||
|
'load_mw': load,
|
||||||
|
'forecast_mw': forecast,
|
||||||
|
'weather_temp': temp,
|
||||||
|
'humidity': humidity
|
||||||
|
})
|
||||||
|
df_list.append(data)
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True)
|
||||||
|
|
||||||
|
def save_raw_data(datasets, output_dir):
|
||||||
|
output_path = Path(output_dir) / 'raw'
|
||||||
|
output_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
saved = {}
|
||||||
|
for name, df in datasets.items():
|
||||||
|
file_path = output_path / f'{name}_raw.parquet'
|
||||||
|
df.to_parquet(file_path, compression='snappy')
|
||||||
|
saved[name] = str(file_path)
|
||||||
|
print(f" Saved: {file_path}")
|
||||||
|
|
||||||
|
return saved
|
||||||
|
|
||||||
|
def main():
|
||||||
|
config = load_config()
|
||||||
|
|
||||||
|
time_config = config['time_range']
|
||||||
|
timestamps = generate_timestamps(
|
||||||
|
time_config['start_date'],
|
||||||
|
time_config['end_date'],
|
||||||
|
time_config['granularity']
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Fetching historical data for {len(timestamps)} timestamps...")
|
||||||
|
|
||||||
|
datasets = {}
|
||||||
|
|
||||||
|
datasets['electricity_prices'] = fetch_electricity_prices(config, timestamps)
|
||||||
|
print(f" - Electricity prices: {len(datasets['electricity_prices'])} rows")
|
||||||
|
|
||||||
|
datasets['bitcoin_mining'] = fetch_bitcoin_mining_data(config, timestamps)
|
||||||
|
print(f" - Bitcoin mining: {len(datasets['bitcoin_mining'])} rows")
|
||||||
|
|
||||||
|
datasets['load_profiles'] = fetch_load_profiles(config, timestamps)
|
||||||
|
print(f" - Load profiles: {len(datasets['load_profiles'])} rows")
|
||||||
|
|
||||||
|
output_base = Path(__file__).parent.parent / 'data'
|
||||||
|
saved_files = save_raw_data(datasets, output_base)
|
||||||
|
|
||||||
|
print(f"\nSaved {len(datasets)} historical datasets to data/raw/")
|
||||||
|
|
||||||
|
return datasets
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
172
scripts/03_process_merge.py
Normal file
172
scripts/03_process_merge.py
Normal file
@@ -0,0 +1,172 @@
|
|||||||
|
"""
|
||||||
|
Process and merge all datasets, apply compression, and save to Parquet format.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from pathlib import Path
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
|
||||||
|
def load_config():
|
||||||
|
config_path = Path(__file__).parent.parent / "config" / "data_config.yaml"
|
||||||
|
with open(config_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
def load_dataset(dataset_name, data_base):
|
||||||
|
synthetic_path = data_base / 'metadata' / 'generation_metadata.json'
|
||||||
|
|
||||||
|
df_list = []
|
||||||
|
|
||||||
|
raw_path = data_base / 'raw' / f'{dataset_name}_raw.parquet'
|
||||||
|
if raw_path.exists():
|
||||||
|
print(f" Loading {dataset_name} from raw data...")
|
||||||
|
df = pd.read_parquet(raw_path)
|
||||||
|
df_list.append(df)
|
||||||
|
|
||||||
|
print(f" Total rows for {dataset_name}: {len(pd.concat(df_list, ignore_index=True)) if df_list else 0}")
|
||||||
|
|
||||||
|
return pd.concat(df_list, ignore_index=True) if df_list else None
|
||||||
|
|
||||||
|
def downgrade_precision(df, config):
|
||||||
|
precision = config['output'].get('precision', 'float32')
|
||||||
|
|
||||||
|
for col in df.select_dtypes(include=['float64']).columns:
|
||||||
|
if col == 'timestamp':
|
||||||
|
continue
|
||||||
|
df[col] = df[col].astype(precision)
|
||||||
|
|
||||||
|
for col in df.select_dtypes(include=['int64']).columns:
|
||||||
|
if col == 'timestamp':
|
||||||
|
continue
|
||||||
|
df[col] = df[col].astype('int32')
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def convert_categoricals(df):
|
||||||
|
for col in df.select_dtypes(include=['object']).columns:
|
||||||
|
if col == 'timestamp':
|
||||||
|
continue
|
||||||
|
if df[col].nunique() < df.shape[0] * 0.5:
|
||||||
|
df[col] = df[col].astype('category')
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def optimize_memory(df):
|
||||||
|
start_mem = df.memory_usage(deep=True).sum() / 1024 / 1024
|
||||||
|
|
||||||
|
df = downgrade_precision(df, {'output': {'precision': 'float32'}})
|
||||||
|
df = convert_categoricals(df)
|
||||||
|
|
||||||
|
end_mem = df.memory_usage(deep=True).sum() / 1024 / 1024
|
||||||
|
|
||||||
|
reduction = (1 - end_mem / start_mem) * 100
|
||||||
|
print(f" Memory: {start_mem:.2f}MB -> {end_mem:.2f}MB ({reduction:.1f}% reduction)")
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
def save_processed_dataset(df, dataset_name, output_dir, config):
|
||||||
|
output_path = Path(output_dir) / f'{dataset_name}.parquet'
|
||||||
|
|
||||||
|
compression = config['output'].get('compression', 'snappy')
|
||||||
|
|
||||||
|
df.to_parquet(output_path, compression=compression, index=False)
|
||||||
|
|
||||||
|
file_size_mb = output_path.stat().st_size / 1024 / 1024
|
||||||
|
print(f" Saved: {output_path} ({file_size_mb:.2f}MB)")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'path': str(output_path),
|
||||||
|
'size_mb': file_size_mb,
|
||||||
|
'rows': len(df),
|
||||||
|
'columns': len(df.columns)
|
||||||
|
}
|
||||||
|
|
||||||
|
def validate_timestamps(df, dataset_name):
|
||||||
|
if 'timestamp' not in df.columns:
|
||||||
|
print(f" Warning: {dataset_name} has no timestamp column")
|
||||||
|
return False
|
||||||
|
|
||||||
|
df['timestamp'] = pd.to_datetime(df['timestamp'])
|
||||||
|
duplicates = df['timestamp'].duplicated().sum()
|
||||||
|
|
||||||
|
if duplicates > 0:
|
||||||
|
print(f" Warning: {dataset_name} has {duplicates} duplicate timestamps")
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def generate_final_metadata(processed_info, output_dir):
|
||||||
|
metadata = {
|
||||||
|
'processed_at': pd.Timestamp.utcnow().isoformat(),
|
||||||
|
'total_datasets': len(processed_info),
|
||||||
|
'total_size_mb': sum(info['size_mb'] for info in processed_info.values()),
|
||||||
|
'datasets': processed_info
|
||||||
|
}
|
||||||
|
|
||||||
|
output_path = Path(output_dir) / 'metadata' / 'final_metadata.json'
|
||||||
|
with open(output_path, 'w') as f:
|
||||||
|
json.dump(metadata, f, indent=2, default=str)
|
||||||
|
|
||||||
|
return metadata
|
||||||
|
|
||||||
|
def main():
|
||||||
|
config = load_config()
|
||||||
|
|
||||||
|
data_base = Path(__file__).parent.parent / 'data'
|
||||||
|
processed_dir = data_base / 'processed'
|
||||||
|
processed_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
print("Processing and merging datasets...")
|
||||||
|
|
||||||
|
datasets = [
|
||||||
|
'electricity_prices',
|
||||||
|
'battery_capacity',
|
||||||
|
'renewable_generation',
|
||||||
|
'conventional_generation',
|
||||||
|
'load_profiles',
|
||||||
|
'data_centers',
|
||||||
|
'bitcoin_mining'
|
||||||
|
]
|
||||||
|
|
||||||
|
processed_info = {}
|
||||||
|
|
||||||
|
for dataset_name in datasets:
|
||||||
|
print(f"\nProcessing {dataset_name}...")
|
||||||
|
|
||||||
|
df = load_dataset(dataset_name, data_base)
|
||||||
|
|
||||||
|
if df is None:
|
||||||
|
print(f" Warning: {dataset_name} has no data, skipping")
|
||||||
|
continue
|
||||||
|
|
||||||
|
validate_timestamps(df, dataset_name)
|
||||||
|
|
||||||
|
print(" Optimizing memory...")
|
||||||
|
df = optimize_memory(df)
|
||||||
|
|
||||||
|
info = save_processed_dataset(df, dataset_name, processed_dir, config)
|
||||||
|
processed_info[dataset_name] = info
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("Processing complete!")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
metadata = generate_final_metadata(processed_info, data_base)
|
||||||
|
|
||||||
|
print(f"\nTotal datasets processed: {len(processed_info)}")
|
||||||
|
print(f"Total size: {metadata['total_size_mb']:.2f}MB")
|
||||||
|
print(f"Target size: {config['output']['target_size_mb']}MB")
|
||||||
|
|
||||||
|
if metadata['total_size_mb'] > config['output']['target_size_mb']:
|
||||||
|
print(f"Warning: Total size exceeds target by {metadata['total_size_mb'] - config['output']['target_size_mb']:.2f}MB")
|
||||||
|
else:
|
||||||
|
print("✓ Total size within target")
|
||||||
|
|
||||||
|
print(f"\nProcessed data saved to: {processed_dir}")
|
||||||
|
print(f"Metadata saved to: {data_base / 'metadata' / 'final_metadata.json'}")
|
||||||
|
|
||||||
|
return processed_info
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
272
scripts/04_validate.py
Normal file
272
scripts/04_validate.py
Normal file
@@ -0,0 +1,272 @@
|
|||||||
|
"""
|
||||||
|
Validate processed datasets for quality, missing values, and data consistency.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from pathlib import Path
|
||||||
|
import json
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
def load_config():
|
||||||
|
config_path = Path(__file__).parent.parent / "config" / "data_config.yaml"
|
||||||
|
with open(config_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
def load_schema():
|
||||||
|
schema_path = Path(__file__).parent.parent / "config" / "schema.yaml"
|
||||||
|
with open(schema_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
def load_processed_dataset(dataset_name, data_dir):
|
||||||
|
file_path = Path(data_dir) / 'processed' / f'{dataset_name}.parquet'
|
||||||
|
if file_path.exists():
|
||||||
|
return pd.read_parquet(file_path)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def check_missing_values(df, dataset_name):
|
||||||
|
missing_info = {}
|
||||||
|
|
||||||
|
for col in df.columns:
|
||||||
|
missing_count = df[col].isna().sum()
|
||||||
|
missing_pct = (missing_count / len(df)) * 100
|
||||||
|
|
||||||
|
if missing_count > 0:
|
||||||
|
missing_info[col] = {
|
||||||
|
'count': int(missing_count),
|
||||||
|
'percentage': round(missing_pct, 2)
|
||||||
|
}
|
||||||
|
|
||||||
|
return missing_info
|
||||||
|
|
||||||
|
def check_data_ranges(df, dataset_name, schema):
|
||||||
|
validation_results = []
|
||||||
|
|
||||||
|
if dataset_name not in schema['validation_rules']:
|
||||||
|
return validation_results
|
||||||
|
|
||||||
|
rules = schema['validation_rules'][dataset_name]
|
||||||
|
|
||||||
|
for rule in rules:
|
||||||
|
column = rule['column']
|
||||||
|
if column not in df.columns:
|
||||||
|
continue
|
||||||
|
|
||||||
|
col_data = df[column].dropna()
|
||||||
|
|
||||||
|
if 'min' in rule:
|
||||||
|
violations = (col_data < rule['min']).sum()
|
||||||
|
if violations > 0:
|
||||||
|
validation_results.append({
|
||||||
|
'column': column,
|
||||||
|
'rule': f'min >= {rule["min"]}',
|
||||||
|
'violations': int(violations),
|
||||||
|
'severity': 'error'
|
||||||
|
})
|
||||||
|
|
||||||
|
if 'max' in rule:
|
||||||
|
violations = (col_data > rule['max']).sum()
|
||||||
|
if violations > 0:
|
||||||
|
validation_results.append({
|
||||||
|
'column': column,
|
||||||
|
'rule': f'max <= {rule["max"]}',
|
||||||
|
'violations': int(violations),
|
||||||
|
'severity': 'error'
|
||||||
|
})
|
||||||
|
|
||||||
|
return validation_results
|
||||||
|
|
||||||
|
def check_duplicated_rows(df, dataset_name):
|
||||||
|
duplicates = df.duplicated().sum()
|
||||||
|
return int(duplicates)
|
||||||
|
|
||||||
|
def check_timestamp_continuity(df, dataset_name, expected_freq='1min'):
|
||||||
|
if 'timestamp' not in df.columns:
|
||||||
|
return {'status': 'skipped', 'reason': 'no timestamp column'}
|
||||||
|
|
||||||
|
df_sorted = df.sort_values('timestamp')
|
||||||
|
time_diffs = df_sorted['timestamp'].diff().dropna()
|
||||||
|
|
||||||
|
expected_diff = pd.Timedelta(expected_freq)
|
||||||
|
missing_gaps = time_diffs[time_diffs > expected_diff * 1.5]
|
||||||
|
|
||||||
|
return {
|
||||||
|
'status': 'checked',
|
||||||
|
'expected_frequency': expected_freq,
|
||||||
|
'gaps_detected': len(missing_gaps),
|
||||||
|
'total_rows': len(df)
|
||||||
|
}
|
||||||
|
|
||||||
|
def check_data_types(df, dataset_name, schema):
|
||||||
|
type_issues = []
|
||||||
|
|
||||||
|
expected_schema = schema['schemas'].get(dataset_name, {})
|
||||||
|
expected_columns = {col['name']: col['type'] for col in expected_schema.get('columns', [])}
|
||||||
|
|
||||||
|
for col, expected_type in expected_columns.items():
|
||||||
|
if col not in df.columns:
|
||||||
|
type_issues.append({
|
||||||
|
'column': col,
|
||||||
|
'issue': 'missing',
|
||||||
|
'expected': expected_type
|
||||||
|
})
|
||||||
|
elif expected_type == 'datetime64[ns]':
|
||||||
|
if not pd.api.types.is_datetime64_any_dtype(df[col]):
|
||||||
|
type_issues.append({
|
||||||
|
'column': col,
|
||||||
|
'issue': 'wrong_type',
|
||||||
|
'expected': 'datetime',
|
||||||
|
'actual': str(df[col].dtype)
|
||||||
|
})
|
||||||
|
elif expected_type == 'category':
|
||||||
|
if not pd.api.types.is_categorical_dtype(df[col]):
|
||||||
|
type_issues.append({
|
||||||
|
'column': col,
|
||||||
|
'issue': 'wrong_type',
|
||||||
|
'expected': 'category',
|
||||||
|
'actual': str(df[col].dtype)
|
||||||
|
})
|
||||||
|
elif expected_type == 'float32':
|
||||||
|
if df[col].dtype not in ['float32', 'float64']:
|
||||||
|
type_issues.append({
|
||||||
|
'column': col,
|
||||||
|
'issue': 'wrong_type',
|
||||||
|
'expected': 'float32',
|
||||||
|
'actual': str(df[col].dtype)
|
||||||
|
})
|
||||||
|
|
||||||
|
return type_issues
|
||||||
|
|
||||||
|
def validate_dataset(df, dataset_name, schema):
|
||||||
|
results = {
|
||||||
|
'dataset': dataset_name,
|
||||||
|
'rows': len(df),
|
||||||
|
'columns': len(df.columns),
|
||||||
|
'memory_mb': round(df.memory_usage(deep=True).sum() / 1024 / 1024, 2),
|
||||||
|
'missing_values': check_missing_values(df, dataset_name),
|
||||||
|
'duplicated_rows': check_duplicated_rows(df, dataset_name),
|
||||||
|
'timestamp_continuity': check_timestamp_continuity(df, dataset_name),
|
||||||
|
'data_ranges': check_data_ranges(df, dataset_name, schema),
|
||||||
|
'data_types': check_data_types(df, dataset_name, schema)
|
||||||
|
}
|
||||||
|
|
||||||
|
error_count = (
|
||||||
|
sum(1 for v in results['data_ranges'] if v.get('severity') == 'error') +
|
||||||
|
len(results['data_types'])
|
||||||
|
)
|
||||||
|
|
||||||
|
results['status'] = 'pass' if error_count == 0 else 'warning' if error_count < 10 else 'fail'
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def generate_validation_report(all_results, output_dir):
|
||||||
|
total_errors = sum(1 for r in all_results if r['status'] == 'fail')
|
||||||
|
total_warnings = sum(1 for r in all_results if r['status'] == 'warning')
|
||||||
|
total_pass = sum(1 for r in all_results if r['status'] == 'pass')
|
||||||
|
|
||||||
|
total_size_mb = sum(r['memory_mb'] for r in all_results)
|
||||||
|
total_rows = sum(r['rows'] for r in all_results)
|
||||||
|
|
||||||
|
report = {
|
||||||
|
'generated_at': datetime.utcnow().isoformat(),
|
||||||
|
'summary': {
|
||||||
|
'total_datasets': len(all_results),
|
||||||
|
'passed': total_pass,
|
||||||
|
'warnings': total_warnings,
|
||||||
|
'failed': total_errors,
|
||||||
|
'total_size_mb': round(total_size_mb, 2),
|
||||||
|
'total_rows': total_rows
|
||||||
|
},
|
||||||
|
'datasets': all_results
|
||||||
|
}
|
||||||
|
|
||||||
|
output_path = Path(output_dir) / 'metadata' / 'validation_report.json'
|
||||||
|
with open(output_path, 'w') as f:
|
||||||
|
json.dump(report, f, indent=2, default=str)
|
||||||
|
|
||||||
|
return report
|
||||||
|
|
||||||
|
def print_summary(report):
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("VALIDATION SUMMARY")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(f"Datasets processed: {report['summary']['total_datasets']}")
|
||||||
|
print(f" ✓ Passed: {report['summary']['passed']}")
|
||||||
|
print(f" ⚠ Warnings: {report['summary']['warnings']}")
|
||||||
|
print(f" ✗ Failed: {report['summary']['failed']}")
|
||||||
|
print(f"\nTotal size: {report['summary']['total_size_mb']:.2f}MB")
|
||||||
|
print(f"Total rows: {report['summary']['total_rows']:,}")
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("PER-DATASET DETAILS")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
for result in report['datasets']:
|
||||||
|
status_icon = '✓' if result['status'] == 'pass' else '⚠' if result['status'] == 'warning' else '✗'
|
||||||
|
print(f"\n{status_icon} {result['dataset']}")
|
||||||
|
print(f" Rows: {result['rows']:,} | Columns: {result['columns']} | Size: {result['memory_mb']:.2f}MB")
|
||||||
|
|
||||||
|
if result['missing_values']:
|
||||||
|
print(f" Missing values: {len(result['missing_values'])} columns")
|
||||||
|
|
||||||
|
if result['data_ranges']:
|
||||||
|
print(f" Range violations: {len(result['data_ranges'])}")
|
||||||
|
|
||||||
|
if result['data_types']:
|
||||||
|
print(f" Type issues: {len(result['data_types'])}")
|
||||||
|
|
||||||
|
if result['timestamp_continuity']['status'] == 'checked':
|
||||||
|
if result['timestamp_continuity']['gaps_detected'] > 0:
|
||||||
|
print(f" Time gaps: {result['timestamp_continuity']['gaps_detected']}")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
config = load_config()
|
||||||
|
schema = load_schema()
|
||||||
|
|
||||||
|
data_dir = Path(__file__).parent.parent / 'data'
|
||||||
|
|
||||||
|
datasets = [
|
||||||
|
'electricity_prices',
|
||||||
|
'battery_capacity',
|
||||||
|
'renewable_generation',
|
||||||
|
'conventional_generation',
|
||||||
|
'load_profiles',
|
||||||
|
'data_centers',
|
||||||
|
'bitcoin_mining'
|
||||||
|
]
|
||||||
|
|
||||||
|
print("Validating processed datasets...\n")
|
||||||
|
|
||||||
|
all_results = []
|
||||||
|
|
||||||
|
for dataset_name in datasets:
|
||||||
|
print(f"Validating {dataset_name}...")
|
||||||
|
|
||||||
|
df = load_processed_dataset(dataset_name, data_dir)
|
||||||
|
|
||||||
|
if df is None:
|
||||||
|
print(f" ✗ Dataset not found, skipping")
|
||||||
|
all_results.append({
|
||||||
|
'dataset': dataset_name,
|
||||||
|
'status': 'error',
|
||||||
|
'error': 'Dataset file not found'
|
||||||
|
})
|
||||||
|
continue
|
||||||
|
|
||||||
|
result = validate_dataset(df, dataset_name, schema)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
status_icon = '✓' if result['status'] == 'pass' else '⚠' if result['status'] == 'warning' else '✗'
|
||||||
|
print(f" {status_icon} {result['rows']:,} rows, {result['columns']} cols, {result['memory_mb']:.2f}MB")
|
||||||
|
|
||||||
|
report = generate_validation_report(all_results, data_dir)
|
||||||
|
print_summary(report)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"Validation report saved to: {data_dir / 'metadata' / 'validation_report.json'}")
|
||||||
|
|
||||||
|
return report
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user