Initial commit

This commit is contained in:
git
2026-01-03 22:05:49 +07:00
commit 2f8859dbe8
63 changed files with 6708 additions and 0 deletions

97
.gitignore vendored Normal file
View File

@@ -0,0 +1,97 @@
# Security: Sensitive Files and Credentials
# Add these patterns to your .gitignore to prevent accidental commits of sensitive data
# Environment variables
.env
.env.local
.env.*.local
# Configuration files with credentials
config.*.yaml
!config.example.yaml
!config.quickstart.yaml
!config.test.yaml
# Logs (may contain sensitive information)
logs/
*.log
# Reports and analysis output
reports/
investigation_reports/
analysis/
# IDE and editor files
.vscode/
.idea/
*.swp
*.swo
*~
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual environments
venv/
ENV/
env/
# Testing
.pytest_cache/
.coverage
htmlcov/
# OS
.DS_Store
Thumbs.db
# Temporary files
*.tmp
*.bak
*.backup
*~
# Database files
*.db
*.sqlite
*.sqlite3
# Docker
.dockerignore
docker-compose.override.yml
# Credentials and secrets (CRITICAL)
**/secrets/
**/credentials/
**/.aws/
**/.azure/
**/.gcp/
**/private_key*
**/secret_key*
**/api_key*
**/token*
**/password*
# Configuration with real values
config.prod.yaml
config.production.yaml
config.live.yaml

21
LICENSE Executable file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 QA Engineering Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

741
README.md Executable file
View File

@@ -0,0 +1,741 @@
# Data Regression Testing Framework
A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.
## ✨ Features
- **Automated Discovery** - Scan databases and auto-generate configuration files
- **Multiple Comparison Types** - Row counts, schema validation, aggregate sums
- **Investigation Queries** - Execute diagnostic SQL queries from regression analysis
- **Flexible Configuration** - YAML-based setup with extensive customization
- **Rich Reporting** - HTML, CSV, and PDF reports with detailed results
- **Windows Authentication** - Secure, credential-free database access
- **Read-Only Operations** - All queries are SELECT-only for safety
- **Comprehensive Logging** - Detailed execution logs with timestamps
## 🚀 Quick Start
### Prerequisites
- Python 3.9+
- Microsoft ODBC Driver 17+ for SQL Server
- Windows environment with domain authentication (or Linux with Kerberos)
- Read access to SQL Server databases
### Installation
```bash
# Clone the repository
git clone <repository-url>
cd data_regression_testing
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the framework
pip install -e .
# Verify installation
drt --version
```
### Basic Usage
```bash
# 1. Discover tables from baseline database
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml
# 2. Edit config.yaml to add target database connection
# 3. Validate configuration
drt validate --config config.yaml
# 4. Run comparison
drt compare --config config.yaml
# 5. (Optional) Investigate regression issues
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml
```
## 📦 Platform-Specific Installation
### Windows
1. Install Python 3.9+ from https://www.python.org/downloads/
2. ODBC Driver is usually pre-installed on Windows
3. Install Framework:
```cmd
python -m venv venv
venv\Scripts\activate
pip install -e .
```
### Linux (Debian/Ubuntu)
```bash
# Install ODBC Driver
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev
# Install Kerberos for Windows Authentication
sudo apt-get install -y krb5-user
# Configure /etc/krb5.conf with your domain settings
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM
# Install framework
python3 -m venv venv
source venv/bin/activate
pip install -e .
```
## 📋 Commands
### Discovery
Automatically scan databases and generate configuration files.
```bash
drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]
```
**Options:**
- `--server TEXT` - SQL Server hostname (required)
- `--database TEXT` - Database name (required)
- `--output, -o TEXT` - Output file (default: config_discovered.yaml)
- `--schemas TEXT` - Specific schemas to include
- `--verbose, -v` - Enable verbose output
### Validate
Validate configuration file syntax and database connectivity.
```bash
drt validate --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output
### Compare
Execute data comparison between baseline and target databases.
```bash
drt compare --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be compared without executing
### Investigate
Execute diagnostic queries from regression analysis.
```bash
drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--analysis-dir, -a PATH` - Analysis output directory containing `*_investigate.sql` files (required)
- `--config, -c PATH` - Configuration file (required)
- `--output-dir, -o PATH` - Output directory for reports (default: ./investigation_reports)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be executed without running
**Example:**
```bash
drt investigate -a analysis/output_20251209_184032/ -c config.yaml
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports
```
**What it does:**
- Discovers all `*_investigate.sql` files in the analysis directory
- Parses SQL files (handles markdown, multiple queries per file)
- Executes queries on both baseline and target databases
- Handles errors gracefully (continues on failures)
- Generates HTML and CSV reports with side-by-side comparisons
## ⚙️ Configuration
### Database Connections
```yaml
database_pairs:
- name: "DWH_Comparison"
enabled: true
baseline:
server: "<YOUR_SERVER>"
database: "<YOUR_BASELINE_DB>"
timeout:
connection: 30
query: 300
target:
server: "<YOUR_SERVER>"
database: "<YOUR_TARGET_DB>"
```
### Comparison Settings
```yaml
comparison:
mode: "health_check" # or "full"
row_count:
enabled: true
tolerance_percent: 0.0
schema:
enabled: true
checks:
column_names: true
data_types: true
aggregates:
enabled: true
tolerance_percent: 0.01
```
### Table Configuration
```yaml
tables:
- schema: "dbo"
name: "FactTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount"
- "Quantity"
```
### Output Directories
```yaml
reporting:
output_dir: "./reports"
investigation_dir: "./investigation_reports"
logging:
output_dir: "./logs"
discovery:
analysis_directory: "./analysis"
```
**Benefits:**
- Centralized storage of all output files
- Easy cleanup and management of generated files
- Configuration flexibility via YAML
- Backward compatibility with CLI overrides
## 📊 Reports
### Comparison Reports
The framework generates comprehensive reports in multiple formats:
- **HTML Report** - Visual summary with color-coded results and detailed breakdowns
- **CSV Report** - Machine-readable format for Excel or databases
- **PDF Report** - Professional formatted output (requires weasyprint)
Reports are saved to `./reports/` with timestamps.
### Investigation Reports
- **HTML Report** - Interactive report with collapsible query results, side-by-side baseline vs target comparison
- **CSV Report** - Flattened structure with one row per query execution
Investigation reports are saved to `./investigation_reports/` with timestamps.
## 🔄 Exit Codes
| Code | Meaning |
|------|---------|
| 0 | Success - all comparisons passed |
| 1 | Failures detected - one or more FAIL results |
| 2 | Execution error - configuration or connection issues |
## 🧪 Testing
### Docker Test Environment
```bash
# Start test SQL Server containers
bash test_data/setup_test_environment.sh
# Test discovery
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
# Test comparison
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
```
### Manual Testing
```bash
# Connect to test databases (use SA_PASSWORD environment variable)
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"
# Run queries to verify data
SELECT COUNT(*) FROM dbo.FactTable1;
```
## 🚢 Deployment
### Scheduled Execution
**Windows Task Scheduler:**
```batch
@echo off
cd C:\path\to\framework
call venv\Scripts\activate.bat
drt compare --config config.yaml
if %ERRORLEVEL% NEQ 0 (
echo Test failed with exit code %ERRORLEVEL%
exit /b %ERRORLEVEL%
)
```
**Linux Cron:**
```bash
# Run daily at 2 AM
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1
```
### Monitoring
```bash
# Watch logs
tail -f logs/drt_*.log
# Search for failures
grep -i "FAIL\|ERROR" logs/drt_*.log
```
## 🏗️ Architecture
```
src/drt/
├── cli/ # Command-line interface
│ └── commands/ # CLI commands (compare, discover, validate, investigate)
├── config/ # Configuration management
├── database/ # Database connectivity (READ ONLY)
├── models/ # Data models
├── reporting/ # Report generators
├── services/ # Business logic
│ ├── checkers/ # Comparison checkers
│ ├── investigation.py # Investigation service
│ └── sql_parser.py # SQL file parser
└── utils/ # Utilities
```
## 🔒 Security
- **Windows Authentication Only** - No stored credentials
- **Read-Only Operations** - All queries are SELECT-only
- **Minimal Permissions** - Only requires db_datareader role
- **No Data Logging** - Sensitive data never logged
## 🔧 Troubleshooting
### Connection Failed
```bash
# Test connectivity
drt discover --server <YOUR_SERVER> --database master
# Verify ODBC driver
odbcinst -q -d
# Check permissions
# User needs db_datareader role on target databases
```
### Query Timeout
Increase timeout in configuration:
```yaml
baseline:
timeout:
query: 600 # 10 minutes
```
### Linux Kerberos Issues
```bash
# Check ticket
klist
# Renew if expired
kinit username@YOUR_DOMAIN.COM
# Verify ticket is valid
klist
```
## ⚡ Performance
### Diagnostic Logging
Enable verbose mode to see detailed timing:
```bash
drt compare --config config.yaml --verbose
```
This shows:
- Per-check timing (existence, row count, schema, aggregates)
- Query execution times
- Parallelization opportunities
### Optimization Tips
- Disable aggregate checks for surrogate keys
- Increase query timeouts for large tables
- Use table filtering to focus on critical tables
- Consider parallel execution for multiple database pairs
## 👨‍💻 Development
### Getting Started
1. Fork the repository on GitHub
2. Clone your fork locally:
```bash
git clone https://github.com/your-username/data_regression_testing.git
cd data_regression_testing
```
3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
4. Install dependencies:
```bash
pip install -r requirements.txt
pip install -e .
```
5. Install development dependencies:
```bash
pip install pytest pytest-cov black flake8 mypy
```
### Development Workflow
#### 1. Create a Branch
```bash
git checkout -b feature/your-feature-name
# or
git checkout -b bugfix/issue-description
```
#### 2. Make Your Changes
- Write clean, readable code
- Follow the existing code style
- Add docstrings to all functions and classes
- Update documentation as needed
#### 3. Run Tests
```bash
# All tests
pytest
# With coverage
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py
```
#### 4. Code Quality Checks
```bash
# Format code with black
black src/ tests/
# Check code style with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
```
#### 5. Commit Your Changes
Write clear, descriptive commit messages:
```bash
git add .
git commit -m "Add feature: description of your changes"
```
**Commit message guidelines:**
- Use present tense ("Add feature" not "Added feature")
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
- Limit first line to 72 characters
- Reference issues and pull requests when relevant
#### 6. Push and Create Pull Request
```bash
git push origin feature/your-feature-name
```
Create a pull request on GitHub with:
- Clear title and description
- Reference to related issues
- Screenshots (if applicable)
- Test results
### Code Style Guidelines
#### Python Style
- Follow PEP 8 style guide
- Use type hints for function parameters and return values
- Maximum line length: 100 characters
- Use meaningful variable and function names
**Example:**
```python
def calculate_row_count_difference(
baseline_count: int,
target_count: int,
tolerance_percent: float
) -> tuple[bool, float]:
"""
Calculate if row count difference is within tolerance.
Args:
baseline_count: Row count from baseline database
target_count: Row count from target database
tolerance_percent: Acceptable difference percentage
Returns:
Tuple of (is_within_tolerance, actual_difference_percent)
"""
# Implementation here
pass
```
#### Documentation
- Add docstrings to all public functions, classes, and modules
- Use Google-style docstrings
- Include examples in docstrings when helpful
- Update README.md for user-facing changes
#### Testing
- Write unit tests for all new functionality
- Aim for >80% code coverage
- Use descriptive test names
- Follow AAA pattern (Arrange, Act, Assert)
**Example:**
```python
def test_row_count_checker_exact_match():
"""Test row count checker with exact match"""
# Arrange
checker = RowCountChecker(tolerance_percent=0.0)
# Act
result = checker.check(baseline_count=1000, target_count=1000)
# Assert
assert result.status == Status.PASS
assert result.baseline_value == 1000
assert result.target_value == 1000
```
### Adding New Features
#### New Checker Type
To add a new comparison checker:
1. Create new checker in `src/drt/services/checkers/`
2. Inherit from `BaseChecker`
3. Implement `check()` method
4. Add new `CheckType` enum value
5. Register in `ComparisonService`
6. Add tests in `tests/test_checkers.py`
7. Update documentation
#### New Report Format
To add a new report format:
1. Create new reporter in `src/drt/reporting/`
2. Implement `generate()` method
3. Add format option to configuration
4. Update `ReportGenerator` to use new format
5. Add tests
6. Update documentation
### Testing
#### Unit Tests
Run the test suite:
```bash
# All tests
pytest
# With coverage report
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py -v
# Specific test function
pytest tests/test_models.py::test_status_enum -v
```
#### Integration Tests
Use the Docker test environment:
```bash
# Start test databases
bash test_data/setup_test_environment.sh
# Run integration tests
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
```
#### Manual Testing
```bash
# Test against real databases (requires access)
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
drt validate --config manual_test.yaml
drt compare --config manual_test.yaml --dry-run
```
### Reporting Issues
When reporting issues, please include:
- Clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version, ODBC driver version)
- Relevant logs or error messages
- Configuration file (sanitized - remove server names/credentials)
**Example:**
```markdown
**Description:** Row count comparison fails with timeout error
**Steps to Reproduce:**
1. Configure comparison for large table (>1M rows)
2. Run `drt compare --config config.yaml`
3. Observe timeout error
**Expected:** Comparison completes successfully
**Actual:** Query timeout after 300 seconds
**Environment:**
- OS: Windows 10
- Python: 3.9.7
- ODBC Driver: 17 for SQL Server
**Logs:**
```
ERROR: Query timeout on table dbo.FactTable1
```
```
### Feature Requests
For feature requests, please:
- Check if feature already exists or is planned
- Describe the use case clearly
- Explain why it would be valuable
- Provide examples if possible
### Code Review Process
All contributions go through code review:
1. Automated checks must pass (tests, linting)
2. At least one maintainer approval required
3. Address review feedback promptly
4. Keep pull requests focused and reasonably sized
### Release Process
Releases follow semantic versioning (MAJOR.MINOR.PATCH):
- **MAJOR** - Breaking changes
- **MINOR** - New features (backward compatible)
- **PATCH** - Bug fixes (backward compatible)
### Development Tips
#### Debugging
```bash
# Enable verbose logging
drt compare --config config.yaml --verbose
# Use dry-run to test without execution
drt compare --config config.yaml --dry-run
# Check configuration validity
drt validate --config config.yaml
```
#### Performance Profiling
```bash
# Enable diagnostic logging
drt compare --config config.yaml --verbose
# Look for timing information in logs
grep "execution time" logs/drt_*.log
```
#### Docker Development
```bash
# Build and test in Docker
docker build -t drt:dev .
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml
```
## 📝 License
MIT License - see LICENSE file for details
## 📞 Support
For issues and questions:
- GitHub Issues: <repository-url>/issues
- Check logs in `./logs/`
- Review configuration with `drt validate`
- Test connectivity with `drt discover`
## 👥 Authors
QA Engineering Team
## 📌 Version
Current version: 1.0.0

286
config.example.yaml Executable file
View File

@@ -0,0 +1,286 @@
# Data Regression Testing Framework - Example Configuration
# This file demonstrates all available configuration options
# ============================================================================
# DATABASE PAIRS
# Define baseline (production) and target (test) database connections
# ============================================================================
database_pairs:
# Example 1: Data Warehouse Comparison
- name: "DWH_Comparison"
enabled: true
description: "Compare production and test data warehouse"
baseline:
server: "<YOUR_SERVER_NAME>"
database: "<YOUR_BASELINE_DB>"
timeout:
connection: 30 # seconds
query: 300 # seconds (5 minutes)
target:
server: "<YOUR_SERVER_NAME>"
database: "<YOUR_TARGET_DB>"
timeout:
connection: 30
query: 300
# Example 2: Operational Database Comparison (disabled)
- name: "OPS_Comparison"
enabled: false
description: "Compare operational databases (currently disabled)"
baseline:
server: "<YOUR_SERVER_NAME>"
database: "<YOUR_BASELINE_DB_2>"
target:
server: "<YOUR_SERVER_NAME>"
database: "<YOUR_TARGET_DB_2>"
# ============================================================================
# COMPARISON SETTINGS
# Configure what types of comparisons to perform
# ============================================================================
comparison:
# Comparison mode: "health_check" or "full"
# - health_check: Quick validation (row counts, schema)
# - full: Comprehensive validation (includes aggregates)
mode: "health_check"
# Row Count Comparison
row_count:
enabled: true
tolerance_percent: 0.0 # 0% = exact match required
# Examples:
# 0.0 = exact match
# 0.1 = allow 0.1% difference
# 1.0 = allow 1% difference
# Schema Comparison
schema:
enabled: true
checks:
column_names: true # Verify column names match
data_types: true # Verify data types match
nullable: true # Verify nullable constraints match
primary_keys: true # Verify primary keys match
# Aggregate Comparison (sums of numeric columns)
aggregates:
enabled: true
tolerance_percent: 0.01 # 0.01% tolerance for rounding differences
# Note: Only applies when mode is "full"
# ============================================================================
# TABLES TO COMPARE
# List all tables to include in comparison
# ============================================================================
tables:
# Example 1: Fact table with aggregates
- schema: "dbo"
name: "FactTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount1"
- "Amount2"
- "Amount3"
- "Quantity"
notes: "Example fact table with numeric aggregates"
# Example 2: Dimension table without aggregates
- schema: "dbo"
name: "DimTable1"
enabled: true
expected_in_target: true
aggregate_columns: []
notes: "Example dimension table - no numeric aggregates"
# Example 3: Table expected to be missing in target
- schema: "dbo"
name: "TempTable1"
enabled: true
expected_in_target: false
aggregate_columns: []
notes: "Example temporary table - should not exist in target"
# Example 4: Disabled table (skipped during comparison)
- schema: "dbo"
name: "Table4"
enabled: false
expected_in_target: true
aggregate_columns: []
notes: "Example disabled table - excluded from comparison"
# Example 5: Table with multiple schemas
- schema: "staging"
name: "StagingTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount"
notes: "Example staging table"
# Example 6: Large fact table
- schema: "dbo"
name: "FactTable2"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount"
- "Fee"
- "NetAmount"
notes: "Example high-volume fact table"
# Example 7: Reference data table
- schema: "ref"
name: "RefTable1"
enabled: true
expected_in_target: true
aggregate_columns: []
notes: "Example reference data table"
# ============================================================================
# REPORTING SETTINGS
# Configure report generation and output
# ============================================================================
reporting:
# Output directory for reports (use relative path or set via environment variable)
output_dir: "./reports"
# Output directory for investigation reports (use relative path or set via environment variable)
investigation_dir: "./investigation_reports"
# Report formats to generate
formats:
html: true # Rich HTML report with styling
csv: true # CSV report for Excel/analysis
pdf: false # PDF report (requires weasyprint)
# Report naming
filename_prefix: "regression_test"
include_timestamp: true # Append YYYYMMDD_HHMMSS to filename
# Report content options
include_passed: true # Include passed checks in report
include_warnings: true # Include warnings in report
summary_only: false # Only show summary (no details)
# ============================================================================
# LOGGING SETTINGS
# Configure logging behavior
# ============================================================================
logging:
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL
level: "INFO"
# Log output directory (use relative path or set via environment variable)
output_dir: "./logs"
# Log file naming
filename_prefix: "drt"
include_timestamp: true
# Console output
console:
enabled: true
level: "INFO"
colored: true # Use colored output (if terminal supports it)
# File output
file:
enabled: true
level: "DEBUG"
max_size_mb: 10 # Rotate after 10MB
backup_count: 5 # Keep 5 backup files
# ============================================================================
# EXECUTION SETTINGS
# Configure execution behavior
# ============================================================================
execution:
# Parallel execution (future feature)
parallel:
enabled: false
max_workers: 4
# Retry settings for transient failures
retry:
enabled: true
max_attempts: 3
delay_seconds: 5
# Performance settings
performance:
batch_size: 1000 # Rows per batch for large queries
use_nolock: true # Use NOLOCK hints (read uncommitted)
connection_pooling: true
# ============================================================================
# FILTERS
# Global filters applied to all tables
# ============================================================================
filters:
# Schema filters (include/exclude patterns)
schemas:
include:
- "dbo"
- "staging"
- "ref"
exclude:
- "sys"
- "temp"
# Table name filters (wildcard patterns)
tables:
include:
- "*" # Include all tables
exclude:
- "tmp_*" # Exclude temporary tables
- "backup_*" # Exclude backup tables
- "archive_*" # Exclude archive tables
# Column filters for aggregate comparisons
columns:
exclude_patterns:
- "*_id" # Exclude ID columns
- "*_key" # Exclude key columns
- "created_*" # Exclude audit columns
- "modified_*" # Exclude audit columns
# ============================================================================
# NOTIFICATIONS (future feature)
# Configure notifications for test results
# ============================================================================
notifications:
enabled: false
# Email notifications
email:
enabled: false
smtp_server: "smtp.company.com"
smtp_port: 587
from_address: "drt@company.com"
to_addresses:
- "qa-team@company.com"
on_failure_only: true
# Slack notifications
slack:
enabled: false
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#qa-alerts"
on_failure_only: true
# ============================================================================
# METADATA
# Optional metadata about this configuration
# ============================================================================
metadata:
version: "1.0"
created_by: "QA Team"
created_date: "2024-01-15"
description: "Standard regression test configuration for DWH migration"
project: "DWH Migration Phase 2"
environment: "UAT"
tags:
- "migration"
- "data-quality"
- "regression"

46
config.quickstart.yaml Executable file
View File

@@ -0,0 +1,46 @@
# Quick Start Configuration
# Minimal configuration to get started quickly
database_pairs:
- name: "Quick_Test"
enabled: true
baseline:
server: "YOUR_SERVER_NAME"
database: "YOUR_BASELINE_DB"
target:
server: "YOUR_SERVER_NAME"
database: "YOUR_TARGET_DB"
comparison:
mode: "health_check"
row_count:
enabled: true
tolerance_percent: 0.0
schema:
enabled: true
checks:
column_names: true
data_types: true
aggregates:
enabled: false
tables:
# Add your tables here after running discovery
# Example:
# - schema: "dbo"
# name: "YourTable"
# enabled: true
# expected_in_target: true
# aggregate_columns: []
reporting:
output_dir: "./reports"
investigation_dir: "./investigation_reports"
formats:
html: true
csv: true
pdf: false
logging:
level: "INFO"
output_dir: "./logs"

83
config.test.yaml Executable file
View File

@@ -0,0 +1,83 @@
# Test Configuration for Docker SQL Server Environment
# Use this configuration with the Docker test environment
database_pairs:
- name: "Docker_Test_Comparison"
enabled: true
description: "Compare Docker test databases"
baseline:
server: "localhost,1433"
database: "TestDB_Baseline"
# Use environment variables for credentials: DRT_DB_USERNAME, DRT_DB_PASSWORD
# username: "${DRT_DB_USERNAME}"
# password: "${DRT_DB_PASSWORD}"
timeout:
connection: 30
query: 300
target:
server: "localhost,1434"
database: "TestDB_Target"
# Use environment variables for credentials: DRT_DB_USERNAME, DRT_DB_PASSWORD
# username: "${DRT_DB_USERNAME}"
# password: "${DRT_DB_PASSWORD}"
timeout:
connection: 30
query: 300
comparison:
mode: "health_check"
row_count:
enabled: true
tolerance_percent: 0.0
schema:
enabled: true
checks:
column_names: true
data_types: true
aggregates:
enabled: true
tolerance_percent: 0.01
tables:
- schema: "dbo"
name: "DimTable1"
enabled: true
expected_in_target: true
aggregate_columns: []
notes: "Example dimension table"
- schema: "dbo"
name: "DimTable2"
enabled: true
expected_in_target: true
aggregate_columns: []
notes: "Example dimension table with schema differences"
- schema: "dbo"
name: "FactTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Quantity"
- "Amount"
- "Tax"
notes: "Example fact table with numeric aggregates"
- schema: "dbo"
name: "TempTable1"
enabled: true
expected_in_target: false
aggregate_columns: []
notes: "Example temporary table - only exists in target"
reporting:
output_directory: "/home/user/reports"
investigation_directory: "/home/user/investigation_reports"
formats: ["html", "csv"]
filename_template: "test_regression_{timestamp}"
logging:
level: "INFO"
directory: "/home/user/logs"
filename_template: "drt_test_{timestamp}.log"
console: true

0
config/.gitkeep Executable file
View File

52
docker-compose.test.yml Executable file
View File

@@ -0,0 +1,52 @@
version: '3.8'
services:
# SQL Server 2022 - Baseline (Production)
sqlserver-baseline:
image: mcr.microsoft.com/mssql/server:2022-latest
container_name: drt-sqlserver-baseline
environment:
- ACCEPT_EULA=Y
- SA_PASSWORD=${SA_PASSWORD:-YourStrong!Passw0rd}
- MSSQL_PID=Developer
ports:
- "1433:1433"
volumes:
- ./test_data/init_baseline.sql:/docker-entrypoint-initdb.d/init.sql
- sqlserver_baseline_data:/var/opt/mssql
healthcheck:
test: /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ${SA_PASSWORD:-YourStrong!Passw0rd} -Q "SELECT 1"
interval: 10s
timeout: 5s
retries: 5
networks:
- drt-network
# SQL Server 2022 - Target (Test)
sqlserver-target:
image: mcr.microsoft.com/mssql/server:2022-latest
container_name: drt-sqlserver-target
environment:
- ACCEPT_EULA=Y
- SA_PASSWORD=${SA_PASSWORD:-YourStrong!Passw0rd}
- MSSQL_PID=Developer
ports:
- "1434:1433"
volumes:
- ./test_data/init_target.sql:/docker-entrypoint-initdb.d/init.sql
- sqlserver_target_data:/var/opt/mssql
healthcheck:
test: /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ${SA_PASSWORD:-YourStrong!Passw0rd} -Q "SELECT 1"
interval: 10s
timeout: 5s
retries: 5
networks:
- drt-network
volumes:
sqlserver_baseline_data:
sqlserver_target_data:
networks:
drt-network:
driver: bridge

121
install_docker_debian.sh Executable file
View File

@@ -0,0 +1,121 @@
#!/bin/bash
# Docker Installation Script for Debian 12
set -e
echo "=========================================="
echo "Docker Installation for Debian 12"
echo "=========================================="
echo ""
# Check if running as root
if [ "$EUID" -ne 0 ]; then
echo "Please run with sudo: sudo bash install_docker_debian.sh"
exit 1
fi
# Detect OS
if [ -f /etc/os-release ]; then
. /etc/os-release
OS=$ID
VER=$VERSION_ID
echo "Detected OS: $PRETTY_NAME"
else
echo "Cannot detect OS version"
exit 1
fi
# Remove old versions
echo ""
echo "Step 1: Removing old Docker versions (if any)..."
apt-get remove -y docker docker-engine docker.io containerd runc 2>/dev/null || true
# Install prerequisites
echo ""
echo "Step 2: Installing prerequisites..."
apt-get update
apt-get install -y \
ca-certificates \
curl \
gnupg \
lsb-release
# Add Docker's official GPG key
echo ""
echo "Step 3: Adding Docker GPG key..."
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg
# Set up Docker repository
echo ""
echo "Step 4: Adding Docker repository..."
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker Engine
echo ""
echo "Step 5: Installing Docker Engine..."
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Start Docker service
echo ""
echo "Step 6: Starting Docker service..."
systemctl start docker
systemctl enable docker
# Add current user to docker group (if not root)
if [ -n "$SUDO_USER" ]; then
echo ""
echo "Step 7: Adding user $SUDO_USER to docker group..."
usermod -aG docker $SUDO_USER
echo "Note: You'll need to log out and back in for group changes to take effect"
fi
# Verify installation
echo ""
echo "Step 8: Verifying Docker installation..."
if docker --version; then
echo "✓ Docker installed successfully"
else
echo "✗ Docker installation failed"
exit 1
fi
if docker compose version; then
echo "✓ Docker Compose installed successfully"
else
echo "✗ Docker Compose installation failed"
exit 1
fi
# Test Docker
echo ""
echo "Step 9: Testing Docker..."
if docker run --rm hello-world > /dev/null 2>&1; then
echo "✓ Docker is working correctly"
else
echo "⚠ Docker test failed - you may need to log out and back in"
fi
echo ""
echo "=========================================="
echo "Installation completed successfully!"
echo "=========================================="
echo ""
echo "Docker version:"
docker --version
echo ""
echo "Docker Compose version:"
docker compose version
echo ""
echo "IMPORTANT: If you're not root, log out and back in for group changes to take effect"
echo ""
echo "Next steps:"
echo "1. Log out and back in (or run: newgrp docker)"
echo "2. Test Docker: docker run hello-world"
echo "3. Set up test environment: bash test_data/setup_test_environment.sh"
echo ""

112
install_odbc_debian.sh Executable file
View File

@@ -0,0 +1,112 @@
#!/bin/bash
# ODBC Driver Installation Script for Debian 12
# This script installs Microsoft ODBC Driver 18 for SQL Server
set -e
echo "=========================================="
echo "ODBC Driver Installation for Debian 12"
echo "=========================================="
echo ""
# Check if running as root
if [ "$EUID" -ne 0 ]; then
echo "Please run with sudo: sudo bash install_odbc_debian.sh"
exit 1
fi
# Detect OS
if [ -f /etc/os-release ]; then
. /etc/os-release
OS=$ID
VER=$VERSION_ID
echo "Detected OS: $PRETTY_NAME"
else
echo "Cannot detect OS version"
exit 1
fi
# Clean up any corrupted repository files
echo ""
echo "Step 1: Cleaning up any previous installation attempts..."
if [ -f /etc/apt/sources.list.d/mssql-release.list ]; then
echo "Removing corrupted mssql-release.list..."
rm -f /etc/apt/sources.list.d/mssql-release.list
fi
# Install prerequisites
echo ""
echo "Step 2: Installing prerequisites..."
apt-get update
apt-get install -y curl gnupg2 apt-transport-https ca-certificates
# Add Microsoft GPG key
echo ""
echo "Step 3: Adding Microsoft GPG key..."
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
# Add Microsoft repository based on OS
echo ""
echo "Step 4: Adding Microsoft repository..."
if [ "$OS" = "debian" ]; then
if [ "$VER" = "12" ]; then
curl https://packages.microsoft.com/config/debian/12/prod.list | tee /etc/apt/sources.list.d/mssql-release.list
elif [ "$VER" = "11" ]; then
curl https://packages.microsoft.com/config/debian/11/prod.list | tee /etc/apt/sources.list.d/mssql-release.list
else
echo "Unsupported Debian version: $VER"
exit 1
fi
elif [ "$OS" = "ubuntu" ]; then
curl https://packages.microsoft.com/config/ubuntu/$VER/prod.list | tee /etc/apt/sources.list.d/mssql-release.list
else
echo "Unsupported OS: $OS"
exit 1
fi
# Update package list
echo ""
echo "Step 5: Updating package list..."
apt-get update
# Install ODBC Driver
echo ""
echo "Step 6: Installing ODBC Driver 18 for SQL Server..."
ACCEPT_EULA=Y apt-get install -y msodbcsql18
# Install unixODBC development headers
echo ""
echo "Step 7: Installing unixODBC development headers..."
apt-get install -y unixodbc-dev
# Verify installation
echo ""
echo "Step 8: Verifying installation..."
if odbcinst -q -d -n "ODBC Driver 18 for SQL Server" > /dev/null 2>&1; then
echo "✓ ODBC Driver 18 for SQL Server installed successfully"
odbcinst -q -d -n "ODBC Driver 18 for SQL Server"
else
echo "✗ ODBC Driver installation failed"
exit 1
fi
# Check for ODBC Driver 17 as fallback
if odbcinst -q -d -n "ODBC Driver 17 for SQL Server" > /dev/null 2>&1; then
echo "✓ ODBC Driver 17 for SQL Server also available"
fi
echo ""
echo "=========================================="
echo "Installation completed successfully!"
echo "=========================================="
echo ""
echo "Next steps:"
echo "1. Install Python dependencies: pip install -r requirements.txt"
echo "2. Install the framework: pip install -e ."
echo "3. Test the installation: drt --version"
echo ""
echo "For Windows Authentication, you'll also need to:"
echo "1. Install Kerberos: apt-get install -y krb5-user"
echo "2. Configure /etc/krb5.conf with your domain settings"
echo "3. Get a Kerberos ticket: kinit username@YOUR_DOMAIN.COM"
echo ""

73
pyproject.toml Executable file
View File

@@ -0,0 +1,73 @@
[project]
name = "data-regression-tester"
version = "1.0.0"
description = "Data Regression Testing Framework for SQL Server"
readme = "README.md"
requires-python = ">=3.9"
license = {text = "MIT"}
authors = [
{name = "QA Engineering Team"}
]
keywords = ["data", "regression", "testing", "sql-server", "comparison"]
classifiers = [
"Development Status :: 4 - Beta",
"Environment :: Console",
"Intended Audience :: Developers",
"Operating System :: Microsoft :: Windows",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Database",
"Topic :: Software Development :: Testing",
]
dependencies = [
"pandas>=2.0",
"sqlalchemy>=2.0",
"pyodbc>=4.0",
"pyyaml>=6.0",
"pydantic>=2.0",
"click>=8.0",
"rich>=13.0",
"jinja2>=3.0",
"weasyprint>=60.0",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"pytest-cov>=4.0",
"black>=23.0",
"ruff>=0.1.0",
"mypy>=1.0",
"pre-commit>=3.0",
]
[project.scripts]
drt = "drt.cli.main:cli"
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[tool.setuptools.packages.find]
where = ["src"]
[tool.black]
line-length = 100
target-version = ["py39", "py310", "py311", "py312"]
[tool.ruff]
line-length = 100
select = ["E", "F", "W", "I", "N", "UP", "B", "C4"]
[tool.mypy]
python_version = "3.9"
warn_return_any = true
warn_unused_configs = true
ignore_missing_imports = true
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-v --cov=drt --cov-report=term-missing"

14
pytest.ini Executable file
View File

@@ -0,0 +1,14 @@
[pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts =
-v
--strict-markers
--tb=short
--disable-warnings
markers =
unit: Unit tests
integration: Integration tests
slow: Slow running tests

9
requirements.txt Executable file
View File

@@ -0,0 +1,9 @@
pandas>=2.0
sqlalchemy>=2.0
pyodbc>=4.0
pyyaml>=6.0
pydantic>=2.0
click>=8.0
rich>=13.0
jinja2>=3.0
weasyprint>=60.0

14
src/drt/__init__.py Executable file
View File

@@ -0,0 +1,14 @@
"""
Data Regression Testing Framework
A comprehensive framework for validating data integrity during code migration
and system updates by comparing data outputs between Baseline (Production)
and Target (Test) SQL Server databases.
"""
__version__ = "1.0.0"
__author__ = "QA Engineering Team"
from drt.models.enums import Status, CheckType
__all__ = ["__version__", "__author__", "Status", "CheckType"]

11
src/drt/__main__.py Executable file
View File

@@ -0,0 +1,11 @@
"""
Entry point for running the framework as a module.
Usage:
python -m drt <command> [options]
"""
from drt.cli.main import cli
if __name__ == "__main__":
cli()

5
src/drt/cli/__init__.py Executable file
View File

@@ -0,0 +1,5 @@
"""Command-line interface for the framework."""
from drt.cli.main import cli
__all__ = ["cli"]

View File

@@ -0,0 +1,5 @@
"""CLI commands."""
from drt.cli.commands import discover, compare, validate, investigate
__all__ = ["discover", "compare", "validate", "investigate"]

137
src/drt/cli/commands/compare.py Executable file
View File

@@ -0,0 +1,137 @@
"""Compare command implementation."""
import click
import sys
from pathlib import Path
from drt.config.loader import load_config
from drt.services.comparison import ComparisonService
from drt.reporting.generator import ReportGenerator
from drt.utils.logging import setup_logging, get_logger
from drt.utils.timestamps import format_duration
logger = get_logger(__name__)
@click.command()
@click.option('--config', '-c', required=True, type=click.Path(exists=True), help='Configuration file path')
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
@click.option('--dry-run', is_flag=True, help='Show what would be compared without executing')
def compare(config, verbose, dry_run):
"""
Execute comparison between Baseline and Target databases.
Compares configured tables between baseline and target databases,
checking for data regression issues.
Example:
drt compare --config ./config.yaml
"""
# Load config first to get log directory
from drt.config.loader import load_config
cfg = load_config(config)
# Setup logging using config
log_level = "DEBUG" if verbose else "INFO"
log_dir = cfg.logging.directory
setup_logging(log_level=log_level, log_dir=log_dir, log_to_file=not dry_run)
click.echo("=" * 60)
click.echo("Data Regression Testing Framework")
click.echo("=" * 60)
click.echo()
try:
# Load configuration
click.echo(f"Loading configuration: {config}")
cfg = load_config(config)
click.echo(f"✓ Configuration loaded")
click.echo(f" Database pairs: {len(cfg.database_pairs)}")
click.echo(f" Tables configured: {len(cfg.tables)}")
click.echo()
if dry_run:
click.echo("=" * 60)
click.echo("DRY RUN - Preview Only")
click.echo("=" * 60)
for pair in cfg.database_pairs:
if not pair.enabled:
continue
click.echo(f"\nDatabase Pair: {pair.name}")
click.echo(f" Baseline: {pair.baseline.server}.{pair.baseline.database}")
click.echo(f" Target: {pair.target.server}.{pair.target.database}")
# Count enabled tables
enabled_tables = [t for t in cfg.tables if t.enabled]
click.echo(f" Tables to compare: {len(enabled_tables)}")
click.echo("\n" + "=" * 60)
click.echo("Use without --dry-run to execute comparison")
click.echo("=" * 60)
sys.exit(0)
# Execute comparison for each database pair
all_summaries = []
for pair in cfg.database_pairs:
if not pair.enabled:
click.echo(f"Skipping disabled pair: {pair.name}")
continue
click.echo(f"Comparing: {pair.name}")
click.echo(f" Baseline: {pair.baseline.server}.{pair.baseline.database}")
click.echo(f" Target: {pair.target.server}.{pair.target.database}")
click.echo()
# Run comparison
comparison_service = ComparisonService(cfg)
summary = comparison_service.run_comparison(pair)
all_summaries.append(summary)
click.echo()
# Generate reports for all summaries
if all_summaries:
click.echo("=" * 60)
click.echo("Generating Reports")
click.echo("=" * 60)
report_gen = ReportGenerator(cfg)
for summary in all_summaries:
report_files = report_gen.generate_reports(summary)
for filepath in report_files:
click.echo(f"{filepath}")
click.echo()
# Display final summary
click.echo("=" * 60)
click.echo("EXECUTION COMPLETE")
click.echo("=" * 60)
total_passed = sum(s.passed for s in all_summaries)
total_failed = sum(s.failed for s in all_summaries)
total_warnings = sum(s.warnings for s in all_summaries)
total_errors = sum(s.errors for s in all_summaries)
click.echo(f" PASS: {total_passed:3d}")
click.echo(f" FAIL: {total_failed:3d}")
click.echo(f" WARNING: {total_warnings:3d}")
click.echo(f" ERROR: {total_errors:3d}")
click.echo("=" * 60)
# Exit with appropriate code
if total_errors > 0 or total_failed > 0:
click.echo("Status: FAILED ❌")
sys.exit(1)
else:
click.echo("Status: PASSED ✓")
sys.exit(0)
except Exception as e:
logger.error(f"Comparison failed: {e}", exc_info=verbose)
click.echo(f"✗ Error: {e}", err=True)
sys.exit(2)

118
src/drt/cli/commands/discover.py Executable file
View File

@@ -0,0 +1,118 @@
"""Discovery command implementation."""
import click
import sys
from drt.services.discovery import DiscoveryService
from drt.config.models import ConnectionConfig, Config
from drt.config.loader import save_config
from drt.utils.logging import setup_logging, get_logger
logger = get_logger(__name__)
@click.command()
@click.option('--server', required=True, help='SQL Server hostname or instance')
@click.option('--database', required=True, help='Database name to discover')
@click.option('--output', '-o', default='./config_discovered.yaml', help='Output configuration file')
@click.option('--schemas', multiple=True, help='Specific schemas to include (can specify multiple)')
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
def discover(server, database, output, schemas, verbose):
"""
Discover tables and generate configuration file.
Scans the specified database and automatically generates a configuration
file with all discovered tables, columns, and metadata.
Example:
drt discover --server SQLSERVER01 --database ORBIS_DWH_PROD
"""
# Setup logging
log_level = "DEBUG" if verbose else "INFO"
setup_logging(log_level=log_level)
click.echo("=" * 60)
click.echo("Data Regression Testing Framework - Discovery Mode")
click.echo("=" * 60)
click.echo()
try:
# Create connection config
conn_config = ConnectionConfig(
server=server,
database=database
)
# Create base config with schema filters if provided
config = Config()
if schemas:
config.discovery.include_schemas = list(schemas)
# Initialize discovery service
click.echo(f"Connecting to {server}.{database}...")
discovery_service = DiscoveryService(conn_config, config)
# Test connection
if not discovery_service.conn_mgr.test_connection():
click.echo("✗ Connection failed", err=True)
sys.exit(2)
click.echo("✓ Connected (Windows Authentication)")
click.echo()
# Discover tables
click.echo("Scanning tables...")
tables = discovery_service.discover_tables()
if not tables:
click.echo("⚠ No tables found", err=True)
sys.exit(0)
click.echo(f"✓ Found {len(tables)} tables")
click.echo()
# Generate configuration
click.echo("Generating configuration...")
generated_config = discovery_service.generate_config(tables)
# Save configuration
save_config(generated_config, output)
click.echo(f"✓ Configuration saved to: {output}")
click.echo()
# Display summary
click.echo("=" * 60)
click.echo("Discovery Summary")
click.echo("=" * 60)
click.echo(f" Tables discovered: {len(tables)}")
# Count columns
total_cols = sum(len(t.columns) for t in tables)
click.echo(f" Total columns: {total_cols}")
# Count numeric columns
numeric_cols = sum(len(t.aggregate_columns) for t in tables)
click.echo(f" Numeric columns: {numeric_cols}")
# Show largest tables
if tables:
largest = sorted(tables, key=lambda t: t.estimated_row_count, reverse=True)[:3]
click.echo()
click.echo(" Largest tables:")
for table in largest:
click.echo(f"{table.full_name:40s} {table.estimated_row_count:>12,} rows")
click.echo()
click.echo("=" * 60)
click.echo("Next Steps:")
click.echo(f" 1. Review {output}")
click.echo(" 2. Configure target database connection")
click.echo(" 3. Set 'expected_in_target: false' for tables being removed")
click.echo(f" 4. Run: drt compare --config {output}")
click.echo("=" * 60)
sys.exit(0)
except Exception as e:
logger.error(f"Discovery failed: {e}", exc_info=verbose)
click.echo(f"✗ Error: {e}", err=True)
sys.exit(2)

View File

@@ -0,0 +1,177 @@
"""Investigate command implementation."""
import click
import sys
from pathlib import Path
from drt.config.loader import load_config
from drt.services.investigation import InvestigationService
from drt.reporting.investigation_report import (
InvestigationHTMLReportGenerator,
InvestigationCSVReportGenerator
)
from drt.utils.logging import setup_logging, get_logger
from drt.utils.timestamps import get_timestamp
logger = get_logger(__name__)
@click.command()
@click.option('--analysis-dir', '-a', required=True, type=click.Path(exists=True),
help='Analysis output directory containing *_investigate.sql files')
@click.option('--config', '-c', required=True, type=click.Path(exists=True),
help='Configuration file path')
@click.option('--output-dir', '-o', default=None,
help='Output directory for reports (overrides config setting)')
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
@click.option('--dry-run', is_flag=True, help='Show what would be executed without running')
def investigate(analysis_dir, config, output_dir, verbose, dry_run):
"""
Execute investigation queries from regression analysis.
Processes all *_investigate.sql files in the analysis directory,
executes queries on both baseline and target databases, and
generates comprehensive reports.
Example:
drt investigate -a /home/user/analysis/output_20251209_184032/ -c config.yaml
"""
# Load config first to get log directory
from drt.config.loader import load_config
cfg = load_config(config)
# Setup logging using config
log_level = "DEBUG" if verbose else "INFO"
log_dir = cfg.logging.directory
setup_logging(log_level=log_level, log_dir=log_dir, log_to_file=not dry_run)
click.echo("=" * 60)
click.echo("Data Regression Testing Framework - Investigation")
click.echo("=" * 60)
click.echo()
try:
# Use output_dir from CLI if provided, otherwise use config
if output_dir is None:
output_dir = cfg.reporting.investigation_directory
click.echo(f"✓ Configuration loaded")
click.echo(f" Database pairs: {len(cfg.database_pairs)}")
click.echo()
# Convert paths
analysis_path = Path(analysis_dir)
output_path = Path(output_dir)
# Create output directory
output_path.mkdir(parents=True, exist_ok=True)
if dry_run:
click.echo("=" * 60)
click.echo("DRY RUN - Preview Only")
click.echo("=" * 60)
# Discover SQL files
from drt.services.sql_parser import discover_sql_files
sql_files = discover_sql_files(analysis_path)
click.echo(f"\nAnalysis Directory: {analysis_path}")
click.echo(f"Found {len(sql_files)} investigation SQL files")
if sql_files:
click.echo("\nTables with investigation queries:")
for schema, table, sql_path in sql_files[:10]: # Show first 10
click.echo(f"{schema}.{table}")
if len(sql_files) > 10:
click.echo(f" ... and {len(sql_files) - 10} more")
for pair in cfg.database_pairs:
if not pair.enabled:
continue
click.echo(f"\nDatabase Pair: {pair.name}")
click.echo(f" Baseline: {pair.baseline.server}.{pair.baseline.database}")
click.echo(f" Target: {pair.target.server}.{pair.target.database}")
click.echo(f"\nReports would be saved to: {output_path}")
click.echo("\n" + "=" * 60)
click.echo("Use without --dry-run to execute investigation")
click.echo("=" * 60)
sys.exit(0)
# Execute investigation for each database pair
all_summaries = []
for pair in cfg.database_pairs:
if not pair.enabled:
click.echo(f"Skipping disabled pair: {pair.name}")
continue
click.echo(f"Investigating: {pair.name}")
click.echo(f" Baseline: {pair.baseline.server}.{pair.baseline.database}")
click.echo(f" Target: {pair.target.server}.{pair.target.database}")
click.echo()
# Run investigation
investigation_service = InvestigationService(cfg)
summary = investigation_service.run_investigation(analysis_path, pair)
all_summaries.append(summary)
click.echo()
# Generate reports for all summaries
if all_summaries:
click.echo("=" * 60)
click.echo("Generating Reports")
click.echo("=" * 60)
for summary in all_summaries:
timestamp = get_timestamp()
# Generate HTML report
html_gen = InvestigationHTMLReportGenerator(cfg)
html_path = output_path / f"investigation_report_{timestamp}.html"
html_gen.generate(summary, html_path)
click.echo(f" ✓ HTML: {html_path}")
# Generate CSV report
csv_gen = InvestigationCSVReportGenerator(cfg)
csv_path = output_path / f"investigation_report_{timestamp}.csv"
csv_gen.generate(summary, csv_path)
click.echo(f" ✓ CSV: {csv_path}")
click.echo()
# Display final summary
click.echo("=" * 60)
click.echo("INVESTIGATION COMPLETE")
click.echo("=" * 60)
total_processed = sum(s.tables_processed for s in all_summaries)
total_successful = sum(s.tables_successful for s in all_summaries)
total_partial = sum(s.tables_partial for s in all_summaries)
total_failed = sum(s.tables_failed for s in all_summaries)
total_queries = sum(s.total_queries_executed for s in all_summaries)
click.echo(f" Tables Processed: {total_processed:3d}")
click.echo(f" Successful: {total_successful:3d}")
click.echo(f" Partial: {total_partial:3d}")
click.echo(f" Failed: {total_failed:3d}")
click.echo(f" Total Queries: {total_queries:3d}")
click.echo("=" * 60)
# Exit with appropriate code
if total_failed > 0:
click.echo("Status: COMPLETED WITH FAILURES ⚠️")
sys.exit(1)
elif total_partial > 0:
click.echo("Status: COMPLETED WITH PARTIAL RESULTS ◐")
sys.exit(0)
else:
click.echo("Status: SUCCESS ✓")
sys.exit(0)
except Exception as e:
logger.error(f"Investigation failed: {e}", exc_info=verbose)
click.echo(f"✗ Error: {e}", err=True)
sys.exit(2)

View File

@@ -0,0 +1,92 @@
"""Validate command implementation."""
import click
import sys
from drt.config.loader import load_config
from drt.config.validator import validate_config
from drt.utils.logging import setup_logging, get_logger
logger = get_logger(__name__)
@click.command()
@click.option('--config', '-c', required=True, type=click.Path(exists=True), help='Configuration file path')
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
def validate(config, verbose):
"""
Validate configuration file without running comparison.
Checks configuration for completeness and correctness, reporting
any errors or warnings.
Example:
drt validate --config ./config.yaml
"""
# Setup logging
log_level = "DEBUG" if verbose else "INFO"
setup_logging(log_level=log_level, log_to_console=True, log_to_file=False)
click.echo("=" * 60)
click.echo("Configuration Validation")
click.echo("=" * 60)
click.echo()
try:
# Load configuration
click.echo(f"Loading: {config}")
cfg = load_config(config)
click.echo("✓ YAML syntax valid")
click.echo("✓ Configuration structure valid")
click.echo()
# Validate configuration
click.echo("Validating configuration...")
is_valid, errors = validate_config(cfg)
if errors:
click.echo()
click.echo("Validation Errors:")
for error in errors:
click.echo(f"{error}", err=True)
click.echo()
# Display configuration summary
click.echo("=" * 60)
click.echo("Configuration Summary")
click.echo("=" * 60)
click.echo(f" Database pairs: {len(cfg.database_pairs)}")
click.echo(f" Tables configured: {len(cfg.tables)}")
click.echo(f" Enabled tables: {sum(1 for t in cfg.tables if t.enabled)}")
click.echo(f" Disabled tables: {sum(1 for t in cfg.tables if not t.enabled)}")
click.echo()
# Check for tables not expected in target
not_expected = sum(1 for t in cfg.tables if not t.expected_in_target)
if not_expected > 0:
click.echo(f"{not_expected} table(s) marked as expected_in_target: false")
# Display database pairs
click.echo()
click.echo("Database Pairs:")
for pair in cfg.database_pairs:
status = "" if pair.enabled else ""
click.echo(f" {status} {pair.name}")
click.echo(f" Baseline: {pair.baseline.server}.{pair.baseline.database}")
click.echo(f" Target: {pair.target.server}.{pair.target.database}")
click.echo()
click.echo("=" * 60)
if is_valid:
click.echo("Configuration is VALID ✓")
click.echo("=" * 60)
sys.exit(0)
else:
click.echo("Configuration is INVALID ✗")
click.echo("=" * 60)
sys.exit(1)
except Exception as e:
logger.error(f"Validation failed: {e}", exc_info=verbose)
click.echo(f"✗ Error: {e}", err=True)
sys.exit(2)

52
src/drt/cli/main.py Executable file
View File

@@ -0,0 +1,52 @@
"""Main CLI entry point."""
import click
import sys
from drt import __version__
from drt.cli.commands import discover, compare, validate, investigate
from drt.utils.logging import setup_logging
@click.group()
@click.version_option(version=__version__, prog_name="drt")
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
@click.pass_context
def cli(ctx, verbose):
"""
Data Regression Testing Framework
A comprehensive framework for validating data integrity during code migration
and system updates by comparing data outputs between Baseline (Production)
and Target (Test) SQL Server databases.
"""
ctx.ensure_object(dict)
ctx.obj['verbose'] = verbose
# Setup logging
log_level = "DEBUG" if verbose else "INFO"
setup_logging(log_level=log_level, log_to_console=True, log_to_file=False)
@cli.command()
def version():
"""Display version information."""
import platform
click.echo("=" * 60)
click.echo("Data Regression Testing Framework")
click.echo("=" * 60)
click.echo(f"Version: {__version__}")
click.echo(f"Python: {platform.python_version()}")
click.echo(f"Platform: {platform.platform()}")
click.echo("=" * 60)
# Register commands
cli.add_command(discover.discover)
cli.add_command(compare.compare)
cli.add_command(validate.validate)
cli.add_command(investigate.investigate)
if __name__ == '__main__':
cli()

7
src/drt/config/__init__.py Executable file
View File

@@ -0,0 +1,7 @@
"""Configuration management for the framework."""
from drt.config.loader import load_config
from drt.config.validator import validate_config
from drt.config.models import Config
__all__ = ["load_config", "validate_config", "Config"]

84
src/drt/config/loader.py Executable file
View File

@@ -0,0 +1,84 @@
"""Configuration file loader."""
import yaml
from pathlib import Path
from typing import Union
from drt.config.models import Config
from drt.utils.logging import get_logger
logger = get_logger(__name__)
def load_config(config_path: Union[str, Path]) -> Config:
"""
Load configuration from YAML file.
Args:
config_path: Path to configuration file
Returns:
Parsed configuration object
Raises:
FileNotFoundError: If config file doesn't exist
yaml.YAMLError: If YAML is invalid
ValueError: If configuration is invalid
"""
config_path = Path(config_path)
if not config_path.exists():
raise FileNotFoundError(f"Configuration file not found: {config_path}")
logger.info(f"Loading configuration from: {config_path}")
try:
with open(config_path, "r", encoding="utf-8") as f:
config_data = yaml.safe_load(f)
if not config_data:
raise ValueError("Configuration file is empty")
# Parse with Pydantic
config = Config(**config_data)
logger.info(f"Configuration loaded successfully")
logger.info(f" Database pairs: {len(config.database_pairs)}")
logger.info(f" Tables configured: {len(config.tables)}")
return config
except yaml.YAMLError as e:
logger.error(f"YAML parsing error: {e}")
raise
except Exception as e:
logger.error(f"Configuration loading error: {e}")
raise
def save_config(config: Config, output_path: Union[str, Path]) -> None:
"""
Save configuration to YAML file.
Args:
config: Configuration object to save
output_path: Path where to save the configuration
"""
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)
logger.info(f"Saving configuration to: {output_path}")
# Convert to dict and save as YAML
config_dict = config.model_dump(exclude_none=True)
with open(output_path, "w", encoding="utf-8") as f:
yaml.dump(
config_dict,
f,
default_flow_style=False,
sort_keys=False,
allow_unicode=True,
width=100,
)
logger.info(f"Configuration saved successfully")

199
src/drt/config/models.py Executable file
View File

@@ -0,0 +1,199 @@
"""Pydantic models for configuration."""
from typing import List, Optional, Dict, Any
from pydantic import BaseModel, Field, field_validator
class ConnectionConfig(BaseModel):
"""Database connection configuration."""
server: str
database: str
username: Optional[str] = None
password: Optional[str] = None
timeout: Dict[str, int] = Field(default_factory=lambda: {"connection": 30, "query": 300})
class DatabasePairConfig(BaseModel):
"""Configuration for a database pair to compare."""
name: str
enabled: bool = True
baseline: ConnectionConfig
target: ConnectionConfig
class RowCountConfig(BaseModel):
"""Row count comparison configuration."""
enabled: bool = True
tolerance_percent: float = 0.0
class SchemaConfig(BaseModel):
"""Schema comparison configuration."""
enabled: bool = True
checks: Dict[str, bool] = Field(default_factory=lambda: {
"column_names": True,
"data_types": True,
"nullability": False,
"column_order": False
})
severity: Dict[str, str] = Field(default_factory=lambda: {
"missing_column_in_target": "FAIL",
"extra_column_in_target": "WARNING",
"data_type_mismatch": "WARNING"
})
class AggregatesConfig(BaseModel):
"""Aggregate comparison configuration."""
enabled: bool = True
tolerance_percent: float = 0.01
large_table_threshold: int = 10000000
sample_size: int = 100000
class TableExistenceConfig(BaseModel):
"""Table existence check configuration."""
missing_table_default: str = "FAIL"
extra_table_action: str = "INFO"
class ComparisonConfig(BaseModel):
"""Comparison settings."""
mode: str = "health_check"
row_count: RowCountConfig = Field(default_factory=RowCountConfig)
schema_config: SchemaConfig = Field(default_factory=SchemaConfig, alias="schema")
aggregates: AggregatesConfig = Field(default_factory=AggregatesConfig)
table_existence: TableExistenceConfig = Field(default_factory=TableExistenceConfig)
@property
def schema(self) -> SchemaConfig:
"""Return schema config for backward compatibility."""
return self.schema_config
class Config:
populate_by_name = True
class ExecutionConfig(BaseModel):
"""Execution settings."""
continue_on_error: bool = True
retry: Dict[str, int] = Field(default_factory=lambda: {"attempts": 3, "delay_seconds": 5})
class TableFilterConfig(BaseModel):
"""Table filtering configuration."""
mode: str = "all"
include_list: List[Dict[str, str]] = Field(default_factory=list)
exclude_patterns: List[str] = Field(default_factory=lambda: [
"*_TEMP", "*_TMP", "*_BAK", "*_BACKUP", "*_OLD", "tmp*", "temp*", "#*"
])
exclude_schemas: List[str] = Field(default_factory=lambda: [
"sys", "INFORMATION_SCHEMA", "guest"
])
class TableConfig(BaseModel):
"""Individual table configuration."""
schema_name: str = Field(..., alias="schema")
name: str
enabled: bool = True
expected_in_target: bool = True
estimated_row_count: int = 0
primary_key_columns: List[str] = Field(default_factory=list)
aggregate_columns: List[str] = Field(default_factory=list)
notes: str = ""
@property
def schema(self) -> str:
"""Return schema name for backward compatibility."""
return self.schema_name
class Config:
populate_by_name = True
class ReportingConfig(BaseModel):
"""Reporting configuration."""
output_directory: str = "./reports"
investigation_directory: str = "./investigation_reports"
formats: List[str] = Field(default_factory=lambda: ["html", "csv"])
filename_template: str = "regression_report_{timestamp}"
html: Dict[str, Any] = Field(default_factory=lambda: {
"embed_styles": True,
"include_charts": True,
"colors": {
"pass": "#28a745",
"fail": "#dc3545",
"warning": "#ffc107",
"error": "#6f42c1",
"info": "#17a2b8",
"skip": "#6c757d"
}
})
csv: Dict[str, Any] = Field(default_factory=lambda: {
"delimiter": ",",
"include_header": True,
"encoding": "utf-8-sig"
})
pdf: Dict[str, str] = Field(default_factory=lambda: {
"page_size": "A4",
"orientation": "landscape"
})
class LoggingConfig(BaseModel):
"""Logging configuration."""
level: str = "INFO"
directory: str = "./logs"
filename_template: str = "drt_{timestamp}.log"
console: bool = True
format: str = "%(asctime)s | %(levelname)-8s | %(name)-20s | %(message)s"
date_format: str = "%Y%m%d_%H%M%S"
class DiscoveryConfig(BaseModel):
"""Discovery settings."""
output_file: str = "./config_discovered.yaml"
analysis_directory: str = "./analysis"
include_schemas: List[str] = Field(default_factory=list)
exclude_schemas: List[str] = Field(default_factory=lambda: [
"sys", "INFORMATION_SCHEMA", "guest"
])
exclude_patterns: List[str] = Field(default_factory=lambda: [
"*_TEMP", "*_TMP", "*_BAK", "#*"
])
include_row_counts: bool = True
include_column_details: bool = True
detect_numeric_columns: bool = True
detect_primary_keys: bool = True
default_expected_in_target: bool = True
class MetadataConfig(BaseModel):
"""Configuration metadata."""
config_version: str = "1.0"
generated_date: Optional[str] = None
generated_by: Optional[str] = None
framework_version: str = "1.0.0"
class Config(BaseModel):
"""Main configuration model."""
metadata: MetadataConfig = Field(default_factory=MetadataConfig)
connections: Dict[str, ConnectionConfig] = Field(default_factory=dict)
database_pairs: List[DatabasePairConfig] = Field(default_factory=list)
comparison: ComparisonConfig = Field(default_factory=ComparisonConfig)
execution: ExecutionConfig = Field(default_factory=ExecutionConfig)
table_filters: TableFilterConfig = Field(default_factory=TableFilterConfig)
tables: List[TableConfig] = Field(default_factory=list)
reporting: ReportingConfig = Field(default_factory=ReportingConfig)
logging: LoggingConfig = Field(default_factory=LoggingConfig)
discovery: DiscoveryConfig = Field(default_factory=DiscoveryConfig)
@field_validator('database_pairs')
@classmethod
def validate_database_pairs(cls, v):
"""Ensure at least one database pair is configured."""
if not v:
raise ValueError("At least one database pair must be configured")
return v

79
src/drt/config/validator.py Executable file
View File

@@ -0,0 +1,79 @@
"""Configuration validator."""
from typing import List, Tuple
from drt.config.models import Config
from drt.utils.logging import get_logger
logger = get_logger(__name__)
def validate_config(config: Config) -> Tuple[bool, List[str]]:
"""
Validate configuration for completeness and correctness.
Args:
config: Configuration to validate
Returns:
Tuple of (is_valid, list_of_errors)
"""
errors = []
warnings = []
# Check database pairs
if not config.database_pairs:
errors.append("No database pairs configured")
for pair in config.database_pairs:
if not pair.baseline.server or not pair.baseline.database:
errors.append(f"Database pair '{pair.name}': Baseline connection incomplete")
if not pair.target.server or not pair.target.database:
errors.append(f"Database pair '{pair.name}': Target connection incomplete")
# Check comparison mode
valid_modes = ["health_check", "detailed"]
if config.comparison.mode not in valid_modes:
errors.append(f"Invalid comparison mode: {config.comparison.mode}. Must be one of {valid_modes}")
# Check table configuration
if config.table_filters.mode == "include_list" and not config.table_filters.include_list:
warnings.append("Table filter mode is 'include_list' but include_list is empty")
# Check for tables marked as not expected in target
not_expected_count = sum(1 for t in config.tables if not t.expected_in_target)
if not_expected_count > 0:
warnings.append(f"{not_expected_count} table(s) marked as expected_in_target: false")
# Check for disabled tables
disabled_count = sum(1 for t in config.tables if not t.enabled)
if disabled_count > 0:
warnings.append(f"{disabled_count} table(s) disabled (enabled: false)")
# Check reporting formats
valid_formats = ["html", "csv", "pdf"]
for fmt in config.reporting.formats:
if fmt not in valid_formats:
errors.append(f"Invalid report format: {fmt}. Must be one of {valid_formats}")
# Check logging level
valid_levels = ["DEBUG", "INFO", "WARNING", "ERROR"]
if config.logging.level.upper() not in valid_levels:
errors.append(f"Invalid logging level: {config.logging.level}. Must be one of {valid_levels}")
# Log results
if errors:
logger.error(f"Configuration validation failed with {len(errors)} error(s)")
for error in errors:
logger.error(f"{error}")
if warnings:
logger.warning(f"Configuration has {len(warnings)} warning(s)")
for warning in warnings:
logger.warning(f" ⚠️ {warning}")
if not errors and not warnings:
logger.info("✓ Configuration is valid")
elif not errors:
logger.info("✓ Configuration is valid (with warnings)")
return len(errors) == 0, errors

7
src/drt/database/__init__.py Executable file
View File

@@ -0,0 +1,7 @@
"""Database access layer."""
from drt.database.connection import ConnectionManager
from drt.database.executor import QueryExecutor
from drt.database.queries import SQLQueries
__all__ = ["ConnectionManager", "QueryExecutor", "SQLQueries"]

176
src/drt/database/connection.py Executable file
View File

@@ -0,0 +1,176 @@
"""Database connection management."""
import pyodbc
import platform
from typing import Optional
from contextlib import contextmanager
from drt.config.models import ConnectionConfig
from drt.utils.logging import get_logger
logger = get_logger(__name__)
def get_odbc_driver() -> str:
"""
Detect available ODBC driver for SQL Server.
Returns:
ODBC driver name
"""
# Get list of available drivers
drivers = [driver for driver in pyodbc.drivers() if 'SQL Server' in driver]
# Prefer newer drivers
preferred_order = [
'ODBC Driver 18 for SQL Server',
'ODBC Driver 17 for SQL Server',
'ODBC Driver 13 for SQL Server',
'SQL Server Native Client 11.0',
'SQL Server'
]
for preferred in preferred_order:
if preferred in drivers:
logger.debug(f"Using ODBC driver: {preferred}")
return preferred
# Fallback to first available
if drivers:
logger.warning(f"Using fallback driver: {drivers[0]}")
return drivers[0]
# Default fallback
logger.warning("No SQL Server ODBC driver found, using default")
return 'ODBC Driver 17 for SQL Server'
class ConnectionManager:
"""Manages database connections using Windows Authentication."""
def __init__(self, config: ConnectionConfig):
"""
Initialize connection manager.
Args:
config: Connection configuration
"""
self.config = config
self._connection: Optional[pyodbc.Connection] = None
def connect(self) -> pyodbc.Connection:
"""
Establish database connection using Windows or SQL Authentication.
Returns:
Database connection
Raises:
pyodbc.Error: If connection fails
"""
if self._connection and not self._connection.closed:
return self._connection
try:
# Detect available ODBC driver
driver = get_odbc_driver()
# Build connection string
conn_str_parts = [
f"DRIVER={{{driver}}}",
f"SERVER={self.config.server}",
f"DATABASE={self.config.database}",
f"Connection Timeout={self.config.timeout.get('connection', 30)}"
]
# Check if username/password are provided for SQL Authentication
if hasattr(self.config, 'username') and self.config.username:
conn_str_parts.append(f"UID={self.config.username}")
conn_str_parts.append(f"PWD={self.config.password}")
auth_type = "SQL Authentication"
else:
# Use Windows Authentication
conn_str_parts.append("Trusted_Connection=yes")
auth_type = "Windows Authentication"
# Add TrustServerCertificate on Linux for self-signed certs
if platform.system() != 'Windows':
conn_str_parts.append("TrustServerCertificate=yes")
conn_str = ";".join(conn_str_parts) + ";"
logger.info(f"Connecting to {self.config.server}.{self.config.database}")
logger.debug(f"Connection string: {conn_str.replace(self.config.server, 'SERVER').replace(self.config.password if hasattr(self.config, 'password') and self.config.password else '', '***')}")
self._connection = pyodbc.connect(conn_str)
# Set query timeout
query_timeout = self.config.timeout.get('query', 300)
self._connection.timeout = query_timeout
logger.info(f"✓ Connected ({auth_type})")
return self._connection
except pyodbc.Error as e:
logger.error(f"Connection failed: {e}")
raise
def disconnect(self) -> None:
"""Close database connection."""
if self._connection and not self._connection.closed:
self._connection.close()
logger.info("Connection closed")
self._connection = None
@contextmanager
def get_connection(self):
"""
Context manager for database connections.
Yields:
Database connection
Example:
with conn_mgr.get_connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT 1")
"""
conn = self.connect()
try:
yield conn
finally:
# Don't close connection here - reuse it
pass
def test_connection(self) -> bool:
"""
Test database connectivity.
Returns:
True if connection successful, False otherwise
"""
try:
with self.get_connection() as conn:
cursor = conn.cursor()
cursor.execute("SELECT 1")
cursor.fetchone()
return True
except Exception as e:
logger.error(f"Connection test failed: {e}")
return False
@property
def is_connected(self) -> bool:
"""Check if connection is active."""
return self._connection is not None and not self._connection.closed
def __enter__(self):
"""Context manager entry."""
self.connect()
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit."""
self.disconnect()
def __del__(self):
"""Cleanup on deletion."""
self.disconnect()

267
src/drt/database/executor.py Executable file
View File

@@ -0,0 +1,267 @@
"""Query executor for READ ONLY database operations."""
import pandas as pd
import time
from typing import Any, Dict, List, Optional, Tuple
from drt.database.connection import ConnectionManager
from drt.database.queries import SQLQueries
from drt.models.enums import Status
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class QueryExecutor:
"""Executes READ ONLY queries against the database."""
def __init__(self, connection_manager: ConnectionManager):
"""
Initialize query executor.
Args:
connection_manager: Connection manager instance
"""
self.conn_mgr = connection_manager
def execute_query(self, query: str, params: tuple = None) -> pd.DataFrame:
"""
Execute a SELECT query and return results as DataFrame.
Args:
query: SQL query string (SELECT only)
params: Query parameters
Returns:
Query results as pandas DataFrame
Raises:
ValueError: If query is not a SELECT statement
Exception: If query execution fails
"""
# Safety check - only allow SELECT queries
query_upper = query.strip().upper()
if not query_upper.startswith('SELECT'):
raise ValueError("Only SELECT queries are allowed (READ ONLY)")
try:
with self.conn_mgr.get_connection() as conn:
if params:
df = pd.read_sql(query, conn, params=params)
else:
df = pd.read_sql(query, conn)
return df
except Exception as e:
logger.error(f"Query execution failed: {e}")
logger.debug(f"Query: {query}")
raise
def execute_scalar(self, query: str, params: tuple = None) -> Any:
"""
Execute query and return single scalar value.
Args:
query: SQL query string
params: Query parameters
Returns:
Single scalar value
"""
df = self.execute_query(query, params)
if df.empty:
return None
return df.iloc[0, 0]
def get_row_count(self, schema: str, table: str) -> int:
"""
Get row count for a table.
Args:
schema: Schema name
table: Table name
Returns:
Row count
"""
query = SQLQueries.build_row_count_query(schema, table)
count = self.execute_scalar(query)
return int(count) if count is not None else 0
def table_exists(self, schema: str, table: str) -> bool:
"""
Check if table exists.
Args:
schema: Schema name
table: Table name
Returns:
True if table exists, False otherwise
"""
count = self.execute_scalar(SQLQueries.CHECK_TABLE_EXISTS, (schema, table))
return int(count) > 0 if count is not None else False
def get_all_tables(self) -> List[Dict[str, Any]]:
"""
Get list of all user tables in the database.
Returns:
List of table information dictionaries
"""
df = self.execute_query(SQLQueries.GET_ALL_TABLES)
return df.to_dict('records')
def get_columns(self, schema: str, table: str) -> List[Dict[str, Any]]:
"""
Get column information for a table.
Args:
schema: Schema name
table: Table name
Returns:
List of column information dictionaries
"""
df = self.execute_query(SQLQueries.GET_COLUMNS, (schema, table))
return df.to_dict('records')
def get_primary_keys(self, schema: str, table: str) -> List[str]:
"""
Get primary key columns for a table.
Args:
schema: Schema name
table: Table name
Returns:
List of primary key column names
"""
# Diagnostic: Check what columns are available in CONSTRAINT_COLUMN_USAGE
try:
logger.debug("Checking CONSTRAINT_COLUMN_USAGE schema...")
constraint_cols_df = self.execute_query(SQLQueries.GET_CONSTRAINT_COLUMNS_SCHEMA)
logger.debug(f"CONSTRAINT_COLUMN_USAGE columns: {constraint_cols_df['COLUMN_NAME'].tolist()}")
except Exception as e:
logger.debug(f"Could not query CONSTRAINT_COLUMN_USAGE schema: {e}")
# Diagnostic: Check what columns are available in KEY_COLUMN_USAGE
try:
logger.debug("Checking KEY_COLUMN_USAGE schema...")
key_cols_df = self.execute_query(SQLQueries.GET_KEY_COLUMNS_SCHEMA)
logger.debug(f"KEY_COLUMN_USAGE columns: {key_cols_df['COLUMN_NAME'].tolist()}")
except Exception as e:
logger.debug(f"Could not query KEY_COLUMN_USAGE schema: {e}")
df = self.execute_query(SQLQueries.GET_PRIMARY_KEYS, (schema, table))
return df['COLUMN_NAME'].tolist() if not df.empty else []
def get_aggregate_sums(self, schema: str, table: str, columns: List[str]) -> Dict[str, float]:
"""
Get aggregate sums for numeric columns.
Args:
schema: Schema name
table: Table name
columns: List of column names to aggregate
Returns:
Dictionary mapping column names to their sums
"""
if not columns:
return {}
query = SQLQueries.build_aggregate_query(schema, table, columns)
if not query:
return {}
df = self.execute_query(query)
if df.empty:
return {col: 0.0 for col in columns}
# Extract results
results = {}
for col in columns:
sum_col = f"{col}_sum"
if sum_col in df.columns:
value = df.iloc[0][sum_col]
results[col] = float(value) if pd.notna(value) else 0.0
else:
results[col] = 0.0
return results
def execute_investigation_query(
self,
query: str,
timeout: Optional[int] = None
) -> Tuple[Status, Optional[pd.DataFrame], Optional[str], int]:
"""
Execute investigation query with comprehensive error handling.
This method is specifically for investigation queries and does NOT
enforce the SELECT-only restriction. It handles errors gracefully
and returns detailed status information.
Args:
query: SQL query to execute
timeout: Query timeout in seconds (optional)
Returns:
Tuple of (status, result_df, error_message, execution_time_ms)
"""
start_time = time.time()
try:
# Execute query
with self.conn_mgr.get_connection() as conn:
if timeout:
# Set query timeout if supported
try:
cursor = conn.cursor()
cursor.execute(f"SET QUERY_TIMEOUT {timeout}")
except Exception:
# Timeout setting not supported, continue anyway
pass
df = pd.read_sql(query, conn)
execution_time = int((time.time() - start_time) * 1000)
return (Status.PASS, df, None, execution_time)
except Exception as e:
execution_time = int((time.time() - start_time) * 1000)
error_msg = str(e)
error_type = type(e).__name__
# Categorize error
if any(phrase in error_msg.lower() for phrase in [
'does not exist',
'invalid object name',
'could not find',
'not found'
]):
status = Status.SKIP
message = f"Object not found: {error_msg}"
elif 'timeout' in error_msg.lower():
status = Status.FAIL
message = f"Query timeout: {error_msg}"
elif any(phrase in error_msg.lower() for phrase in [
'syntax error',
'incorrect syntax'
]):
status = Status.FAIL
message = f"Syntax error: {error_msg}"
elif 'permission' in error_msg.lower():
status = Status.FAIL
message = f"Permission denied: {error_msg}"
else:
status = Status.FAIL
message = f"{error_type}: {error_msg}"
logger.debug(f"Query execution failed: {message}")
return (status, None, message, execution_time)

128
src/drt/database/queries.py Executable file
View File

@@ -0,0 +1,128 @@
"""SQL query templates for database operations."""
class SQLQueries:
"""Collection of SQL query templates (READ ONLY)."""
# Table discovery queries
GET_ALL_TABLES = """
SELECT
s.name AS schema_name,
t.name AS table_name,
SUM(p.rows) AS estimated_rows
FROM sys.tables t WITH (NOLOCK)
INNER JOIN sys.schemas s WITH (NOLOCK) ON t.schema_id = s.schema_id
INNER JOIN sys.partitions p WITH (NOLOCK) ON t.object_id = p.object_id
WHERE t.type = 'U'
AND p.index_id IN (0, 1)
GROUP BY s.name, t.name
ORDER BY s.name, t.name
"""
GET_COLUMNS = """
SELECT
COLUMN_NAME,
DATA_TYPE,
CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION,
NUMERIC_SCALE,
IS_NULLABLE,
ORDINAL_POSITION
FROM INFORMATION_SCHEMA.COLUMNS WITH (NOLOCK)
WHERE TABLE_SCHEMA = ?
AND TABLE_NAME = ?
ORDER BY ORDINAL_POSITION
"""
# Diagnostic query to check available columns in CONSTRAINT_COLUMN_USAGE
GET_CONSTRAINT_COLUMNS_SCHEMA = """
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS WITH (NOLOCK)
WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'
AND TABLE_NAME = 'CONSTRAINT_COLUMN_USAGE'
ORDER BY ORDINAL_POSITION
"""
# Diagnostic query to check available columns in KEY_COLUMN_USAGE
GET_KEY_COLUMNS_SCHEMA = """
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS WITH (NOLOCK)
WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'
AND TABLE_NAME = 'KEY_COLUMN_USAGE'
ORDER BY ORDINAL_POSITION
"""
GET_PRIMARY_KEYS = """
SELECT
c.COLUMN_NAME
FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc WITH (NOLOCK)
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE c WITH (NOLOCK)
ON tc.CONSTRAINT_NAME = c.CONSTRAINT_NAME
WHERE tc.CONSTRAINT_TYPE = 'PRIMARY KEY'
AND tc.TABLE_SCHEMA = ?
AND tc.TABLE_NAME = ?
"""
# Comparison queries
GET_ROW_COUNT = """
SELECT COUNT(*) AS row_count
FROM [{schema}].[{table}] WITH (NOLOCK)
"""
CHECK_TABLE_EXISTS = """
SELECT COUNT(*) AS table_exists
FROM INFORMATION_SCHEMA.TABLES WITH (NOLOCK)
WHERE TABLE_SCHEMA = ?
AND TABLE_NAME = ?
"""
GET_AGGREGATE_SUMS = """
SELECT {column_expressions}
FROM [{schema}].[{table}] WITH (NOLOCK)
"""
@staticmethod
def build_row_count_query(schema: str, table: str) -> str:
"""Build row count query for a specific table."""
return SQLQueries.GET_ROW_COUNT.format(schema=schema, table=table)
@staticmethod
def build_aggregate_query(schema: str, table: str, columns: list[str]) -> str:
"""
Build aggregate query for numeric columns.
Args:
schema: Schema name
table: Table name
columns: List of column names to aggregate
Returns:
SQL query string
"""
if not columns:
return None
# Build column expressions
column_expressions = []
for col in columns:
# Cast to FLOAT to handle different numeric types
expr = f"SUM(CAST([{col}] AS FLOAT)) AS [{col}_sum]"
column_expressions.append(expr)
column_expr_str = ",\n ".join(column_expressions)
return SQLQueries.GET_AGGREGATE_SUMS.format(
schema=schema,
table=table,
column_expressions=column_expr_str
)
@staticmethod
def is_numeric_type(data_type: str) -> bool:
"""Check if a data type is numeric."""
numeric_types = {
'int', 'bigint', 'smallint', 'tinyint',
'decimal', 'numeric', 'float', 'real',
'money', 'smallmoney'
}
return data_type.lower() in numeric_types

16
src/drt/models/__init__.py Executable file
View File

@@ -0,0 +1,16 @@
"""Data models for the regression testing framework."""
from drt.models.enums import Status, CheckType
from drt.models.table import TableInfo, ColumnInfo
from drt.models.results import ComparisonResult, CheckResult
from drt.models.summary import ExecutionSummary
__all__ = [
"Status",
"CheckType",
"TableInfo",
"ColumnInfo",
"ComparisonResult",
"CheckResult",
"ExecutionSummary",
]

49
src/drt/models/enums.py Executable file
View File

@@ -0,0 +1,49 @@
"""Enumerations for status and check types."""
from enum import Enum
class Status(str, Enum):
"""Result status enumeration."""
PASS = "PASS"
FAIL = "FAIL"
WARNING = "WARNING"
ERROR = "ERROR"
INFO = "INFO"
SKIP = "SKIP"
def __str__(self) -> str:
return self.value
@property
def severity(self) -> int:
"""Return severity level for comparison (higher = more severe)."""
severity_map = {
Status.ERROR: 6,
Status.FAIL: 5,
Status.WARNING: 4,
Status.INFO: 3,
Status.PASS: 2,
Status.SKIP: 1,
}
return severity_map[self]
@classmethod
def most_severe(cls, statuses: list["Status"]) -> "Status":
"""Return the most severe status from a list."""
if not statuses:
return cls.SKIP
return max(statuses, key=lambda s: s.severity)
class CheckType(str, Enum):
"""Type of comparison check."""
EXISTENCE = "TABLE_EXISTENCE"
ROW_COUNT = "ROW_COUNT"
SCHEMA = "SCHEMA"
AGGREGATE = "AGGREGATE"
def __str__(self) -> str:
return self.value

View File

@@ -0,0 +1,70 @@
"""Data models for investigation feature."""
from dataclasses import dataclass, field
from typing import List, Optional
import pandas as pd
from drt.models.enums import Status
@dataclass
class QueryExecutionResult:
"""Result of executing a single query."""
query_number: int
query_text: str
status: Status
execution_time_ms: int
result_data: Optional[pd.DataFrame] = None
error_message: Optional[str] = None
row_count: int = 0
@dataclass
class TableInvestigationResult:
"""Results for all queries in a table's investigation."""
schema: str
table: str
sql_file_path: str
baseline_results: List[QueryExecutionResult]
target_results: List[QueryExecutionResult]
overall_status: Status
timestamp: str
@property
def full_name(self) -> str:
"""Get full table name."""
return f"{self.schema}.{self.table}"
@property
def total_queries(self) -> int:
"""Get total number of queries."""
return len(self.baseline_results)
@property
def successful_queries(self) -> int:
"""Get number of successful queries."""
all_results = self.baseline_results + self.target_results
return sum(1 for r in all_results if r.status == Status.PASS)
@dataclass
class InvestigationSummary:
"""Overall investigation execution summary."""
start_time: str
end_time: str
duration_seconds: int
analysis_directory: str
baseline_info: str
target_info: str
tables_processed: int
tables_successful: int
tables_partial: int
tables_failed: int
total_queries_executed: int
results: List[TableInvestigationResult] = field(default_factory=list)
@property
def success_rate(self) -> float:
"""Calculate success rate percentage."""
if self.tables_processed == 0:
return 0.0
return (self.tables_successful / self.tables_processed) * 100

49
src/drt/models/results.py Executable file
View File

@@ -0,0 +1,49 @@
"""Result models for comparison operations."""
from typing import Any, Dict, Optional
from pydantic import BaseModel, Field
from drt.models.enums import Status, CheckType
from drt.models.table import TableInfo
class CheckResult(BaseModel):
"""Result of a single check operation."""
check_type: CheckType
status: Status
baseline_value: Any = None
target_value: Any = None
difference: Any = None
message: str = ""
details: Dict[str, Any] = Field(default_factory=dict)
class Config:
arbitrary_types_allowed = True
class ComparisonResult(BaseModel):
"""Result of comparing a single table."""
table: TableInfo
overall_status: Status
check_results: list[CheckResult] = Field(default_factory=list)
execution_time_ms: int = 0
error_message: str = ""
timestamp: str = ""
def add_check(self, check_result: CheckResult) -> None:
"""Add a check result and update overall status."""
self.check_results.append(check_result)
# Update overall status to most severe
all_statuses = [cr.status for cr in self.check_results]
self.overall_status = Status.most_severe(all_statuses)
def get_check(self, check_type: CheckType) -> Optional[CheckResult]:
"""Get check result by type."""
for check in self.check_results:
if check.check_type == check_type:
return check
return None
class Config:
arbitrary_types_allowed = True

65
src/drt/models/summary.py Executable file
View File

@@ -0,0 +1,65 @@
"""Execution summary model."""
from typing import List
from pydantic import BaseModel, Field
from drt.models.results import ComparisonResult
from drt.models.enums import Status
class ExecutionSummary(BaseModel):
"""Summary of an entire test execution."""
start_time: str
end_time: str
duration_seconds: int
total_tables: int = 0
passed: int = 0
failed: int = 0
warnings: int = 0
errors: int = 0
skipped: int = 0
info: int = 0
results: List[ComparisonResult] = Field(default_factory=list)
config_file: str = ""
baseline_info: str = ""
target_info: str = ""
def add_result(self, result: ComparisonResult) -> None:
"""Add a comparison result and update counters."""
self.results.append(result)
self.total_tables += 1
# Update status counters
status = result.overall_status
if status == Status.PASS:
self.passed += 1
elif status == Status.FAIL:
self.failed += 1
elif status == Status.WARNING:
self.warnings += 1
elif status == Status.ERROR:
self.errors += 1
elif status == Status.INFO:
self.info += 1
elif status == Status.SKIP:
self.skipped += 1
@property
def has_failures(self) -> bool:
"""Check if there are any failures."""
return self.failed > 0
@property
def has_errors(self) -> bool:
"""Check if there are any errors."""
return self.errors > 0
@property
def success_rate(self) -> float:
"""Calculate success rate percentage."""
if self.total_tables == 0:
return 0.0
return (self.passed / self.total_tables) * 100
class Config:
arbitrary_types_allowed = True

53
src/drt/models/table.py Executable file
View File

@@ -0,0 +1,53 @@
"""Table and column information models."""
from typing import List, Optional
from pydantic import BaseModel, Field
class ColumnInfo(BaseModel):
"""Information about a database column."""
name: str
data_type: str
max_length: Optional[int] = None
precision: Optional[int] = None
scale: Optional[int] = None
is_nullable: bool = True
is_numeric: bool = False
ordinal_position: int
class Config:
frozen = True
class TableInfo(BaseModel):
"""Information about a database table."""
schema_name: str = Field(..., alias="schema")
name: str
estimated_row_count: int = 0
columns: List[ColumnInfo] = Field(default_factory=list)
primary_key_columns: List[str] = Field(default_factory=list)
enabled: bool = True
expected_in_target: bool = True
aggregate_columns: List[str] = Field(default_factory=list)
notes: str = ""
@property
def schema(self) -> str:
"""Return schema name for backward compatibility."""
return self.schema_name
@property
def full_name(self) -> str:
"""Return fully qualified table name."""
return f"{self.schema_name}.{self.name}"
@property
def numeric_columns(self) -> List[ColumnInfo]:
"""Return list of numeric columns."""
return [col for col in self.columns if col.is_numeric]
class Config:
frozen = False
populate_by_name = True # Allow both 'schema' and 'schema_name'

7
src/drt/reporting/__init__.py Executable file
View File

@@ -0,0 +1,7 @@
"""Reporting module for generating test reports."""
from drt.reporting.generator import ReportGenerator
from drt.reporting.html import HTMLReportGenerator
from drt.reporting.csv import CSVReportGenerator
__all__ = ["ReportGenerator", "HTMLReportGenerator", "CSVReportGenerator"]

97
src/drt/reporting/csv.py Executable file
View File

@@ -0,0 +1,97 @@
"""CSV report generator."""
import csv
from pathlib import Path
from drt.models.summary import ExecutionSummary
from drt.models.enums import CheckType
from drt.config.models import Config
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class CSVReportGenerator:
"""Generates CSV format reports."""
def __init__(self, config: Config):
"""
Initialize CSV generator.
Args:
config: Configuration object
"""
self.config = config
def generate(self, summary: ExecutionSummary, filepath: Path) -> None:
"""
Generate CSV report.
Args:
summary: Execution summary
filepath: Output file path
"""
csv_config = self.config.reporting.csv
delimiter = csv_config.get("delimiter", ",")
encoding = csv_config.get("encoding", "utf-8-sig")
with open(filepath, "w", newline="", encoding=encoding) as f:
writer = csv.writer(f, delimiter=delimiter)
# Write header
writer.writerow([
"Timestamp",
"Schema",
"Table",
"Overall_Status",
"Existence_Status",
"RowCount_Status",
"Baseline_Rows",
"Target_Rows",
"Row_Difference",
"Row_Diff_Pct",
"Schema_Status",
"Schema_Details",
"Aggregate_Status",
"Aggregate_Details",
"Expected_In_Target",
"Notes",
"Execution_Time_Ms"
])
# Write data rows
for result in summary.results:
# Get check results
existence = result.get_check(CheckType.EXISTENCE)
row_count = result.get_check(CheckType.ROW_COUNT)
schema = result.get_check(CheckType.SCHEMA)
aggregate = result.get_check(CheckType.AGGREGATE)
# Extract values
baseline_rows = row_count.baseline_value if row_count else "N/A"
target_rows = row_count.target_value if row_count else "N/A"
row_diff = row_count.difference if row_count else "N/A"
row_diff_pct = ""
if row_count and row_count.baseline_value and row_count.baseline_value > 0:
row_diff_pct = f"{(row_count.difference / row_count.baseline_value * 100):.2f}%"
writer.writerow([
result.timestamp,
result.table.schema,
result.table.name,
result.overall_status.value,
existence.status.value if existence else "N/A",
row_count.status.value if row_count else "N/A",
baseline_rows,
target_rows,
row_diff,
row_diff_pct,
schema.status.value if schema else "N/A",
schema.message if schema else "",
aggregate.status.value if aggregate else "N/A",
aggregate.message if aggregate else "",
result.table.expected_in_target,
result.table.notes,
result.execution_time_ms
])
logger.debug(f"CSV report written to {filepath}")

84
src/drt/reporting/generator.py Executable file
View File

@@ -0,0 +1,84 @@
"""Report generator orchestrator."""
from pathlib import Path
from typing import List
from drt.models.summary import ExecutionSummary
from drt.config.models import Config
from drt.reporting.html import HTMLReportGenerator
from drt.reporting.csv import CSVReportGenerator
from drt.utils.logging import get_logger
from drt.utils.timestamps import get_timestamp
logger = get_logger(__name__)
class ReportGenerator:
"""Orchestrates report generation in multiple formats."""
def __init__(self, config: Config):
"""
Initialize report generator.
Args:
config: Configuration object
"""
self.config = config
# Use absolute path from config
self.output_dir = Path(config.reporting.output_directory).expanduser().resolve()
self.output_dir.mkdir(parents=True, exist_ok=True)
def generate_reports(self, summary: ExecutionSummary) -> List[str]:
"""
Generate reports in all configured formats.
Args:
summary: Execution summary
Returns:
List of generated report file paths
"""
logger.info("Generating reports...")
generated_files = []
timestamp = summary.start_time
# Generate filename
filename_base = self.config.reporting.filename_template.format(
timestamp=timestamp,
config_name="regression"
)
for fmt in self.config.reporting.formats:
try:
if fmt == "html":
filepath = self._generate_html(summary, filename_base)
generated_files.append(filepath)
elif fmt == "csv":
filepath = self._generate_csv(summary, filename_base)
generated_files.append(filepath)
elif fmt == "pdf":
logger.warning("PDF generation not yet implemented")
else:
logger.warning(f"Unknown report format: {fmt}")
except Exception as e:
logger.error(f"Failed to generate {fmt} report: {e}")
logger.info(f"Generated {len(generated_files)} report(s)")
return generated_files
def _generate_html(self, summary: ExecutionSummary, filename_base: str) -> str:
"""Generate HTML report."""
generator = HTMLReportGenerator(self.config)
filepath = self.output_dir / f"{filename_base}.html"
generator.generate(summary, filepath)
logger.info(f"✓ HTML: {filepath}")
return str(filepath)
def _generate_csv(self, summary: ExecutionSummary, filename_base: str) -> str:
"""Generate CSV report."""
generator = CSVReportGenerator(self.config)
filepath = self.output_dir / f"{filename_base}.csv"
generator.generate(summary, filepath)
logger.info(f"✓ CSV: {filepath}")
return str(filepath)

239
src/drt/reporting/html.py Executable file
View File

@@ -0,0 +1,239 @@
"""HTML report generator."""
from pathlib import Path
from drt.models.summary import ExecutionSummary
from drt.models.enums import Status, CheckType
from drt.config.models import Config
from drt.utils.logging import get_logger
from drt.utils.timestamps import format_duration
logger = get_logger(__name__)
class HTMLReportGenerator:
"""Generates HTML format reports."""
def __init__(self, config: Config):
"""
Initialize HTML generator.
Args:
config: Configuration object
"""
self.config = config
self.colors = config.reporting.html.get("colors", {})
def generate(self, summary: ExecutionSummary, filepath: Path) -> None:
"""
Generate HTML report.
Args:
summary: Execution summary
filepath: Output file path
"""
html_content = self._build_html(summary)
with open(filepath, "w", encoding="utf-8") as f:
f.write(html_content)
logger.debug(f"HTML report written to {filepath}")
def _build_html(self, summary: ExecutionSummary) -> str:
"""Build complete HTML document."""
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Data Regression Test Report - {summary.start_time}</title>
{self._get_styles()}
</head>
<body>
<div class="container">
{self._build_header(summary)}
{self._build_summary(summary)}
{self._build_failures(summary)}
{self._build_warnings(summary)}
{self._build_detailed_results(summary)}
{self._build_footer(summary)}
</div>
</body>
</html>"""
def _get_styles(self) -> str:
"""Get embedded CSS styles."""
return """<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; background: #f5f5f5; padding: 20px; }
.container { max-width: 1400px; margin: 0 auto; background: white; padding: 30px; border-radius: 8px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); }
h1 { color: #333; border-bottom: 3px solid #007bff; padding-bottom: 10px; margin-bottom: 20px; }
h2 { color: #555; margin-top: 30px; margin-bottom: 15px; border-left: 4px solid #007bff; padding-left: 10px; }
.header { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 8px; margin-bottom: 30px; }
.header h1 { color: white; border: none; }
.info-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 15px; margin: 20px 0; }
.info-box { background: #f8f9fa; padding: 15px; border-radius: 5px; border-left: 4px solid #007bff; }
.info-label { font-weight: bold; color: #666; font-size: 0.9em; }
.info-value { color: #333; font-size: 1.1em; margin-top: 5px; }
.summary-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(150px, 1fr)); gap: 15px; margin: 20px 0; }
.summary-box { padding: 20px; border-radius: 8px; text-align: center; color: white; }
.summary-box.pass { background: #28a745; }
.summary-box.fail { background: #dc3545; }
.summary-box.warning { background: #ffc107; color: #333; }
.summary-box.error { background: #6f42c1; }
.summary-box.info { background: #17a2b8; }
.summary-box.skip { background: #6c757d; }
.summary-number { font-size: 2.5em; font-weight: bold; }
.summary-label { font-size: 0.9em; margin-top: 5px; }
.summary-percent { font-size: 0.8em; opacity: 0.9; }
table { width: 100%; border-collapse: collapse; margin: 20px 0; }
th { background: #007bff; color: white; padding: 12px; text-align: left; font-weight: 600; }
td { padding: 10px 12px; border-bottom: 1px solid #dee2e6; }
tr:hover { background: #f8f9fa; }
.status-badge { display: inline-block; padding: 4px 12px; border-radius: 12px; font-size: 0.85em; font-weight: 600; }
.status-PASS { background: #d4edda; color: #155724; }
.status-FAIL { background: #f8d7da; color: #721c24; }
.status-WARNING { background: #fff3cd; color: #856404; }
.status-ERROR { background: #e7d6f5; color: #4a148c; }
.status-INFO { background: #d1ecf1; color: #0c5460; }
.status-SKIP { background: #e2e3e5; color: #383d41; }
.failure-box { background: #fff5f5; border: 1px solid #feb2b2; border-radius: 5px; padding: 15px; margin: 10px 0; }
.failure-title { font-weight: bold; color: #c53030; margin-bottom: 8px; }
.failure-detail { color: #666; margin: 5px 0; font-size: 0.95em; }
.footer { margin-top: 40px; padding-top: 20px; border-top: 1px solid #dee2e6; text-align: center; color: #666; font-size: 0.9em; }
</style>"""
def _build_header(self, summary: ExecutionSummary) -> str:
"""Build report header."""
return f"""<div class="header">
<h1>📊 Data Regression Test Report</h1>
<p>Generated: {summary.start_time}</p>
</div>
<div class="info-grid">
<div class="info-box">
<div class="info-label">Start Time</div>
<div class="info-value">{summary.start_time}</div>
</div>
<div class="info-box">
<div class="info-label">End Time</div>
<div class="info-value">{summary.end_time}</div>
</div>
<div class="info-box">
<div class="info-label">Duration</div>
<div class="info-value">{format_duration(summary.duration_seconds)}</div>
</div>
<div class="info-box">
<div class="info-label">Baseline</div>
<div class="info-value">{summary.baseline_info}</div>
</div>
<div class="info-box">
<div class="info-label">Target</div>
<div class="info-value">{summary.target_info}</div>
</div>
<div class="info-box">
<div class="info-label">Total Tables</div>
<div class="info-value">{summary.total_tables}</div>
</div>
</div>"""
def _build_summary(self, summary: ExecutionSummary) -> str:
"""Build summary section."""
return f"""<h2>Summary</h2>
<div class="summary-grid">
<div class="summary-box pass">
<div class="summary-number">{summary.passed}</div>
<div class="summary-label">PASS</div>
<div class="summary-percent">{(summary.passed/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
<div class="summary-box fail">
<div class="summary-number">{summary.failed}</div>
<div class="summary-label">FAIL</div>
<div class="summary-percent">{(summary.failed/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
<div class="summary-box warning">
<div class="summary-number">{summary.warnings}</div>
<div class="summary-label">WARNING</div>
<div class="summary-percent">{(summary.warnings/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
<div class="summary-box error">
<div class="summary-number">{summary.errors}</div>
<div class="summary-label">ERROR</div>
<div class="summary-percent">{(summary.errors/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
<div class="summary-box info">
<div class="summary-number">{summary.info}</div>
<div class="summary-label">INFO</div>
<div class="summary-percent">{(summary.info/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
<div class="summary-box skip">
<div class="summary-number">{summary.skipped}</div>
<div class="summary-label">SKIP</div>
<div class="summary-percent">{(summary.skipped/summary.total_tables*100) if summary.total_tables > 0 else 0:.1f}%</div>
</div>
</div>"""
def _build_failures(self, summary: ExecutionSummary) -> str:
"""Build failures section."""
failures = [r for r in summary.results if r.overall_status == Status.FAIL]
if not failures:
return ""
html = '<h2>❌ Failures (Immediate Action Required)</h2>'
for result in failures:
html += f"""<div class="failure-box">
<div class="failure-title">{result.table.full_name}</div>"""
for check in result.check_results:
if check.status == Status.FAIL:
html += f'<div class="failure-detail">• {check.check_type.value}: {check.message}</div>'
html += '</div>'
return html
def _build_warnings(self, summary: ExecutionSummary) -> str:
"""Build warnings section."""
warnings = [r for r in summary.results if r.overall_status == Status.WARNING]
if not warnings:
return ""
html = '<h2>⚠️ Warnings</h2><ul>'
for result in warnings:
for check in result.check_results:
if check.status == Status.WARNING:
html += f'<li><strong>{result.table.full_name}</strong>: {check.message}</li>'
html += '</ul>'
return html
def _build_detailed_results(self, summary: ExecutionSummary) -> str:
"""Build detailed results table."""
html = '<h2>Detailed Results</h2><table><thead><tr>'
html += '<th>Table</th><th>Status</th><th>Row Count</th><th>Schema</th><th>Aggregates</th><th>Time (ms)</th>'
html += '</tr></thead><tbody>'
for result in summary.results:
row_count = result.get_check(CheckType.ROW_COUNT)
schema = result.get_check(CheckType.SCHEMA)
aggregate = result.get_check(CheckType.AGGREGATE)
html += f'<tr><td>{result.table.full_name}</td>'
html += f'<td><span class="status-badge status-{result.overall_status.value}">{result.overall_status.value}</span></td>'
html += f'<td><span class="status-badge status-{row_count.status.value if row_count else "SKIP"}">{row_count.status.value if row_count else "SKIP"}</span></td>'
html += f'<td><span class="status-badge status-{schema.status.value if schema else "SKIP"}">{schema.status.value if schema else "SKIP"}</span></td>'
html += f'<td><span class="status-badge status-{aggregate.status.value if aggregate else "SKIP"}">{aggregate.status.value if aggregate else "SKIP"}</span></td>'
html += f'<td>{result.execution_time_ms}</td></tr>'
html += '</tbody></table>'
return html
def _build_footer(self, summary: ExecutionSummary) -> str:
"""Build report footer."""
return f"""<div class="footer">
<p>Generated by Data Regression Testing Framework v1.0.0</p>
<p>Success Rate: {summary.success_rate:.1f}%</p>
</div>"""

View File

@@ -0,0 +1,357 @@
"""Investigation report generators for HTML and CSV formats."""
import csv
from pathlib import Path
from typing import Optional
from drt.models.investigation import InvestigationSummary, QueryExecutionResult
from drt.models.enums import Status
from drt.config.models import Config
from drt.utils.logging import get_logger
from drt.utils.timestamps import format_duration
logger = get_logger(__name__)
class InvestigationHTMLReportGenerator:
"""Generates HTML format investigation reports."""
def __init__(self, config: Config):
"""
Initialize HTML generator.
Args:
config: Configuration object
"""
self.config = config
self.max_rows = 100 # Limit rows displayed in HTML
def generate(self, summary: InvestigationSummary, filepath: Path) -> None:
"""
Generate HTML investigation report.
Args:
summary: Investigation summary
filepath: Output file path
"""
html_content = self._build_html(summary)
with open(filepath, "w", encoding="utf-8") as f:
f.write(html_content)
logger.debug(f"Investigation HTML report written to {filepath}")
def _build_html(self, summary: InvestigationSummary) -> str:
"""Build complete HTML document."""
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Investigation Report - {summary.start_time}</title>
{self._get_styles()}
{self._get_scripts()}
</head>
<body>
<div class="container">
{self._build_header(summary)}
{self._build_summary(summary)}
{self._build_table_results(summary)}
{self._build_footer(summary)}
</div>
</body>
</html>"""
def _get_styles(self) -> str:
"""Get embedded CSS styles."""
return """<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; background: #f5f5f5; padding: 20px; }
.container { max-width: 1600px; margin: 0 auto; background: white; padding: 30px; border-radius: 8px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); }
h1 { color: #333; border-bottom: 3px solid #007bff; padding-bottom: 10px; margin-bottom: 20px; }
h2 { color: #555; margin-top: 30px; margin-bottom: 15px; border-left: 4px solid #007bff; padding-left: 10px; }
h3 { color: #666; margin-top: 20px; margin-bottom: 10px; }
.header { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 8px; margin-bottom: 30px; }
.header h1 { color: white; border: none; }
.info-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 15px; margin: 20px 0; }
.info-box { background: #f8f9fa; padding: 15px; border-radius: 5px; border-left: 4px solid #007bff; }
.info-label { font-weight: bold; color: #666; font-size: 0.9em; }
.info-value { color: #333; font-size: 1.1em; margin-top: 5px; }
.summary-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(150px, 1fr)); gap: 15px; margin: 20px 0; }
.summary-box { padding: 20px; border-radius: 8px; text-align: center; color: white; }
.summary-box.success { background: #28a745; }
.summary-box.partial { background: #ffc107; color: #333; }
.summary-box.failed { background: #dc3545; }
.summary-number { font-size: 2.5em; font-weight: bold; }
.summary-label { font-size: 0.9em; margin-top: 5px; }
.table-card { background: #fff; border: 1px solid #dee2e6; border-radius: 8px; margin: 20px 0; overflow: hidden; }
.table-header { background: #f8f9fa; padding: 15px; border-bottom: 2px solid #dee2e6; cursor: pointer; }
.table-header:hover { background: #e9ecef; }
.table-name { font-size: 1.2em; font-weight: bold; color: #333; }
.table-status { display: inline-block; padding: 4px 12px; border-radius: 12px; font-size: 0.85em; font-weight: 600; margin-left: 10px; }
.status-SUCCESS { background: #d4edda; color: #155724; }
.status-PASS { background: #d4edda; color: #155724; }
.status-FAIL { background: #f8d7da; color: #721c24; }
.status-WARNING { background: #fff3cd; color: #856404; }
.status-SKIP { background: #e2e3e5; color: #383d41; }
.table-content { padding: 20px; display: none; }
.table-content.active { display: block; }
.query-section { margin: 20px 0; padding: 15px; background: #f8f9fa; border-radius: 5px; }
.query-header { font-weight: bold; margin-bottom: 10px; color: #555; }
.comparison-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 20px; margin: 15px 0; }
.env-section { background: white; padding: 15px; border-radius: 5px; border: 1px solid #dee2e6; }
.env-title { font-weight: bold; color: #007bff; margin-bottom: 10px; }
.query-code { background: #2d2d2d; color: #f8f8f2; padding: 15px; border-radius: 5px; overflow-x: auto; font-family: 'Courier New', monospace; font-size: 0.9em; margin: 10px 0; }
.result-table { width: 100%; border-collapse: collapse; margin: 10px 0; font-size: 0.9em; }
.result-table th { background: #007bff; color: white; padding: 8px; text-align: left; }
.result-table td { padding: 8px; border-bottom: 1px solid #dee2e6; }
.result-table tr:hover { background: #f8f9fa; }
.error-box { background: #fff5f5; border: 1px solid #feb2b2; border-radius: 5px; padding: 15px; margin: 10px 0; color: #c53030; }
.result-meta { display: flex; gap: 20px; margin: 10px 0; font-size: 0.9em; color: #666; }
.footer { margin-top: 40px; padding-top: 20px; border-top: 1px solid #dee2e6; text-align: center; color: #666; font-size: 0.9em; }
.toggle-icon { float: right; transition: transform 0.3s; }
.toggle-icon.active { transform: rotate(180deg); }
</style>"""
def _get_scripts(self) -> str:
"""Get embedded JavaScript."""
return """<script>
function toggleTable(id) {
const content = document.getElementById('content-' + id);
const icon = document.getElementById('icon-' + id);
content.classList.toggle('active');
icon.classList.toggle('active');
}
</script>"""
def _build_header(self, summary: InvestigationSummary) -> str:
"""Build report header."""
return f"""<div class="header">
<h1>🔍 Investigation Report</h1>
<p>Analysis Directory: {summary.analysis_directory}</p>
</div>
<div class="info-grid">
<div class="info-box">
<div class="info-label">Start Time</div>
<div class="info-value">{summary.start_time}</div>
</div>
<div class="info-box">
<div class="info-label">End Time</div>
<div class="info-value">{summary.end_time}</div>
</div>
<div class="info-box">
<div class="info-label">Duration</div>
<div class="info-value">{format_duration(summary.duration_seconds)}</div>
</div>
<div class="info-box">
<div class="info-label">Baseline</div>
<div class="info-value">{summary.baseline_info}</div>
</div>
<div class="info-box">
<div class="info-label">Target</div>
<div class="info-value">{summary.target_info}</div>
</div>
<div class="info-box">
<div class="info-label">Total Queries</div>
<div class="info-value">{summary.total_queries_executed}</div>
</div>
</div>"""
def _build_summary(self, summary: InvestigationSummary) -> str:
"""Build summary section."""
return f"""<h2>Summary</h2>
<div class="summary-grid">
<div class="summary-box success">
<div class="summary-number">{summary.tables_successful}</div>
<div class="summary-label">Successful</div>
</div>
<div class="summary-box partial">
<div class="summary-number">{summary.tables_partial}</div>
<div class="summary-label">Partial</div>
</div>
<div class="summary-box failed">
<div class="summary-number">{summary.tables_failed}</div>
<div class="summary-label">Failed</div>
</div>
</div>"""
def _build_table_results(self, summary: InvestigationSummary) -> str:
"""Build table-by-table results."""
html = '<h2>Investigation Results</h2>'
for idx, table_result in enumerate(summary.results):
html += f"""<div class="table-card">
<div class="table-header" onclick="toggleTable({idx})">
<span class="table-name">{table_result.full_name}</span>
<span class="table-status status-{table_result.overall_status.value}">{table_result.overall_status.value}</span>
<span class="toggle-icon" id="icon-{idx}">▼</span>
</div>
<div class="table-content" id="content-{idx}">
<p><strong>SQL File:</strong> {table_result.sql_file_path}</p>
<p><strong>Total Queries:</strong> {table_result.total_queries}</p>
<p><strong>Successful Queries:</strong> {table_result.successful_queries}</p>
{self._build_queries(table_result)}
</div>
</div>"""
return html
def _build_queries(self, table_result) -> str:
"""Build query results for a table."""
html = ""
for i, (baseline_result, target_result) in enumerate(zip(
table_result.baseline_results,
table_result.target_results
), 1):
html += f"""<div class="query-section">
<div class="query-header">Query {baseline_result.query_number}</div>
<details>
<summary>View SQL</summary>
<div class="query-code">{self._escape_html(baseline_result.query_text)}</div>
</details>
<div class="comparison-grid">
{self._build_query_result(baseline_result, "Baseline")}
{self._build_query_result(target_result, "Target")}
</div>
</div>"""
return html
def _build_query_result(self, result: QueryExecutionResult, env: str) -> str:
"""Build single query result."""
html = f"""<div class="env-section">
<div class="env-title">{env}</div>
<span class="table-status status-{result.status.value}">{result.status.value}</span>
<div class="result-meta">
<span>⏱️ {result.execution_time_ms}ms</span>
<span>📊 {result.row_count} rows</span>
</div>"""
if result.error_message:
html += f'<div class="error-box">❌ {self._escape_html(result.error_message)}</div>'
elif result.result_data is not None and not result.result_data.empty:
html += self._build_result_table(result)
html += '</div>'
return html
def _build_result_table(self, result: QueryExecutionResult) -> str:
"""Build HTML table from DataFrame."""
df = result.result_data
if df is None or df.empty:
return '<p>No data returned</p>'
# Limit rows
display_df = df.head(self.max_rows)
html = '<table class="result-table"><thead><tr>'
for col in display_df.columns:
html += f'<th>{self._escape_html(str(col))}</th>'
html += '</tr></thead><tbody>'
for _, row in display_df.iterrows():
html += '<tr>'
for val in row:
html += f'<td>{self._escape_html(str(val))}</td>'
html += '</tr>'
html += '</tbody></table>'
if len(df) > self.max_rows:
html += f'<p><em>Showing first {self.max_rows} of {len(df)} rows</em></p>'
return html
def _escape_html(self, text: str) -> str:
"""Escape HTML special characters."""
return (text
.replace('&', '&amp;')
.replace('<', '&lt;')
.replace('>', '&gt;')
.replace('"', '&quot;')
.replace("'", '&#39;'))
def _build_footer(self, summary: InvestigationSummary) -> str:
"""Build report footer."""
return f"""<div class="footer">
<p>Generated by Data Regression Testing Framework - Investigation Module</p>
<p>Success Rate: {summary.success_rate:.1f}%</p>
</div>"""
class InvestigationCSVReportGenerator:
"""Generates CSV format investigation reports."""
def __init__(self, config: Config):
"""
Initialize CSV generator.
Args:
config: Configuration object
"""
self.config = config
def generate(self, summary: InvestigationSummary, filepath: Path) -> None:
"""
Generate CSV investigation report.
Args:
summary: Investigation summary
filepath: Output file path
"""
csv_config = self.config.reporting.csv
delimiter = csv_config.get("delimiter", ",")
encoding = csv_config.get("encoding", "utf-8-sig")
with open(filepath, "w", newline="", encoding=encoding) as f:
writer = csv.writer(f, delimiter=delimiter)
# Write header
writer.writerow([
"Timestamp",
"Schema",
"Table",
"Query_Number",
"Environment",
"Status",
"Row_Count",
"Execution_Time_Ms",
"Error_Message",
"SQL_File_Path"
])
# Write data rows
for table_result in summary.results:
# Baseline results
for query_result in table_result.baseline_results:
writer.writerow([
table_result.timestamp,
table_result.schema,
table_result.table,
query_result.query_number,
"baseline",
query_result.status.value,
query_result.row_count,
query_result.execution_time_ms,
query_result.error_message or "",
table_result.sql_file_path
])
# Target results
for query_result in table_result.target_results:
writer.writerow([
table_result.timestamp,
table_result.schema,
table_result.table,
query_result.query_number,
"target",
query_result.status.value,
query_result.row_count,
query_result.execution_time_ms,
query_result.error_message or "",
table_result.sql_file_path
])
logger.debug(f"Investigation CSV report written to {filepath}")

6
src/drt/services/__init__.py Executable file
View File

@@ -0,0 +1,6 @@
"""Business logic services."""
from drt.services.discovery import DiscoveryService
from drt.services.comparison import ComparisonService
__all__ = ["DiscoveryService", "ComparisonService"]

View File

@@ -0,0 +1,15 @@
"""Comparison checkers."""
from drt.services.checkers.base import BaseChecker
from drt.services.checkers.existence import ExistenceChecker
from drt.services.checkers.row_count import RowCountChecker
from drt.services.checkers.schema import SchemaChecker
from drt.services.checkers.aggregate import AggregateChecker
__all__ = [
"BaseChecker",
"ExistenceChecker",
"RowCountChecker",
"SchemaChecker",
"AggregateChecker",
]

View File

@@ -0,0 +1,111 @@
"""Aggregate checker."""
import time
from drt.services.checkers.base import BaseChecker
from drt.models.results import CheckResult
from drt.models.table import TableInfo
from drt.models.enums import Status, CheckType
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class AggregateChecker(BaseChecker):
"""Checks aggregate sums for numeric columns."""
def check(self, table: TableInfo) -> CheckResult:
"""
Check aggregate sums.
Args:
table: Table information
Returns:
Check result
"""
if not self.config.comparison.aggregates.enabled:
return CheckResult(
check_type=CheckType.AGGREGATE,
status=Status.SKIP,
message="Aggregate check disabled"
)
if not table.aggregate_columns:
return CheckResult(
check_type=CheckType.AGGREGATE,
status=Status.SKIP,
message="No aggregate columns configured"
)
try:
# Time baseline query
baseline_start = time.time()
baseline_sums = self.baseline_executor.get_aggregate_sums(
table.schema, table.name, table.aggregate_columns
)
baseline_time = (time.time() - baseline_start) * 1000
logger.debug(f" └─ Baseline aggregate query: {baseline_time:.0f}ms")
# Time target query
target_start = time.time()
target_sums = self.target_executor.get_aggregate_sums(
table.schema, table.name, table.aggregate_columns
)
target_time = (time.time() - target_start) * 1000
logger.debug(f" └─ Target aggregate query: {target_time:.0f}ms")
logger.debug(f" └─ Total aggregate time: {baseline_time + target_time:.0f}ms (could be parallelized)")
tolerance_pct = self.config.comparison.aggregates.tolerance_percent
issues = []
statuses = []
for col in table.aggregate_columns:
baseline_val = baseline_sums.get(col, 0.0)
target_val = target_sums.get(col, 0.0)
if baseline_val == target_val:
continue
# Calculate percentage difference
if baseline_val != 0:
pct_diff = abs((target_val - baseline_val) / baseline_val * 100)
else:
pct_diff = 100.0 if target_val != 0 else 0.0
if pct_diff > tolerance_pct:
statuses.append(Status.FAIL)
issues.append(
f"Column '{col}': SUM differs by {pct_diff:.2f}% "
f"(Baseline: {baseline_val:,.2f}, Target: {target_val:,.2f})"
)
# Determine overall status
if not statuses:
status = Status.PASS
message = f"All {len(table.aggregate_columns)} aggregate(s) match"
else:
status = Status.most_severe(statuses)
message = "; ".join(issues)
return CheckResult(
check_type=CheckType.AGGREGATE,
status=status,
baseline_value=baseline_sums,
target_value=target_sums,
message=message,
details={
"baseline_sums": baseline_sums,
"target_sums": target_sums,
"tolerance_percent": tolerance_pct,
"columns_checked": table.aggregate_columns,
"issues": issues
}
)
except Exception as e:
logger.error(f"Aggregate check failed for {table.full_name}: {e}")
return CheckResult(
check_type=CheckType.AGGREGATE,
status=Status.ERROR,
message=f"Aggregate check error: {str(e)}"
)

View File

@@ -0,0 +1,42 @@
"""Base checker class."""
from abc import ABC, abstractmethod
from drt.models.results import CheckResult
from drt.models.table import TableInfo
from drt.database.executor import QueryExecutor
from drt.config.models import Config
class BaseChecker(ABC):
"""Abstract base class for all checkers."""
def __init__(
self,
baseline_executor: QueryExecutor,
target_executor: QueryExecutor,
config: Config
):
"""
Initialize checker.
Args:
baseline_executor: Query executor for baseline database
target_executor: Query executor for target database
config: Configuration object
"""
self.baseline_executor = baseline_executor
self.target_executor = target_executor
self.config = config
@abstractmethod
def check(self, table: TableInfo) -> CheckResult:
"""
Perform the check.
Args:
table: Table information
Returns:
Check result
"""
pass

View File

@@ -0,0 +1,78 @@
"""Table existence checker."""
import time
from drt.services.checkers.base import BaseChecker
from drt.models.results import CheckResult
from drt.models.table import TableInfo
from drt.models.enums import Status, CheckType
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class ExistenceChecker(BaseChecker):
"""Checks if table exists in both baseline and target."""
def check(self, table: TableInfo) -> CheckResult:
"""
Check table existence.
Args:
table: Table information
Returns:
Check result
"""
try:
# Time baseline query
baseline_start = time.time()
baseline_exists = self.baseline_executor.table_exists(table.schema, table.name)
baseline_time = (time.time() - baseline_start) * 1000
logger.debug(f" └─ Baseline existence query: {baseline_time:.0f}ms")
# Time target query
target_start = time.time()
target_exists = self.target_executor.table_exists(table.schema, table.name)
target_time = (time.time() - target_start) * 1000
logger.debug(f" └─ Target existence query: {target_time:.0f}ms")
logger.debug(f" └─ Total existence time: {baseline_time + target_time:.0f}ms (could be parallelized)")
# Determine status
if baseline_exists and target_exists:
status = Status.PASS
message = "Table exists in both databases"
elif baseline_exists and not target_exists:
# Table missing in target
if table.expected_in_target:
status = Status.FAIL
message = "Table exists in Baseline but missing in Target (REGRESSION)"
else:
status = Status.INFO
message = "Table removed from Target (expected per configuration)"
elif not baseline_exists and target_exists:
status = Status.INFO
message = "Table exists only in Target (new table)"
else:
status = Status.ERROR
message = "Table does not exist in either database"
return CheckResult(
check_type=CheckType.EXISTENCE,
status=status,
baseline_value=baseline_exists,
target_value=target_exists,
message=message,
details={
"baseline_exists": baseline_exists,
"target_exists": target_exists,
"expected_in_target": table.expected_in_target
}
)
except Exception as e:
logger.error(f"Existence check failed for {table.full_name}: {e}")
return CheckResult(
check_type=CheckType.EXISTENCE,
status=Status.ERROR,
message=f"Existence check error: {str(e)}"
)

View File

@@ -0,0 +1,90 @@
"""Row count checker."""
import time
from drt.services.checkers.base import BaseChecker
from drt.models.results import CheckResult
from drt.models.table import TableInfo
from drt.models.enums import Status, CheckType
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class RowCountChecker(BaseChecker):
"""Checks row count differences between baseline and target."""
def check(self, table: TableInfo) -> CheckResult:
"""
Check row counts.
Args:
table: Table information
Returns:
Check result
"""
if not self.config.comparison.row_count.enabled:
return CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.SKIP,
message="Row count check disabled"
)
try:
# Time baseline query
baseline_start = time.time()
baseline_count = self.baseline_executor.get_row_count(table.schema, table.name)
baseline_time = (time.time() - baseline_start) * 1000
logger.debug(f" └─ Baseline row count query: {baseline_time:.0f}ms")
# Time target query
target_start = time.time()
target_count = self.target_executor.get_row_count(table.schema, table.name)
target_time = (time.time() - target_start) * 1000
logger.debug(f" └─ Target row count query: {target_time:.0f}ms")
logger.debug(f" └─ Total row count time: {baseline_time + target_time:.0f}ms (could be parallelized)")
difference = target_count - baseline_count
tolerance_pct = self.config.comparison.row_count.tolerance_percent
# Determine status
if baseline_count == target_count:
status = Status.PASS
message = f"Row counts match: {baseline_count:,}"
elif target_count > baseline_count:
pct_diff = (difference / baseline_count * 100) if baseline_count > 0 else 0
status = Status.WARNING
message = f"Target has {difference:,} more rows (+{pct_diff:.2f}%)"
else: # target_count < baseline_count
pct_diff = abs(difference / baseline_count * 100) if baseline_count > 0 else 0
if pct_diff <= tolerance_pct:
status = Status.WARNING
message = f"Target has {abs(difference):,} fewer rows (-{pct_diff:.2f}%) - within tolerance"
else:
status = Status.FAIL
message = f"Target missing {abs(difference):,} rows (-{pct_diff:.2f}%) - REGRESSION"
return CheckResult(
check_type=CheckType.ROW_COUNT,
status=status,
baseline_value=baseline_count,
target_value=target_count,
difference=difference,
message=message,
details={
"baseline_count": baseline_count,
"target_count": target_count,
"difference": difference,
"percent_difference": (difference / baseline_count * 100) if baseline_count > 0 else 0,
"tolerance_percent": tolerance_pct
}
)
except Exception as e:
logger.error(f"Row count check failed for {table.full_name}: {e}")
return CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.ERROR,
message=f"Row count check error: {str(e)}"
)

View File

@@ -0,0 +1,132 @@
"""Schema checker."""
import time
from typing import Set
from drt.services.checkers.base import BaseChecker
from drt.models.results import CheckResult
from drt.models.table import TableInfo
from drt.models.enums import Status, CheckType
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class SchemaChecker(BaseChecker):
"""Checks schema differences between baseline and target."""
def check(self, table: TableInfo) -> CheckResult:
"""
Check schema compatibility.
Args:
table: Table information
Returns:
Check result
"""
if not self.config.comparison.schema.enabled:
return CheckResult(
check_type=CheckType.SCHEMA,
status=Status.SKIP,
message="Schema check disabled"
)
try:
# Time baseline query
baseline_start = time.time()
baseline_cols = self.baseline_executor.get_columns(table.schema, table.name)
baseline_time = (time.time() - baseline_start) * 1000
logger.debug(f" └─ Baseline schema query: {baseline_time:.0f}ms")
# Time target query
target_start = time.time()
target_cols = self.target_executor.get_columns(table.schema, table.name)
target_time = (time.time() - target_start) * 1000
logger.debug(f" └─ Target schema query: {target_time:.0f}ms")
logger.debug(f" └─ Total schema time: {baseline_time + target_time:.0f}ms (could be parallelized)")
baseline_col_names = {col['COLUMN_NAME'] for col in baseline_cols}
target_col_names = {col['COLUMN_NAME'] for col in target_cols}
missing_in_target = baseline_col_names - target_col_names
extra_in_target = target_col_names - baseline_col_names
issues = []
statuses = []
# Check for missing columns
if missing_in_target:
severity = self.config.comparison.schema.severity.get(
"missing_column_in_target", "FAIL"
)
statuses.append(Status[severity])
issues.append(f"Missing columns in Target: {', '.join(sorted(missing_in_target))}")
# Check for extra columns
if extra_in_target:
severity = self.config.comparison.schema.severity.get(
"extra_column_in_target", "WARNING"
)
statuses.append(Status[severity])
issues.append(f"Extra columns in Target: {', '.join(sorted(extra_in_target))}")
# Check data types for matching columns
if self.config.comparison.schema.checks.get("data_types", True):
type_mismatches = self._check_data_types(baseline_cols, target_cols)
if type_mismatches:
severity = self.config.comparison.schema.severity.get(
"data_type_mismatch", "WARNING"
)
statuses.append(Status[severity])
issues.extend(type_mismatches)
# Determine overall status
if not statuses:
status = Status.PASS
message = f"Schema matches: {len(baseline_col_names)} columns"
else:
status = Status.most_severe(statuses)
message = "; ".join(issues)
return CheckResult(
check_type=CheckType.SCHEMA,
status=status,
baseline_value=len(baseline_col_names),
target_value=len(target_col_names),
message=message,
details={
"baseline_columns": sorted(baseline_col_names),
"target_columns": sorted(target_col_names),
"missing_in_target": sorted(missing_in_target),
"extra_in_target": sorted(extra_in_target),
"issues": issues
}
)
except Exception as e:
logger.error(f"Schema check failed for {table.full_name}: {e}")
return CheckResult(
check_type=CheckType.SCHEMA,
status=Status.ERROR,
message=f"Schema check error: {str(e)}"
)
def _check_data_types(self, baseline_cols: list, target_cols: list) -> list:
"""Check for data type mismatches."""
mismatches = []
# Create lookup dictionaries
baseline_types = {col['COLUMN_NAME']: col['DATA_TYPE'] for col in baseline_cols}
target_types = {col['COLUMN_NAME']: col['DATA_TYPE'] for col in target_cols}
# Check common columns
common_cols = set(baseline_types.keys()) & set(target_types.keys())
for col in sorted(common_cols):
if baseline_types[col] != target_types[col]:
mismatches.append(
f"Column '{col}': type mismatch "
f"(Baseline: {baseline_types[col]}, Target: {target_types[col]})"
)
return mismatches

250
src/drt/services/comparison.py Executable file
View File

@@ -0,0 +1,250 @@
"""Comparison service for executing database comparisons."""
import time
from typing import List
from drt.database.connection import ConnectionManager
from drt.database.executor import QueryExecutor
from drt.config.models import Config, DatabasePairConfig
from drt.models.table import TableInfo
from drt.models.results import ComparisonResult
from drt.models.summary import ExecutionSummary
from drt.models.enums import Status
from drt.services.checkers import (
ExistenceChecker,
RowCountChecker,
SchemaChecker,
AggregateChecker
)
from drt.utils.logging import get_logger
from drt.utils.timestamps import get_timestamp
from drt.utils.patterns import matches_pattern
logger = get_logger(__name__)
class ComparisonService:
"""Service for comparing baseline and target databases."""
def __init__(self, config: Config):
"""
Initialize comparison service.
Args:
config: Configuration object
"""
self.config = config
def run_comparison(self, db_pair: DatabasePairConfig) -> ExecutionSummary:
"""
Run comparison for a database pair.
Args:
db_pair: Database pair configuration
Returns:
Execution summary with results
"""
start_time = get_timestamp()
start_ts = time.time()
logger.info("=" * 60)
logger.info(f"Starting comparison: {db_pair.name}")
logger.info("=" * 60)
# Initialize connections
baseline_mgr = ConnectionManager(db_pair.baseline)
target_mgr = ConnectionManager(db_pair.target)
try:
# Connect to databases
baseline_mgr.connect()
target_mgr.connect()
# Create executors
baseline_executor = QueryExecutor(baseline_mgr)
target_executor = QueryExecutor(target_mgr)
# Initialize checkers
existence_checker = ExistenceChecker(baseline_executor, target_executor, self.config)
row_count_checker = RowCountChecker(baseline_executor, target_executor, self.config)
schema_checker = SchemaChecker(baseline_executor, target_executor, self.config)
aggregate_checker = AggregateChecker(baseline_executor, target_executor, self.config)
# Get tables to compare
tables = self._get_tables_to_compare()
logger.info(f"Tables to compare: {len(tables)}")
# Create summary
summary = ExecutionSummary(
start_time=start_time,
end_time="",
duration_seconds=0,
config_file=self.config.metadata.generated_date or "",
baseline_info=f"{db_pair.baseline.server}.{db_pair.baseline.database}",
target_info=f"{db_pair.target.server}.{db_pair.target.database}"
)
# Compare each table
for idx, table in enumerate(tables, 1):
if not table.enabled:
logger.info(f"[{idx:3d}/{len(tables)}] {table.full_name:40s} SKIP (disabled)")
result = self._create_skipped_result(table)
summary.add_result(result)
continue
logger.info(f"[{idx:3d}/{len(tables)}] {table.full_name:40s} ...", extra={'end': ''})
result = self._compare_table(
table,
existence_checker,
row_count_checker,
schema_checker,
aggregate_checker
)
summary.add_result(result)
# Log result
status_symbol = self._get_status_symbol(result.overall_status)
logger.info(f" {status_symbol} {result.overall_status.value}")
if not self.config.execution.continue_on_error and result.overall_status == Status.ERROR:
logger.error("Stopping due to error (continue_on_error=False)")
break
# Finalize summary
end_time = get_timestamp()
duration = int(time.time() - start_ts)
summary.end_time = end_time
summary.duration_seconds = duration
# Log summary
self._log_summary(summary)
return summary
finally:
baseline_mgr.disconnect()
target_mgr.disconnect()
def _compare_table(
self,
table: TableInfo,
existence_checker: ExistenceChecker,
row_count_checker: RowCountChecker,
schema_checker: SchemaChecker,
aggregate_checker: AggregateChecker
) -> ComparisonResult:
"""Compare a single table."""
start_ms = time.time() * 1000
result = ComparisonResult(
table=table,
overall_status=Status.PASS,
timestamp=get_timestamp()
)
try:
# Check existence first
check_start = time.time()
existence_result = existence_checker.check(table)
existence_time = (time.time() - check_start) * 1000
logger.debug(f" └─ Existence check: {existence_time:.0f}ms")
result.add_check(existence_result)
# Only proceed with other checks if table exists in both
if existence_result.status == Status.PASS:
# Row count check
check_start = time.time()
row_count_result = row_count_checker.check(table)
row_count_time = (time.time() - check_start) * 1000
logger.debug(f" └─ Row count check: {row_count_time:.0f}ms")
result.add_check(row_count_result)
# Schema check
check_start = time.time()
schema_result = schema_checker.check(table)
schema_time = (time.time() - check_start) * 1000
logger.debug(f" └─ Schema check: {schema_time:.0f}ms")
result.add_check(schema_result)
# Aggregate check
check_start = time.time()
aggregate_result = aggregate_checker.check(table)
aggregate_time = (time.time() - check_start) * 1000
logger.debug(f" └─ Aggregate check: {aggregate_time:.0f}ms")
result.add_check(aggregate_result)
except Exception as e:
logger.error(f"Comparison failed for {table.full_name}: {e}")
result.overall_status = Status.ERROR
result.error_message = str(e)
result.execution_time_ms = int(time.time() * 1000 - start_ms)
logger.debug(f" └─ Total table time: {result.execution_time_ms}ms")
return result
def _get_tables_to_compare(self) -> List[TableInfo]:
"""Get list of tables to compare based on configuration."""
tables = []
for table_config in self.config.tables:
table = TableInfo(
schema=table_config.schema,
name=table_config.name,
enabled=table_config.enabled,
expected_in_target=table_config.expected_in_target,
estimated_row_count=table_config.estimated_row_count,
primary_key_columns=table_config.primary_key_columns,
aggregate_columns=table_config.aggregate_columns,
notes=table_config.notes
)
tables.append(table)
# Apply filters
if self.config.table_filters.mode == "include_list":
if self.config.table_filters.include_list:
include_names = {f"{t['schema']}.{t['name']}" for t in self.config.table_filters.include_list}
tables = [t for t in tables if t.full_name in include_names]
# Apply exclusions
tables = [
t for t in tables
if not matches_pattern(t.name, self.config.table_filters.exclude_patterns)
and t.schema not in self.config.table_filters.exclude_schemas
]
return tables
def _create_skipped_result(self, table: TableInfo) -> ComparisonResult:
"""Create a skipped result for disabled tables."""
return ComparisonResult(
table=table,
overall_status=Status.SKIP,
timestamp=get_timestamp()
)
def _get_status_symbol(self, status: Status) -> str:
"""Get symbol for status."""
symbols = {
Status.PASS: "",
Status.FAIL: "",
Status.WARNING: "",
Status.ERROR: "🔴",
Status.INFO: "",
Status.SKIP: ""
}
return symbols.get(status, "?")
def _log_summary(self, summary: ExecutionSummary) -> None:
"""Log execution summary."""
logger.info("=" * 60)
logger.info("COMPARISON SUMMARY")
logger.info("=" * 60)
logger.info(f" PASS: {summary.passed:3d} | FAIL: {summary.failed:3d}")
logger.info(f" WARNING: {summary.warnings:3d} | ERROR: {summary.errors:3d}")
logger.info(f" INFO: {summary.info:3d} | SKIP: {summary.skipped:3d}")
logger.info("=" * 60)
logger.info(f"Duration: {summary.duration_seconds} seconds")
logger.info(f"Success Rate: {summary.success_rate:.1f}%")
logger.info("=" * 60)

192
src/drt/services/discovery.py Executable file
View File

@@ -0,0 +1,192 @@
"""Discovery service for auto-generating configuration."""
from typing import List
from drt.database.connection import ConnectionManager
from drt.database.executor import QueryExecutor
from drt.database.queries import SQLQueries
from drt.models.table import TableInfo, ColumnInfo
from drt.config.models import Config, TableConfig, MetadataConfig, ConnectionConfig
from drt.utils.logging import get_logger
from drt.utils.timestamps import get_timestamp
from drt.utils.patterns import matches_pattern
logger = get_logger(__name__)
class DiscoveryService:
"""Service for discovering database tables and generating configuration."""
def __init__(self, connection_config: ConnectionConfig, config: Config = None):
"""
Initialize discovery service.
Args:
connection_config: Connection configuration for baseline database
config: Optional existing configuration for discovery settings
"""
self.conn_config = connection_config
self.config = config or Config()
self.conn_mgr = ConnectionManager(connection_config)
self.executor = QueryExecutor(self.conn_mgr)
def discover_tables(self) -> List[TableInfo]:
"""
Discover all tables in the database.
Returns:
List of discovered tables
"""
logger.info("Starting table discovery...")
try:
# Get all tables
tables_data = self.executor.get_all_tables()
logger.info(f"Found {len(tables_data)} tables")
discovered_tables = []
for table_data in tables_data:
schema = table_data['schema_name']
name = table_data['table_name']
estimated_rows = table_data.get('estimated_rows', 0)
# Apply filters
if self._should_exclude_table(schema, name):
logger.debug(f"Excluding table: {schema}.{name}")
continue
# Get column information
columns = self._discover_columns(schema, name)
# Get primary keys
pk_columns = self.executor.get_primary_keys(schema, name)
# Identify numeric columns for aggregation
aggregate_cols = [
col.name for col in columns
if col.is_numeric and self.config.discovery.detect_numeric_columns
]
table_info = TableInfo(
schema=schema,
name=name,
estimated_row_count=estimated_rows,
columns=columns,
primary_key_columns=pk_columns,
enabled=True,
expected_in_target=self.config.discovery.default_expected_in_target,
aggregate_columns=aggregate_cols,
notes=""
)
discovered_tables.append(table_info)
logger.debug(f"Discovered: {table_info.full_name} ({estimated_rows:,} rows)")
logger.info(f"Discovery complete: {len(discovered_tables)} tables discovered")
return discovered_tables
except Exception as e:
logger.error(f"Discovery failed: {e}")
raise
def _discover_columns(self, schema: str, table: str) -> List[ColumnInfo]:
"""Discover columns for a table."""
import math
columns_data = self.executor.get_columns(schema, table)
columns = []
for idx, col_data in enumerate(columns_data, 1):
is_numeric = SQLQueries.is_numeric_type(col_data['DATA_TYPE'])
# Convert nan to None for Pydantic validation
# Pandas converts SQL NULL to nan, but Pydantic v2 rejects nan for Optional[int]
max_length = col_data.get('CHARACTER_MAXIMUM_LENGTH')
if isinstance(max_length, float) and math.isnan(max_length):
max_length = None
precision = col_data.get('NUMERIC_PRECISION')
if isinstance(precision, float) and math.isnan(precision):
precision = None
scale = col_data.get('NUMERIC_SCALE')
if isinstance(scale, float) and math.isnan(scale):
scale = None
# DEBUG: Log converted values to verify fix
logger.debug(f"Column {col_data['COLUMN_NAME']}: max_length={max_length} (converted from {col_data.get('CHARACTER_MAXIMUM_LENGTH')}), "
f"precision={precision}, scale={scale}, data_type={col_data['DATA_TYPE']}")
column = ColumnInfo(
name=col_data['COLUMN_NAME'],
data_type=col_data['DATA_TYPE'],
max_length=max_length,
precision=precision,
scale=scale,
is_nullable=col_data['IS_NULLABLE'] == 'YES',
is_numeric=is_numeric,
ordinal_position=col_data.get('ORDINAL_POSITION', idx)
)
columns.append(column)
return columns
def _should_exclude_table(self, schema: str, table: str) -> bool:
"""Check if table should be excluded based on filters."""
# Check schema exclusions
if schema in self.config.discovery.exclude_schemas:
return True
# Check table name patterns
if matches_pattern(table, self.config.discovery.exclude_patterns):
return True
# Check schema inclusions (if specified)
if self.config.discovery.include_schemas:
if schema not in self.config.discovery.include_schemas:
return True
return False
def generate_config(self, tables: List[TableInfo]) -> Config:
"""
Generate configuration from discovered tables.
Args:
tables: List of discovered tables
Returns:
Generated configuration
"""
logger.info("Generating configuration...")
# Create table configs
table_configs = [
TableConfig(
schema=table.schema,
name=table.name,
enabled=table.enabled,
expected_in_target=table.expected_in_target,
estimated_row_count=table.estimated_row_count,
primary_key_columns=table.primary_key_columns,
aggregate_columns=table.aggregate_columns,
notes=table.notes
)
for table in tables
]
# Update metadata
metadata = MetadataConfig(
config_version="1.0",
generated_date=get_timestamp(),
generated_by="discovery",
framework_version="1.0.0"
)
# Create new config with discovered tables
config = Config(
metadata=metadata,
tables=table_configs
)
logger.info(f"Configuration generated with {len(table_configs)} tables")
return config

View File

@@ -0,0 +1,297 @@
"""Investigation service for executing investigation queries."""
import time
from pathlib import Path
from typing import List, Tuple
from drt.database.connection import ConnectionManager
from drt.database.executor import QueryExecutor
from drt.config.models import Config, DatabasePairConfig
from drt.models.investigation import (
QueryExecutionResult,
TableInvestigationResult,
InvestigationSummary
)
from drt.models.enums import Status
from drt.services.sql_parser import SQLParser, discover_sql_files
from drt.utils.logging import get_logger
from drt.utils.timestamps import get_timestamp
logger = get_logger(__name__)
class InvestigationService:
"""Service for executing investigation queries."""
def __init__(self, config: Config):
"""
Initialize investigation service.
Args:
config: Configuration object
"""
self.config = config
self.parser = SQLParser()
def run_investigation(
self,
analysis_dir: Path,
db_pair: DatabasePairConfig
) -> InvestigationSummary:
"""
Run investigation for all SQL files in analysis directory.
Args:
analysis_dir: Path to analysis output directory
db_pair: Database pair configuration
Returns:
Investigation summary with all results
"""
start_time = get_timestamp()
start_ts = time.time()
logger.info("=" * 60)
logger.info(f"Starting investigation: {analysis_dir.name}")
logger.info("=" * 60)
# Initialize connections
baseline_mgr = ConnectionManager(db_pair.baseline)
target_mgr = ConnectionManager(db_pair.target)
try:
# Connect to databases
baseline_mgr.connect()
target_mgr.connect()
# Create executors
baseline_executor = QueryExecutor(baseline_mgr)
target_executor = QueryExecutor(target_mgr)
# Discover SQL files
sql_files = discover_sql_files(analysis_dir)
logger.info(f"Found {len(sql_files)} investigation files")
# Create summary
summary = InvestigationSummary(
start_time=start_time,
end_time="",
duration_seconds=0,
analysis_directory=str(analysis_dir),
baseline_info=f"{db_pair.baseline.server}.{db_pair.baseline.database}",
target_info=f"{db_pair.target.server}.{db_pair.target.database}",
tables_processed=0,
tables_successful=0,
tables_partial=0,
tables_failed=0,
total_queries_executed=0,
results=[]
)
# Process each SQL file
for idx, (schema, table, sql_path) in enumerate(sql_files, 1):
logger.info(f"[{idx:3d}/{len(sql_files)}] {schema}.{table:40s} ...")
result = self._investigate_table(
schema,
table,
sql_path,
baseline_executor,
target_executor
)
summary.results.append(result)
summary.tables_processed += 1
# Update counters
if result.overall_status == Status.PASS:
summary.tables_successful += 1
elif result.overall_status == Status.SKIP:
# Don't count skipped tables in partial/failed
pass
elif result.overall_status in [Status.WARNING, Status.INFO]:
# Treat WARNING/INFO as partial success
summary.tables_partial += 1
elif self._is_partial_status(result):
summary.tables_partial += 1
else:
summary.tables_failed += 1
# Count queries
summary.total_queries_executed += len(result.baseline_results)
summary.total_queries_executed += len(result.target_results)
logger.info(f" {self._get_status_symbol(result.overall_status)} "
f"{result.overall_status.value}")
# Finalize summary
end_time = get_timestamp()
duration = int(time.time() - start_ts)
summary.end_time = end_time
summary.duration_seconds = duration
self._log_summary(summary)
return summary
finally:
baseline_mgr.disconnect()
target_mgr.disconnect()
def _investigate_table(
self,
schema: str,
table: str,
sql_path: Path,
baseline_executor: QueryExecutor,
target_executor: QueryExecutor
) -> TableInvestigationResult:
"""Execute investigation queries for a single table."""
# Parse SQL file
queries = self.parser.parse_sql_file(sql_path)
if not queries:
logger.warning(f"No valid queries found in {sql_path.name}")
return TableInvestigationResult(
schema=schema,
table=table,
sql_file_path=str(sql_path),
baseline_results=[],
target_results=[],
overall_status=Status.SKIP,
timestamp=get_timestamp()
)
logger.debug(f" └─ Executing {len(queries)} queries")
# Execute on baseline
baseline_results = self._execute_queries(
queries,
baseline_executor,
"baseline"
)
# Execute on target
target_results = self._execute_queries(
queries,
target_executor,
"target"
)
# Determine overall status
overall_status = self._determine_overall_status(
baseline_results,
target_results
)
return TableInvestigationResult(
schema=schema,
table=table,
sql_file_path=str(sql_path),
baseline_results=baseline_results,
target_results=target_results,
overall_status=overall_status,
timestamp=get_timestamp()
)
def _execute_queries(
self,
queries: List[Tuple[int, str]],
executor: QueryExecutor,
environment: str
) -> List[QueryExecutionResult]:
"""Execute list of queries on one environment."""
results = []
for query_num, query_text in queries:
logger.debug(f" └─ Query {query_num} on {environment}")
status, result_df, error_msg, exec_time = \
executor.execute_investigation_query(query_text)
result = QueryExecutionResult(
query_number=query_num,
query_text=query_text,
status=status,
execution_time_ms=exec_time,
result_data=result_df,
error_message=error_msg,
row_count=len(result_df) if result_df is not None else 0
)
results.append(result)
logger.debug(f" └─ {status.value} ({exec_time}ms, "
f"{result.row_count} rows)")
return results
def _determine_overall_status(
self,
baseline_results: List[QueryExecutionResult],
target_results: List[QueryExecutionResult]
) -> Status:
"""Determine overall status for table investigation."""
all_results = baseline_results + target_results
if not all_results:
return Status.SKIP
success_count = sum(1 for r in all_results if r.status == Status.PASS)
failed_count = sum(1 for r in all_results if r.status == Status.FAIL)
skipped_count = sum(1 for r in all_results if r.status == Status.SKIP)
# All successful
if success_count == len(all_results):
return Status.PASS
# All failed
if failed_count == len(all_results):
return Status.FAIL
# All skipped
if skipped_count == len(all_results):
return Status.SKIP
# Mixed results - use WARNING to indicate partial success
if success_count > 0:
return Status.WARNING
else:
return Status.FAIL
def _is_partial_status(self, result: TableInvestigationResult) -> bool:
"""Check if result represents partial success."""
all_results = result.baseline_results + result.target_results
if not all_results:
return False
success_count = sum(1 for r in all_results if r.status == Status.PASS)
return 0 < success_count < len(all_results)
def _get_status_symbol(self, status: Status) -> str:
"""Get symbol for status."""
symbols = {
Status.PASS: "",
Status.FAIL: "",
Status.WARNING: "",
Status.SKIP: "",
Status.ERROR: "🔴",
Status.INFO: ""
}
return symbols.get(status, "?")
def _log_summary(self, summary: InvestigationSummary) -> None:
"""Log investigation summary."""
logger.info("=" * 60)
logger.info("INVESTIGATION SUMMARY")
logger.info("=" * 60)
logger.info(f" Tables Processed: {summary.tables_processed}")
logger.info(f" Successful: {summary.tables_successful}")
logger.info(f" Partial: {summary.tables_partial}")
logger.info(f" Failed: {summary.tables_failed}")
logger.info(f" Total Queries: {summary.total_queries_executed}")
logger.info("=" * 60)
logger.info(f"Duration: {summary.duration_seconds} seconds")
logger.info(f"Success Rate: {summary.success_rate:.1f}%")
logger.info("=" * 60)

View File

@@ -0,0 +1,173 @@
"""SQL file parser for investigation queries."""
import re
from pathlib import Path
from typing import List, Tuple
from drt.utils.logging import get_logger
logger = get_logger(__name__)
class SQLParser:
"""Parser for investigation SQL files."""
@staticmethod
def parse_sql_file(file_path: Path) -> List[Tuple[int, str]]:
"""
Parse SQL file into individual queries with their numbers.
Args:
file_path: Path to SQL file
Returns:
List of tuples (query_number, query_text)
Example:
>>> queries = SQLParser.parse_sql_file(Path("investigate.sql"))
>>> for num, query in queries:
... print(f"Query {num}: {query[:50]}...")
"""
try:
content = file_path.read_text(encoding='utf-8')
# Step 1: Remove markdown code blocks
content = SQLParser._remove_markdown(content)
# Step 2: Split into queries
queries = SQLParser._split_queries(content)
# Step 3: Clean and validate
cleaned_queries = []
for num, query in queries:
cleaned = SQLParser._clean_query(query)
if cleaned and SQLParser._is_valid_query(cleaned):
cleaned_queries.append((num, cleaned))
else:
logger.debug(f"Skipped invalid query {num} in {file_path.name}")
logger.info(f"Parsed {len(cleaned_queries)} queries from {file_path.name}")
return cleaned_queries
except Exception as e:
logger.error(f"Failed to parse {file_path}: {e}")
return []
@staticmethod
def _remove_markdown(content: str) -> str:
"""Remove markdown code blocks from content."""
# Remove opening ```sql
content = re.sub(r'```sql\s*\n?', '', content, flags=re.IGNORECASE)
# Remove closing ```
content = re.sub(r'```\s*\n?', '', content)
return content
@staticmethod
def _split_queries(content: str) -> List[Tuple[int, str]]:
"""
Split content into individual queries.
Looks for patterns like:
-- Query 1: Description
-- Query 2: Description
"""
queries = []
current_query = []
current_number = 0
for line in content.split('\n'):
# Check if line is a query separator
match = re.match(r'^\s*--\s*Query\s+(\d+):', line, re.IGNORECASE)
if match:
# Save previous query if exists
if current_query and current_number > 0:
query_text = '\n'.join(current_query).strip()
if query_text:
queries.append((current_number, query_text))
# Start new query
current_number = int(match.group(1))
current_query = []
else:
# Add line to current query
current_query.append(line)
# Don't forget the last query
if current_query and current_number > 0:
query_text = '\n'.join(current_query).strip()
if query_text:
queries.append((current_number, query_text))
return queries
@staticmethod
def _clean_query(query: str) -> str:
"""Clean query text."""
# Remove leading/trailing whitespace
query = query.strip()
# Remove comment-only lines at start
lines = query.split('\n')
while lines and lines[0].strip().startswith('--'):
lines.pop(0)
# Remove empty lines at start and end
while lines and not lines[0].strip():
lines.pop(0)
while lines and not lines[-1].strip():
lines.pop()
return '\n'.join(lines)
@staticmethod
def _is_valid_query(query: str) -> bool:
"""Check if query is valid (not empty, not just comments)."""
if not query:
return False
# Remove all comments and whitespace
cleaned = re.sub(r'--.*$', '', query, flags=re.MULTILINE)
cleaned = cleaned.strip()
# Must have some SQL content
return len(cleaned) > 0
def discover_sql_files(analysis_dir: Path) -> List[Tuple[str, str, Path]]:
"""
Discover all *_investigate.sql files in analysis directory.
Args:
analysis_dir: Root analysis directory
Returns:
List of tuples (schema, table, file_path)
Example:
>>> files = discover_sql_files(Path("analysis/output_20251209_184032"))
>>> for schema, table, path in files:
... print(f"{schema}.{table}: {path}")
"""
sql_files = []
# Pattern: dbo.TableName/dbo.TableName_investigate.sql
pattern = "**/*_investigate.sql"
for sql_file in analysis_dir.glob(pattern):
# Extract schema and table from filename
# Example: dbo.A_COREC_NACES2008_investigate.sql
filename = sql_file.stem # Remove .sql
if filename.endswith('_investigate'):
# Remove _investigate suffix
full_name = filename[:-12] # len('_investigate') = 12
# Split schema.table
if '.' in full_name:
schema, table = full_name.split('.', 1)
sql_files.append((schema, table, sql_file))
else:
logger.warning(f"Could not parse schema.table from {filename}")
logger.info(f"Discovered {len(sql_files)} investigation SQL files")
return sql_files

7
src/drt/utils/__init__.py Executable file
View File

@@ -0,0 +1,7 @@
"""Utility functions and helpers."""
from drt.utils.timestamps import get_timestamp, format_duration
from drt.utils.patterns import matches_pattern
from drt.utils.logging import setup_logging
__all__ = ["get_timestamp", "format_duration", "matches_pattern", "setup_logging"]

75
src/drt/utils/logging.py Executable file
View File

@@ -0,0 +1,75 @@
"""Logging configuration and setup."""
import logging
import sys
from pathlib import Path
from typing import Optional
from drt.utils.timestamps import get_timestamp
def setup_logging(
log_level: str = "INFO",
log_dir: str = "./logs",
log_to_console: bool = True,
log_to_file: bool = True,
) -> logging.Logger:
"""
Configure logging for the framework.
Args:
log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
log_dir: Directory for log files
log_to_console: Whether to log to console
log_to_file: Whether to log to file
Returns:
Configured logger instance
"""
# Create logger
logger = logging.getLogger("drt")
logger.setLevel(getattr(logging, log_level.upper()))
# Remove existing handlers
logger.handlers.clear()
# Create formatter
log_format = "%(asctime)s | %(levelname)-8s | %(name)-20s | %(message)s"
date_format = "%Y%m%d_%H%M%S"
formatter = logging.Formatter(log_format, datefmt=date_format)
# Console handler
if log_to_console:
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(getattr(logging, log_level.upper()))
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
# File handler
if log_to_file:
log_path = Path(log_dir)
log_path.mkdir(parents=True, exist_ok=True)
timestamp = get_timestamp()
log_file = log_path / f"drt_{timestamp}.log"
file_handler = logging.FileHandler(log_file, encoding="utf-8")
file_handler.setLevel(logging.DEBUG) # Always log everything to file
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
logger.info(f"Logging to file: {log_file}")
return logger
def get_logger(name: str) -> logging.Logger:
"""
Get a logger instance for a specific module.
Args:
name: Logger name (typically __name__)
Returns:
Logger instance
"""
return logging.getLogger(f"drt.{name}")

58
src/drt/utils/patterns.py Executable file
View File

@@ -0,0 +1,58 @@
"""Pattern matching utilities for wildcard support."""
import fnmatch
from typing import List
def matches_pattern(text: str, patterns: List[str]) -> bool:
"""
Check if text matches any of the given wildcard patterns.
Args:
text: Text to match
patterns: List of wildcard patterns (e.g., "*_TEMP", "tmp*")
Returns:
True if text matches any pattern, False otherwise
Examples:
>>> matches_pattern("Orders_TEMP", ["*_TEMP", "*_TMP"])
True
>>> matches_pattern("Orders", ["*_TEMP", "*_TMP"])
False
"""
if not patterns:
return False
for pattern in patterns:
if fnmatch.fnmatch(text.upper(), pattern.upper()):
return True
return False
def filter_by_patterns(
items: List[str], include_patterns: List[str] = None, exclude_patterns: List[str] = None
) -> List[str]:
"""
Filter items by include and exclude patterns.
Args:
items: List of items to filter
include_patterns: Patterns to include (if None, include all)
exclude_patterns: Patterns to exclude
Returns:
Filtered list of items
"""
result = items.copy()
# Apply include patterns if specified
if include_patterns:
result = [item for item in result if matches_pattern(item, include_patterns)]
# Apply exclude patterns
if exclude_patterns:
result = [item for item in result if not matches_pattern(item, exclude_patterns)]
return result

59
src/drt/utils/timestamps.py Executable file
View File

@@ -0,0 +1,59 @@
"""Timestamp utilities using YYYYMMDD_HHMMSS format."""
from datetime import datetime
def get_timestamp() -> str:
"""
Get current timestamp in YYYYMMDD_HHMMSS format.
Returns:
Formatted timestamp string
"""
return datetime.now().strftime("%Y%m%d_%H%M%S")
def format_duration(seconds: int) -> str:
"""
Format duration in seconds to human-readable string.
Args:
seconds: Duration in seconds
Returns:
Formatted duration string (e.g., "4 minutes 38 seconds")
"""
if seconds < 60:
return f"{seconds} second{'s' if seconds != 1 else ''}"
minutes = seconds // 60
remaining_seconds = seconds % 60
if minutes < 60:
if remaining_seconds == 0:
return f"{minutes} minute{'s' if minutes != 1 else ''}"
return f"{minutes} minute{'s' if minutes != 1 else ''} {remaining_seconds} second{'s' if remaining_seconds != 1 else ''}"
hours = minutes // 60
remaining_minutes = minutes % 60
parts = [f"{hours} hour{'s' if hours != 1 else ''}"]
if remaining_minutes > 0:
parts.append(f"{remaining_minutes} minute{'s' if remaining_minutes != 1 else ''}")
if remaining_seconds > 0:
parts.append(f"{remaining_seconds} second{'s' if remaining_seconds != 1 else ''}")
return " ".join(parts)
def parse_timestamp(timestamp_str: str) -> datetime:
"""
Parse timestamp string in YYYYMMDD_HHMMSS format.
Args:
timestamp_str: Timestamp string to parse
Returns:
datetime object
"""
return datetime.strptime(timestamp_str, "%Y%m%d_%H%M%S")

117
test_data/init_baseline.sql Executable file
View File

@@ -0,0 +1,117 @@
-- Baseline Database Initialization Script
-- This creates a sample database structure for testing
USE master;
GO
-- Create test database
IF NOT EXISTS (SELECT name FROM sys.databases WHERE name = 'TestDB_Baseline')
BEGIN
CREATE DATABASE TestDB_Baseline;
END
GO
USE TestDB_Baseline;
GO
-- Create sample tables
-- Dimension: Customers
CREATE TABLE dbo.DimCustomer (
CustomerID INT PRIMARY KEY IDENTITY(1,1),
CustomerName NVARCHAR(100) NOT NULL,
Email NVARCHAR(100),
City NVARCHAR(50),
Country NVARCHAR(50),
CreatedDate DATETIME DEFAULT GETDATE()
);
-- Dimension: Products
CREATE TABLE dbo.DimProduct (
ProductID INT PRIMARY KEY IDENTITY(1,1),
ProductName NVARCHAR(100) NOT NULL,
Category NVARCHAR(50),
UnitPrice DECIMAL(10,2),
IsActive BIT DEFAULT 1
);
-- Fact: Sales
CREATE TABLE dbo.FactSales (
SaleID INT PRIMARY KEY IDENTITY(1,1),
CustomerID INT,
ProductID INT,
SaleDate DATE,
Quantity INT,
UnitPrice DECIMAL(10,2),
TotalAmount DECIMAL(10,2),
TaxAmount DECIMAL(10,2),
FOREIGN KEY (CustomerID) REFERENCES dbo.DimCustomer(CustomerID),
FOREIGN KEY (ProductID) REFERENCES dbo.DimProduct(ProductID)
);
-- Insert sample data (TEST DATA ONLY - NOT REAL CUSTOMERS)
-- Customers
INSERT INTO dbo.DimCustomer (CustomerName, Email, City, Country) VALUES
('TestCustomer1', 'test1@test.local', 'City1', 'Country1'),
('TestCustomer2', 'test2@test.local', 'City2', 'Country2'),
('TestCustomer3', 'test3@test.local', 'City3', 'Country3'),
('TestCustomer4', 'test4@test.local', 'City4', 'Country4'),
('TestCustomer5', 'test5@test.local', 'City5', 'Country5');
-- Products
INSERT INTO dbo.DimProduct (ProductName, Category, UnitPrice, IsActive) VALUES
('Laptop', 'Electronics', 999.99, 1),
('Mouse', 'Electronics', 29.99, 1),
('Keyboard', 'Electronics', 79.99, 1),
('Monitor', 'Electronics', 299.99, 1),
('Desk Chair', 'Furniture', 199.99, 1),
('Desk', 'Furniture', 399.99, 1),
('Notebook', 'Stationery', 4.99, 1),
('Pen Set', 'Stationery', 12.99, 1);
-- Sales (100 records)
DECLARE @i INT = 1;
WHILE @i <= 100
BEGIN
INSERT INTO dbo.FactSales (CustomerID, ProductID, SaleDate, Quantity, UnitPrice, TotalAmount, TaxAmount)
VALUES (
(ABS(CHECKSUM(NEWID())) % 5) + 1, -- Random CustomerID 1-5
(ABS(CHECKSUM(NEWID())) % 8) + 1, -- Random ProductID 1-8
DATEADD(DAY, -ABS(CHECKSUM(NEWID())) % 365, GETDATE()), -- Random date in last year
(ABS(CHECKSUM(NEWID())) % 10) + 1, -- Random Quantity 1-10
(ABS(CHECKSUM(NEWID())) % 900) + 100.00, -- Random price 100-1000
0, -- Will be calculated
0 -- Will be calculated
);
-- Calculate amounts
UPDATE dbo.FactSales
SET TotalAmount = Quantity * UnitPrice,
TaxAmount = Quantity * UnitPrice * 0.1
WHERE SaleID = @i;
SET @i = @i + 1;
END
GO
-- Create some views for testing
CREATE VIEW dbo.vw_SalesSummary AS
SELECT
c.CustomerName,
p.ProductName,
s.SaleDate,
s.Quantity,
s.TotalAmount
FROM dbo.FactSales s
JOIN dbo.DimCustomer c ON s.CustomerID = c.CustomerID
JOIN dbo.DimProduct p ON s.ProductID = p.ProductID;
GO
-- Create statistics
CREATE STATISTICS stat_sales_date ON dbo.FactSales(SaleDate);
CREATE STATISTICS stat_customer_country ON dbo.DimCustomer(Country);
GO
PRINT 'Baseline database initialized successfully';
GO

131
test_data/init_target.sql Executable file
View File

@@ -0,0 +1,131 @@
-- Target Database Initialization Script
-- This creates a similar structure with some intentional differences for testing
USE master;
GO
-- Create test database
IF NOT EXISTS (SELECT name FROM sys.databases WHERE name = 'TestDB_Target')
BEGIN
CREATE DATABASE TestDB_Target;
END
GO
USE TestDB_Target;
GO
-- Create sample tables (similar to baseline with some differences)
-- Dimension: Customers (same structure)
CREATE TABLE dbo.DimCustomer (
CustomerID INT PRIMARY KEY IDENTITY(1,1),
CustomerName NVARCHAR(100) NOT NULL,
Email NVARCHAR(100),
City NVARCHAR(50),
Country NVARCHAR(50),
CreatedDate DATETIME DEFAULT GETDATE()
);
-- Dimension: Products (slightly different - added column)
CREATE TABLE dbo.DimProduct (
ProductID INT PRIMARY KEY IDENTITY(1,1),
ProductName NVARCHAR(100) NOT NULL,
Category NVARCHAR(50),
UnitPrice DECIMAL(10,2),
IsActive BIT DEFAULT 1,
LastModified DATETIME DEFAULT GETDATE() -- Extra column for testing
);
-- Fact: Sales (same structure)
CREATE TABLE dbo.FactSales (
SaleID INT PRIMARY KEY IDENTITY(1,1),
CustomerID INT,
ProductID INT,
SaleDate DATE,
Quantity INT,
UnitPrice DECIMAL(10,2),
TotalAmount DECIMAL(10,2),
TaxAmount DECIMAL(10,2),
FOREIGN KEY (CustomerID) REFERENCES dbo.DimCustomer(CustomerID),
FOREIGN KEY (ProductID) REFERENCES dbo.DimProduct(ProductID)
);
-- Insert sample data (TEST DATA ONLY - NOT REAL CUSTOMERS)
-- Customers
INSERT INTO dbo.DimCustomer (CustomerName, Email, City, Country) VALUES
('TestCustomer1', 'test1@test.local', 'City1', 'Country1'),
('TestCustomer2', 'test2@test.local', 'City2', 'Country2'),
('TestCustomer3', 'test3@test.local', 'City3', 'Country3'),
('TestCustomer4', 'test4@test.local', 'City4', 'Country4'),
('TestCustomer5', 'test5@test.local', 'City5', 'Country5');
-- Products (with LastModified)
INSERT INTO dbo.DimProduct (ProductName, Category, UnitPrice, IsActive, LastModified) VALUES
('Laptop', 'Electronics', 999.99, 1, GETDATE()),
('Mouse', 'Electronics', 29.99, 1, GETDATE()),
('Keyboard', 'Electronics', 79.99, 1, GETDATE()),
('Monitor', 'Electronics', 299.99, 1, GETDATE()),
('Desk Chair', 'Furniture', 199.99, 1, GETDATE()),
('Desk', 'Furniture', 399.99, 1, GETDATE()),
('Notebook', 'Stationery', 4.99, 1, GETDATE()),
('Pen Set', 'Stationery', 12.99, 1, GETDATE());
-- Sales (95 records - 5 fewer than baseline for testing)
DECLARE @i INT = 1;
WHILE @i <= 95
BEGIN
INSERT INTO dbo.FactSales (CustomerID, ProductID, SaleDate, Quantity, UnitPrice, TotalAmount, TaxAmount)
VALUES (
(ABS(CHECKSUM(NEWID())) % 5) + 1,
(ABS(CHECKSUM(NEWID())) % 8) + 1,
DATEADD(DAY, -ABS(CHECKSUM(NEWID())) % 365, GETDATE()),
(ABS(CHECKSUM(NEWID())) % 10) + 1,
(ABS(CHECKSUM(NEWID())) % 900) + 100.00,
0,
0
);
-- Calculate amounts
UPDATE dbo.FactSales
SET TotalAmount = Quantity * UnitPrice,
TaxAmount = Quantity * UnitPrice * 0.1
WHERE SaleID = @i;
SET @i = @i + 1;
END
GO
-- Create the same view
CREATE VIEW dbo.vw_SalesSummary AS
SELECT
c.CustomerName,
p.ProductName,
s.SaleDate,
s.Quantity,
s.TotalAmount
FROM dbo.FactSales s
JOIN dbo.DimCustomer c ON s.CustomerID = c.CustomerID
JOIN dbo.DimProduct p ON s.ProductID = p.ProductID;
GO
-- Create an extra table that doesn't exist in baseline
CREATE TABLE dbo.TempProcessing (
ProcessID INT PRIMARY KEY IDENTITY(1,1),
ProcessName NVARCHAR(100),
Status NVARCHAR(20),
CreatedDate DATETIME DEFAULT GETDATE()
);
INSERT INTO dbo.TempProcessing (ProcessName, Status) VALUES
('DataLoad', 'Completed'),
('Validation', 'In Progress');
GO
-- Create statistics
CREATE STATISTICS stat_sales_date ON dbo.FactSales(SaleDate);
CREATE STATISTICS stat_customer_country ON dbo.DimCustomer(Country);
GO
PRINT 'Target database initialized successfully';
GO

View File

@@ -0,0 +1,97 @@
#!/bin/bash
# Setup script for test SQL Server environment
set -e
echo "=========================================="
echo "SQL Server Test Environment Setup"
echo "=========================================="
echo ""
# Check if Docker is installed
if ! command -v docker &> /dev/null; then
echo "Error: Docker is not installed"
echo "Please install Docker first: https://docs.docker.com/get-docker/"
exit 1
fi
# Check if Docker Compose is available (either standalone or plugin)
if ! command -v docker-compose &> /dev/null && ! docker compose version &> /dev/null; then
echo "Error: Docker Compose is not installed"
echo "Please install Docker Compose first"
exit 1
fi
# Determine which compose command to use
if docker compose version &> /dev/null; then
COMPOSE_CMD="docker compose"
else
COMPOSE_CMD="docker-compose"
fi
echo "Step 1: Starting SQL Server containers..."
$COMPOSE_CMD -f docker-compose.test.yml up -d
echo ""
echo "Step 2: Waiting for SQL Server to be ready..."
echo "This may take 30-60 seconds..."
# Set default password if not provided
SA_PASSWORD=${SA_PASSWORD:-YourStrong!Passw0rd}
# Wait for baseline server
echo -n "Waiting for baseline server"
for i in {1..30}; do
if docker exec drt-sqlserver-baseline /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C -Q "SELECT 1" &> /dev/null; then
echo " ✓"
break
fi
echo -n "."
sleep 2
done
# Wait for target server
echo -n "Waiting for target server"
for i in {1..30}; do
if docker exec drt-sqlserver-target /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C -Q "SELECT 1" &> /dev/null; then
echo " ✓"
break
fi
echo -n "."
sleep 2
done
echo ""
echo "Step 3: Initializing baseline database..."
docker exec -i drt-sqlserver-baseline /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C < test_data/init_baseline.sql
echo ""
echo "Step 4: Initializing target database..."
docker exec -i drt-sqlserver-target /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" -C < test_data/init_target.sql
echo ""
echo "=========================================="
echo "Setup completed successfully!"
echo "=========================================="
echo ""
echo "SQL Server instances are running:"
echo " Baseline: localhost:1433"
echo " Target: localhost:1434"
echo ""
echo "Credentials:"
echo " Username: sa"
echo " Password: (set via SA_PASSWORD environment variable)"
echo ""
echo "Test databases:"
echo " Baseline: TestDB_Baseline"
echo " Target: TestDB_Target"
echo ""
echo "To test the connection:"
echo " drt discover --server localhost --database TestDB_Baseline --output config_test.yaml"
echo ""
echo "To stop the servers:"
echo " $COMPOSE_CMD -f docker-compose.test.yml down"
echo ""
echo "To stop and remove all data:"
echo " $COMPOSE_CMD -f docker-compose.test.yml down -v"
echo ""

3
tests/__init__.py Executable file
View File

@@ -0,0 +1,3 @@
"""
Test suite for Data Regression Testing Framework
"""

207
tests/test_config.py Executable file
View File

@@ -0,0 +1,207 @@
"""
Unit tests for configuration management
"""
import pytest
from pathlib import Path
from drt.config.models import (
DatabaseConnection,
DatabasePair,
ComparisonSettings,
RowCountSettings,
SchemaSettings,
AggregateSettings,
ReportingSettings,
LoggingSettings,
Config
)
class TestDatabaseConnection:
"""Test DatabaseConnection model"""
def test_database_connection_minimal(self):
"""Test creating a minimal database connection"""
conn = DatabaseConnection(
server="SQLSERVER01",
database="TestDB"
)
assert conn.server == "SQLSERVER01"
assert conn.database == "TestDB"
assert conn.timeout.connection == 30
assert conn.timeout.query == 300
def test_database_connection_with_timeout(self):
"""Test database connection with custom timeout"""
conn = DatabaseConnection(
server="SQLSERVER01",
database="TestDB",
timeout={"connection": 60, "query": 600}
)
assert conn.timeout.connection == 60
assert conn.timeout.query == 600
class TestDatabasePair:
"""Test DatabasePair model"""
def test_database_pair_creation(self):
"""Test creating a database pair"""
pair = DatabasePair(
name="Test_Pair",
enabled=True,
baseline=DatabaseConnection(
server="SQLSERVER01",
database="PROD_DB"
),
target=DatabaseConnection(
server="SQLSERVER01",
database="TEST_DB"
)
)
assert pair.name == "Test_Pair"
assert pair.enabled is True
assert pair.baseline.database == "PROD_DB"
assert pair.target.database == "TEST_DB"
class TestComparisonSettings:
"""Test ComparisonSettings model"""
def test_comparison_settings_health_check(self):
"""Test health check mode settings"""
settings = ComparisonSettings(
mode="health_check",
row_count=RowCountSettings(enabled=True, tolerance_percent=0.0),
schema=SchemaSettings(
enabled=True,
checks={
"column_names": True,
"data_types": True
}
),
aggregates=AggregateSettings(enabled=False)
)
assert settings.mode == "health_check"
assert settings.row_count.enabled is True
assert settings.aggregates.enabled is False
def test_comparison_settings_full_mode(self):
"""Test full mode settings"""
settings = ComparisonSettings(
mode="full",
row_count=RowCountSettings(enabled=True, tolerance_percent=0.0),
schema=SchemaSettings(enabled=True),
aggregates=AggregateSettings(enabled=True, tolerance_percent=0.01)
)
assert settings.mode == "full"
assert settings.aggregates.enabled is True
assert settings.aggregates.tolerance_percent == 0.01
class TestReportingSettings:
"""Test ReportingSettings model"""
def test_reporting_settings_defaults(self):
"""Test default reporting settings"""
settings = ReportingSettings()
assert settings.output_dir == "./reports"
assert settings.formats.html is True
assert settings.formats.csv is True
assert settings.formats.pdf is False
assert settings.include_timestamp is True
def test_reporting_settings_custom(self):
"""Test custom reporting settings"""
settings = ReportingSettings(
output_dir="./custom_reports",
filename_prefix="custom_test",
formats={"html": True, "csv": False, "pdf": True}
)
assert settings.output_dir == "./custom_reports"
assert settings.filename_prefix == "custom_test"
assert settings.formats.pdf is True
class TestLoggingSettings:
"""Test LoggingSettings model"""
def test_logging_settings_defaults(self):
"""Test default logging settings"""
settings = LoggingSettings()
assert settings.level == "INFO"
assert settings.output_dir == "./logs"
assert settings.console.enabled is True
assert settings.file.enabled is True
def test_logging_settings_custom(self):
"""Test custom logging settings"""
settings = LoggingSettings(
level="DEBUG",
console={"enabled": True, "level": "WARNING"}
)
assert settings.level == "DEBUG"
assert settings.console.level == "WARNING"
class TestConfig:
"""Test Config model"""
def test_config_minimal(self):
"""Test creating a minimal config"""
config = Config(
database_pairs=[
DatabasePair(
name="Test",
enabled=True,
baseline=DatabaseConnection(
server="SERVER01",
database="PROD"
),
target=DatabaseConnection(
server="SERVER01",
database="TEST"
)
)
],
comparison=ComparisonSettings(
mode="health_check",
row_count=RowCountSettings(enabled=True),
schema=SchemaSettings(enabled=True),
aggregates=AggregateSettings(enabled=False)
),
tables=[]
)
assert len(config.database_pairs) == 1
assert config.comparison.mode == "health_check"
assert len(config.tables) == 0
def test_config_with_tables(self):
"""Test config with table definitions"""
from drt.models.table import TableInfo
config = Config(
database_pairs=[
DatabasePair(
name="Test",
enabled=True,
baseline=DatabaseConnection(server="S1", database="D1"),
target=DatabaseConnection(server="S1", database="D2")
)
],
comparison=ComparisonSettings(
mode="health_check",
row_count=RowCountSettings(enabled=True),
schema=SchemaSettings(enabled=True),
aggregates=AggregateSettings(enabled=False)
),
tables=[
TableInfo(
schema="dbo",
name="TestTable",
enabled=True,
expected_in_target=True
)
]
)
assert len(config.tables) == 1
assert config.tables[0].name == "TestTable"

186
tests/test_models.py Executable file
View File

@@ -0,0 +1,186 @@
"""
Unit tests for data models
"""
import pytest
from drt.models.enums import Status, CheckType
from drt.models.table import TableInfo, ColumnInfo
from drt.models.results import CheckResult, ComparisonResult
class TestStatus:
"""Test Status enum"""
def test_status_values(self):
"""Test status enum values"""
assert Status.PASS.value == "PASS"
assert Status.FAIL.value == "FAIL"
assert Status.WARNING.value == "WARNING"
assert Status.ERROR.value == "ERROR"
assert Status.INFO.value == "INFO"
assert Status.SKIP.value == "SKIP"
def test_status_severity(self):
"""Test status severity comparison"""
assert Status.FAIL.severity > Status.WARNING.severity
assert Status.WARNING.severity > Status.PASS.severity
assert Status.ERROR.severity > Status.FAIL.severity
class TestCheckType:
"""Test CheckType enum"""
def test_check_type_values(self):
"""Test check type enum values"""
assert CheckType.TABLE_EXISTENCE.value == "TABLE_EXISTENCE"
assert CheckType.ROW_COUNT.value == "ROW_COUNT"
assert CheckType.SCHEMA.value == "SCHEMA"
assert CheckType.AGGREGATE.value == "AGGREGATE"
class TestTableInfo:
"""Test TableInfo model"""
def test_table_info_creation(self):
"""Test creating a TableInfo instance"""
table = TableInfo(
schema="dbo",
name="TestTable",
enabled=True,
expected_in_target=True
)
assert table.schema == "dbo"
assert table.name == "TestTable"
assert table.enabled is True
assert table.expected_in_target is True
assert table.aggregate_columns == []
def test_table_info_with_aggregates(self):
"""Test TableInfo with aggregate columns"""
table = TableInfo(
schema="dbo",
name="FactSales",
enabled=True,
expected_in_target=True,
aggregate_columns=["Amount", "Quantity"]
)
assert len(table.aggregate_columns) == 2
assert "Amount" in table.aggregate_columns
class TestColumnInfo:
"""Test ColumnInfo model"""
def test_column_info_creation(self):
"""Test creating a ColumnInfo instance"""
column = ColumnInfo(
name="CustomerID",
data_type="int",
is_nullable=False,
is_primary_key=True
)
assert column.name == "CustomerID"
assert column.data_type == "int"
assert column.is_nullable is False
assert column.is_primary_key is True
class TestCheckResult:
"""Test CheckResult model"""
def test_check_result_pass(self):
"""Test creating a passing check result"""
result = CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.PASS,
message="Row counts match",
baseline_value=1000,
target_value=1000
)
assert result.status == Status.PASS
assert result.baseline_value == 1000
assert result.target_value == 1000
def test_check_result_fail(self):
"""Test creating a failing check result"""
result = CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.FAIL,
message="Row count mismatch",
baseline_value=1000,
target_value=950
)
assert result.status == Status.FAIL
assert result.baseline_value != result.target_value
class TestComparisonResult:
"""Test ComparisonResult model"""
def test_comparison_result_creation(self):
"""Test creating a ComparisonResult instance"""
result = ComparisonResult(
schema="dbo",
table="TestTable"
)
assert result.schema == "dbo"
assert result.table == "TestTable"
assert len(result.checks) == 0
def test_add_check_result(self):
"""Test adding check results"""
comparison = ComparisonResult(
schema="dbo",
table="TestTable"
)
check = CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.PASS,
message="Row counts match"
)
comparison.checks.append(check)
assert len(comparison.checks) == 1
assert comparison.checks[0].status == Status.PASS
def test_overall_status_all_pass(self):
"""Test overall status when all checks pass"""
comparison = ComparisonResult(
schema="dbo",
table="TestTable"
)
comparison.checks.append(CheckResult(
check_type=CheckType.TABLE_EXISTENCE,
status=Status.PASS,
message="Table exists"
))
comparison.checks.append(CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.PASS,
message="Row counts match"
))
assert comparison.overall_status == Status.PASS
def test_overall_status_with_failure(self):
"""Test overall status when one check fails"""
comparison = ComparisonResult(
schema="dbo",
table="TestTable"
)
comparison.checks.append(CheckResult(
check_type=CheckType.TABLE_EXISTENCE,
status=Status.PASS,
message="Table exists"
))
comparison.checks.append(CheckResult(
check_type=CheckType.ROW_COUNT,
status=Status.FAIL,
message="Row count mismatch"
))
assert comparison.overall_status == Status.FAIL

83
tests/test_utils.py Executable file
View File

@@ -0,0 +1,83 @@
"""
Unit tests for utility functions
"""
import pytest
from datetime import datetime
from drt.utils.timestamps import format_timestamp, format_duration
from drt.utils.patterns import matches_pattern
class TestTimestamps:
"""Test timestamp utilities"""
def test_format_timestamp(self):
"""Test timestamp formatting"""
dt = datetime(2024, 1, 15, 14, 30, 45)
formatted = format_timestamp(dt)
assert formatted == "20240115_143045"
def test_format_timestamp_current(self):
"""Test formatting current timestamp"""
formatted = format_timestamp()
# Should be in YYYYMMDD_HHMMSS format
assert len(formatted) == 15
assert formatted[8] == "_"
def test_format_duration_seconds(self):
"""Test duration formatting for seconds"""
duration = format_duration(45.5)
assert duration == "45.50s"
def test_format_duration_minutes(self):
"""Test duration formatting for minutes"""
duration = format_duration(125.0)
assert duration == "2m 5.00s"
def test_format_duration_hours(self):
"""Test duration formatting for hours"""
duration = format_duration(3725.0)
assert duration == "1h 2m 5.00s"
class TestPatterns:
"""Test pattern matching utilities"""
def test_exact_match(self):
"""Test exact pattern matching"""
assert matches_pattern("TestTable", "TestTable") is True
assert matches_pattern("TestTable", "OtherTable") is False
def test_wildcard_star(self):
"""Test wildcard * pattern"""
assert matches_pattern("TestTable", "Test*") is True
assert matches_pattern("TestTable", "*Table") is True
assert matches_pattern("TestTable", "*est*") is True
assert matches_pattern("TestTable", "Other*") is False
def test_wildcard_question(self):
"""Test wildcard ? pattern"""
assert matches_pattern("Test1", "Test?") is True
assert matches_pattern("TestA", "Test?") is True
assert matches_pattern("Test12", "Test?") is False
assert matches_pattern("Test", "Test?") is False
def test_combined_wildcards(self):
"""Test combined wildcard patterns"""
assert matches_pattern("Test_Table_01", "Test_*_??") is True
assert matches_pattern("Test_Table_1", "Test_*_??") is False
def test_case_sensitivity(self):
"""Test case-sensitive matching"""
assert matches_pattern("TestTable", "testtable") is False
assert matches_pattern("TestTable", "TestTable") is True
def test_empty_pattern(self):
"""Test empty pattern"""
assert matches_pattern("TestTable", "") is False
assert matches_pattern("", "") is True
def test_special_characters(self):
"""Test patterns with special characters"""
assert matches_pattern("Test.Table", "Test.Table") is True
assert matches_pattern("Test_Table", "Test_*") is True
assert matches_pattern("Test-Table", "Test-*") is True