17 KiB
Executable File
Data Regression Testing Framework
A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.
✨ Features
- Automated Discovery - Scan databases and auto-generate configuration files
- Multiple Comparison Types - Row counts, schema validation, aggregate sums
- Investigation Queries - Execute diagnostic SQL queries from regression analysis
- Flexible Configuration - YAML-based setup with extensive customization
- Rich Reporting - HTML, CSV, and PDF reports with detailed results
- Windows Authentication - Secure, credential-free database access
- Read-Only Operations - All queries are SELECT-only for safety
- Comprehensive Logging - Detailed execution logs with timestamps
🚀 Quick Start
Prerequisites
- Python 3.9+
- Microsoft ODBC Driver 17+ for SQL Server
- Windows environment with domain authentication (or Linux with Kerberos)
- Read access to SQL Server databases
Installation
# Clone the repository
git clone <repository-url>
cd data_regression_testing
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the framework
pip install -e .
# Verify installation
drt --version
Basic Usage
# 1. Discover tables from baseline database
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml
# 2. Edit config.yaml to add target database connection
# 3. Validate configuration
drt validate --config config.yaml
# 4. Run comparison
drt compare --config config.yaml
# 5. (Optional) Investigate regression issues
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml
📦 Platform-Specific Installation
Windows
- Install Python 3.9+ from https://www.python.org/downloads/
- ODBC Driver is usually pre-installed on Windows
- Install Framework:
python -m venv venv venv\Scripts\activate pip install -e .
Linux (Debian/Ubuntu)
# Install ODBC Driver
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev
# Install Kerberos for Windows Authentication
sudo apt-get install -y krb5-user
# Configure /etc/krb5.conf with your domain settings
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM
# Install framework
python3 -m venv venv
source venv/bin/activate
pip install -e .
📋 Commands
Discovery
Automatically scan databases and generate configuration files.
drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]
Options:
--server TEXT- SQL Server hostname (required)--database TEXT- Database name (required)--output, -o TEXT- Output file (default: config_discovered.yaml)--schemas TEXT- Specific schemas to include--verbose, -v- Enable verbose output
Validate
Validate configuration file syntax and database connectivity.
drt validate --config <CONFIG_FILE> [OPTIONS]
Options:
--config, -c PATH- Configuration file (required)--verbose, -v- Enable verbose output
Compare
Execute data comparison between baseline and target databases.
drt compare --config <CONFIG_FILE> [OPTIONS]
Options:
--config, -c PATH- Configuration file (required)--verbose, -v- Enable verbose output--dry-run- Show what would be compared without executing
Investigate
Execute diagnostic queries from regression analysis.
drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]
Options:
--analysis-dir, -a PATH- Analysis output directory containing*_investigate.sqlfiles (required)--config, -c PATH- Configuration file (required)--output-dir, -o PATH- Output directory for reports (default: ./investigation_reports)--verbose, -v- Enable verbose output--dry-run- Show what would be executed without running
Example:
drt investigate -a analysis/output_20251209_184032/ -c config.yaml
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports
What it does:
- Discovers all
*_investigate.sqlfiles in the analysis directory - Parses SQL files (handles markdown, multiple queries per file)
- Executes queries on both baseline and target databases
- Handles errors gracefully (continues on failures)
- Generates HTML and CSV reports with side-by-side comparisons
⚙️ Configuration
Database Connections
database_pairs:
- name: "DWH_Comparison"
enabled: true
baseline:
server: "<YOUR_SERVER>"
database: "<YOUR_BASELINE_DB>"
timeout:
connection: 30
query: 300
target:
server: "<YOUR_SERVER>"
database: "<YOUR_TARGET_DB>"
Comparison Settings
comparison:
mode: "health_check" # or "full"
row_count:
enabled: true
tolerance_percent: 0.0
schema:
enabled: true
checks:
column_names: true
data_types: true
aggregates:
enabled: true
tolerance_percent: 0.01
Table Configuration
tables:
- schema: "dbo"
name: "FactTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount"
- "Quantity"
Output Directories
reporting:
output_dir: "./reports"
investigation_dir: "./investigation_reports"
logging:
output_dir: "./logs"
discovery:
analysis_directory: "./analysis"
Benefits:
- Centralized storage of all output files
- Easy cleanup and management of generated files
- Configuration flexibility via YAML
- Backward compatibility with CLI overrides
📊 Reports
Comparison Reports
The framework generates comprehensive reports in multiple formats:
- HTML Report - Visual summary with color-coded results and detailed breakdowns
- CSV Report - Machine-readable format for Excel or databases
- PDF Report - Professional formatted output (requires weasyprint)
Reports are saved to ./reports/ with timestamps.
Investigation Reports
- HTML Report - Interactive report with collapsible query results, side-by-side baseline vs target comparison
- CSV Report - Flattened structure with one row per query execution
Investigation reports are saved to ./investigation_reports/ with timestamps.
🔄 Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success - all comparisons passed |
| 1 | Failures detected - one or more FAIL results |
| 2 | Execution error - configuration or connection issues |
🧪 Testing
Docker Test Environment
# Start test SQL Server containers
bash test_data/setup_test_environment.sh
# Test discovery
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
# Test comparison
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
Manual Testing
# Connect to test databases (use SA_PASSWORD environment variable)
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"
# Run queries to verify data
SELECT COUNT(*) FROM dbo.FactTable1;
🚢 Deployment
Scheduled Execution
Windows Task Scheduler:
@echo off
cd C:\path\to\framework
call venv\Scripts\activate.bat
drt compare --config config.yaml
if %ERRORLEVEL% NEQ 0 (
echo Test failed with exit code %ERRORLEVEL%
exit /b %ERRORLEVEL%
)
Linux Cron:
# Run daily at 2 AM
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1
Monitoring
# Watch logs
tail -f logs/drt_*.log
# Search for failures
grep -i "FAIL\|ERROR" logs/drt_*.log
🏗️ Architecture
src/drt/
├── cli/ # Command-line interface
│ └── commands/ # CLI commands (compare, discover, validate, investigate)
├── config/ # Configuration management
├── database/ # Database connectivity (READ ONLY)
├── models/ # Data models
├── reporting/ # Report generators
├── services/ # Business logic
│ ├── checkers/ # Comparison checkers
│ ├── investigation.py # Investigation service
│ └── sql_parser.py # SQL file parser
└── utils/ # Utilities
🔒 Security
- Windows Authentication Only - No stored credentials
- Read-Only Operations - All queries are SELECT-only
- Minimal Permissions - Only requires db_datareader role
- No Data Logging - Sensitive data never logged
🔧 Troubleshooting
Connection Failed
# Test connectivity
drt discover --server <YOUR_SERVER> --database master
# Verify ODBC driver
odbcinst -q -d
# Check permissions
# User needs db_datareader role on target databases
Query Timeout
Increase timeout in configuration:
baseline:
timeout:
query: 600 # 10 minutes
Linux Kerberos Issues
# Check ticket
klist
# Renew if expired
kinit username@YOUR_DOMAIN.COM
# Verify ticket is valid
klist
⚡ Performance
Diagnostic Logging
Enable verbose mode to see detailed timing:
drt compare --config config.yaml --verbose
This shows:
- Per-check timing (existence, row count, schema, aggregates)
- Query execution times
- Parallelization opportunities
Optimization Tips
- Disable aggregate checks for surrogate keys
- Increase query timeouts for large tables
- Use table filtering to focus on critical tables
- Consider parallel execution for multiple database pairs
👨💻 Development
Getting Started
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://git.informatica-polet.com/drt/drt.git cd data_regression_testing - Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate - Install dependencies:
pip install -r requirements.txt pip install -e . - Install development dependencies:
pip install pytest pytest-cov black flake8 mypy
Development Workflow
1. Create a Branch
git checkout -b feature/your-feature-name
# or
git checkout -b bugfix/issue-description
2. Make Your Changes
- Write clean, readable code
- Follow the existing code style
- Add docstrings to all functions and classes
- Update documentation as needed
3. Run Tests
# All tests
pytest
# With coverage
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py
4. Code Quality Checks
# Format code with black
black src/ tests/
# Check code style with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
5. Commit Your Changes
Write clear, descriptive commit messages:
git add .
git commit -m "Add feature: description of your changes"
Commit message guidelines:
- Use present tense ("Add feature" not "Added feature")
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
- Limit first line to 72 characters
- Reference issues and pull requests when relevant
6. Push and Create Pull Request
git push origin feature/your-feature-name
Create a pull request on GitHub with:
- Clear title and description
- Reference to related issues
- Screenshots (if applicable)
- Test results
Code Style Guidelines
Python Style
- Follow PEP 8 style guide
- Use type hints for function parameters and return values
- Maximum line length: 100 characters
- Use meaningful variable and function names
Example:
def calculate_row_count_difference(
baseline_count: int,
target_count: int,
tolerance_percent: float
) -> tuple[bool, float]:
"""
Calculate if row count difference is within tolerance.
Args:
baseline_count: Row count from baseline database
target_count: Row count from target database
tolerance_percent: Acceptable difference percentage
Returns:
Tuple of (is_within_tolerance, actual_difference_percent)
"""
# Implementation here
pass
Documentation
- Add docstrings to all public functions, classes, and modules
- Use Google-style docstrings
- Include examples in docstrings when helpful
- Update README.md for user-facing changes
Testing
- Write unit tests for all new functionality
- Aim for >80% code coverage
- Use descriptive test names
- Follow AAA pattern (Arrange, Act, Assert)
Example:
def test_row_count_checker_exact_match():
"""Test row count checker with exact match"""
# Arrange
checker = RowCountChecker(tolerance_percent=0.0)
# Act
result = checker.check(baseline_count=1000, target_count=1000)
# Assert
assert result.status == Status.PASS
assert result.baseline_value == 1000
assert result.target_value == 1000
Adding New Features
New Checker Type
To add a new comparison checker:
- Create new checker in
src/drt/services/checkers/ - Inherit from
BaseChecker - Implement
check()method - Add new
CheckTypeenum value - Register in
ComparisonService - Add tests in
tests/test_checkers.py - Update documentation
New Report Format
To add a new report format:
- Create new reporter in
src/drt/reporting/ - Implement
generate()method - Add format option to configuration
- Update
ReportGeneratorto use new format - Add tests
- Update documentation
Testing
Unit Tests
Run the test suite:
# All tests
pytest
# With coverage report
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py -v
# Specific test function
pytest tests/test_models.py::test_status_enum -v
Integration Tests
Use the Docker test environment:
# Start test databases
bash test_data/setup_test_environment.sh
# Run integration tests
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
Manual Testing
# Test against real databases (requires access)
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
drt validate --config manual_test.yaml
drt compare --config manual_test.yaml --dry-run
Reporting Issues
When reporting issues, please include:
- Clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version, ODBC driver version)
- Relevant logs or error messages
- Configuration file (sanitized - remove server names/credentials)
Example:
**Description:** Row count comparison fails with timeout error
**Steps to Reproduce:**
1. Configure comparison for large table (>1M rows)
2. Run `drt compare --config config.yaml`
3. Observe timeout error
**Expected:** Comparison completes successfully
**Actual:** Query timeout after 300 seconds
**Environment:**
- OS: Windows 10
- Python: 3.9.7
- ODBC Driver: 17 for SQL Server
**Logs:**
ERROR: Query timeout on table dbo.FactTable1
Feature Requests
For feature requests, please:
- Check if feature already exists or is planned
- Describe the use case clearly
- Explain why it would be valuable
- Provide examples if possible
Code Review Process
All contributions go through code review:
- Automated checks must pass (tests, linting)
- At least one maintainer approval required
- Address review feedback promptly
- Keep pull requests focused and reasonably sized
Release Process
Releases follow semantic versioning (MAJOR.MINOR.PATCH):
- MAJOR - Breaking changes
- MINOR - New features (backward compatible)
- PATCH - Bug fixes (backward compatible)
Development Tips
Debugging
# Enable verbose logging
drt compare --config config.yaml --verbose
# Use dry-run to test without execution
drt compare --config config.yaml --dry-run
# Check configuration validity
drt validate --config config.yaml
Performance Profiling
# Enable diagnostic logging
drt compare --config config.yaml --verbose
# Look for timing information in logs
grep "execution time" logs/drt_*.log
Docker Development
# Build and test in Docker
docker build -t drt:dev .
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml
📝 License
MIT License - see LICENSE file for details
📞 Support
For issues and questions:
- GitHub Issues: https://git.informatica-polet.com/drt/drt/issues
- Check logs in
./logs/ - Review configuration with
drt validate - Test connectivity with
drt discover