733 lines
17 KiB
Markdown
Executable File
733 lines
17 KiB
Markdown
Executable File
# Data Regression Testing Framework
|
|
|
|
A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.
|
|
|
|
## ✨ Features
|
|
|
|
- **Automated Discovery** - Scan databases and auto-generate configuration files
|
|
- **Multiple Comparison Types** - Row counts, schema validation, aggregate sums
|
|
- **Investigation Queries** - Execute diagnostic SQL queries from regression analysis
|
|
- **Flexible Configuration** - YAML-based setup with extensive customization
|
|
- **Rich Reporting** - HTML, CSV, and PDF reports with detailed results
|
|
- **Windows Authentication** - Secure, credential-free database access
|
|
- **Read-Only Operations** - All queries are SELECT-only for safety
|
|
- **Comprehensive Logging** - Detailed execution logs with timestamps
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.9+
|
|
- Microsoft ODBC Driver 17+ for SQL Server
|
|
- Windows environment with domain authentication (or Linux with Kerberos)
|
|
- Read access to SQL Server databases
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone <repository-url>
|
|
cd data_regression_testing
|
|
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
|
|
# Install the framework
|
|
pip install -e .
|
|
|
|
# Verify installation
|
|
drt --version
|
|
```
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# 1. Discover tables from baseline database
|
|
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml
|
|
|
|
# 2. Edit config.yaml to add target database connection
|
|
|
|
# 3. Validate configuration
|
|
drt validate --config config.yaml
|
|
|
|
# 4. Run comparison
|
|
drt compare --config config.yaml
|
|
|
|
# 5. (Optional) Investigate regression issues
|
|
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml
|
|
```
|
|
|
|
## 📦 Platform-Specific Installation
|
|
|
|
### Windows
|
|
|
|
1. Install Python 3.9+ from https://www.python.org/downloads/
|
|
2. ODBC Driver is usually pre-installed on Windows
|
|
3. Install Framework:
|
|
```cmd
|
|
python -m venv venv
|
|
venv\Scripts\activate
|
|
pip install -e .
|
|
```
|
|
|
|
### Linux (Debian/Ubuntu)
|
|
|
|
```bash
|
|
# Install ODBC Driver
|
|
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
|
|
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
|
|
sudo apt-get update
|
|
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev
|
|
|
|
# Install Kerberos for Windows Authentication
|
|
sudo apt-get install -y krb5-user
|
|
|
|
# Configure /etc/krb5.conf with your domain settings
|
|
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM
|
|
|
|
# Install framework
|
|
python3 -m venv venv
|
|
source venv/bin/activate
|
|
pip install -e .
|
|
```
|
|
|
|
## 📋 Commands
|
|
|
|
### Discovery
|
|
|
|
Automatically scan databases and generate configuration files.
|
|
|
|
```bash
|
|
drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]
|
|
```
|
|
|
|
**Options:**
|
|
- `--server TEXT` - SQL Server hostname (required)
|
|
- `--database TEXT` - Database name (required)
|
|
- `--output, -o TEXT` - Output file (default: config_discovered.yaml)
|
|
- `--schemas TEXT` - Specific schemas to include
|
|
- `--verbose, -v` - Enable verbose output
|
|
|
|
### Validate
|
|
|
|
Validate configuration file syntax and database connectivity.
|
|
|
|
```bash
|
|
drt validate --config <CONFIG_FILE> [OPTIONS]
|
|
```
|
|
|
|
**Options:**
|
|
- `--config, -c PATH` - Configuration file (required)
|
|
- `--verbose, -v` - Enable verbose output
|
|
|
|
### Compare
|
|
|
|
Execute data comparison between baseline and target databases.
|
|
|
|
```bash
|
|
drt compare --config <CONFIG_FILE> [OPTIONS]
|
|
```
|
|
|
|
**Options:**
|
|
- `--config, -c PATH` - Configuration file (required)
|
|
- `--verbose, -v` - Enable verbose output
|
|
- `--dry-run` - Show what would be compared without executing
|
|
|
|
### Investigate
|
|
|
|
Execute diagnostic queries from regression analysis.
|
|
|
|
```bash
|
|
drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]
|
|
```
|
|
|
|
**Options:**
|
|
- `--analysis-dir, -a PATH` - Analysis output directory containing `*_investigate.sql` files (required)
|
|
- `--config, -c PATH` - Configuration file (required)
|
|
- `--output-dir, -o PATH` - Output directory for reports (default: ./investigation_reports)
|
|
- `--verbose, -v` - Enable verbose output
|
|
- `--dry-run` - Show what would be executed without running
|
|
|
|
**Example:**
|
|
```bash
|
|
drt investigate -a analysis/output_20251209_184032/ -c config.yaml
|
|
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports
|
|
```
|
|
|
|
**What it does:**
|
|
- Discovers all `*_investigate.sql` files in the analysis directory
|
|
- Parses SQL files (handles markdown, multiple queries per file)
|
|
- Executes queries on both baseline and target databases
|
|
- Handles errors gracefully (continues on failures)
|
|
- Generates HTML and CSV reports with side-by-side comparisons
|
|
|
|
## ⚙️ Configuration
|
|
|
|
### Database Connections
|
|
|
|
```yaml
|
|
database_pairs:
|
|
- name: "DWH_Comparison"
|
|
enabled: true
|
|
baseline:
|
|
server: "<YOUR_SERVER>"
|
|
database: "<YOUR_BASELINE_DB>"
|
|
timeout:
|
|
connection: 30
|
|
query: 300
|
|
target:
|
|
server: "<YOUR_SERVER>"
|
|
database: "<YOUR_TARGET_DB>"
|
|
```
|
|
|
|
### Comparison Settings
|
|
|
|
```yaml
|
|
comparison:
|
|
mode: "health_check" # or "full"
|
|
row_count:
|
|
enabled: true
|
|
tolerance_percent: 0.0
|
|
schema:
|
|
enabled: true
|
|
checks:
|
|
column_names: true
|
|
data_types: true
|
|
aggregates:
|
|
enabled: true
|
|
tolerance_percent: 0.01
|
|
```
|
|
|
|
### Table Configuration
|
|
|
|
```yaml
|
|
tables:
|
|
- schema: "dbo"
|
|
name: "FactTable1"
|
|
enabled: true
|
|
expected_in_target: true
|
|
aggregate_columns:
|
|
- "Amount"
|
|
- "Quantity"
|
|
```
|
|
|
|
### Output Directories
|
|
|
|
```yaml
|
|
reporting:
|
|
output_dir: "./reports"
|
|
investigation_dir: "./investigation_reports"
|
|
|
|
logging:
|
|
output_dir: "./logs"
|
|
|
|
discovery:
|
|
analysis_directory: "./analysis"
|
|
```
|
|
|
|
**Benefits:**
|
|
- Centralized storage of all output files
|
|
- Easy cleanup and management of generated files
|
|
- Configuration flexibility via YAML
|
|
- Backward compatibility with CLI overrides
|
|
|
|
## 📊 Reports
|
|
|
|
### Comparison Reports
|
|
|
|
The framework generates comprehensive reports in multiple formats:
|
|
|
|
- **HTML Report** - Visual summary with color-coded results and detailed breakdowns
|
|
- **CSV Report** - Machine-readable format for Excel or databases
|
|
- **PDF Report** - Professional formatted output (requires weasyprint)
|
|
|
|
Reports are saved to `./reports/` with timestamps.
|
|
|
|
### Investigation Reports
|
|
|
|
- **HTML Report** - Interactive report with collapsible query results, side-by-side baseline vs target comparison
|
|
- **CSV Report** - Flattened structure with one row per query execution
|
|
|
|
Investigation reports are saved to `./investigation_reports/` with timestamps.
|
|
|
|
## 🔄 Exit Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 0 | Success - all comparisons passed |
|
|
| 1 | Failures detected - one or more FAIL results |
|
|
| 2 | Execution error - configuration or connection issues |
|
|
|
|
## 🧪 Testing
|
|
|
|
### Docker Test Environment
|
|
|
|
```bash
|
|
# Start test SQL Server containers
|
|
bash test_data/setup_test_environment.sh
|
|
|
|
# Test discovery
|
|
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
|
|
|
|
# Test comparison
|
|
drt compare --config config.test.yaml
|
|
|
|
# Cleanup
|
|
docker-compose -f docker-compose.test.yml down -v
|
|
```
|
|
|
|
### Manual Testing
|
|
|
|
```bash
|
|
# Connect to test databases (use SA_PASSWORD environment variable)
|
|
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"
|
|
|
|
# Run queries to verify data
|
|
SELECT COUNT(*) FROM dbo.FactTable1;
|
|
```
|
|
|
|
## 🚢 Deployment
|
|
|
|
### Scheduled Execution
|
|
|
|
**Windows Task Scheduler:**
|
|
```batch
|
|
@echo off
|
|
cd C:\path\to\framework
|
|
call venv\Scripts\activate.bat
|
|
drt compare --config config.yaml
|
|
if %ERRORLEVEL% NEQ 0 (
|
|
echo Test failed with exit code %ERRORLEVEL%
|
|
exit /b %ERRORLEVEL%
|
|
)
|
|
```
|
|
|
|
**Linux Cron:**
|
|
```bash
|
|
# Run daily at 2 AM
|
|
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
```bash
|
|
# Watch logs
|
|
tail -f logs/drt_*.log
|
|
|
|
# Search for failures
|
|
grep -i "FAIL\|ERROR" logs/drt_*.log
|
|
```
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
src/drt/
|
|
├── cli/ # Command-line interface
|
|
│ └── commands/ # CLI commands (compare, discover, validate, investigate)
|
|
├── config/ # Configuration management
|
|
├── database/ # Database connectivity (READ ONLY)
|
|
├── models/ # Data models
|
|
├── reporting/ # Report generators
|
|
├── services/ # Business logic
|
|
│ ├── checkers/ # Comparison checkers
|
|
│ ├── investigation.py # Investigation service
|
|
│ └── sql_parser.py # SQL file parser
|
|
└── utils/ # Utilities
|
|
```
|
|
|
|
## 🔒 Security
|
|
|
|
- **Windows Authentication Only** - No stored credentials
|
|
- **Read-Only Operations** - All queries are SELECT-only
|
|
- **Minimal Permissions** - Only requires db_datareader role
|
|
- **No Data Logging** - Sensitive data never logged
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Connection Failed
|
|
|
|
```bash
|
|
# Test connectivity
|
|
drt discover --server <YOUR_SERVER> --database master
|
|
|
|
# Verify ODBC driver
|
|
odbcinst -q -d
|
|
|
|
# Check permissions
|
|
# User needs db_datareader role on target databases
|
|
```
|
|
|
|
### Query Timeout
|
|
|
|
Increase timeout in configuration:
|
|
```yaml
|
|
baseline:
|
|
timeout:
|
|
query: 600 # 10 minutes
|
|
```
|
|
|
|
### Linux Kerberos Issues
|
|
|
|
```bash
|
|
# Check ticket
|
|
klist
|
|
|
|
# Renew if expired
|
|
kinit username@YOUR_DOMAIN.COM
|
|
|
|
# Verify ticket is valid
|
|
klist
|
|
```
|
|
|
|
## ⚡ Performance
|
|
|
|
### Diagnostic Logging
|
|
|
|
Enable verbose mode to see detailed timing:
|
|
```bash
|
|
drt compare --config config.yaml --verbose
|
|
```
|
|
|
|
This shows:
|
|
- Per-check timing (existence, row count, schema, aggregates)
|
|
- Query execution times
|
|
- Parallelization opportunities
|
|
|
|
### Optimization Tips
|
|
|
|
- Disable aggregate checks for surrogate keys
|
|
- Increase query timeouts for large tables
|
|
- Use table filtering to focus on critical tables
|
|
- Consider parallel execution for multiple database pairs
|
|
|
|
## 👨💻 Development
|
|
|
|
### Getting Started
|
|
|
|
1. Fork the repository on GitHub
|
|
2. Clone your fork locally:
|
|
```bash
|
|
git clone https://git.informatica-polet.com/drt/drt.git
|
|
cd data_regression_testing
|
|
```
|
|
3. Create a virtual environment:
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
```
|
|
4. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
pip install -e .
|
|
```
|
|
5. Install development dependencies:
|
|
```bash
|
|
pip install pytest pytest-cov black flake8 mypy
|
|
```
|
|
|
|
### Development Workflow
|
|
|
|
#### 1. Create a Branch
|
|
|
|
```bash
|
|
git checkout -b feature/your-feature-name
|
|
# or
|
|
git checkout -b bugfix/issue-description
|
|
```
|
|
|
|
#### 2. Make Your Changes
|
|
|
|
- Write clean, readable code
|
|
- Follow the existing code style
|
|
- Add docstrings to all functions and classes
|
|
- Update documentation as needed
|
|
|
|
#### 3. Run Tests
|
|
|
|
```bash
|
|
# All tests
|
|
pytest
|
|
|
|
# With coverage
|
|
pytest --cov=src/drt --cov-report=html
|
|
|
|
# Specific test file
|
|
pytest tests/test_models.py
|
|
```
|
|
|
|
#### 4. Code Quality Checks
|
|
|
|
```bash
|
|
# Format code with black
|
|
black src/ tests/
|
|
|
|
# Check code style with flake8
|
|
flake8 src/ tests/
|
|
|
|
# Type checking with mypy
|
|
mypy src/
|
|
```
|
|
|
|
#### 5. Commit Your Changes
|
|
|
|
Write clear, descriptive commit messages:
|
|
|
|
```bash
|
|
git add .
|
|
git commit -m "Add feature: description of your changes"
|
|
```
|
|
|
|
**Commit message guidelines:**
|
|
- Use present tense ("Add feature" not "Added feature")
|
|
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
|
|
- Limit first line to 72 characters
|
|
- Reference issues and pull requests when relevant
|
|
|
|
#### 6. Push and Create Pull Request
|
|
|
|
```bash
|
|
git push origin feature/your-feature-name
|
|
```
|
|
|
|
Create a pull request on GitHub with:
|
|
- Clear title and description
|
|
- Reference to related issues
|
|
- Screenshots (if applicable)
|
|
- Test results
|
|
|
|
### Code Style Guidelines
|
|
|
|
#### Python Style
|
|
|
|
- Follow PEP 8 style guide
|
|
- Use type hints for function parameters and return values
|
|
- Maximum line length: 100 characters
|
|
- Use meaningful variable and function names
|
|
|
|
**Example:**
|
|
```python
|
|
def calculate_row_count_difference(
|
|
baseline_count: int,
|
|
target_count: int,
|
|
tolerance_percent: float
|
|
) -> tuple[bool, float]:
|
|
"""
|
|
Calculate if row count difference is within tolerance.
|
|
|
|
Args:
|
|
baseline_count: Row count from baseline database
|
|
target_count: Row count from target database
|
|
tolerance_percent: Acceptable difference percentage
|
|
|
|
Returns:
|
|
Tuple of (is_within_tolerance, actual_difference_percent)
|
|
"""
|
|
# Implementation here
|
|
pass
|
|
```
|
|
|
|
#### Documentation
|
|
|
|
- Add docstrings to all public functions, classes, and modules
|
|
- Use Google-style docstrings
|
|
- Include examples in docstrings when helpful
|
|
- Update README.md for user-facing changes
|
|
|
|
#### Testing
|
|
|
|
- Write unit tests for all new functionality
|
|
- Aim for >80% code coverage
|
|
- Use descriptive test names
|
|
- Follow AAA pattern (Arrange, Act, Assert)
|
|
|
|
**Example:**
|
|
```python
|
|
def test_row_count_checker_exact_match():
|
|
"""Test row count checker with exact match"""
|
|
# Arrange
|
|
checker = RowCountChecker(tolerance_percent=0.0)
|
|
|
|
# Act
|
|
result = checker.check(baseline_count=1000, target_count=1000)
|
|
|
|
# Assert
|
|
assert result.status == Status.PASS
|
|
assert result.baseline_value == 1000
|
|
assert result.target_value == 1000
|
|
```
|
|
|
|
### Adding New Features
|
|
|
|
#### New Checker Type
|
|
|
|
To add a new comparison checker:
|
|
|
|
1. Create new checker in `src/drt/services/checkers/`
|
|
2. Inherit from `BaseChecker`
|
|
3. Implement `check()` method
|
|
4. Add new `CheckType` enum value
|
|
5. Register in `ComparisonService`
|
|
6. Add tests in `tests/test_checkers.py`
|
|
7. Update documentation
|
|
|
|
#### New Report Format
|
|
|
|
To add a new report format:
|
|
|
|
1. Create new reporter in `src/drt/reporting/`
|
|
2. Implement `generate()` method
|
|
3. Add format option to configuration
|
|
4. Update `ReportGenerator` to use new format
|
|
5. Add tests
|
|
6. Update documentation
|
|
|
|
### Testing
|
|
|
|
#### Unit Tests
|
|
|
|
Run the test suite:
|
|
|
|
```bash
|
|
# All tests
|
|
pytest
|
|
|
|
# With coverage report
|
|
pytest --cov=src/drt --cov-report=html
|
|
|
|
# Specific test file
|
|
pytest tests/test_models.py -v
|
|
|
|
# Specific test function
|
|
pytest tests/test_models.py::test_status_enum -v
|
|
```
|
|
|
|
#### Integration Tests
|
|
|
|
Use the Docker test environment:
|
|
|
|
```bash
|
|
# Start test databases
|
|
bash test_data/setup_test_environment.sh
|
|
|
|
# Run integration tests
|
|
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
|
|
drt compare --config config.test.yaml
|
|
|
|
# Cleanup
|
|
docker-compose -f docker-compose.test.yml down -v
|
|
```
|
|
|
|
#### Manual Testing
|
|
|
|
```bash
|
|
# Test against real databases (requires access)
|
|
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
|
|
drt validate --config manual_test.yaml
|
|
drt compare --config manual_test.yaml --dry-run
|
|
```
|
|
|
|
### Reporting Issues
|
|
|
|
When reporting issues, please include:
|
|
|
|
- Clear description of the problem
|
|
- Steps to reproduce
|
|
- Expected vs actual behavior
|
|
- Environment details (OS, Python version, ODBC driver version)
|
|
- Relevant logs or error messages
|
|
- Configuration file (sanitized - remove server names/credentials)
|
|
|
|
**Example:**
|
|
```markdown
|
|
**Description:** Row count comparison fails with timeout error
|
|
|
|
**Steps to Reproduce:**
|
|
1. Configure comparison for large table (>1M rows)
|
|
2. Run `drt compare --config config.yaml`
|
|
3. Observe timeout error
|
|
|
|
**Expected:** Comparison completes successfully
|
|
**Actual:** Query timeout after 300 seconds
|
|
|
|
**Environment:**
|
|
- OS: Windows 10
|
|
- Python: 3.9.7
|
|
- ODBC Driver: 17 for SQL Server
|
|
|
|
**Logs:**
|
|
```
|
|
ERROR: Query timeout on table dbo.FactTable1
|
|
```
|
|
```
|
|
|
|
### Feature Requests
|
|
|
|
For feature requests, please:
|
|
|
|
- Check if feature already exists or is planned
|
|
- Describe the use case clearly
|
|
- Explain why it would be valuable
|
|
- Provide examples if possible
|
|
|
|
### Code Review Process
|
|
|
|
All contributions go through code review:
|
|
|
|
1. Automated checks must pass (tests, linting)
|
|
2. At least one maintainer approval required
|
|
3. Address review feedback promptly
|
|
4. Keep pull requests focused and reasonably sized
|
|
|
|
### Release Process
|
|
|
|
Releases follow semantic versioning (MAJOR.MINOR.PATCH):
|
|
|
|
- **MAJOR** - Breaking changes
|
|
- **MINOR** - New features (backward compatible)
|
|
- **PATCH** - Bug fixes (backward compatible)
|
|
|
|
### Development Tips
|
|
|
|
#### Debugging
|
|
|
|
```bash
|
|
# Enable verbose logging
|
|
drt compare --config config.yaml --verbose
|
|
|
|
# Use dry-run to test without execution
|
|
drt compare --config config.yaml --dry-run
|
|
|
|
# Check configuration validity
|
|
drt validate --config config.yaml
|
|
```
|
|
|
|
#### Performance Profiling
|
|
|
|
```bash
|
|
# Enable diagnostic logging
|
|
drt compare --config config.yaml --verbose
|
|
|
|
# Look for timing information in logs
|
|
grep "execution time" logs/drt_*.log
|
|
```
|
|
|
|
#### Docker Development
|
|
|
|
```bash
|
|
# Build and test in Docker
|
|
docker build -t drt:dev .
|
|
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml
|
|
```
|
|
|
|
## 📝 License
|
|
|
|
MIT License - see LICENSE file for details
|
|
|
|
## 📞 Support
|
|
|
|
For issues and questions:
|
|
- GitHub Issues: https://git.informatica-polet.com/drt/drt/issues
|
|
- Check logs in `./logs/`
|
|
- Review configuration with `drt validate`
|
|
- Test connectivity with `drt discover` |