Files
drt/README.md

733 lines
17 KiB
Markdown
Executable File

# Data Regression Testing Framework
A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.
## ✨ Features
- **Automated Discovery** - Scan databases and auto-generate configuration files
- **Multiple Comparison Types** - Row counts, schema validation, aggregate sums
- **Investigation Queries** - Execute diagnostic SQL queries from regression analysis
- **Flexible Configuration** - YAML-based setup with extensive customization
- **Rich Reporting** - HTML, CSV, and PDF reports with detailed results
- **Windows Authentication** - Secure, credential-free database access
- **Read-Only Operations** - All queries are SELECT-only for safety
- **Comprehensive Logging** - Detailed execution logs with timestamps
## 🚀 Quick Start
### Prerequisites
- Python 3.9+
- Microsoft ODBC Driver 17+ for SQL Server
- Windows environment with domain authentication (or Linux with Kerberos)
- Read access to SQL Server databases
### Installation
```bash
# Clone the repository
git clone <repository-url>
cd data_regression_testing
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the framework
pip install -e .
# Verify installation
drt --version
```
### Basic Usage
```bash
# 1. Discover tables from baseline database
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml
# 2. Edit config.yaml to add target database connection
# 3. Validate configuration
drt validate --config config.yaml
# 4. Run comparison
drt compare --config config.yaml
# 5. (Optional) Investigate regression issues
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml
```
## 📦 Platform-Specific Installation
### Windows
1. Install Python 3.9+ from https://www.python.org/downloads/
2. ODBC Driver is usually pre-installed on Windows
3. Install Framework:
```cmd
python -m venv venv
venv\Scripts\activate
pip install -e .
```
### Linux (Debian/Ubuntu)
```bash
# Install ODBC Driver
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev
# Install Kerberos for Windows Authentication
sudo apt-get install -y krb5-user
# Configure /etc/krb5.conf with your domain settings
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM
# Install framework
python3 -m venv venv
source venv/bin/activate
pip install -e .
```
## 📋 Commands
### Discovery
Automatically scan databases and generate configuration files.
```bash
drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]
```
**Options:**
- `--server TEXT` - SQL Server hostname (required)
- `--database TEXT` - Database name (required)
- `--output, -o TEXT` - Output file (default: config_discovered.yaml)
- `--schemas TEXT` - Specific schemas to include
- `--verbose, -v` - Enable verbose output
### Validate
Validate configuration file syntax and database connectivity.
```bash
drt validate --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output
### Compare
Execute data comparison between baseline and target databases.
```bash
drt compare --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be compared without executing
### Investigate
Execute diagnostic queries from regression analysis.
```bash
drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]
```
**Options:**
- `--analysis-dir, -a PATH` - Analysis output directory containing `*_investigate.sql` files (required)
- `--config, -c PATH` - Configuration file (required)
- `--output-dir, -o PATH` - Output directory for reports (default: ./investigation_reports)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be executed without running
**Example:**
```bash
drt investigate -a analysis/output_20251209_184032/ -c config.yaml
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports
```
**What it does:**
- Discovers all `*_investigate.sql` files in the analysis directory
- Parses SQL files (handles markdown, multiple queries per file)
- Executes queries on both baseline and target databases
- Handles errors gracefully (continues on failures)
- Generates HTML and CSV reports with side-by-side comparisons
## ⚙️ Configuration
### Database Connections
```yaml
database_pairs:
- name: "DWH_Comparison"
enabled: true
baseline:
server: "<YOUR_SERVER>"
database: "<YOUR_BASELINE_DB>"
timeout:
connection: 30
query: 300
target:
server: "<YOUR_SERVER>"
database: "<YOUR_TARGET_DB>"
```
### Comparison Settings
```yaml
comparison:
mode: "health_check" # or "full"
row_count:
enabled: true
tolerance_percent: 0.0
schema:
enabled: true
checks:
column_names: true
data_types: true
aggregates:
enabled: true
tolerance_percent: 0.01
```
### Table Configuration
```yaml
tables:
- schema: "dbo"
name: "FactTable1"
enabled: true
expected_in_target: true
aggregate_columns:
- "Amount"
- "Quantity"
```
### Output Directories
```yaml
reporting:
output_dir: "./reports"
investigation_dir: "./investigation_reports"
logging:
output_dir: "./logs"
discovery:
analysis_directory: "./analysis"
```
**Benefits:**
- Centralized storage of all output files
- Easy cleanup and management of generated files
- Configuration flexibility via YAML
- Backward compatibility with CLI overrides
## 📊 Reports
### Comparison Reports
The framework generates comprehensive reports in multiple formats:
- **HTML Report** - Visual summary with color-coded results and detailed breakdowns
- **CSV Report** - Machine-readable format for Excel or databases
- **PDF Report** - Professional formatted output (requires weasyprint)
Reports are saved to `./reports/` with timestamps.
### Investigation Reports
- **HTML Report** - Interactive report with collapsible query results, side-by-side baseline vs target comparison
- **CSV Report** - Flattened structure with one row per query execution
Investigation reports are saved to `./investigation_reports/` with timestamps.
## 🔄 Exit Codes
| Code | Meaning |
|------|---------|
| 0 | Success - all comparisons passed |
| 1 | Failures detected - one or more FAIL results |
| 2 | Execution error - configuration or connection issues |
## 🧪 Testing
### Docker Test Environment
```bash
# Start test SQL Server containers
bash test_data/setup_test_environment.sh
# Test discovery
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
# Test comparison
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
```
### Manual Testing
```bash
# Connect to test databases (use SA_PASSWORD environment variable)
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"
# Run queries to verify data
SELECT COUNT(*) FROM dbo.FactTable1;
```
## 🚢 Deployment
### Scheduled Execution
**Windows Task Scheduler:**
```batch
@echo off
cd C:\path\to\framework
call venv\Scripts\activate.bat
drt compare --config config.yaml
if %ERRORLEVEL% NEQ 0 (
echo Test failed with exit code %ERRORLEVEL%
exit /b %ERRORLEVEL%
)
```
**Linux Cron:**
```bash
# Run daily at 2 AM
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1
```
### Monitoring
```bash
# Watch logs
tail -f logs/drt_*.log
# Search for failures
grep -i "FAIL\|ERROR" logs/drt_*.log
```
## 🏗️ Architecture
```
src/drt/
├── cli/ # Command-line interface
│ └── commands/ # CLI commands (compare, discover, validate, investigate)
├── config/ # Configuration management
├── database/ # Database connectivity (READ ONLY)
├── models/ # Data models
├── reporting/ # Report generators
├── services/ # Business logic
│ ├── checkers/ # Comparison checkers
│ ├── investigation.py # Investigation service
│ └── sql_parser.py # SQL file parser
└── utils/ # Utilities
```
## 🔒 Security
- **Windows Authentication Only** - No stored credentials
- **Read-Only Operations** - All queries are SELECT-only
- **Minimal Permissions** - Only requires db_datareader role
- **No Data Logging** - Sensitive data never logged
## 🔧 Troubleshooting
### Connection Failed
```bash
# Test connectivity
drt discover --server <YOUR_SERVER> --database master
# Verify ODBC driver
odbcinst -q -d
# Check permissions
# User needs db_datareader role on target databases
```
### Query Timeout
Increase timeout in configuration:
```yaml
baseline:
timeout:
query: 600 # 10 minutes
```
### Linux Kerberos Issues
```bash
# Check ticket
klist
# Renew if expired
kinit username@YOUR_DOMAIN.COM
# Verify ticket is valid
klist
```
## ⚡ Performance
### Diagnostic Logging
Enable verbose mode to see detailed timing:
```bash
drt compare --config config.yaml --verbose
```
This shows:
- Per-check timing (existence, row count, schema, aggregates)
- Query execution times
- Parallelization opportunities
### Optimization Tips
- Disable aggregate checks for surrogate keys
- Increase query timeouts for large tables
- Use table filtering to focus on critical tables
- Consider parallel execution for multiple database pairs
## 👨‍💻 Development
### Getting Started
1. Fork the repository on GitHub
2. Clone your fork locally:
```bash
git clone https://git.informatica-polet.com/drt/drt.git
cd data_regression_testing
```
3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
4. Install dependencies:
```bash
pip install -r requirements.txt
pip install -e .
```
5. Install development dependencies:
```bash
pip install pytest pytest-cov black flake8 mypy
```
### Development Workflow
#### 1. Create a Branch
```bash
git checkout -b feature/your-feature-name
# or
git checkout -b bugfix/issue-description
```
#### 2. Make Your Changes
- Write clean, readable code
- Follow the existing code style
- Add docstrings to all functions and classes
- Update documentation as needed
#### 3. Run Tests
```bash
# All tests
pytest
# With coverage
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py
```
#### 4. Code Quality Checks
```bash
# Format code with black
black src/ tests/
# Check code style with flake8
flake8 src/ tests/
# Type checking with mypy
mypy src/
```
#### 5. Commit Your Changes
Write clear, descriptive commit messages:
```bash
git add .
git commit -m "Add feature: description of your changes"
```
**Commit message guidelines:**
- Use present tense ("Add feature" not "Added feature")
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
- Limit first line to 72 characters
- Reference issues and pull requests when relevant
#### 6. Push and Create Pull Request
```bash
git push origin feature/your-feature-name
```
Create a pull request on GitHub with:
- Clear title and description
- Reference to related issues
- Screenshots (if applicable)
- Test results
### Code Style Guidelines
#### Python Style
- Follow PEP 8 style guide
- Use type hints for function parameters and return values
- Maximum line length: 100 characters
- Use meaningful variable and function names
**Example:**
```python
def calculate_row_count_difference(
baseline_count: int,
target_count: int,
tolerance_percent: float
) -> tuple[bool, float]:
"""
Calculate if row count difference is within tolerance.
Args:
baseline_count: Row count from baseline database
target_count: Row count from target database
tolerance_percent: Acceptable difference percentage
Returns:
Tuple of (is_within_tolerance, actual_difference_percent)
"""
# Implementation here
pass
```
#### Documentation
- Add docstrings to all public functions, classes, and modules
- Use Google-style docstrings
- Include examples in docstrings when helpful
- Update README.md for user-facing changes
#### Testing
- Write unit tests for all new functionality
- Aim for >80% code coverage
- Use descriptive test names
- Follow AAA pattern (Arrange, Act, Assert)
**Example:**
```python
def test_row_count_checker_exact_match():
"""Test row count checker with exact match"""
# Arrange
checker = RowCountChecker(tolerance_percent=0.0)
# Act
result = checker.check(baseline_count=1000, target_count=1000)
# Assert
assert result.status == Status.PASS
assert result.baseline_value == 1000
assert result.target_value == 1000
```
### Adding New Features
#### New Checker Type
To add a new comparison checker:
1. Create new checker in `src/drt/services/checkers/`
2. Inherit from `BaseChecker`
3. Implement `check()` method
4. Add new `CheckType` enum value
5. Register in `ComparisonService`
6. Add tests in `tests/test_checkers.py`
7. Update documentation
#### New Report Format
To add a new report format:
1. Create new reporter in `src/drt/reporting/`
2. Implement `generate()` method
3. Add format option to configuration
4. Update `ReportGenerator` to use new format
5. Add tests
6. Update documentation
### Testing
#### Unit Tests
Run the test suite:
```bash
# All tests
pytest
# With coverage report
pytest --cov=src/drt --cov-report=html
# Specific test file
pytest tests/test_models.py -v
# Specific test function
pytest tests/test_models.py::test_status_enum -v
```
#### Integration Tests
Use the Docker test environment:
```bash
# Start test databases
bash test_data/setup_test_environment.sh
# Run integration tests
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
drt compare --config config.test.yaml
# Cleanup
docker-compose -f docker-compose.test.yml down -v
```
#### Manual Testing
```bash
# Test against real databases (requires access)
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
drt validate --config manual_test.yaml
drt compare --config manual_test.yaml --dry-run
```
### Reporting Issues
When reporting issues, please include:
- Clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version, ODBC driver version)
- Relevant logs or error messages
- Configuration file (sanitized - remove server names/credentials)
**Example:**
```markdown
**Description:** Row count comparison fails with timeout error
**Steps to Reproduce:**
1. Configure comparison for large table (>1M rows)
2. Run `drt compare --config config.yaml`
3. Observe timeout error
**Expected:** Comparison completes successfully
**Actual:** Query timeout after 300 seconds
**Environment:**
- OS: Windows 10
- Python: 3.9.7
- ODBC Driver: 17 for SQL Server
**Logs:**
```
ERROR: Query timeout on table dbo.FactTable1
```
```
### Feature Requests
For feature requests, please:
- Check if feature already exists or is planned
- Describe the use case clearly
- Explain why it would be valuable
- Provide examples if possible
### Code Review Process
All contributions go through code review:
1. Automated checks must pass (tests, linting)
2. At least one maintainer approval required
3. Address review feedback promptly
4. Keep pull requests focused and reasonably sized
### Release Process
Releases follow semantic versioning (MAJOR.MINOR.PATCH):
- **MAJOR** - Breaking changes
- **MINOR** - New features (backward compatible)
- **PATCH** - Bug fixes (backward compatible)
### Development Tips
#### Debugging
```bash
# Enable verbose logging
drt compare --config config.yaml --verbose
# Use dry-run to test without execution
drt compare --config config.yaml --dry-run
# Check configuration validity
drt validate --config config.yaml
```
#### Performance Profiling
```bash
# Enable diagnostic logging
drt compare --config config.yaml --verbose
# Look for timing information in logs
grep "execution time" logs/drt_*.log
```
#### Docker Development
```bash
# Build and test in Docker
docker build -t drt:dev .
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml
```
## 📝 License
MIT License - see LICENSE file for details
## 📞 Support
For issues and questions:
- GitHub Issues: https://git.informatica-polet.com/drt/drt/issues
- Check logs in `./logs/`
- Review configuration with `drt validate`
- Test connectivity with `drt discover`