# Data Regression Testing Framework A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases. ## โœจ Features - **Automated Discovery** - Scan databases and auto-generate configuration files - **Multiple Comparison Types** - Row counts, schema validation, aggregate sums - **Investigation Queries** - Execute diagnostic SQL queries from regression analysis - **Flexible Configuration** - YAML-based setup with extensive customization - **Rich Reporting** - HTML, CSV, and PDF reports with detailed results - **Windows Authentication** - Secure, credential-free database access - **Read-Only Operations** - All queries are SELECT-only for safety - **Comprehensive Logging** - Detailed execution logs with timestamps ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.9+ - Microsoft ODBC Driver 17+ for SQL Server - Windows environment with domain authentication (or Linux with Kerberos) - Read access to SQL Server databases ### Installation ```bash # Clone the repository git clone cd data_regression_testing # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install the framework pip install -e . # Verify installation drt --version ``` ### Basic Usage ```bash # 1. Discover tables from baseline database drt discover --server --database --output config.yaml # 2. Edit config.yaml to add target database connection # 3. Validate configuration drt validate --config config.yaml # 4. Run comparison drt compare --config config.yaml # 5. (Optional) Investigate regression issues drt investigate --analysis-dir analysis/output_/ --config config.yaml ``` ## ๐Ÿ“ฆ Platform-Specific Installation ### Windows 1. Install Python 3.9+ from https://www.python.org/downloads/ 2. ODBC Driver is usually pre-installed on Windows 3. Install Framework: ```cmd python -m venv venv venv\Scripts\activate pip install -e . ``` ### Linux (Debian/Ubuntu) ```bash # Install ODBC Driver curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list sudo apt-get update sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev # Install Kerberos for Windows Authentication sudo apt-get install -y krb5-user # Configure /etc/krb5.conf with your domain settings # Then obtain ticket: kinit username@YOUR_DOMAIN.COM # Install framework python3 -m venv venv source venv/bin/activate pip install -e . ``` ## ๐Ÿ“‹ Commands ### Discovery Automatically scan databases and generate configuration files. ```bash drt discover --server --database [OPTIONS] ``` **Options:** - `--server TEXT` - SQL Server hostname (required) - `--database TEXT` - Database name (required) - `--output, -o TEXT` - Output file (default: config_discovered.yaml) - `--schemas TEXT` - Specific schemas to include - `--verbose, -v` - Enable verbose output ### Validate Validate configuration file syntax and database connectivity. ```bash drt validate --config [OPTIONS] ``` **Options:** - `--config, -c PATH` - Configuration file (required) - `--verbose, -v` - Enable verbose output ### Compare Execute data comparison between baseline and target databases. ```bash drt compare --config [OPTIONS] ``` **Options:** - `--config, -c PATH` - Configuration file (required) - `--verbose, -v` - Enable verbose output - `--dry-run` - Show what would be compared without executing ### Investigate Execute diagnostic queries from regression analysis. ```bash drt investigate --analysis-dir --config [OPTIONS] ``` **Options:** - `--analysis-dir, -a PATH` - Analysis output directory containing `*_investigate.sql` files (required) - `--config, -c PATH` - Configuration file (required) - `--output-dir, -o PATH` - Output directory for reports (default: ./investigation_reports) - `--verbose, -v` - Enable verbose output - `--dry-run` - Show what would be executed without running **Example:** ```bash drt investigate -a analysis/output_20251209_184032/ -c config.yaml drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports ``` **What it does:** - Discovers all `*_investigate.sql` files in the analysis directory - Parses SQL files (handles markdown, multiple queries per file) - Executes queries on both baseline and target databases - Handles errors gracefully (continues on failures) - Generates HTML and CSV reports with side-by-side comparisons ## โš™๏ธ Configuration ### Database Connections ```yaml database_pairs: - name: "DWH_Comparison" enabled: true baseline: server: "" database: "" timeout: connection: 30 query: 300 target: server: "" database: "" ``` ### Comparison Settings ```yaml comparison: mode: "health_check" # or "full" row_count: enabled: true tolerance_percent: 0.0 schema: enabled: true checks: column_names: true data_types: true aggregates: enabled: true tolerance_percent: 0.01 ``` ### Table Configuration ```yaml tables: - schema: "dbo" name: "FactTable1" enabled: true expected_in_target: true aggregate_columns: - "Amount" - "Quantity" ``` ### Output Directories ```yaml reporting: output_dir: "./reports" investigation_dir: "./investigation_reports" logging: output_dir: "./logs" discovery: analysis_directory: "./analysis" ``` **Benefits:** - Centralized storage of all output files - Easy cleanup and management of generated files - Configuration flexibility via YAML - Backward compatibility with CLI overrides ## ๐Ÿ“Š Reports ### Comparison Reports The framework generates comprehensive reports in multiple formats: - **HTML Report** - Visual summary with color-coded results and detailed breakdowns - **CSV Report** - Machine-readable format for Excel or databases - **PDF Report** - Professional formatted output (requires weasyprint) Reports are saved to `./reports/` with timestamps. ### Investigation Reports - **HTML Report** - Interactive report with collapsible query results, side-by-side baseline vs target comparison - **CSV Report** - Flattened structure with one row per query execution Investigation reports are saved to `./investigation_reports/` with timestamps. ## ๐Ÿ”„ Exit Codes | Code | Meaning | |------|---------| | 0 | Success - all comparisons passed | | 1 | Failures detected - one or more FAIL results | | 2 | Execution error - configuration or connection issues | ## ๐Ÿงช Testing ### Docker Test Environment ```bash # Start test SQL Server containers bash test_data/setup_test_environment.sh # Test discovery drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml # Test comparison drt compare --config config.test.yaml # Cleanup docker-compose -f docker-compose.test.yml down -v ``` ### Manual Testing ```bash # Connect to test databases (use SA_PASSWORD environment variable) docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD" # Run queries to verify data SELECT COUNT(*) FROM dbo.FactTable1; ``` ## ๐Ÿšข Deployment ### Scheduled Execution **Windows Task Scheduler:** ```batch @echo off cd C:\path\to\framework call venv\Scripts\activate.bat drt compare --config config.yaml if %ERRORLEVEL% NEQ 0 ( echo Test failed with exit code %ERRORLEVEL% exit /b %ERRORLEVEL% ) ``` **Linux Cron:** ```bash # Run daily at 2 AM 0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1 ``` ### Monitoring ```bash # Watch logs tail -f logs/drt_*.log # Search for failures grep -i "FAIL\|ERROR" logs/drt_*.log ``` ## ๐Ÿ—๏ธ Architecture ``` src/drt/ โ”œโ”€โ”€ cli/ # Command-line interface โ”‚ โ””โ”€โ”€ commands/ # CLI commands (compare, discover, validate, investigate) โ”œโ”€โ”€ config/ # Configuration management โ”œโ”€โ”€ database/ # Database connectivity (READ ONLY) โ”œโ”€โ”€ models/ # Data models โ”œโ”€โ”€ reporting/ # Report generators โ”œโ”€โ”€ services/ # Business logic โ”‚ โ”œโ”€โ”€ checkers/ # Comparison checkers โ”‚ โ”œโ”€โ”€ investigation.py # Investigation service โ”‚ โ””โ”€โ”€ sql_parser.py # SQL file parser โ””โ”€โ”€ utils/ # Utilities ``` ## ๐Ÿ”’ Security - **Windows Authentication Only** - No stored credentials - **Read-Only Operations** - All queries are SELECT-only - **Minimal Permissions** - Only requires db_datareader role - **No Data Logging** - Sensitive data never logged ## ๐Ÿ”ง Troubleshooting ### Connection Failed ```bash # Test connectivity drt discover --server --database master # Verify ODBC driver odbcinst -q -d # Check permissions # User needs db_datareader role on target databases ``` ### Query Timeout Increase timeout in configuration: ```yaml baseline: timeout: query: 600 # 10 minutes ``` ### Linux Kerberos Issues ```bash # Check ticket klist # Renew if expired kinit username@YOUR_DOMAIN.COM # Verify ticket is valid klist ``` ## โšก Performance ### Diagnostic Logging Enable verbose mode to see detailed timing: ```bash drt compare --config config.yaml --verbose ``` This shows: - Per-check timing (existence, row count, schema, aggregates) - Query execution times - Parallelization opportunities ### Optimization Tips - Disable aggregate checks for surrogate keys - Increase query timeouts for large tables - Use table filtering to focus on critical tables - Consider parallel execution for multiple database pairs ## ๐Ÿ‘จโ€๐Ÿ’ป Development ### Getting Started 1. Fork the repository on GitHub 2. Clone your fork locally: ```bash git clone https://git.informatica-polet.com/drt/drt.git cd data_regression_testing ``` 3. Create a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 4. Install dependencies: ```bash pip install -r requirements.txt pip install -e . ``` 5. Install development dependencies: ```bash pip install pytest pytest-cov black flake8 mypy ``` ### Development Workflow #### 1. Create a Branch ```bash git checkout -b feature/your-feature-name # or git checkout -b bugfix/issue-description ``` #### 2. Make Your Changes - Write clean, readable code - Follow the existing code style - Add docstrings to all functions and classes - Update documentation as needed #### 3. Run Tests ```bash # All tests pytest # With coverage pytest --cov=src/drt --cov-report=html # Specific test file pytest tests/test_models.py ``` #### 4. Code Quality Checks ```bash # Format code with black black src/ tests/ # Check code style with flake8 flake8 src/ tests/ # Type checking with mypy mypy src/ ``` #### 5. Commit Your Changes Write clear, descriptive commit messages: ```bash git add . git commit -m "Add feature: description of your changes" ``` **Commit message guidelines:** - Use present tense ("Add feature" not "Added feature") - Use imperative mood ("Move cursor to..." not "Moves cursor to...") - Limit first line to 72 characters - Reference issues and pull requests when relevant #### 6. Push and Create Pull Request ```bash git push origin feature/your-feature-name ``` Create a pull request on GitHub with: - Clear title and description - Reference to related issues - Screenshots (if applicable) - Test results ### Code Style Guidelines #### Python Style - Follow PEP 8 style guide - Use type hints for function parameters and return values - Maximum line length: 100 characters - Use meaningful variable and function names **Example:** ```python def calculate_row_count_difference( baseline_count: int, target_count: int, tolerance_percent: float ) -> tuple[bool, float]: """ Calculate if row count difference is within tolerance. Args: baseline_count: Row count from baseline database target_count: Row count from target database tolerance_percent: Acceptable difference percentage Returns: Tuple of (is_within_tolerance, actual_difference_percent) """ # Implementation here pass ``` #### Documentation - Add docstrings to all public functions, classes, and modules - Use Google-style docstrings - Include examples in docstrings when helpful - Update README.md for user-facing changes #### Testing - Write unit tests for all new functionality - Aim for >80% code coverage - Use descriptive test names - Follow AAA pattern (Arrange, Act, Assert) **Example:** ```python def test_row_count_checker_exact_match(): """Test row count checker with exact match""" # Arrange checker = RowCountChecker(tolerance_percent=0.0) # Act result = checker.check(baseline_count=1000, target_count=1000) # Assert assert result.status == Status.PASS assert result.baseline_value == 1000 assert result.target_value == 1000 ``` ### Adding New Features #### New Checker Type To add a new comparison checker: 1. Create new checker in `src/drt/services/checkers/` 2. Inherit from `BaseChecker` 3. Implement `check()` method 4. Add new `CheckType` enum value 5. Register in `ComparisonService` 6. Add tests in `tests/test_checkers.py` 7. Update documentation #### New Report Format To add a new report format: 1. Create new reporter in `src/drt/reporting/` 2. Implement `generate()` method 3. Add format option to configuration 4. Update `ReportGenerator` to use new format 5. Add tests 6. Update documentation ### Testing #### Unit Tests Run the test suite: ```bash # All tests pytest # With coverage report pytest --cov=src/drt --cov-report=html # Specific test file pytest tests/test_models.py -v # Specific test function pytest tests/test_models.py::test_status_enum -v ``` #### Integration Tests Use the Docker test environment: ```bash # Start test databases bash test_data/setup_test_environment.sh # Run integration tests drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml drt compare --config config.test.yaml # Cleanup docker-compose -f docker-compose.test.yml down -v ``` #### Manual Testing ```bash # Test against real databases (requires access) drt discover --server --database --output manual_test.yaml drt validate --config manual_test.yaml drt compare --config manual_test.yaml --dry-run ``` ### Reporting Issues When reporting issues, please include: - Clear description of the problem - Steps to reproduce - Expected vs actual behavior - Environment details (OS, Python version, ODBC driver version) - Relevant logs or error messages - Configuration file (sanitized - remove server names/credentials) **Example:** ```markdown **Description:** Row count comparison fails with timeout error **Steps to Reproduce:** 1. Configure comparison for large table (>1M rows) 2. Run `drt compare --config config.yaml` 3. Observe timeout error **Expected:** Comparison completes successfully **Actual:** Query timeout after 300 seconds **Environment:** - OS: Windows 10 - Python: 3.9.7 - ODBC Driver: 17 for SQL Server **Logs:** ``` ERROR: Query timeout on table dbo.FactTable1 ``` ``` ### Feature Requests For feature requests, please: - Check if feature already exists or is planned - Describe the use case clearly - Explain why it would be valuable - Provide examples if possible ### Code Review Process All contributions go through code review: 1. Automated checks must pass (tests, linting) 2. At least one maintainer approval required 3. Address review feedback promptly 4. Keep pull requests focused and reasonably sized ### Release Process Releases follow semantic versioning (MAJOR.MINOR.PATCH): - **MAJOR** - Breaking changes - **MINOR** - New features (backward compatible) - **PATCH** - Bug fixes (backward compatible) ### Development Tips #### Debugging ```bash # Enable verbose logging drt compare --config config.yaml --verbose # Use dry-run to test without execution drt compare --config config.yaml --dry-run # Check configuration validity drt validate --config config.yaml ``` #### Performance Profiling ```bash # Enable diagnostic logging drt compare --config config.yaml --verbose # Look for timing information in logs grep "execution time" logs/drt_*.log ``` #### Docker Development ```bash # Build and test in Docker docker build -t drt:dev . docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml ``` ## ๐Ÿ“ License MIT License - see LICENSE file for details ## ๐Ÿ“ž Support For issues and questions: - GitHub Issues: https://git.informatica-polet.com/drt/drt/issues - Check logs in `./logs/` - Review configuration with `drt validate` - Test connectivity with `drt discover`