qa/drt

Files

git 2966711ca6 Remove version and authors, Fix links to git repo

2026-01-03 22:18:42 +07:00

17 KiB

Executable File

Raw Permalink Blame History

Data Regression Testing Framework

A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.

✨ Features

Automated Discovery - Scan databases and auto-generate configuration files
Multiple Comparison Types - Row counts, schema validation, aggregate sums
Investigation Queries - Execute diagnostic SQL queries from regression analysis
Flexible Configuration - YAML-based setup with extensive customization
Rich Reporting - HTML, CSV, and PDF reports with detailed results
Windows Authentication - Secure, credential-free database access
Read-Only Operations - All queries are SELECT-only for safety
Comprehensive Logging - Detailed execution logs with timestamps

🚀 Quick Start

Prerequisites

Python 3.9+
Microsoft ODBC Driver 17+ for SQL Server
Windows environment with domain authentication (or Linux with Kerberos)
Read access to SQL Server databases

Installation

# Clone the repository
git clone <repository-url>
cd data_regression_testing

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the framework
pip install -e .

# Verify installation
drt --version

Basic Usage

# 1. Discover tables from baseline database
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml

# 2. Edit config.yaml to add target database connection

# 3. Validate configuration
drt validate --config config.yaml

# 4. Run comparison
drt compare --config config.yaml

# 5. (Optional) Investigate regression issues
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml

📦 Platform-Specific Installation

Windows

Install Python 3.9+ from https://www.python.org/downloads/
ODBC Driver is usually pre-installed on Windows

Install Framework:

python -m venv venv
venv\Scripts\activate
pip install -e .

Linux (Debian/Ubuntu)

# Install ODBC Driver
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev

# Install Kerberos for Windows Authentication
sudo apt-get install -y krb5-user

# Configure /etc/krb5.conf with your domain settings
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM

# Install framework
python3 -m venv venv
source venv/bin/activate
pip install -e .

📋 Commands

Discovery

Automatically scan databases and generate configuration files.

drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]

Options:

--server TEXT - SQL Server hostname (required)
--database TEXT - Database name (required)
--output, -o TEXT - Output file (default: config_discovered.yaml)
--schemas TEXT - Specific schemas to include
--verbose, -v - Enable verbose output

Validate

Validate configuration file syntax and database connectivity.

drt validate --config <CONFIG_FILE> [OPTIONS]

Options:

--config, -c PATH - Configuration file (required)
--verbose, -v - Enable verbose output

Compare

Execute data comparison between baseline and target databases.

drt compare --config <CONFIG_FILE> [OPTIONS]

Options:

--config, -c PATH - Configuration file (required)
--verbose, -v - Enable verbose output
--dry-run - Show what would be compared without executing

Investigate

Execute diagnostic queries from regression analysis.

drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]

Options:

--analysis-dir, -a PATH - Analysis output directory containing *_investigate.sql files (required)
--config, -c PATH - Configuration file (required)
--output-dir, -o PATH - Output directory for reports (default: ./investigation_reports)
--verbose, -v - Enable verbose output
--dry-run - Show what would be executed without running

Example:

drt investigate -a analysis/output_20251209_184032/ -c config.yaml
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports

What it does:

Discovers all *_investigate.sql files in the analysis directory
Parses SQL files (handles markdown, multiple queries per file)
Executes queries on both baseline and target databases
Handles errors gracefully (continues on failures)
Generates HTML and CSV reports with side-by-side comparisons

⚙️ Configuration

Database Connections

database_pairs:
  - name: "DWH_Comparison"
    enabled: true
    baseline:
      server: "<YOUR_SERVER>"
      database: "<YOUR_BASELINE_DB>"
      timeout:
        connection: 30
        query: 300
    target:
      server: "<YOUR_SERVER>"
      database: "<YOUR_TARGET_DB>"

Comparison Settings

comparison:
  mode: "health_check"  # or "full"
  row_count:
    enabled: true
    tolerance_percent: 0.0
  schema:
    enabled: true
    checks:
      column_names: true
      data_types: true
  aggregates:
    enabled: true
    tolerance_percent: 0.01

Table Configuration

tables:
  - schema: "dbo"
    name: "FactTable1"
    enabled: true
    expected_in_target: true
    aggregate_columns:
      - "Amount"
      - "Quantity"

Output Directories

reporting:
  output_dir: "./reports"
  investigation_dir: "./investigation_reports"

logging:
  output_dir: "./logs"

discovery:
  analysis_directory: "./analysis"

Benefits:

Centralized storage of all output files
Easy cleanup and management of generated files
Configuration flexibility via YAML
Backward compatibility with CLI overrides

📊 Reports

Comparison Reports

The framework generates comprehensive reports in multiple formats:

HTML Report - Visual summary with color-coded results and detailed breakdowns
CSV Report - Machine-readable format for Excel or databases
PDF Report - Professional formatted output (requires weasyprint)

Reports are saved to ./reports/ with timestamps.

Investigation Reports

HTML Report - Interactive report with collapsible query results, side-by-side baseline vs target comparison
CSV Report - Flattened structure with one row per query execution

Investigation reports are saved to ./investigation_reports/ with timestamps.

🔄 Exit Codes

Code	Meaning
0	Success - all comparisons passed
1	Failures detected - one or more FAIL results
2	Execution error - configuration or connection issues

🧪 Testing

Docker Test Environment

# Start test SQL Server containers
bash test_data/setup_test_environment.sh

# Test discovery
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml

# Test comparison
drt compare --config config.test.yaml

# Cleanup
docker-compose -f docker-compose.test.yml down -v

Manual Testing

# Connect to test databases (use SA_PASSWORD environment variable)
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"

# Run queries to verify data
SELECT COUNT(*) FROM dbo.FactTable1;

🚢 Deployment

Scheduled Execution

Windows Task Scheduler:

@echo off
cd C:\path\to\framework
call venv\Scripts\activate.bat
drt compare --config config.yaml
if %ERRORLEVEL% NEQ 0 (
    echo Test failed with exit code %ERRORLEVEL%
    exit /b %ERRORLEVEL%
)

Linux Cron:

# Run daily at 2 AM
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1

Monitoring

# Watch logs
tail -f logs/drt_*.log

# Search for failures
grep -i "FAIL\|ERROR" logs/drt_*.log

🏗️ Architecture

src/drt/
├── cli/              # Command-line interface
│   └── commands/     # CLI commands (compare, discover, validate, investigate)
├── config/           # Configuration management
├── database/         # Database connectivity (READ ONLY)
├── models/           # Data models
├── reporting/        # Report generators
├── services/         # Business logic
│   ├── checkers/     # Comparison checkers
│   ├── investigation.py  # Investigation service
│   └── sql_parser.py     # SQL file parser
└── utils/            # Utilities

🔒 Security

Windows Authentication Only - No stored credentials
Read-Only Operations - All queries are SELECT-only
Minimal Permissions - Only requires db_datareader role
No Data Logging - Sensitive data never logged

🔧 Troubleshooting

Connection Failed

# Test connectivity
drt discover --server <YOUR_SERVER> --database master

# Verify ODBC driver
odbcinst -q -d

# Check permissions
# User needs db_datareader role on target databases

Query Timeout

Increase timeout in configuration:

baseline:
  timeout:
    query: 600  # 10 minutes

Linux Kerberos Issues

# Check ticket
klist

# Renew if expired
kinit username@YOUR_DOMAIN.COM

# Verify ticket is valid
klist

⚡ Performance

Diagnostic Logging

Enable verbose mode to see detailed timing:

drt compare --config config.yaml --verbose

This shows:

Per-check timing (existence, row count, schema, aggregates)
Query execution times
Parallelization opportunities

Optimization Tips

Disable aggregate checks for surrogate keys
Increase query timeouts for large tables
Use table filtering to focus on critical tables
Consider parallel execution for multiple database pairs

👨‍💻 Development

Getting Started

Fork the repository on GitHub

Clone your fork locally:

git clone https://git.informatica-polet.com/drt/drt.git
cd data_regression_testing

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt
pip install -e .

Install development dependencies:

pip install pytest pytest-cov black flake8 mypy

Development Workflow

1. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b bugfix/issue-description

2. Make Your Changes

Write clean, readable code
Follow the existing code style
Add docstrings to all functions and classes
Update documentation as needed

3. Run Tests

# All tests
pytest

# With coverage
pytest --cov=src/drt --cov-report=html

# Specific test file
pytest tests/test_models.py

4. Code Quality Checks

# Format code with black
black src/ tests/

# Check code style with flake8
flake8 src/ tests/

# Type checking with mypy
mypy src/

5. Commit Your Changes

Write clear, descriptive commit messages:

git add .
git commit -m "Add feature: description of your changes"

Commit message guidelines:

Use present tense ("Add feature" not "Added feature")
Use imperative mood ("Move cursor to..." not "Moves cursor to...")
Limit first line to 72 characters
Reference issues and pull requests when relevant

6. Push and Create Pull Request

git push origin feature/your-feature-name

Create a pull request on GitHub with:

Clear title and description
Reference to related issues
Screenshots (if applicable)
Test results

Code Style Guidelines

Python Style

Follow PEP 8 style guide
Use type hints for function parameters and return values
Maximum line length: 100 characters
Use meaningful variable and function names

Example:

def calculate_row_count_difference(
    baseline_count: int,
    target_count: int,
    tolerance_percent: float
) -> tuple[bool, float]:
    """
    Calculate if row count difference is within tolerance.
    
    Args:
        baseline_count: Row count from baseline database
        target_count: Row count from target database
        tolerance_percent: Acceptable difference percentage
        
    Returns:
        Tuple of (is_within_tolerance, actual_difference_percent)
    """
    # Implementation here
    pass

Documentation

Add docstrings to all public functions, classes, and modules
Use Google-style docstrings
Include examples in docstrings when helpful
Update README.md for user-facing changes

Testing

Write unit tests for all new functionality
Aim for >80% code coverage
Use descriptive test names
Follow AAA pattern (Arrange, Act, Assert)

Example:

def test_row_count_checker_exact_match():
    """Test row count checker with exact match"""
    # Arrange
    checker = RowCountChecker(tolerance_percent=0.0)
    
    # Act
    result = checker.check(baseline_count=1000, target_count=1000)
    
    # Assert
    assert result.status == Status.PASS
    assert result.baseline_value == 1000
    assert result.target_value == 1000

Adding New Features

New Checker Type

To add a new comparison checker:

Create new checker in src/drt/services/checkers/
Inherit from BaseChecker
Implement check() method
Add new CheckType enum value
Register in ComparisonService
Add tests in tests/test_checkers.py
Update documentation

New Report Format

To add a new report format:

Create new reporter in src/drt/reporting/
Implement generate() method
Add format option to configuration
Update ReportGenerator to use new format
Add tests
Update documentation

Testing

Unit Tests

Run the test suite:

# All tests
pytest

# With coverage report
pytest --cov=src/drt --cov-report=html

# Specific test file
pytest tests/test_models.py -v

# Specific test function
pytest tests/test_models.py::test_status_enum -v

Integration Tests

Use the Docker test environment:

# Start test databases
bash test_data/setup_test_environment.sh

# Run integration tests
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
drt compare --config config.test.yaml

# Cleanup
docker-compose -f docker-compose.test.yml down -v

Manual Testing

# Test against real databases (requires access)
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
drt validate --config manual_test.yaml
drt compare --config manual_test.yaml --dry-run

Reporting Issues

When reporting issues, please include:

Clear description of the problem
Steps to reproduce
Expected vs actual behavior
Environment details (OS, Python version, ODBC driver version)
Relevant logs or error messages
Configuration file (sanitized - remove server names/credentials)

Example:

**Description:** Row count comparison fails with timeout error

**Steps to Reproduce:**
1. Configure comparison for large table (>1M rows)
2. Run `drt compare --config config.yaml`
3. Observe timeout error

**Expected:** Comparison completes successfully
**Actual:** Query timeout after 300 seconds

**Environment:**
- OS: Windows 10
- Python: 3.9.7
- ODBC Driver: 17 for SQL Server

**Logs:**

ERROR: Query timeout on table dbo.FactTable1

Feature Requests

For feature requests, please:

Check if feature already exists or is planned
Describe the use case clearly
Explain why it would be valuable
Provide examples if possible

Code Review Process

All contributions go through code review:

Automated checks must pass (tests, linting)
At least one maintainer approval required
Address review feedback promptly
Keep pull requests focused and reasonably sized

Release Process

Releases follow semantic versioning (MAJOR.MINOR.PATCH):

MAJOR - Breaking changes
MINOR - New features (backward compatible)
PATCH - Bug fixes (backward compatible)

Development Tips

Debugging

# Enable verbose logging
drt compare --config config.yaml --verbose

# Use dry-run to test without execution
drt compare --config config.yaml --dry-run

# Check configuration validity
drt validate --config config.yaml

Performance Profiling

# Enable diagnostic logging
drt compare --config config.yaml --verbose

# Look for timing information in logs
grep "execution time" logs/drt_*.log

Docker Development

# Build and test in Docker
docker build -t drt:dev .
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml

📝 License

MIT License - see LICENSE file for details

📞 Support

For issues and questions:

GitHub Issues: https://git.informatica-polet.com/drt/drt/issues
Check logs in ./logs/
Review configuration with drt validate
Test connectivity with drt discover

17 KiB Executable File Raw Permalink Blame History

Data Regression Testing Framework

✨ Features

🚀 Quick Start

Prerequisites

Installation

Basic Usage

📦 Platform-Specific Installation

Windows

Linux (Debian/Ubuntu)

📋 Commands

Discovery

Validate

Compare

Investigate

⚙️ Configuration

Database Connections

Comparison Settings

Table Configuration

Output Directories

📊 Reports

Comparison Reports

Investigation Reports

🔄 Exit Codes

🧪 Testing

Docker Test Environment

Manual Testing

🚢 Deployment

Scheduled Execution

Monitoring

🏗️ Architecture

🔒 Security

🔧 Troubleshooting

Connection Failed

Query Timeout

Linux Kerberos Issues

⚡ Performance

Diagnostic Logging

Optimization Tips

👨‍💻 Development

Getting Started

Development Workflow

1. Create a Branch

2. Make Your Changes

3. Run Tests

4. Code Quality Checks

5. Commit Your Changes

6. Push and Create Pull Request

Code Style Guidelines

Python Style

Documentation

Testing

Adding New Features

New Checker Type

New Report Format

Testing

Unit Tests

Integration Tests

Manual Testing

Reporting Issues

Feature Requests

Code Review Process

Release Process

Development Tips

Debugging

Performance Profiling

Docker Development

📝 License

📞 Support

17 KiB

Executable File

Raw Permalink Blame History