drt/README.md

# Data Regression Testing Framework

A comprehensive framework for validating data integrity during code migration and system updates by comparing data outputs between Baseline (Production) and Target (Test) SQL Server databases.

## ✨ Features

- **Automated Discovery** - Scan databases and auto-generate configuration files
- **Multiple Comparison Types** - Row counts, schema validation, aggregate sums
- **Investigation Queries** - Execute diagnostic SQL queries from regression analysis
- **Flexible Configuration** - YAML-based setup with extensive customization
- **Rich Reporting** - HTML, CSV, and PDF reports with detailed results
- **Windows Authentication** - Secure, credential-free database access
- **Read-Only Operations** - All queries are SELECT-only for safety
- **Comprehensive Logging** - Detailed execution logs with timestamps

## 🚀 Quick Start

### Prerequisites

- Python 3.9+
- Microsoft ODBC Driver 17+ for SQL Server
- Windows environment with domain authentication (or Linux with Kerberos)
- Read access to SQL Server databases

### Installation

```bash
# Clone the repository
git clone <repository-url>
cd data_regression_testing

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the framework
pip install -e .

# Verify installation
drt --version
```

### Basic Usage

```bash
# 1. Discover tables from baseline database
drt discover --server <YOUR_SERVER> --database <YOUR_BASELINE_DB> --output config.yaml

# 2. Edit config.yaml to add target database connection

# 3. Validate configuration
drt validate --config config.yaml

# 4. Run comparison
drt compare --config config.yaml

# 5. (Optional) Investigate regression issues
drt investigate --analysis-dir analysis/output_<TIMESTAMP>/ --config config.yaml
```

## 📦 Platform-Specific Installation

### Windows

1. Install Python 3.9+ from https://www.python.org/downloads/
2. ODBC Driver is usually pre-installed on Windows
3. Install Framework:
   ```cmd
   python -m venv venv
   venv\Scripts\activate
   pip install -e .
   ```

### Linux (Debian/Ubuntu)

```bash
# Install ODBC Driver
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl https://packages.microsoft.com/config/debian/12/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 unixodbc-dev

# Install Kerberos for Windows Authentication
sudo apt-get install -y krb5-user

# Configure /etc/krb5.conf with your domain settings
# Then obtain ticket: kinit username@YOUR_DOMAIN.COM

# Install framework
python3 -m venv venv
source venv/bin/activate
pip install -e .
```

## 📋 Commands

### Discovery

Automatically scan databases and generate configuration files.

```bash
drt discover --server <YOUR_SERVER> --database <YOUR_DATABASE> [OPTIONS]
```

**Options:**
- `--server TEXT` - SQL Server hostname (required)
- `--database TEXT` - Database name (required)
- `--output, -o TEXT` - Output file (default: config_discovered.yaml)
- `--schemas TEXT` - Specific schemas to include
- `--verbose, -v` - Enable verbose output

### Validate

Validate configuration file syntax and database connectivity.

```bash
drt validate --config <CONFIG_FILE> [OPTIONS]
```

**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output

### Compare

Execute data comparison between baseline and target databases.

```bash
drt compare --config <CONFIG_FILE> [OPTIONS]
```

**Options:**
- `--config, -c PATH` - Configuration file (required)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be compared without executing

### Investigate

Execute diagnostic queries from regression analysis.

```bash
drt investigate --analysis-dir <ANALYSIS_DIR> --config <CONFIG_FILE> [OPTIONS]
```

**Options:**
- `--analysis-dir, -a PATH` - Analysis output directory containing `*_investigate.sql` files (required)
- `--config, -c PATH` - Configuration file (required)
- `--output-dir, -o PATH` - Output directory for reports (default: ./investigation_reports)
- `--verbose, -v` - Enable verbose output
- `--dry-run` - Show what would be executed without running

**Example:**
```bash
drt investigate -a analysis/output_20251209_184032/ -c config.yaml
drt investigate -a analysis/output_20251209_184032/ -c config.yaml -o ./my_reports
```

**What it does:**
- Discovers all `*_investigate.sql` files in the analysis directory
- Parses SQL files (handles markdown, multiple queries per file)
- Executes queries on both baseline and target databases
- Handles errors gracefully (continues on failures)
- Generates HTML and CSV reports with side-by-side comparisons

## ⚙️ Configuration

### Database Connections

```yaml
database_pairs:
  - name: "DWH_Comparison"
    enabled: true
    baseline:
      server: "<YOUR_SERVER>"
      database: "<YOUR_BASELINE_DB>"
      timeout:
        connection: 30
        query: 300
    target:
      server: "<YOUR_SERVER>"
      database: "<YOUR_TARGET_DB>"
```

### Comparison Settings

```yaml
comparison:
  mode: "health_check"  # or "full"
  row_count:
    enabled: true
    tolerance_percent: 0.0
  schema:
    enabled: true
    checks:
      column_names: true
      data_types: true
  aggregates:
    enabled: true
    tolerance_percent: 0.01
```

### Table Configuration

```yaml
tables:
  - schema: "dbo"
    name: "FactTable1"
    enabled: true
    expected_in_target: true
    aggregate_columns:
      - "Amount"
      - "Quantity"
```

### Output Directories

```yaml
reporting:
  output_dir: "./reports"
  investigation_dir: "./investigation_reports"

logging:
  output_dir: "./logs"

discovery:
  analysis_directory: "./analysis"
```

**Benefits:**
- Centralized storage of all output files
- Easy cleanup and management of generated files
- Configuration flexibility via YAML
- Backward compatibility with CLI overrides

## 📊 Reports

### Comparison Reports

The framework generates comprehensive reports in multiple formats:

- **HTML Report** - Visual summary with color-coded results and detailed breakdowns
- **CSV Report** - Machine-readable format for Excel or databases
- **PDF Report** - Professional formatted output (requires weasyprint)

Reports are saved to `./reports/` with timestamps.

### Investigation Reports

- **HTML Report** - Interactive report with collapsible query results, side-by-side baseline vs target comparison
- **CSV Report** - Flattened structure with one row per query execution

Investigation reports are saved to `./investigation_reports/` with timestamps.

## 🔄 Exit Codes

| Code | Meaning |
|------|---------|
| 0 | Success - all comparisons passed |
| 1 | Failures detected - one or more FAIL results |
| 2 | Execution error - configuration or connection issues |

## 🧪 Testing

### Docker Test Environment

```bash
# Start test SQL Server containers
bash test_data/setup_test_environment.sh

# Test discovery
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml

# Test comparison
drt compare --config config.test.yaml

# Cleanup
docker-compose -f docker-compose.test.yml down -v
```

### Manual Testing

```bash
# Connect to test databases (use SA_PASSWORD environment variable)
docker exec -it drt-sqlserver-baseline /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P "$SA_PASSWORD"

# Run queries to verify data
SELECT COUNT(*) FROM dbo.FactTable1;
```

## 🚢 Deployment

### Scheduled Execution

**Windows Task Scheduler:**
```batch
@echo off
cd C:\path\to\framework
call venv\Scripts\activate.bat
drt compare --config config.yaml
if %ERRORLEVEL% NEQ 0 (
    echo Test failed with exit code %ERRORLEVEL%
    exit /b %ERRORLEVEL%
)
```

**Linux Cron:**
```bash
# Run daily at 2 AM
0 2 * * * /path/to/venv/bin/drt compare --config /path/to/config.yaml >> /path/to/logs/cron.log 2>&1
```

### Monitoring

```bash
# Watch logs
tail -f logs/drt_*.log

# Search for failures
grep -i "FAIL\|ERROR" logs/drt_*.log
```

## 🏗️ Architecture

```
src/drt/
├── cli/              # Command-line interface
│   └── commands/     # CLI commands (compare, discover, validate, investigate)
├── config/           # Configuration management
├── database/         # Database connectivity (READ ONLY)
├── models/           # Data models
├── reporting/        # Report generators
├── services/         # Business logic
│   ├── checkers/     # Comparison checkers
│   ├── investigation.py  # Investigation service
│   └── sql_parser.py     # SQL file parser
└── utils/            # Utilities
```

## 🔒 Security

- **Windows Authentication Only** - No stored credentials
- **Read-Only Operations** - All queries are SELECT-only
- **Minimal Permissions** - Only requires db_datareader role
- **No Data Logging** - Sensitive data never logged

## 🔧 Troubleshooting

### Connection Failed

```bash
# Test connectivity
drt discover --server <YOUR_SERVER> --database master

# Verify ODBC driver
odbcinst -q -d

# Check permissions
# User needs db_datareader role on target databases
```

### Query Timeout

Increase timeout in configuration:
```yaml
baseline:
  timeout:
    query: 600  # 10 minutes
```

### Linux Kerberos Issues

```bash
# Check ticket
klist

# Renew if expired
kinit username@YOUR_DOMAIN.COM

# Verify ticket is valid
klist
```

## ⚡ Performance

### Diagnostic Logging

Enable verbose mode to see detailed timing:
```bash
drt compare --config config.yaml --verbose
```

This shows:
- Per-check timing (existence, row count, schema, aggregates)
- Query execution times
- Parallelization opportunities

### Optimization Tips

- Disable aggregate checks for surrogate keys
- Increase query timeouts for large tables
- Use table filtering to focus on critical tables
- Consider parallel execution for multiple database pairs

## 👨‍💻 Development

### Getting Started

1. Fork the repository on GitHub
2. Clone your fork locally:
   ```bash
   git clone https://git.informatica-polet.com/drt/drt.git
   cd data_regression_testing
   ```
3. Create a virtual environment:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```
4. Install dependencies:
   ```bash
   pip install -r requirements.txt
   pip install -e .
   ```
5. Install development dependencies:
   ```bash
   pip install pytest pytest-cov black flake8 mypy
   ```

### Development Workflow

#### 1. Create a Branch

```bash
git checkout -b feature/your-feature-name
# or
git checkout -b bugfix/issue-description
```

#### 2. Make Your Changes

- Write clean, readable code
- Follow the existing code style
- Add docstrings to all functions and classes
- Update documentation as needed

#### 3. Run Tests

```bash
# All tests
pytest

# With coverage
pytest --cov=src/drt --cov-report=html

# Specific test file
pytest tests/test_models.py
```

#### 4. Code Quality Checks

```bash
# Format code with black
black src/ tests/

# Check code style with flake8
flake8 src/ tests/

# Type checking with mypy
mypy src/
```

#### 5. Commit Your Changes

Write clear, descriptive commit messages:

```bash
git add .
git commit -m "Add feature: description of your changes"
```

**Commit message guidelines:**
- Use present tense ("Add feature" not "Added feature")
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
- Limit first line to 72 characters
- Reference issues and pull requests when relevant

#### 6. Push and Create Pull Request

```bash
git push origin feature/your-feature-name
```

Create a pull request on GitHub with:
- Clear title and description
- Reference to related issues
- Screenshots (if applicable)
- Test results

### Code Style Guidelines

#### Python Style

- Follow PEP 8 style guide
- Use type hints for function parameters and return values
- Maximum line length: 100 characters
- Use meaningful variable and function names

**Example:**
```python
def calculate_row_count_difference(
    baseline_count: int,
    target_count: int,
    tolerance_percent: float
) -> tuple[bool, float]:
    """
    Calculate if row count difference is within tolerance.

    Args:
        baseline_count: Row count from baseline database
        target_count: Row count from target database
        tolerance_percent: Acceptable difference percentage

    Returns:
        Tuple of (is_within_tolerance, actual_difference_percent)
    """
    # Implementation here
    pass
```

#### Documentation

- Add docstrings to all public functions, classes, and modules
- Use Google-style docstrings
- Include examples in docstrings when helpful
- Update README.md for user-facing changes

#### Testing

- Write unit tests for all new functionality
- Aim for >80% code coverage
- Use descriptive test names
- Follow AAA pattern (Arrange, Act, Assert)

**Example:**
```python
def test_row_count_checker_exact_match():
    """Test row count checker with exact match"""
    # Arrange
    checker = RowCountChecker(tolerance_percent=0.0)

    # Act
    result = checker.check(baseline_count=1000, target_count=1000)

    # Assert
    assert result.status == Status.PASS
    assert result.baseline_value == 1000
    assert result.target_value == 1000
```

### Adding New Features

#### New Checker Type

To add a new comparison checker:

1. Create new checker in `src/drt/services/checkers/`
2. Inherit from `BaseChecker`
3. Implement `check()` method
4. Add new `CheckType` enum value
5. Register in `ComparisonService`
6. Add tests in `tests/test_checkers.py`
7. Update documentation

#### New Report Format

To add a new report format:

1. Create new reporter in `src/drt/reporting/`
2. Implement `generate()` method
3. Add format option to configuration
4. Update `ReportGenerator` to use new format
5. Add tests
6. Update documentation

### Testing

#### Unit Tests

Run the test suite:

```bash
# All tests
pytest

# With coverage report
pytest --cov=src/drt --cov-report=html

# Specific test file
pytest tests/test_models.py -v

# Specific test function
pytest tests/test_models.py::test_status_enum -v
```

#### Integration Tests

Use the Docker test environment:

```bash
# Start test databases
bash test_data/setup_test_environment.sh

# Run integration tests
drt discover --server localhost,1433 --database TestDB_Baseline --output test.yaml
drt compare --config config.test.yaml

# Cleanup
docker-compose -f docker-compose.test.yml down -v
```

#### Manual Testing

```bash
# Test against real databases (requires access)
drt discover --server <YOUR_SERVER> --database <YOUR_DB> --output manual_test.yaml
drt validate --config manual_test.yaml
drt compare --config manual_test.yaml --dry-run
```

### Reporting Issues

When reporting issues, please include:

- Clear description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version, ODBC driver version)
- Relevant logs or error messages
- Configuration file (sanitized - remove server names/credentials)

**Example:**
```markdown
**Description:** Row count comparison fails with timeout error

**Steps to Reproduce:**
1. Configure comparison for large table (>1M rows)
2. Run `drt compare --config config.yaml`
3. Observe timeout error

**Expected:** Comparison completes successfully
**Actual:** Query timeout after 300 seconds

**Environment:**
- OS: Windows 10
- Python: 3.9.7
- ODBC Driver: 17 for SQL Server

**Logs:**
```
ERROR: Query timeout on table dbo.FactTable1
```
```

### Feature Requests

For feature requests, please:

- Check if feature already exists or is planned
- Describe the use case clearly
- Explain why it would be valuable
- Provide examples if possible

### Code Review Process

All contributions go through code review:

1. Automated checks must pass (tests, linting)
2. At least one maintainer approval required
3. Address review feedback promptly
4. Keep pull requests focused and reasonably sized

### Release Process

Releases follow semantic versioning (MAJOR.MINOR.PATCH):

- **MAJOR** - Breaking changes
- **MINOR** - New features (backward compatible)
- **PATCH** - Bug fixes (backward compatible)

### Development Tips

#### Debugging

```bash
# Enable verbose logging
drt compare --config config.yaml --verbose

# Use dry-run to test without execution
drt compare --config config.yaml --dry-run

# Check configuration validity
drt validate --config config.yaml
```

#### Performance Profiling

```bash
# Enable diagnostic logging
drt compare --config config.yaml --verbose

# Look for timing information in logs
grep "execution time" logs/drt_*.log
```

#### Docker Development

```bash
# Build and test in Docker
docker build -t drt:dev .
docker run -v $(pwd)/config.yaml:/app/config.yaml drt:dev compare --config /app/config.yaml
```

## 📝 License

MIT License - see LICENSE file for details

## 📞 Support

For issues and questions:
- GitHub Issues: https://git.informatica-polet.com/drt/drt/issues
- Check logs in `./logs/`
- Review configuration with `drt validate`
- Test connectivity with `drt discover`