Metadata-Version: 2.4
Name: data-validation-gini
Version: 0.3.14
Summary: Data Validation Gini (DVG) CLI for row count, row/column comparison, and schema validation with HTML reports
Author: ShanKonduru
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: openpyxl
Requires-Dist: sec-report-kit

# Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (`data_corruptor.py`) to create controlled mismatches for validation testing.

## Latest Updates (v0.3.14)

- **NEW: --version flag** - Added CLI version flag to check installed version
  - Use `python dvg.py --version` or `dvg --version`
  - Version accessible via `data_validation_gini.__version__`
- **SCHEMA_VALIDATION** - Full implementation of schema validation:
  - Validates column count, column names, and inferred data types
  - Detects INTEGER, FLOAT, BOOLEAN, DATE, and STRING types from sample data
  - Can be combined with ROWCOUNT_VALIDATION and ROW_COL_VALIDATION
  - See `scripts/data/007_run_schema_validation.bat` for examples
- Migrated to a `src/` package layout (`data_validation_gini`) while preserving root-level compatibility wrappers.
- Enhanced CLI contract with explicit source/target kind flags (`--src-kind`, `--tgt-kind`) and compatibility shims.
- Added canonical validation-type normalization (`ROWCOUNT` alias -> `ROWCOUNT_VALIDATION`).
- Added mismatch capping with `--max-mismatches`.
- Added reusable file I/O classes:
  - `IniConfigStore` for INI read/write operations
  - `JsonFileStore` for JSON read/write operations
- Refactored test and coverage scripts for reliable local execution on Windows and Linux/macOS.
- Expanded automated tests and achieved **100% package coverage** for `data_validation_gini`.

## What This Project Does

- Compares source vs target files using row-level and cell-level checks.
- Supports CSV and Excel (`.xlsx`, `.xlsm`, `.xltx`) inputs.
- Supports single-sheet and multi-sheet validation (via sheet mapping).
- Produces a styled, filterable HTML report with KPI summary cards.
- Includes repeatable batch scripts for common mutation and validation scenarios.

## Current Validation Modes

- `ROWCOUNT_VALIDATION`: checks source/target data row counts.
- `ROWCOUNT`: compatibility alias of `ROWCOUNT_VALIDATION`.
- `ROW_COL_VALIDATION`: checks headers and row/column values.
- `SCHEMA_VALIDATION`: checks column count, column names (order-sensitive), and inferred data types.
- Combined mode: pass multiple as comma-separated values:
  - `ROWCOUNT_VALIDATION,ROW_COL_VALIDATION`
  - `SCHEMA_VALIDATION,ROW_COL_VALIDATION`
  - `ROWCOUNT_VALIDATION,SCHEMA_VALIDATION,ROW_COL_VALIDATION`

## Key Features in Current Implementation

- Header mismatch detection:
  - header length mismatches
  - header name mismatches
- Row alignment using preferred key columns:
  - `employee_id`, `id`, `emp_id`, `record_id`, `pk`
  - falls back to first column if no preferred key exists
- Mismatch classification:
  - `CELL` - cell value mismatch
  - `SRC_ONLY` - value in source only
  - `TGT_ONLY` - value in target only
  - `HEADER_LENGTH` - header column count mismatch
  - `HEADER_NAME` - header name mismatch
  - `ROWCOUNT` - row count mismatch
  - `SCHEMA_COLUMN_COUNT` - schema column count mismatch
  - `SCHEMA_COLUMN_NAME` - schema column name mismatch
  - `SCHEMA_DATA_TYPE` - schema data type mismatch (INTEGER, FLOAT, BOOLEAN, DATE, STRING)
- HTML report KPIs:
  - SRC Count
  - TGT Count
  - PASSED
  - FAILED
  - Pass Rate
  - Failed Rate
  - SRC Only
  - TGT Only
- Per-column filter inputs in mismatch table for quick triage.

## Requirements

- Python 3.9+
- Packages:
  - `openpyxl`
  - `pytest` (for tests)
  - `python-dotenv`

Install dependencies:

```bash
pip install -r requirements.txt
```

## Quick Start (Windows Batch Flow)

From project root:

```bat
scripts\001_env.bat
scripts\002_activate.bat
scripts\003_setup.bat
```

Run all mutation scenarios:

```bat
scripts\004_run.bat
```

Run a DVG validation and generate HTML:

```bat
scripts\dvg.bat
```

Run sheet mapping validation (Excel to Excel):

```bat
scripts\006_run_sheet_mapping.bat
```

Deactivate venv:

```bat
scripts\008_deactivate.bat
```

## CLI Usage

### Version Information

Check the installed version:

```bash
python dvg.py --version
# or if installed as package:
dvg --version
```

### DVG Validator

```bash
python dvg.py \
  --src-kind csv \
  --tgt-kind csv \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT_VALIDATION,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html
```

Legacy compatibility mode is still available:

```bash
python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION
```

Optional arguments:

- `--src-sheet <sheet_name>`
- `--tgt-sheet <sheet_name>`
- `--sheet-mapping "SRC1:TGT1,SRC2:TGT2"`
- `--chunk-size <positive_int>` (default: `1000`)
- `--src-db-alias <alias>`, `--tgt-db-alias <alias>`
- `--src-env <env>`, `--tgt-env <env>`, `--allow-cross-env`
- `--max-mismatches <int>`
- `--key-mode <AUTO|PRIMARY_KEY|COLUMNS|GROUP_CANONICAL|HASH>`

Notes:

- `--sheet-mapping` is supported only for Excel file pairs.
- Provide either `--file-type` or both `--src-kind` and `--tgt-kind`.
- `--file-type` remains supported for backward compatibility.
- DB kind declarations include `sqlserver` and `oracle`, but current implementation supports DB execution only for `sqlite`, `postgresql`, and `mysql`.
- Mixed file<->DB validation in a single run is not implemented yet.
- `<datetime>` token in `--html-output` is replaced at runtime with `YYYYMMDD_HHMMSS`.
- `--chunk-size` controls the number of data rows read per batch for CSV/XLSX loading.
- `--max-mismatches` truncates mismatch details included in console preview and HTML report.
- Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.

Large-file tuning tip:

- Start with `--chunk-size 1000` (default), then increase to `2000` or `5000` for faster reads if memory allows.
- In `dvg.bat`, set `CHUNK_SIZE` in the config block to tune batch size without changing CLI commands.

### Installed CLI Entry Point

If installed as a package, you can run:

```bash
dvg --src-kind csv --tgt-kind csv --src-path ... --tgt-path ... --validation-type ROWCOUNT_VALIDATION
```

## Data Mutation Utility (`data_corruptor.py`)

Use this utility to generate controlled data drift before validation.

Example:

```bash
python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo
```

## Batch Scripts for Mutation Scenarios

Located in the `scripts/` folder:

- `run_case_swap.bat` - Swap character cases
- `run_date_shift.bat` - Shift dates by random days
- `run_nullify.bat` - Replace values with NULL/empty
- `run_numeric_shift.bat` - Shift numeric values
- `run_typo.bat` - Introduce character typos

Example:

```bash
scripts\run_case_swap.bat
```

Supported mutation types:

- `nullify`
  - Replaces selected values with blank strings.
  - Purpose: validate missing-value detection.
- `case_swap`
  - Swaps letter casing in selected values.
  - Purpose: validate case sensitivity behavior.
- `numeric_shift`
  - Adds/subtracts a numeric offset (`--value`).
  - Purpose: validate precision and tolerance checks.
- `date_shift`
  - Shifts date/datetime values by day count (`--value`).
  - Supported formats: `YYYY-MM-DD`, `YYYY-MM-DD HH:MM:SS`.
  - Purpose: validate temporal drift handling.
- `typo`
  - Randomly replaces one character in selected strings.
  - Purpose: validate strict text/hash mismatch detection.

## Sample Scenario Scripts

- `run_case_swap.bat`
- `run_date_shift.bat`
- `run_nullify.bat`
- `run_numeric_shift.bat`
- `run_typo.bat`

Each script mutates `inputs/employees.csv` into a corresponding file under `outputs/`.

## Reports

Generated reports are written under `output/` and include:

- high-level pass/fail status
- validation metadata (source, target, validation type, timestamp)
- KPI cards
- detailed mismatch table with filters

## Tests

Run tests with:

```bash
pytest
```

### Local Test Scripts

Windows:

```bat
scripts\005_run_unit_tests.bat
scripts\005_run_code_cov.bat
```

Linux/macOS:

```bash
bash scripts/005_run_unit_tests.sh
bash scripts/005_run_code_cov.sh
```

Coverage command used by the scripts:

```bash
python -m pytest --cov=data_validation_gini --cov-report=term-missing --cov-report=html
```

Current target and baseline: **100% coverage** for package modules under `src/data_validation_gini`.

## Security Audits

The project includes comprehensive security scanning with automated HTML report generation. See [docs/security/SECURITY_AUDITS.md](docs/security/SECURITY_AUDITS.md) for detailed documentation.

### Quick Start

Run all security audits:

```bat
scripts\013_run_all_security_audits.bat
```

Or on Linux/macOS:

```bash
bash scripts/013_run_all_security_audits.sh
```

Individual audit scripts:

- `scripts/010_run_pip_audit.bat` - Scan Python dependencies for known vulnerabilities
- `scripts/011_run_trivy_audit.bat` - Scan filesystem for misconfigurations and secrets
- `scripts/012_run_gitleaks_audit.bat` - Detect accidentally committed secrets

**Reports Generated:**
- `audits/pip_audit_report.html` - Dependency vulnerability report
- `audits/trivy_fs_report.html` - Filesystem audit report
- `audits/gitleaks_report.html` - Secret detection report

**Install Security Tools:**

```bash
# Windows (Chocolatey)
choco install trivy gitleaks
pip install pip-audit

# macOS (Homebrew)
brew install trivy gitleaks
pip install pip-audit
```

See [docs/security/SECURITY_AUDITS.md](docs/security/SECURITY_AUDITS.md) for:
- Detailed tool documentation
- CI/CD integration examples
- Troubleshooting guides
- Report interpretation tips

## Project Structure (High Level)

### Core Files

- `src/data_validation_gini/dvg.py` - validation CLI implementation
- `src/data_validation_gini/dvg_report.py` - HTML report generation
- `src/data_validation_gini/data_corruptor.py` - mutation utility implementation
- `src/data_validation_gini/dvg_db.py` - database connectivity and table loading
- `src/data_validation_gini/file_stores.py` - INI/JSON file reader-writer classes
- `dvg.py`, `dvg_db.py`, `dvg_report.py`, `data_corruptor.py` - root compatibility wrappers
- `README.md` - Main documentation
- `docs/CONTRIBUTING.md` - contributor workflow and repository boundaries
- `docs/security/SECURITY_AUDITS.md` - Security audit scripts documentation

### Scripts Folder (`scripts/`)

**Setup & Environment:**

- `001_env.bat/sh` - Python environment setup
- `002_activate.bat/sh` - Activate virtual environment
- `003_setup.bat/sh` - Install dependencies
- `008_deactivate.bat/sh` - Deactivate virtual environment

**Domain Implementations:**

- `scripts/data/` - operational data workflows (mutations, sheet mapping, DB startup/seed/compare)
- `scripts/testing/` - local test and coverage workflows
- `scripts/security/` - security audit workflows and consolidated run

**Compatibility Wrappers (root scripts):**

- Existing root scripts remain valid (for example `004_run.bat`, `005_run_unit_tests.bat`, `010_run_pip_audit.bat`).
- Each wrapper forwards to the new domain script path so existing entrypoints and automation remain unchanged.

**Validation & CLI:**

- `dvg.bat/sh` - Run DVG validation

### Directories

- `inputs/` - baseline sample datasets
- `outputs/` - mutated sample datasets
- `output/` - generated validation report files
- `audits/` - generated security audit reports (JSON & HTML)
- `tests/` - unit tests
- `data_validation_gini.egg-info/` - package metadata

## License

MIT
