Metadata-Version: 2.4
Name: data-validation-gini
Version: 0.1.6
Summary: Data Validation Gini (DVG) CLI for row count and row/column comparison with HTML reports
Author: ShanKonduru
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: openpyxl

# Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (`data_corruptor.py`) to create controlled mismatches for validation testing.

## What This Project Does

- Compares source vs target files using row-level and cell-level checks.
- Supports CSV and Excel (`.xlsx`, `.xlsm`, `.xltx`) inputs.
- Supports single-sheet and multi-sheet validation (via sheet mapping).
- Produces a styled, filterable HTML report with KPI summary cards.
- Includes repeatable batch scripts for common mutation and validation scenarios.

## Current Validation Modes

- `ROWCOUNT`: checks source/target data row counts.
- `ROW_COL_VALIDATION`: checks headers and row/column values.
- Combined mode: pass both as comma-separated values:
  - `ROWCOUNT,ROW_COL_VALIDATION`

## Key Features in Current Implementation

- Header mismatch detection:
  - header length mismatches
  - header name mismatches
- Row alignment using preferred key columns:
  - `employee_id`, `id`, `emp_id`, `record_id`, `pk`
  - falls back to first column if no preferred key exists
- Mismatch classification:
  - `CELL`
  - `SRC_ONLY`
  - `TGT_ONLY`
  - `HEADER_LENGTH`
  - `HEADER_NAME`
  - `ROWCOUNT`
- HTML report KPIs:
  - SRC Count
  - TGT Count
  - PASSED
  - FAILED
  - Pass Rate
  - Failed Rate
  - SRC Only
  - TGT Only
- Per-column filter inputs in mismatch table for quick triage.

## Requirements

- Python 3.9+
- Packages:
  - `openpyxl`
  - `pytest` (for tests)
  - `python-dotenv`

Install dependencies:

```bash
pip install -r requirements.txt
```

## Quick Start (Windows Batch Flow)

From project root:

```bat
001_env.bat
002_activate.bat
003_setup.bat
```

Run all mutation scenarios:

```bat
004_run.bat
```

Run a DVG validation and generate HTML:

```bat
dvg.bat
```

Run sheet mapping validation (Excel to Excel):

```bat
006_run_sheet_mapping.bat
```

Deactivate venv:

```bat
008_deactivate.bat
```

## CLI Usage

### DVG Validator

```bash
python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html
```

Optional arguments:

- `--src-sheet <sheet_name>`
- `--tgt-sheet <sheet_name>`
- `--sheet-mapping "SRC1:TGT1,SRC2:TGT2"`
- `--chunk-size <positive_int>` (default: `1000`)

Notes:

- `--sheet-mapping` is supported only for Excel file pairs.
- `--file-type` currently accepts `EXCEL` (for both CSV and Excel processing paths).
- `<datetime>` token in `--html-output` is replaced at runtime with `YYYYMMDD_HHMMSS`.
- `--chunk-size` controls the number of data rows read per batch for CSV/XLSX loading.
- Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.

Large-file tuning tip:

- Start with `--chunk-size 1000` (default), then increase to `2000` or `5000` for faster reads if memory allows.
- In `dvg.bat`, set `CHUNK_SIZE` in the config block to tune batch size without changing CLI commands.

### Installed CLI Entry Point

If installed as a package, you can run:

```bash
dvg --file-type EXCEL --src-path ... --tgt-path ... --validation-type ROWCOUNT
```

## Data Mutation Utility (`data_corruptor.py`)

Use this utility to generate controlled data drift before validation.

Example:

```bash
python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo
```

Supported mutation types:

- `nullify`
  - Replaces selected values with blank strings.
  - Purpose: validate missing-value detection.
- `case_swap`
  - Swaps letter casing in selected values.
  - Purpose: validate case sensitivity behavior.
- `numeric_shift`
  - Adds/subtracts a numeric offset (`--value`).
  - Purpose: validate precision and tolerance checks.
- `date_shift`
  - Shifts date/datetime values by day count (`--value`).
  - Supported formats: `YYYY-MM-DD`, `YYYY-MM-DD HH:MM:SS`.
  - Purpose: validate temporal drift handling.
- `typo`
  - Randomly replaces one character in selected strings.
  - Purpose: validate strict text/hash mismatch detection.

## Sample Scenario Scripts

- `run_case_swap.bat`
- `run_date_shift.bat`
- `run_nullify.bat`
- `run_numeric_shift.bat`
- `run_typo.bat`

Each script mutates `inputs/employees.csv` into a corresponding file under `outputs/`.

## Reports

Generated reports are written under `output/` and include:

- high-level pass/fail status
- validation metadata (source, target, validation type, timestamp)
- KPI cards
- detailed mismatch table with filters

## Tests

Run tests with:

```bash
pytest
```

## Project Structure (High Level)

- `dvg.py` - validation CLI
- `dvg_report.py` - HTML report generation
- `data_corruptor.py` - mutation utility
- `inputs/` - baseline sample datasets
- `outputs/` - mutated sample datasets
- `output/` - generated report files
- `tests/` - unit tests

## License

MIT
