Metadata-Version: 2.4
Name: data-validation-gini
Version: 0.1.7
Summary: Data Validation Gini (DVG) CLI for row count and row/column comparison with HTML reports
Author: ShanKonduru
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: openpyxl
Requires-Dist: mcp

# Data Validation Gini (DVG)

Data Validation Gini is a lightweight Python CLI for validating source and target datasets and generating a rich HTML reconciliation report.

The repository also includes a CSV data mutation utility (`data_corruptor.py`) to create controlled mismatches for validation testing.

## What This Project Does

- Compares source vs target files using row-level and cell-level checks.
- Supports CSV and Excel (`.xlsx`, `.xlsm`, `.xltx`) inputs.
- Supports single-sheet and multi-sheet validation (via sheet mapping).
- Produces a styled, filterable HTML report with KPI summary cards.
- Includes repeatable batch scripts for common mutation and validation scenarios.

## Current Validation Modes

- `ROWCOUNT`: checks source/target data row counts.
- `ROW_COL_VALIDATION`: checks headers and row/column values.
- Combined mode: pass both as comma-separated values:
  - `ROWCOUNT,ROW_COL_VALIDATION`

## Key Features in Current Implementation

- Header mismatch detection:
  - header length mismatches
  - header name mismatches
- Row alignment using preferred key columns:
  - `employee_id`, `id`, `emp_id`, `record_id`, `pk`
  - falls back to first column if no preferred key exists
- Mismatch classification:
  - `CELL`
  - `SRC_ONLY`
  - `TGT_ONLY`
  - `HEADER_LENGTH`
  - `HEADER_NAME`
  - `ROWCOUNT`
- HTML report KPIs:
  - SRC Count
  - TGT Count
  - PASSED
  - FAILED
  - Pass Rate
  - Failed Rate
  - SRC Only
  - TGT Only
- Per-column filter inputs in mismatch table for quick triage.

## Requirements

- Python 3.9+
- Packages:
  - `openpyxl`
  - `pytest` (for tests)
  - `python-dotenv`

Install dependencies:

```bash
pip install -r requirements.txt
```

## Quick Start (Windows Batch Flow)

From project root:

```bat
001_env.bat
002_activate.bat
003_setup.bat
```

Run all mutation scenarios:

```bat
004_run.bat
```

Run a DVG validation and generate HTML:

```bat
dvg.bat
```

Run sheet mapping validation (Excel to Excel):

```bat
006_run_sheet_mapping.bat
```

Deactivate venv:

```bat
008_deactivate.bat
```

## CLI Usage

### DVG Validator

```bash
python dvg.py \
  --file-type EXCEL \
  --src-path inputs/employees.csv \
  --tgt-path outputs/employees.csv \
  --validation-type ROWCOUNT,ROW_COL_VALIDATION \
  --html-output output/report_<datetime>.html
```

Optional arguments:

- `--src-sheet <sheet_name>`
- `--tgt-sheet <sheet_name>`
- `--sheet-mapping "SRC1:TGT1,SRC2:TGT2"`
- `--chunk-size <positive_int>` (default: `1000`)

Notes:

- `--sheet-mapping` is supported only for Excel file pairs.
- `--file-type` currently accepts `EXCEL` (for both CSV and Excel processing paths).
- `<datetime>` token in `--html-output` is replaced at runtime with `YYYYMMDD_HHMMSS`.
- `--chunk-size` controls the number of data rows read per batch for CSV/XLSX loading.
- Console output now shows chunk progress for source/target loading: total chunks, current chunk, and completion summary.

Large-file tuning tip:

- Start with `--chunk-size 1000` (default), then increase to `2000` or `5000` for faster reads if memory allows.
- In `dvg.bat`, set `CHUNK_SIZE` in the config block to tune batch size without changing CLI commands.

### Installed CLI Entry Point

If installed as a package, you can run:

```bash
dvg --file-type EXCEL --src-path ... --tgt-path ... --validation-type ROWCOUNT
```

## Data Mutation Utility (`data_corruptor.py`)

Use this utility to generate controlled data drift before validation.

Example:

```bash
python data_corruptor.py \
  --input inputs/employees.csv \
  --output outputs/employees_typos.csv \
  --column email \
  --percentage 1.0 \
  --type typo
```

Supported mutation types:

- `nullify`
  - Replaces selected values with blank strings.
  - Purpose: validate missing-value detection.
- `case_swap`
  - Swaps letter casing in selected values.
  - Purpose: validate case sensitivity behavior.
- `numeric_shift`
  - Adds/subtracts a numeric offset (`--value`).
  - Purpose: validate precision and tolerance checks.
- `date_shift`
  - Shifts date/datetime values by day count (`--value`).
  - Supported formats: `YYYY-MM-DD`, `YYYY-MM-DD HH:MM:SS`.
  - Purpose: validate temporal drift handling.
- `typo`
  - Randomly replaces one character in selected strings.
  - Purpose: validate strict text/hash mismatch detection.

## Sample Scenario Scripts

- `run_case_swap.bat`
- `run_date_shift.bat`
- `run_nullify.bat`
- `run_numeric_shift.bat`
- `run_typo.bat`

Each script mutates `inputs/employees.csv` into a corresponding file under `outputs/`.

## Reports

Generated reports are written under `output/` and include:

- high-level pass/fail status
- validation metadata (source, target, validation type, timestamp)
- KPI cards
- detailed mismatch table with filters

## Tests

Run tests with:

```bash
pytest
```

## MCP Server

This project now ships a small MCP server for the CLI. Start it with:

```bash
dvg-mcp
```

The server exposes four tools:

- `run_validation` - run the existing file comparison workflow and return a structured result.
- `preview_input` - inspect a CSV or Excel file without loading the full dataset.
- `mutate_data` - create a controlled CSV mutation using the same corruption rules as the CLI helper.
- `get_last_report` - read the latest HTML report and return the KPI summary.

### IDE Setup

#### VS Code

**Option 1: Using `.vscode/settings.json`**

Create or edit `.vscode/settings.json` in your workspace:

```json
{
  "github.copilot.codeium.enabled": true,
  "mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "transport": "stdio",
      "disabled": false
    }
  ]
}
```

**Option 2: Using VS Code MCP Extension Settings**

1. Open Command Palette (`Ctrl+Shift+P`)
2. Search for "MCP: Add Server"
3. Configure with:
   - **Name:** `data-validation-gini`
   - **Command:** `dvg-mcp`
   - **Working Directory:** `c:\MyProjects\data-validation-gini`
   - **Transport:** `stdio`

**Option 3: Using Copilot Chat Extension Settings**

Edit `settings.json` with Copilot-specific MCP configuration:

```json
{
  "chat.mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "args": [],
      "env": {
        "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
      }
    }
  ]
}
```

#### Cursor

**Using `cursor_settings.json`**

Edit your Cursor settings file (usually in `%APPDATA%\Cursor\User\settings.json` on Windows):

```json
{
  "mcp.servers": [
    {
      "name": "data-validation-gini",
      "command": "dvg-mcp",
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "transport": "stdio",
      "timeout": 30000
    }
  ]
}
```

Alternatively, use Cursor's GUI:
1. Open Cursor Settings
2. Navigate to "MCP Servers"
3. Click "Add Server"
4. Enter the configuration above

#### Claude Desktop

**Using `claude_desktop_config.json`**

Edit `%APPDATA%\Claude\claude_desktop_config.json` on Windows:

```json
{
  "mcpServers": {
    "data-validation-gini": {
      "command": "dvg-mcp",
      "args": [],
      "cwd": "c:\\MyProjects\\data-validation-gini",
      "env": {
        "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
      }
    }
  }
}
```

#### JetBrains IDEs (PyCharm, IntelliJ IDEA)

**Using IDE Settings (MCP Plugin)**

If using a JetBrains MCP integration plugin:

1. Open **Settings** → **Tools** → **MCP Servers** (or similar)
2. Click **Add** and configure:

```json
{
  "type": "custom",
  "name": "data-validation-gini",
  "command": "dvg-mcp",
  "workingDirectory": "c:\\MyProjects\\data-validation-gini",
  "stdio": true,
  "disabled": false,
  "environment": {
    "PYTHONPATH": "c:\\MyProjects\\data-validation-gini"
  }
}
```

#### Neovim (with MCP Client Plugin)

**Using `neovim/init.lua` or MCP plugin config**

Example for a Neovim MCP plugin:

```lua
require('mcp').register_server({
  name = "data-validation-gini",
  command = "dvg-mcp",
  cwd = "c:\\MyProjects\\data-validation-gini",
  transport = "stdio"
})
```

Or in YAML if using a config file:

```yaml
servers:
  - name: data-validation-gini
    command: dvg-mcp
    cwd: c:\MyProjects\data-validation-gini
    transport: stdio
```

#### Generic MCP Clients (Python, Node.js, etc.)

**For Python clients:**

```python
import subprocess

mcp_server = {
    "name": "data-validation-gini",
    "command": "dvg-mcp",
    "args": [],
    "cwd": "c:\\MyProjects\\data-validation-gini",
    "transport": "stdio"
}

# Start server
process = subprocess.Popen(
    [mcp_server["command"]] + mcp_server.get("args", []),
    cwd=mcp_server["cwd"],
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)
```

**For Node.js/JavaScript clients:**

```javascript
const { spawn } = require('child_process');

const mcpServer = {
  name: 'data-validation-gini',
  command: 'dvg-mcp',
  cwd: 'c:\\MyProjects\\data-validation-gini',
  transport: 'stdio'
};

const process = spawn(mcpServer.command, [], {
  cwd: mcpServer.cwd,
  stdio: ['pipe', 'pipe', 'pipe']
});
```

#### Other IDEs and MCP clients

1. Use any IDE or assistant that supports MCP over `stdio`.
2. Register the server command as `dvg-mcp`.
3. Set the working directory to the repository root so relative paths like `inputs/` and `output/` resolve correctly.
4. Make sure the project dependencies are installed before launching the server.

**Key Configuration Properties:**

| Property | Value | Required | Notes |
|----------|-------|----------|-------|
| `command` | `dvg-mcp` | Yes | The entry point for the MCP server |
| `cwd` / `workingDirectory` | `c:\MyProjects\data-validation-gini` | Yes | Path to project root (enables relative file paths) |
| `transport` | `stdio` | Yes | Communication protocol (HTTP and other protocols not supported) |
| `timeout` | `30000` | No | Timeout in milliseconds (default: 30s) |
| `disabled` | `false` | No | Set to `true` to temporarily disable the server |
| `env.PYTHONPATH` | Project root path | No | Helps Python resolve imports correctly |

### Natural Language Usage

You can talk to the server in plain English and let the client translate that into tool calls.

Example requests:

- "Compare these two CSV files with chunk size 5000 and save a report."
- "Preview the first 5 rows of this XLSX sheet before I validate it."
- "Mutate the email column in this CSV using the typo mode at 1%."
- "Show me the latest report summary and pass/fail counts."
- "Validate this Excel workbook with the departments sheet mapped to departments."
- "Run a row-count check only and use the default chunk size."

The server defaults to chunk size `1000` when you do not specify one.

## Project Structure (High Level)

- `dvg.py` - validation CLI
- `dvg_report.py` - HTML report generation
- `data_corruptor.py` - mutation utility
- `dvg_mcp.py` - MCP server for the CLI
- `inputs/` - baseline sample datasets
- `outputs/` - mutated sample datasets
- `output/` - generated report files
- `tests/` - unit tests

## License

MIT
