Metadata-Version: 2.4
Name: mip-data-validator
Version: 0.0.1
Summary: Validate MIP data-model folders using DuckDB.
Author: MIP Team
Requires-Python: >=3.10,<3.11
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: click (>=8.1,<8.2)
Requires-Dist: duckdb (>=1.1,<1.2)
Description-Content-Type: text/markdown

# data-validator

`data-validator` validates MIP data-model folders using DuckDB.

## Standalone package setup

From the repository root:

```bash
cd data-validator/mip-data-validator
poetry install
```

## Command

```bash
poetry run data-validator validate-data-model /path/to/data_model_folder
```

Run as a Python module:

```bash
poetry run python -m data_validator validate-data-model /path/to/data_model_folder
```

Optional threads:

```bash
poetry run data-validator validate-data-model /path/to/data_model_folder --threads 8
```

Collect all errors and emit NDJSON:

```bash
poetry run data-validator validate-data-model /path/to/data_model_folder --report-all --format ndjson
```

Write HTML report to a file:

```bash
poetry run data-validator validate-data-model /path/to/data_model_folder --report-all --format html --output report.html
```

If `--format html` is used without `--output`, the report is automatically written under `/tmp` and the path is printed.

## Folder Layout

```text
/path/to/data_model_folder/
  CDEsMetadata.json
  dataset1.csv
  dataset2.csv
```

## Validation Notes

- CSV validation queries files directly with DuckDB and uses fused aggregate checks to reduce scan overhead.
- Folder-level dataset uniqueness is enforced across all CSV files via SQL using normalized codes (`trim + lower`).

