Metadata-Version: 2.4
Name: polaguard
Version: 0.2.0
Summary: Lightning-fast local data contract validation via Polars & Pydantic V2.
Author-email: Osemwingie Osadolor <osadolor102@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/osadose01/polaguard
Project-URL: Repository, https://github.com/osadose01/polaguard
Project-URL: Issues, https://github.com/osadose01/polaguard/issues
Project-URL: Changelog, https://github.com/osadose01/polaguard/blob/main/CHANGELOG.md
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.9.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# polaguard

**Lightning-fast local data contract validation for CSV, Parquet, and JSON files.**  
Built to protect local pipelines, software loops, and CI/CD runners before data hits the cloud.

[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/osadose/polaguard/blob/main/LICENSE)
[![Powered by Polars](https://img.shields.io/badge/powered%20by-Polars-orange.svg)](https://pola.rs)

## ⚡ Why Polaguard?
Most data contract engines are database-centric, slow to connect, and heavy. **Polaguard** shifts data quality left:
*   **Zero Infrastructure:** No cloud dependencies, database logins, or heavy configuration.
*   **Blazing Fast:** Vectorized execution handling millions of rows in milliseconds using [Polars](https://pola.rs).
*   **Pipeline Native:** Designed to block git commits and GitHub Action pipelines via automated exit codes.

## 🚀 Quick Start

### 1. Install
```bash
pip install polaguard
```

### 2. Auto-Generate a Contract Schema
Point Polaguard at a clean file. It will automatically infer your structures, formats, uniqueness, and null distributions.
```bash
polaguard init --file data/baseline.parquet --output contract.yaml
```

### 3. Check Incoming Batches
Instantly check incoming files against your established standards:
```bash
polaguard check --file data/new_batch.csv --contract contract.yaml
```

### 4. Use the Python API
```python
from pathlib import Path
from polaguard import validate_file

result = validate_file(Path("data/new_batch.csv"), Path("contract.yaml"))
if not result.is_valid:
  print(result.errors)
```

### 5. Check CLI Version
```bash
polaguard --version
```

## 🛠️ Automated Integrations

### Pre-Commit Hooks
Catch structure breaking data changes before making a git commit. Add this to your `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/osadose/polaguard
    rev: v0.2.0
    hooks:
      - id: polaguard
        args: ["check", "--file", "data/raw_inputs.csv", "--contract", "contract.yaml"]
```

## 📄 Schema Configuration & Constraints

Polaguard YAML contracts support dataset-level constraints, column validations, and custom SQL expression assertions.

### Dataset-level Constraints
* `min_columns` (integer): Minimal number of columns required.
* `min_rows` (integer): Minimal number of rows required.
* `allow_extra_columns` (boolean): Whether to fail if the dataset contains columns not defined in the contract.

### Column-level Validations
Under `columns.<column_name>`:
* `type`: One of `int`, `float`, `str`, `bool`, `date`, `datetime`.
* `required` (boolean): Failing check if the column is absent.
* `unique` (boolean): Evaluates if values must be distinct.
* `null_threshold` (float between `0.0` and `1.0`): Permissible ratio of null values (e.g. `0.2` allows up to 20% nulls).
* `regex` (string): For `str` columns, regular expression format checking.
* `allowed_values` (list): Defines an enum of permitted values.
* `min_value` / `max_value` (any): Upper and lower boundary limits (for numeric, date, and datetime columns).
* `min_length` / `max_length` (integer): Length limits for character strings.

### Custom SQL Expressions
Define a list of arbitrary SQL checks evaluated in Polars against the dataset under `expressions`:
```yaml
expressions:
  - "age >= 18"
  - "start_date < end_date"
  - "revenue - cost > 0"
```

## 📜 License

This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.
