Metadata-Version: 2.4
Name: opendqv
Version: 2.3.22
Summary: OpenDQV Core — open-source, contract-driven data quality validation engine for data pipelines and API boundaries
License: MIT
License-File: LICENSE
Keywords: data-quality,validation,duckdb,api,data-contracts
Author: Sunny Sharma
Requires-Python: >=3.11,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: FastAPI
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Provides-Extra: mcp
Provides-Extra: postgres
Provides-Extra: server
Provides-Extra: ui
Requires-Dist: PyJWT (>=2.12,<3.0)
Requires-Dist: duckdb (>=1.0,<2.0)
Requires-Dist: fastapi (>=0.115,<0.136)
Requires-Dist: gunicorn (>=25.0,<26.0) ; extra == "server"
Requires-Dist: httpx (>=0.27,<0.28)
Requires-Dist: mcp (>=1.0.0,<2.0.0) ; extra == "mcp"
Requires-Dist: pandas (>=2.2,<4.0)
Requires-Dist: passlib[bcrypt] (>=1.7,<2.0)
Requires-Dist: prometheus-client (>=0.21,<0.22)
Requires-Dist: psycopg2-binary (>=2.9,<3.0) ; extra == "postgres"
Requires-Dist: pydantic (>=2.10,<3.0)
Requires-Dist: python-multipart (>=0.0.26)
Requires-Dist: pyyaml (>=6.0,<7.0)
Requires-Dist: regex (>=2024.0.0)
Requires-Dist: requests (>=2.33,<3.0)
Requires-Dist: slowapi (>=0.1,<0.2)
Requires-Dist: strawberry-graphql[fastapi] (>=0.311,<0.313)
Requires-Dist: streamlit (>=1.40,<2.0) ; extra == "ui"
Requires-Dist: uvicorn[standard] (>=0.34,<0.45)
Project-URL: Changelog, https://github.com/OpenDQV/OpenDQV/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/OpenDQV/OpenDQV/tree/main/docs
Project-URL: Homepage, https://github.com/OpenDQV/OpenDQV
Project-URL: Repository, https://github.com/OpenDQV/OpenDQV
Description-Content-Type: text/markdown

<p align="center">
  <img src="docs/assets/OpenDQV-Logo-Hires.png" alt="OpenDQV — Open Data Quality Validation" width="480">
</p>

[![CI](https://github.com/OpenDQV/OpenDQV/actions/workflows/ci.yml/badge.svg)](https://github.com/OpenDQV/OpenDQV/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/OpenDQV/OpenDQV/blob/main/LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://pypi.org/project/opendqv/)
[![PyPI](https://img.shields.io/pypi/v/opendqv?style=flat)](https://pypi.org/project/opendqv/)
[![Docker](https://img.shields.io/badge/docker-ghcr.io%2Fopendqv-blue?logo=docker)](https://github.com/orgs/OpenDQV/packages/container/package/opendqv%2Fopendqv)
[![Platforms](https://img.shields.io/badge/platforms-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20ARM64-lightgrey)](#)
[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/OpenDQV/OpenDQV/badge)](https://securityscorecards.dev/#/projects/github.com/OpenDQV/OpenDQV)
[![Coverage](https://codecov.io/gh/OpenDQV/OpenDQV/branch/main/graph/badge.svg)](https://codecov.io/gh/OpenDQV/OpenDQV)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/12229/badge)](https://www.bestpractices.dev/projects/12229)

| [Quickstart](docs/quickstart.md) | [Rules](docs/rules/) | [Contracts](docs/compliance-contracts.md) | [MCP](docs/mcp.md) | [API](docs/index.md) | [Security](SECURITY.md) | [FAQ](docs/faq.md) |
|---|---|---|---|---|---|---|

> **"Trust is easier to build than to repair."**
> That is why OpenDQV exists. A `422` at the point of write is cheaper than a data incident three weeks later.

> **Beta (v2.x).** Public API surface (REST, contract YAML, MCP tools, Python SDK) is stable. Breaking changes follow a one-release deprecation cycle. Security fixes backported to the latest 2.x line. See [API Stability](#api-stability) for commitments.

**OpenDQV is a write-time data validation service.** Source systems call it before writing data. Bad records return a `422` with per-field errors. Good records pass through. No payload is stored.

![OpenDQV demo — define a contract, send a bad record (get a 422), fix it (get a 200)](docs/demo_wizard.gif)

```mermaid
flowchart LR
    subgraph Callers
        direction TB
        SF[Salesforce]
        SAP[SAP]
        DYN[Dynamics]
        ORA[Oracle]
        WEB[Web forms]
        ETL1[ETL pipelines]

        DJ[Django clean]
        PY[Python scripts]
        PD[Pandas / ETL]

        CD[Claude Desktop]
        CUR[Cursor]
        LLM[LLM agents]
    end

    subgraph OpenDQV
        direction TB
        API[Validation API\nREST / batch]
        SDK[LocalValidator\nin-process SDK]
        MCP[MCP Server\nAI-native]
        API & SDK & MCP --> CON[Contracts · YAML\nGovernance · RBAC\nAudit trail]
        API & SDK & MCP --> GEN[Code Generator\nApex · JS · SQL]
    end

    subgraph Results
        direction TB
        R1[valid: true / false]
        R2[per-field errors]
        R3[severity levels]
        R4[webhooks on events]
    end

    SF & SAP & DYN & ORA & WEB & ETL1 --> API
    DJ & PY & PD --> SDK
    CD & CUR & LLM --> MCP

    API & SDK & MCP --> R1

    subgraph Importers
        IMP[dbt schema · GX suites\nSoda checks · ODCS · CSV]
    end
    IMP --> CON

    style API fill:#0d3b5e,stroke:#092a44,color:#fff
    style SDK fill:#0d3b5e,stroke:#092a44,color:#fff
    style MCP fill:#0d3b5e,stroke:#092a44,color:#fff
    style CON fill:#1a8aad,stroke:#14708d,color:#fff
    style GEN fill:#1a8aad,stroke:#14708d,color:#fff
    style R1 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R2 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R3 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R4 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style IMP fill:#1a8aad,stroke:#14708d,color:#fff
```

A `422` at the point of write closes the feedback loop — producers see failures immediately and fix them at source. Rejection rates drop over time because the tool changes the incentive, not just the outcome.

For post-landing monitoring use [Great Expectations](https://greatexpectations.io), [Soda](https://www.soda.io), or [dbt tests](https://docs.getdbt.com/docs/build/tests) — they're complementary, not competing. OpenDQV owns layer one (write-time enforcement); those tools own layer three (post-ingestion observability).

---

## AI Agents — first-class via MCP

OpenDQV ships a built-in [Model Context Protocol](https://modelcontextprotocol.io) server, so [Claude Desktop](https://claude.ai/download), [Cursor](https://www.cursor.com), and any other MCP-compatible agent can discover contracts, validate records, and explain failures through tool calls the agent **explicitly declares** — no hallucinated compliance, no invented rules.

[![Watch the 4-minute MCP demo](docs/demo_mcp_poster.png)](https://github.com/user-attachments/assets/4d414ff1-b08c-4ff1-91e4-e421f0d5391d)

*4-minute demo: Claude Desktop uses two MCP servers — OpenDQV for validation, Marmot for catalog lineage — to check a menu item against `ppds_menu_item` for Natasha's Law allergen compliance, stating which tool calls it makes and why. ([Backup: download the MP4 from the repo](https://github.com/OpenDQV/OpenDQV/raw/main/docs/demo_mcp.mp4))*

For tool reference, write guardrails, remote/enterprise mode, and the Marmot composition pattern, see **[docs/mcp.md](docs/mcp.md)**.

**Reserved `agent_id` prefix.** The prefix `OpenDQV_SA_` is reserved for OpenDQV-owned system traffic — smoke probes, demos, MCP self-tests, perf harnesses. The pattern is `OpenDQV_SA_[Category]_[Scope]` (e.g. `OpenDQV_SA_smoke_v240`, `OpenDQV_SA_probe_persona_b`). Customer-facing metrics endpoints (`/api/v1/stats`, `/api/v1/agents`, MCP `get_quality_metrics`, MCP `list_agents`) suppress these by default so tenant views stay clean of dev/test traffic. Pass `include_system=true` to surface them for diagnostics — each row carries an `is_system_agent` flag.

---

## Install

| I have... | Command |
|-----------|---------|
| Python 3.11+ | `git clone https://github.com/OpenDQV/OpenDQV.git && cd OpenDQV && bash install.sh` |
| Docker | `git clone https://github.com/OpenDQV/OpenDQV.git && cd OpenDQV && cp .env.example .env && docker compose up -d` |
| Just the SDK/CLI | `pip install opendqv` then `opendqv init` to bootstrap contracts |
| None of the above | [Beginner setup guide →](docs/beginner-quickstart.md) |

`install.sh` creates a virtual environment, installs dependencies, and launches the onboarding wizard. Docker pulls `ghcr.io/opendqv/opendqv:latest` — no build step required.

> ⚠️ `AUTH_MODE=open` (the default) has **no authentication**. Set `AUTH_MODE=token` and a strong `SECRET_KEY` in `.env` before any non-local deployment. See [SECURITY.md](SECURITY.md).

---

## Your First Validation

**1. Write a contract** — drop a YAML file in your contracts directory (run `opendqv init --all` to copy the 43 bundled contracts, or `opendqv init` for a single starter):

```yaml
contract:
  name: order
  version: "1.0"
  owner: "Data Governance"
  status: active
  rules:
    - name: valid_email
      type: regex
      field: email
      pattern: "^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$"
      severity: error
      error_message: "Invalid email format"
    - name: amount_positive
      type: min
      field: amount
      min: 0.01
      severity: error
      error_message: "Order amount must be positive"
    - name: status_valid
      type: allowed_values
      field: status
      allowed_values: [pending, confirmed, shipped, cancelled]
      severity: error
      error_message: "Invalid order status"
```

**2. Reload contracts:**

```bash
curl -X POST http://localhost:8000/api/v1/contracts/reload
```

**3. Send a bad record — OpenDQV rejects it:**

```bash
curl -s -X POST http://localhost:8000/api/v1/validate \
  -H "Content-Type: application/json" \
  -d '{"contract": "order", "record": {"email": "not-an-email", "amount": -5, "status": "unknown"}}'
```

```json
{
  "valid": false,
  "errors": [
    {"field": "email",  "rule": "valid_email",    "message": "Invalid email format",        "severity": "error"},
    {"field": "amount", "rule": "amount_positive", "message": "Order amount must be positive", "severity": "error"},
    {"field": "status", "rule": "status_valid",    "message": "Invalid order status",        "severity": "error"}
  ],
  "contract": "order",
  "version": "1.0"
}
```

**4. Fix the record — it passes:**

```bash
curl -s -X POST http://localhost:8000/api/v1/validate \
  -H "Content-Type: application/json" \
  -d '{"contract": "order", "record": {"email": "alice@example.com", "amount": 49.99, "status": "pending"}}'
```

```json
{"valid": true, "errors": [], "warnings": [], "contract": "order", "version": "1.0"}
```

The `customer` contract ships pre-seeded if you want to skip step 1. The [quickstart guide](docs/quickstart.md) walks through authoring, lifecycle, and batch validation.

---

## Rules

| Type | What it checks |
|------|----------------|
| `not_empty` | Field is present and non-empty |
| `regex` | Field matches (or does not match) a pattern. Built-ins: `builtin:email`, `builtin:uuid`, `builtin:ipv4`, `builtin:url` |
| `min` / `max` / `range` | Numeric bounds |
| `min_length` / `max_length` | String length |
| `date_format` | Parseable date/datetime. Falls back through common formats if no explicit format is set |
| `allowed_values` | Value must be in a fixed list |
| `lookup` | Value must appear in a local file or HTTP endpoint (with TTL cache) |
| `compare` | Cross-field: `field` op `compare_to` — supports `gt`, `lt`, `gte`, `lte`, `eq`, `neq`, and `today`/`now` sentinels |
| `required_if` / `forbidden_if` | Conditional: required or forbidden when another field equals a value |
| `checksum` | Check-digit integrity: IBAN, GTIN/GS1, NHS, ISIN, LEI, VIN, CPF, ISRC |
| `unique` | No duplicates within a batch (batch mode only) |
| `cross_field_range` | Value must be between two other fields in the same record |
| `field_sum` | Sum of named fields must equal a target (within optional tolerance) |
| `geospatial_bounds` | Lat/lon pair within a bounding box |
| `date_diff` | Difference between two date fields within a range |
| `age_match` | Declared age consistent with date-of-birth field |

Rules have `severity: error` (blocks the record) or `severity: warning` (flags but allows).
Any rule can include a `condition` block to apply it only when another field equals a given value.

Full reference: [docs/rules/](docs/rules/)

---

## How it compares

A mature data governance programme operates across three layers, each with a distinct job:

| Layer | Purpose | Tools |
|---|---|---|
| **1. Write-time enforcement** | Prevent bad data from entering any system | **OpenDQV** |
| **2. Catalog / governance / stewardship** | Ownership, glossary, lineage, policy, stewardship workflows | Alation, Atlan, Collibra, Purview, DataHub, Marmot |
| **3. Pipeline testing / observability** | Detect drift, freshness issues, residual quality after ingestion | Great Expectations, Soda Core, dbt tests, Monte Carlo |

OpenDQV Core owns layer one. Your catalog handles layer two, your pipeline tools handle layer three.

| | Great Expectations / Soda / dbt | OpenDQV |
|---|---|---|
| **When** | After data lands (in warehouse/lake) | Before data is written (at the door) |
| **Where** | Data pipelines, batch jobs | Source system integration points |
| **Model** | Scan data at rest | Validate data in flight |
| **Latency** | Minutes to hours (batch) | Milliseconds (API call) |
| **Who calls it** | Data engineers | Data engineers, developers, CRM admins |

**They're complementary.** Use Great Expectations to monitor your warehouse. Use OpenDQV to stop bad data from getting there in the first place.

---

## Contracts

43 production-ready contracts ship inside the `opendqv` package covering GDPR, HIPAA, SOX, MiFID II,
UK Building Safety Act, Martyn's Law, Natasha's Law, Ofcom Online Safety Act, EU DORA,
and 20+ other regulatory frameworks across UK, EU, and US. `pip install opendqv` gives you all of them
— `opendqv list` works with zero configuration.

See [docs/compliance-contracts.md](docs/compliance-contracts.md) for the full list with
regulatory context, or browse [opendqv/contracts/](opendqv/contracts/) directly.
17 minimal starter templates are in [examples/starter_contracts/](examples/starter_contracts/).

---

## Performance

EC2 c6i.large, 2 workers, 12-rule contract, mixed 50/50 workload:
**~482 req/s, p99 ~182 ms.** Sizing rule: `WEB_CONCURRENCY = number of vCPUs`.

See [docs/benchmark_throughput.md](docs/benchmark_throughput.md) for full platform comparison,
methodology, and monthly volume extrapolation.

---

## Documentation

| | |
|---|---|
| [Quickstart](docs/quickstart.md) | Build your first contract in 15 minutes |
| [Rules Reference](docs/rules/) | All rule types with parameters and examples |
| [Compliance Contracts](docs/compliance-contracts.md) | 44 contracts with regulatory context |
| [API Reference](docs/index.md) | REST endpoints, SDK, GraphQL, webhooks |
| [Security](SECURITY.md) | Deployment checklist, threat model, RBAC |
| [Production Deployment](docs/production_deployment.md) | Token auth, TLS, Docker Compose, hardening |
| [Integrations](docs/index.md) | Salesforce, Kafka, Snowflake, dbt, Databricks, MCP, and more |
| [All docs →](docs/) | 76 documentation files |

---

## API Stability

OpenDQV is in Beta as of 2.0.0. The following stability commitments apply to the v2.x series:

- **REST API endpoints** — paths, request bodies, and response shapes are stable within `v2.x`. Backwards-incompatible changes require a major version bump and follow a deprecation cycle (one minor release of warnings before removal).
- **YAML contract format** — the contract schema (rules, fields, types) is stable within `v2.x`. New rule types may be added; existing rules will not change semantics without a deprecation cycle.
- **Python SDK** — `OpenDQVClient`, `AsyncOpenDQVClient`, and `LocalValidator` public method signatures are stable within `v2.x`. Internal helpers (prefixed `_`) are not covered.
- **MCP tools** — tool names and parameters are stable within `v2.x`.
- **Security fixes** — backported to the latest 2.x line on a best-effort basis.

### Known limitations in v2.2.x

- **Rule null handling is inconsistent.** Most format rules fail when the target
  field is missing; a few (`max_length`, `allowed_values`) pass silently;
  `field_sum` and `ratio_check` coerce missing operands to `0`. Single-record
  and batch paths disagree in a few cases. See
  [`docs/rules/core_rules.md`](docs/rules/core_rules.md#null-handling-current-v22x-behaviour)
  for the full matrix and the safe pattern to use today. v2.3.0 will make this
  consistent (loud-by-default with an `optional: true` opt-out).
- **Unknown rule types pass silently at runtime.** A typo in `type:` (e.g.
  `min_lenght`) is caught by `opendqv lint` but not by the engine — a typo'd
  rule is a disabled rule. Always lint before deploy. v2.3.0 will reject
  unknown types at contract load.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions, coding guidelines, and how to submit changes.

## License

MIT — see [LICENSE](LICENSE).

## Acknowledgements

**Led by [Sunny Sharma](https://uk.linkedin.com/in/sunny-sharma-3927632), [BGMS Consultants Ltd](https://www.bgmsconsultants.com).** The vision, the architecture, every contract, and every design decision in this repository are directed by a human who believes data quality is a write-time responsibility.

OpenDQV is built with a hybrid team. Sunny leads — carbon and silicon. Three AI collaborators execute: Claude Sonnet 4.6 (primary developer), Claude Opus 4.6 (strategic auditor), and Grok (market intelligence). All answer to the same ethos: *trust is easier to build than to repair.*


