Metadata-Version: 2.4
Name: data-contract-validator
Version: 1.1.0
Summary: Validate data contracts between dbt models and FastAPI/Pydantic APIs with accurate, low-false-positive schema checks
Author-email: Ogunniran Siji <ogunniransiji@gmail.com>
Maintainer-email: Ogunniran Siji <ogunniransiji@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/OGsiji/data-contract-validator
Project-URL: Documentation, https://github.com/OGsiji/data-contract-validator/blob/main/README.md
Project-URL: Repository, https://github.com/OGsiji/data-contract-validator
Project-URL: Bug Reports, https://github.com/OGsiji/data-contract-validator/issues
Project-URL: Changelog, https://github.com/OGsiji/data-contract-validator/blob/main/CHANGELOG.md
Keywords: dbt,fastapi,contract-testing,api-validation,data-engineering,schema-validation,ci-cd,devops
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Database
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: requests>=2.25.0
Requires-Dist: click>=8.0.0
Requires-Dist: sqlglot>=20.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.991; extra == "dev"
Requires-Dist: pre-commit>=2.20.0; extra == "dev"
Requires-Dist: build>=0.8.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-mock>=3.8.0; extra == "test"
Dynamic: license-file

# 🛡️ Data Contract Validator

> **Catch breaking changes between your dbt models and your FastAPI/Pydantic APIs — before they hit production.**

[![PyPI version](https://badge.fury.io/py/data-contract-validator.svg)](https://badge.fury.io/py/data-contract-validator)
[![Tests](https://github.com/OGsiji/data-contract-validator/workflows/Tests/badge.svg)](https://github.com/OGsiji/data-contract-validator/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## 🎯 What it solves

Your analytics team changes a dbt model. Your API team's FastAPI service still
expects the old shape. Nobody notices until production 500s at 2 AM.

This tool sits on that boundary. It extracts the schema your **dbt models
produce** and the schema your **Pydantic models expect**, compares them, and
fails CI when the data side can no longer satisfy the API side.

```
   dbt models                 Data Contract Validator                FastAPI / Pydantic
(what the pipeline   ──▶   extract → normalize → compare   ◀──   (what the API expects)
    produces)                     ↓
                          critical issues block the build
```

### Built for trust

A check that gates a deploy is only useful if it doesn't cry wolf. v1.1
re-architected extraction around that principle:

- **Canonical types** — dbt `varchar` and Pydantic `str` are understood to be
  the same thing, so you don't get drowned in fake "type mismatch" warnings.
- **A real SQL parser** (`sqlglot`) instead of regex — CTEs, `||`
  concatenation, window functions and quoted identifiers are parsed correctly.
- **Confidence-aware** — if the tool can't fully resolve a model's columns
  (e.g. `SELECT *`), it will **warn** rather than falsely **block** your build.

## ⚡ Quick start

```bash
pip install data-contract-validator
```

```bash
# Initialize config + CI workflow in your dbt project
contract-validator init --interactive

# Sanity-check the setup
contract-validator test

# Validate
contract-validator validate
```

### One-off validation (no config file)

```bash
# Local dbt project against a local Pydantic models file or directory
contract-validator validate \
  --dbt-project ./my-dbt-project \
  --fastapi-local ./my-api/app/models.py

# dbt project against models in another GitHub repo (microservices)
contract-validator validate \
  --dbt-project . \
  --fastapi-repo "my-org/my-api" \
  --fastapi-path "app/models.py"
```

## 🔍 How extraction works (and why it's accurate)

### dbt side — tiered, best-source-wins

| Tier | Source | Types | Confidence | Notes |
|---|---|---|---|---|
| 1 | `target/catalog.json` | **Real warehouse types** | high | Produced by `dbt docs generate`. Most accurate. |
| 2 | `sqlglot` SQL parse | Inferred (often unknown) | medium | Trusted column **names**; enriched with documented types from `manifest.json`. Detects `SELECT *`. |
| 3 | regex parse | Guessed | low | Last resort. Never used to hard-fail a build. |

The tool auto-detects what's available and degrades gracefully — so it works
offline in pre-commit **and** with full type fidelity in a warehouse-connected
CI job.

> 💡 **Tip:** run `dbt docs generate` in CI before validating to unlock Tier 1
> (real types). Without it, you still get accurate column-presence checks from
> Tier 2.

### FastAPI side

Pydantic / SQLModel classes are parsed from source with Python's `ast` (no
imports executed). `Optional[...]` controls whether a field is required;
`table=True` SQLModel classes (DB tables, not API contracts) are skipped.

## 🚦 What gets flagged

| Severity | Meaning | Example |
|---|---|---|
| 🚨 **Critical** | Blocks the build | API requires a column the dbt model no longer produces |
| ⚠️ **Warning** | Worth a look, non-blocking | A real type mismatch, or a missing column on a model we couldn't fully resolve |

```bash
$ contract-validator validate

🛡️ Data Contract Validation Results:
Status: ❌ FAILED
Critical: 1 | Warnings: 0

🚨 Critical Issues (Must Fix):
  💥 user_analytics
     Column: total_orders
     Problem: Target REQUIRES column 'total_orders' but source doesn't provide it
     🔧 Fix: Add column 'total_orders' to source model for table 'user_analytics'
```

## 🔧 Configuration (`.retl-validator.yml`)

```yaml
version: "1.0"
name: "my-project-contracts"

source:
  dbt:
    project_path: "."
    auto_compile: true
    # Force Tier 2/3 SQL parsing even if catalog/manifest exist:
    disable_manifest: false

target:
  fastapi:
    # GitHub repo:
    type: "github"
    repo: "my-org/my-api"
    path: "app/models.py"
    # ...or local:
    # type: "local"
    # path: "../my-api/app/models.py"

# Optional: explicit mapping for when names don't line up by convention.
mapping:
  tables:
    # target (Pydantic) table : source (dbt) model
    user_analytics: user_analytics_summary
  columns:
    user_analytics:
      # target column : source column
      userId: user_id

validation:
  fail_on: ["missing_tables", "missing_required_columns"]
  warn_on: ["type_mismatches", "missing_optional_columns"]
```

### When do I need `mapping`?

By default, names are matched across `snake_case` / `camelCase` / casing
(`UserAnalytics` → `user_analytics`, `userId` → `user_id`). Reach for `mapping`
only when a model or column is named so differently that the convention can't
bridge it (e.g. Pydantic `user_id` ↔ dbt `customer_identifier`).

## 🐍 Python API

```python
from data_contract_validator import ContractValidator, DBTExtractor, FastAPIExtractor

dbt = DBTExtractor(project_path="./dbt-project")
fastapi = FastAPIExtractor.from_github_repo("my-org/my-api", "app/models.py")

validator = ContractValidator(
    source_extractor=dbt,
    target_extractor=fastapi,
    mapping={"tables": {"user_analytics": "user_analytics_summary"}},  # optional
)
result = validator.validate()

if not result.success:
    for issue in result.critical_issues:
        print(f"💥 {issue.table}.{issue.column}: {issue.message}")
```

## 🪝 CI / pre-commit integration

### GitHub Actions

`contract-validator init` generates a workflow for you. Minimal version:

```yaml
name: 🛡️ Data Contract Validation
on:
  pull_request:
    paths: ["models/**/*.sql", "dbt_project.yml", "**/*models*.py"]
jobs:
  validate-contracts:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: { python-version: "3.11" }
      - run: pip install data-contract-validator
      # Optional: `dbt docs generate` here for real warehouse types (Tier 1)
      - run: contract-validator validate --output github
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
```

### Pre-commit

```bash
contract-validator setup-precommit --install-hooks
```

```yaml
repos:
  - repo: https://github.com/OGsiji/data-contract-validator
    rev: v1.1.0
    hooks:
      - id: contract-validation
```

## 🧪 Output formats

```bash
contract-validator validate --output terminal   # human-friendly (default)
contract-validator validate --output json        # machine-readable for CI
contract-validator validate --output github       # GitHub Actions annotations
```

## 🚀 Supported frameworks

**Source:** dbt (all adapters — Snowflake, BigQuery, Redshift, Postgres, …).
**Target:** FastAPI (Pydantic v2 + SQLModel).

The extractor architecture is intentionally pluggable (`BaseExtractor` →
`Dict[str, Schema]` with canonical types), so additional sources/targets can be
added without touching the validator. [Open an issue](https://github.com/OGsiji/data-contract-validator/issues)
to request one.

## 🛠️ Development & testing

```bash
git clone https://github.com/OGsiji/data-contract-validator
cd data-contract-validator

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"     # or: pip install -e ".[test]"

# Run the suite
pytest

# Lint / format
black data_contract_validator tests
```

The test suite covers the canonical type system (`tests/test_core/test_types.py`),
the tiered dbt extractor including sqlglot CTE handling and `catalog.json`
(`tests/test_extractors/test_dbt.py`), and the confidence/mapping behavior of
the validator (`tests/test_core/test_validator.py`).

### Adding an extractor

```python
from data_contract_validator.extractors.base import BaseExtractor
from data_contract_validator.core.types import CanonicalType

class MyExtractor(BaseExtractor):
    def extract_schemas(self):
        # return Dict[str, Schema]; use self._make_column(...) so each column
        # carries a canonical_type the validator can compare.
        ...
```

## 🗺️ Roadmap

- Real compatibility semantics (nullability, additive vs. breaking changes)
- Reporter/logging abstraction (quiet/embeddable core)
- A canonical, language-neutral contract artifact + baseline/snapshot diffing
- More targets (Django, SQLAlchemy, GraphQL, OpenAPI)

## 📄 License

MIT — see [LICENSE](LICENSE).

## 🆘 Support

- 🐛 Issues: https://github.com/OGsiji/data-contract-validator/issues
- 📧 Email: ogunniransiji@gmail.com

If this saves you a production incident, please ⭐ the repo.
