Metadata-Version: 2.4
Name: schema-drift-guard
Version: 0.2.0
Summary: Lightweight schema drift detection and data contract enforcement tool
Author: Mohsin Shaikh
License: MIT
Project-URL: Homepage, https://github.com/mohsinshaikh/schema-guard
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: pyyaml
Requires-Dist: click
Requires-Dist: rich
Dynamic: license-file

# Schema Drift Guard

**Schema Drift Guard** is a lightweight CLI tool to detect schema drift and enforce data contracts in data pipelines.

It helps data teams detect unexpected schema changes before they break pipelines, dashboards, or downstream systems.

The tool can automatically:

* Detect schema drift
* Detect column type changes
* Profile dataset columns
* Suggest schema tests
* Update YAML schema files
* Maintain schema version history
* Enforce checks in CI/CD pipelines

---

## Installation

Install from PyPI:

```bash
pip install schema-drift-guard
```

Or install locally for development:

```bash
git clone https://github.com/smohsin46/schema-drift-guard.git
cd schema-drift-guard
pip install -e .
```

---

## Quick Start

Run a schema check against a dataset and schema definition.

```bash
schema-guard check \
  --source-type csv \
  --source data/orders.csv \
  --schema schemas/orders.yml
```

Example output:

```
⚠️ Schema drift detected

New columns detected:
  + discount

Updating schema YAML...

➕ Adding column to schema: discount
```

---

## Supported Data Sources

Current connectors:

| Source Type | Description                |
| ----------- | -------------------------- |
| csv         | Local CSV files            |
| snowflake   | Snowflake warehouse tables |

Example Snowflake command:

```bash
schema-guard check \
  --source-type snowflake \
  --source orders \
  --schema schemas/orders.yml \
  --account <account> \
  --user <user> \
  --password <password> \
  --warehouse <warehouse> \
  --database <database> \
  --schema-name <schema>
```

---

## Schema YAML Format

Example schema definition:

```yaml
columns:
  - name: order_id
    type: integer

  - name: user_id
    type: integer

  - name: price
    type: float

  - name: created_at
    type: string
```

When new columns are detected, the tool can automatically update the schema.

---

## Column Profiling

The tool profiles dataset columns and reports statistics such as:

* null percentage
* distinct count
* minimum values
* maximum values

Example:

```
Column: price
  null_percent: 0.0
  distinct_count: 152
  min: 2.5
  max: 500.0
```

---

## Automatic Test Generation

Schema Drift Guard can generate useful tests automatically.

Examples:

| Column Name | Generated Tests          |
| ----------- | ------------------------ |
| id          | not_null, unique         |
| email       | not_null                 |
| price       | not_null, accepted_range |

Example generated YAML:

```yaml
- name: price
  tests:
    - not_null
    - accepted_range:
        min: 0
        max: 500
```

---

## CI/CD Pipeline Enforcement

Schema Drift Guard can fail pipelines when drift is detected.

```bash
schema-guard check \
  --source-type csv \
  --source data/orders.csv \
  --schema schemas/orders.yml \
  --fail-on-drift
```

If drift is detected:

```
❌ Schema drift detected. Failing pipeline.
```

This allows teams to enforce **data contracts** in automated workflows.

---

## Example GitHub Actions Workflow

```yaml
name: Schema Check

on: [pull_request]

jobs:
  schema_guard:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Install tool
        run: pip install schema-drift-guard

      - name: Run schema check
        run: |
          schema-guard check \
            --source-type csv \
            --source data/orders.csv \
            --schema schemas/orders.yml \
            --fail-on-drift
```

---

## Features

✔ Schema drift detection
✔ Column type drift detection
✔ Automatic schema updates
✔ Column profiling
✔ Automatic range test generation
✔ Schema version history
✔ Pluggable connectors
✔ Installable CLI tool
✔ CI/CD pipeline enforcement
✔ Snowflake warehouse support

---

## Project Structure

```
schema-drift-guard

cli/
connectors/
core/
detectors/
generators/
schemas/
tests/

README.md
pyproject.toml
```

---

## Roadmap

Future improvements:

* BigQuery connector
* Postgres connector
* dbt project integration
* Metadata-only warehouse scanning
* AI-assisted schema suggestions

---

## Contributing

Contributions are welcome.

To contribute:

1. Fork the repository
2. Create a feature branch
3. Submit a pull request

---

## License

MIT License

---

## Author

Mohsin Shaikh
