Metadata-Version: 2.4
Name: scherlok
Version: 0.5.0
Summary: A detective for your data. Zero-config data quality monitoring.
Project-URL: Homepage, https://github.com/rbmuller/scherlok
Project-URL: Repository, https://github.com/rbmuller/scherlok
Project-URL: Issues, https://github.com/rbmuller/scherlok/issues
Author-email: Robson Bayer Müller <rbmuller91@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: anomaly-detection,data-engineering,data-quality,monitoring
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.0
Requires-Dist: psycopg2-binary>=2.9.0
Requires-Dist: requests>=2.28.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'bigquery'
Provides-Extra: dbt
Requires-Dist: pyyaml>=6.0; extra == 'dbt'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'snowflake'
Description-Content-Type: text/markdown

<div align="center">

<img src="https://img.shields.io/badge/python-3.10+-blue?logo=python&logoColor=white" alt="Python 3.10+">
<img src="https://img.shields.io/pypi/v/scherlok?color=green" alt="PyPI">
<img src="https://img.shields.io/github/license/rbmuller/scherlok" alt="MIT License">
<a href="https://github.com/rbmuller/scherlok/actions/workflows/ci.yml"><img src="https://github.com/rbmuller/scherlok/actions/workflows/ci.yml/badge.svg" alt="CI"></a>

<br><br>

<img src="assets/scherlok-logo.png" alt="Scherlok" width="120">

<h1>Scherlok</h1>

<p><strong>Your data broke in production. Again.</strong><br>
Scherlok makes sure it doesn't happen next time.</p>

</div>

<div align="center">

<img src="examples/demo.svg" alt="Scherlok Demo" width="700">

**Zero config. Zero YAML. Zero rules to write.**<br>
Scherlok learns what "normal" looks like, then tells you when something changes.

</div>

---

## The Problem

Every data team has the same nightmare:

> A source API silently changes from **dollars to cents**. Revenue dashboards show wrong numbers for **3 weeks** before anyone notices.
>
> A column starts returning **NULLs**. A table stops updating. Row counts drop **40% on a Tuesday**. Nobody knows until the CEO asks why the report looks weird.

Current tools (Great Expectations, Soda, dbt tests) require you to **define what "correct" looks like** before you can detect what's wrong. Hundreds of rules. Dozens of YAML files. And you still miss things — because you can't write rules for problems you haven't imagined yet.

## The Solution

Scherlok takes the opposite approach: **learn first, then detect.**

```bash
scherlok connect postgres://user:pass@host/db   # connect once
scherlok investigate                              # learn your data
scherlok watch                                    # detect anomalies
```

Three commands. Five minutes. Done.

## What It Catches

| Anomaly | What Happened | Severity |
|---------|---------------|----------|
| **Volume drop** | Row count dropped 40% overnight | CRITICAL |
| **Volume spike** | 3x more rows than normal | WARNING |
| **Freshness alert** | Table hasn't updated in 12h (normally every 2h) | CRITICAL |
| **Schema drift** | Column removed or type changed | CRITICAL |
| **NULL surge** | NULL rate jumped from 2% to 45% | WARNING |
| **Distribution shift** | Column mean shifted 5+ standard deviations | WARNING |
| **Cardinality explosion** | Status column went from 5 values to 500 | CRITICAL |

Every anomaly is auto-scored: **INFO**, **WARNING**, or **CRITICAL**. No thresholds to configure.

## Works with dbt

Already running dbt? Scherlok complements `dbt test` with **automatic** anomaly detection — no rules to write.

```bash
pip install scherlok[dbt]

# After `dbt run`, point Scherlok at your project
scherlok dbt --project-dir ./my_dbt_project
```

Scherlok reads `target/manifest.json`, discovers every materialized model (`table`, `incremental`, `view`), auto-resolves the connection from your `profiles.yml`, and profiles each model:

```
Investigating 4 dbt models in ./my_dbt_project (postgres)
  ✓ stg_customers                  (12,345 rows)
  ✓ stg_orders                     (98,765 rows)
  ✗ fct_orders                     CRITICAL: Row count dropped 42% (98,765 → 57,283)
  ✓ dim_customers_inc              (12,300 rows)

Summary: 4 profiled, 1 anomalies (1 critical, 0 warning)
```

Use it as a CI gate after `dbt run`:

```yaml
- run: dbt run --target prod
- run: scherlok dbt --project-dir . --target prod --fail-on critical
```

**Supported adapters:** `postgres`, `bigquery`, `snowflake`. For others, pass `--connection-string` explicitly.

📖 Full docs: [dbt integration guide →](src/scherlok/dbt/README.md)

## HTML dashboard

![scherlok dashboard](assets/dashboard-screenshot.png)

```bash
scherlok dashboard --out report.html
```

One self-contained HTML file (~28 KB): KPIs, per-table incidents grouped with first-seen timestamps, `+`/`−`/`~` schema-drift diff, sparklines, and full anomaly history. Auto dark/light theme via `prefers-color-scheme`.

📖 Full docs: [dashboard guide →](src/scherlok/dashboard/README.md)

## How It Works

### 1. `investigate` — Learn the patterns

```bash
$ scherlok investigate

  Profiling 12 tables...
  ✓ users         — 45,231 rows, 8 columns
  ✓ orders        — 1,203,847 rows, 15 columns
  ✓ products      — 892 rows, 12 columns
  ...
  Done. Profiles saved.
```

Scherlok profiles every table: row counts, column types, NULL rates, value distributions, freshness cadence, cardinality. Stores everything locally in SQLite.

### 2. `watch` — Detect anomalies

```bash
$ scherlok watch

  Checking 12 tables against learned profiles...

  🔴 CRITICAL  orders    volume_drop     Row count dropped 52% (1,203,847 → 578,412)
  🟡 WARNING   users     null_increase   Column "email": NULL rate 2.1% → 18.7%
  🔵 INFO      products  distribution    Column "price": mean shifted 3.2σ

  3 anomalies detected. Exit code: 1
```

### 3. Alert — Slack, CI/CD, or both

```bash
# Slack
scherlok watch --webhook https://hooks.slack.com/services/...

# Discord
scherlok watch --webhook https://discord.com/api/webhooks/...

# Microsoft Teams
scherlok watch --webhook https://outlook.office.com/webhook/...

# Any endpoint (generic JSON payload)
scherlok watch --webhook https://my-api.com/alerts

# CI/CD gate (fails pipeline on CRITICAL)
scherlok watch --exit-code --fail-on critical
```

Auto-detects Slack, Discord, and Teams from the URL and formats the payload accordingly. Any other URL receives a generic JSON payload.

## CI/CD Integration

Use Scherlok as a data quality gate. The `ci` command does it in one line:

```yaml
# GitHub Actions
- name: Data quality check
  run: |
    pip install scherlok
    scherlok config --store s3://my-bucket/scherlok/profiles.db
    scherlok ci ${{ secrets.DATABASE_URL }} \
      --webhook ${{ secrets.SLACK_WEBHOOK }} \
      --fail-on critical
```

If Scherlok detects a critical anomaly, the pipeline fails. Bad data never reaches production.

## Email alerts

```bash
export SCHERLOK_SMTP_HOST=smtp.gmail.com
export SCHERLOK_SMTP_USER=alerts@company.com
export SCHERLOK_SMTP_PASSWORD=app-specific-password

scherlok watch --email team@company.com --email cto@company.com
```

## Connectors

```bash
# PostgreSQL
scherlok connect postgres://user:pass@host:5432/db

# BigQuery
pip install scherlok[bigquery]
scherlok connect bigquery://project-id/dataset-name

# Snowflake
pip install scherlok[snowflake]
export SNOWFLAKE_USER=...
export SNOWFLAKE_PASSWORD=...
export SNOWFLAKE_WAREHOUSE=...
scherlok connect snowflake://account/database/schema
```

| Database | Status |
|----------|--------|
| PostgreSQL | Available |
| BigQuery | Available |
| Snowflake | Available |
| MySQL | Coming soon |
| DuckDB | Planned |

## Remote Storage

Share profiles across CI runs and team members:

```bash
# AWS S3
scherlok config --store s3://my-bucket/scherlok/profiles.db

# Google Cloud Storage
scherlok config --store gs://my-bucket/scherlok/profiles.db

# Azure Blob Storage
scherlok config --store az://my-container/scherlok/profiles.db
```

## Why Not [Other Tool]?

| | Great Expectations | Soda | Monte Carlo | **Scherlok** |
|---|---|---|---|---|
| Setup time | Hours | 30 min | Weeks | **5 minutes** |
| Config required | Hundreds of rules | YAML checks | Dashboard setup | **None** |
| Anomaly detection | Manual thresholds | Paid feature | Yes | **Yes, free** |
| Self-hosted | Yes | Limited | No (SaaS) | **Yes** |
| CI/CD gate | Yes | Yes | No | **Yes** |
| Price | Free | Freemium | $50-200K/yr | **Free, forever** |

## CLI Reference

```
scherlok connect <url>          Connect to a database
scherlok investigate            Profile all tables (learn patterns)
scherlok watch [-w <url>] [-e <email>]  Detect anomalies and alert
scherlok ci <url> [opts]        All-in-one CI/CD command (connect + watch + exit code)
scherlok status                 Quick health dashboard
scherlok report                 Detailed profile summary
scherlok history [--days N]     Timeline of past anomalies
scherlok config --store <url>   Set remote storage
scherlok version                Show version
```

## Install

```bash
pip install scherlok

# With BigQuery support
pip install scherlok[bigquery]
```

Requires Python 3.10+.

## Contributing

Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md).

We're especially looking for:
- New database connectors (Snowflake, MySQL, DuckDB)
- Anomaly detection improvements
- Documentation and examples

## License

[MIT](LICENSE) — Developed by [Robson Bayer Müller](https://github.com/rbmuller)
