Metadata-Version: 2.4
Name: pdrift
Version: 0.1.0
Summary: Snapshot testing for LLM prompts - catch meaningful output drift, ignore harmless rephrasing. Local embeddings, no API key.
Project-URL: Homepage, https://github.com/MIthunvasanth/pdrift
Project-URL: Repository, https://github.com/MIthunvasanth/pdrift
Project-URL: Issues, https://github.com/MIthunvasanth/pdrift/issues
Project-URL: Changelog, https://github.com/MIthunvasanth/pdrift/blob/main/CHANGELOG.md
Author: Mithun
License: MIT
License-File: LICENSE
Keywords: drift,embeddings,llm,prompts,regression,snapshot,testing
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: fastembed>=0.4
Requires-Dist: numpy>=1.24
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: tomli>=2.0; python_version < '3.11'
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# pdrift

[![PyPI](https://img.shields.io/pypi/v/pdrift)](https://pypi.org/project/pdrift/)
[![CI](https://github.com/MIthunvasanth/pdrift/actions/workflows/ci.yml/badge.svg)](https://github.com/MIthunvasanth/pdrift/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

**You changed a prompt (or a model version) and now you don't know which of your LLM outputs silently changed meaning.** pdrift is snapshot testing for prompts — like jest snapshots, but judged semantically instead of byte-by-byte.

```bash
pip install pdrift
```

```python
# pdrift_answers.py
from pdrift import case

@case(inputs="cases.jsonl")          # one JSON object per line: {"id": "...", "input": "..."}
def answers(input: str) -> str:
    return my_llm_call(input)        # any function that returns a string
```

```bash
pdrift snapshot   # run the suite, record baseline outputs
pdrift check      # re-run and flag *meaningful* drift — exit 1 in CI
```

## See it catch a flipped meaning

A one-word prompt tweak turns *"found no link"* into *"confirmed a strong link"*. Exact-match snapshots scream at every harmless rephrase; humans skim and miss the real one. pdrift buckets each case by embedding similarity — actual output:

```
$ pdrift check
                         pdrift check (threshold 0.9)
+----------------------------------------------------------------------------+
| suite | identical | trivial | meaningful | new case | missing case | error |
|-------+-----------+---------+------------+----------+--------------+-------|
| llm   |         0 |       3 |          2 |        0 |            0 |     0 |
+----------------------------------------------------------------------------+
                           meaningful changes - llm
+-----------------------------------------------------------------------------+
| case          |    sim | baseline vs current                                |
|---------------+--------+----------------------------------------------------|
| coffee-study  | 0.7449 | - The study found no link between coffee           |
|               |        | consumption and heart disease.                     |
|               |        | + The study confirmed a strong link between coffee |
|               |        | consumption and heart disease.                     |
| deploy-status | 0.5679 | - The deployment completed successfully and no     |
|               |        | downtime was reported.                             |
|               |        | + The deployment failed and caused significant     |
|               |        | downtime across all regions.                       |
+-----------------------------------------------------------------------------+
Drift detected.
# exit code 1
```

The three rephrasings ("Hello! How can I help you today?" → "Hi there! How can I assist you today?", sim 0.9684) landed in **trivial** — exit code stays 0 for those. The two flipped meanings got flagged. That's the whole tool.

No server, no dashboard, no API key, no YAML pipeline — a dev tool, not a platform. Baselines are JSON files in your repo; the check is a CLI command with an exit code.

## Why this doesn't drown you in false positives

Two things make pdrift trustworthy where naive semantic diffing isn't:

**1. Local embeddings — free, offline, no API key.** Similarity is computed with [fastembed](https://github.com/qdrant/fastembed) (ONNX, `BAAI/bge-small-en-v1.5`) on your machine. Checking costs zero dollars and zero network calls, so you can run it on every commit. Embeddings are cached per suite (keyed by output hash) — repeated checks don't even re-embed.

**2. The noise floor — the tool learns each case's natural variance.** LLMs at temperature > 0 rephrase themselves constantly. Take multiple baseline samples and pdrift measures how much *the baselines differ from each other* — the noise floor. A new output is flagged only if it's **more different from the baselines than they are from each other**:

```bash
pdrift snapshot --samples 3
```

```
                          meaningful changes - noisy
+-----------------------------------------------------------------------------+
| case       |    sim | noise floor | diff                                    |
|------------+--------+-------------+-----------------------------------------|
| water-boil | 0.6407 |      0.9633 | - The boiling point of water at sea     |
|            |        |             | level is 100 degrees Celsius.           |
|            |        |             | + Water never boils no matter how hot   |
|            |        |             | it gets; boiling is impossible.         |
+-----------------------------------------------------------------------------+
```

This case's baseline samples agree with each other at 0.9633; the new output only manages 0.6407 against the closest one — flagged, with the numbers shown so you can see *why*. Meanwhile an honest paraphrase scoring 0.95 sails through, because that's within the case's own noise. No hand-tuned per-case thresholds.

## How it works

Each case lands in exactly one bucket:

| verdict | meaning | exit code |
|---|---|---|
| `identical` | exact string match to a baseline sample (embeddings skipped entirely) | 0 |
| `trivial` | differs, but similarity ≥ `min(noise floor, threshold)` | 0 |
| `meaningful` | more different from the baselines than they are from each other | **1** |
| `new case` | in the JSONL but not in the baseline | 0 |
| `missing case` | in the baseline but gone from the JSONL | **1** |
| `error` | the target function raised (recorded, never crashes the run) | **1** |

Baselines are pretty-printed, key-sorted JSON in `.pdrift/<suite>/baseline.json` — commit them, and prompt-output changes show up as reviewable diffs in PRs. `pdrift accept` promotes the latest check run to the new baseline after you've reviewed a change.

## JSON outputs get a structural diff

If both baseline and current outputs parse as JSON, pdrift skips embeddings and diffs the structure — keys added/removed, values changed, with dotted paths:

```
                       meaningful changes - jsonapi
+-------------------------------------------------------------------------+
| case    | sim | noise floor | diff                                      |
|---------+-----+-------------+-------------------------------------------|
| profile |   - |           - | ~ user.address.city: "Berlin" -> "Munich" |
|         |     |             | - removed user.email                      |
+-------------------------------------------------------------------------+
```

String values longer than 40 chars (summaries, bios) fall back to embedding similarity, so a rephrased description doesn't fail your schema check.

## Configuration (optional — zero config needed)

CLI flags override `pdrift.toml`, which overrides defaults. The report header shows the effective value and where it came from.

```toml
# pdrift.toml — everything optional
# threshold = 0.90                    # similarity at/above which a change is trivial
# samples = 1                         # baseline runs per case (3+ enables the noise floor)
# model = "BAAI/bge-small-en-v1.5"    # any fastembed-supported model

# [suite.summarizer]                  # per-suite overrides
# threshold = 0.85
# samples = 5
```

## pytest plugin

Installed automatically. Each case becomes a pytest test:

```bash
pytest --pdrift                    # MEANINGFUL = fail, TRIVIAL/IDENTICAL = pass
pytest --pdrift --pdrift-path prompts/
```

Missing baseline → the case is skipped with "run `pdrift snapshot` first". Failure messages include the similarity, noise floor, and diff. Without `--pdrift` the plugin does nothing.

## CI: fail PRs on meaningful drift

```yaml
# .github/workflows/prompt-drift.yml
name: prompt-drift
on: pull_request
jobs:
  pdrift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pdrift
      - run: pdrift check prompts/   # exits 1 on meaningful changes
```

Because baselines live in git, the *reviewable* prompt-output diff is right there in the PR alongside the code change that caused it.

## FAQ

**Why local embeddings instead of an LLM judge or an embeddings API?**
Cost and trust. A check that costs money per run doesn't get run. Local ONNX embeddings are free, deterministic, offline, and fast enough to run on every commit. The first check downloads the model (~130 MB) once; after that, no network at all.

**My outputs are non-deterministic. Won't every check fail?**
That's the noise floor's job. Snapshot with `--samples 3` (or more): pdrift measures how much your own baselines disagree and only flags outputs that fall *below* that self-similarity. Temperature noise passes; meaning flips don't.

**What does a check cost?**
Zero. No API keys anywhere in the tool. Identical outputs skip embedding entirely, and everything embedded once is cached in `.pdrift/<suite>/embeddings.npy`.

**Embeddings are weak at negation — can a meaning flip sneak through?**
Sometimes similarity models score negations higher than humans would. In practice flips we tested score 0.57–0.74 against the 0.90 default — comfortably flagged — but embedding-based comparison is a tradeoff, not magic. Multi-sample baselines tighten the bar further; tune `threshold` per suite for sensitive cases.

**Windows?**
First-class. Developed on Windows; CI runs the matrix on ubuntu + windows, py3.10–3.12.

## License

MIT
