Metadata-Version: 2.4
Name: warden-interp
Version: 0.1.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Dist: numpy>=1.24
Requires-Dist: torch>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: click>=8.1
Requires-Dist: safetensors>=0.4
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: pytest>=7.4 ; extra == 'dev'
Requires-Dist: transformers>=4.40 ; extra == 'dev'
Requires-Dist: opentelemetry-sdk>=1.20 ; extra == 'dev'
Requires-Dist: opentelemetry-exporter-prometheus>=0.41b0 ; extra == 'dev'
Requires-Dist: prometheus-client>=0.19 ; extra == 'dev'
Requires-Dist: pyarrow>=14 ; extra == 'dev'
Requires-Dist: transformers>=4.40 ; extra == 'hf'
Requires-Dist: opentelemetry-sdk>=1.20 ; extra == 'production'
Requires-Dist: opentelemetry-exporter-prometheus>=0.41b0 ; extra == 'production'
Requires-Dist: prometheus-client>=0.19 ; extra == 'production'
Requires-Dist: pyarrow>=14 ; extra == 'sae-training'
Provides-Extra: dev
Provides-Extra: hf
Provides-Extra: production
Provides-Extra: sae-training
License-File: LICENSE
Summary: Circuit-level regression testing for AI systems
Keywords: interpretability,mechanistic-interpretability,sparse-autoencoder,llm,testing,regression-testing
Author-email: Ghassen Naouar <ghassennaouar7@gmail.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Documentation, https://github.com/ghassenov/warden/blob/main/docs/README.md
Project-URL: Homepage, https://github.com/ghassenov/warden
Project-URL: Issues, https://github.com/ghassenov/warden/issues
Project-URL: Repository, https://github.com/ghassenov/warden

<div align="center">

<pre align="center">
  ██      ██    ██████    ████████    ████████    ██████████  ██      ██
  ██      ██  ██      ██  ██      ██  ██      ██  ██          ████    ██
  ██      ██  ██      ██  ██      ██  ██      ██  ██          ██  ██  ██
  ██  ██  ██  ██████████  ████████    ██      ██  ████████    ██  ██  ██
  ██  ██  ██  ██      ██  ██  ██      ██      ██  ██          ██    ████
  ████  ████  ██      ██  ██    ██    ██      ██  ██          ██      ██
  ██      ██  ██      ██  ██      ██  ████████    ██████████  ██      ██
</pre>

</div>

<br>

## The problem

Behavioral evals (accuracy, BLEU, LLM-as-judge) only see input→output. A
model can pass every one of them while the *mechanism* underneath silently
shifts to something brittle — a shortcut, a spurious feature, a circuit that
happens to produce the right answer for the wrong reason (feature
absorption / shortcut learning). Nothing in a standard eval suite tests
whether the internal computation itself is still doing what you think it's
doing, so this kind of drift ships unnoticed until it fails somewhere an
eval didn't cover.

## How Warden solves it

Warden lets you write declarative assertions about a model's internal
mechanism — not just its output — and run them like tests: *"is this
behavior mechanistically necessary and sufficient in this circuit, and
hasn't the mechanism silently drifted?"*

It uses sparse autoencoders to identify the features involved in a
behavior, and causal patching (ablation / activation-patching) to measure
whether that circuit is actually driving the behavior, rather than merely
correlated with it. The heavy numerics run in a Rust core (PyO3); a Python
layer handles orchestration, the DSL, and the CLI — so contracts read like
tests and slot into a normal `pytest` run or CI pipeline.

**Full docs, one per component, with all the "why" and the real numbers
behind each claim: [`docs/`](docs/README.md).** This README is the
quickstart; `docs/` is where the depth lives.

---

## What's here

Everything below is implemented and verified against real GPT-2 small — not
mocked, not synthetic where it mattered:

| capability | what | docs |
|---|---|---|
| Circuit testing | circuit discovery, necessity/sufficiency, contract DSL, pytest plugin, CLI | [contracts.md](docs/contracts.md) |
| Drift detection | `drift` assertions, HTML reports, GitHub Action | [drift.md](docs/drift.md) |
| Production monitoring | plugin SDK, `warden sample`, OpenTelemetry/Prometheus | [production.md](docs/production.md) |
| Self-serve SAEs | `warden train-sae`, hand-derived backprop in Rust | [sae-training.md](docs/sae-training.md) |

Two deliberate simplifications (each doc above explains why):

- **Circuits** are flat top-k SAE feature lists, not full attribution graphs.
  Drift is a Jaccard-distance proxy, not graph-edit-distance.
- **`warden sample`** re-checks fixed contracts on an interval rather than
  mining circuits from unlabeled live traffic (which `discover_circuit`'s
  methodology doesn't support).

---

## Install

```bash
uv venv .venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release
uv pip install -e ".[dev]"
```

> **Note:** `--release` is important — debug builds are ~10-20x slower for training.

---

## Quickstart

```bash
python examples/demo.py
# or:
warden run examples/ioi_circuit.warden.yaml --json-out report.json --html-out report.html
warden report report.json
```

A contract (`*.warden.yaml`) declares a model, a layer, an SAE, an eval set,
and assertions:

```yaml
name: ioi_name_mover_circuit
model: gpt2
layer: 9
sae:
  repo_id: jbloom/GPT2-Small-SAEs-Reformatted
  subfolder: blocks.9.hook_resid_pre
eval_set: ioi_eval.jsonl
assertions:
  - type: necessity
    min_score: 0.3
  - type: sufficiency
    min_score: 0.08
```

Contracts run real model forward passes, so both the pytest plugin and
`@warden.contract`-decorated functions are **skipped by default** —
`pytest --warden` opts in.

Full DSL reference, the Python decorator form, and what
necessity/sufficiency actually compute:
**[docs/contracts.md](docs/contracts.md)**.

---

## Features

- **Necessity & sufficiency** — ablate or activation-patch a discovered
  circuit and measure the causal effect on a real behavior, not just
  correlation. → [docs/contracts.md](docs/contracts.md)

- **Drift detection** — compare a checkpoint's circuit against a baseline's;
  demoed catching a real, self-inflicted regression from a fine-tune.
  → [docs/drift.md](docs/drift.md)

- **Plugin SDK** — swap in your own model adapter or SAE loader, via a
  registry or a real Python entry point, no fork required.
  → [docs/production.md](docs/production.md)

- **Production monitoring** — `warden sample` re-checks contracts against a
  live checkpoint path on an interval, exporting real Prometheus/OTel
  metrics. → [docs/production.md](docs/production.md)

- **Self-serve SAE training** — `warden train-sae` for layers with no
  public dictionary; hand-derived forward/backward pass in Rust (no
  autodiff there), verified by numerical gradient checking.
  → [docs/sae-training.md](docs/sae-training.md)

- **HTML reports & GitHub Action** — a self-contained report for human
  review, and a reusable composite action to block merges on regressions.
  → [docs/drift.md](docs/drift.md), [docs/production.md](docs/production.md)

---

## Results

Real numbers from real runs — not illustrative. Full detail, including two
real bugs found (and fixed) by actually running things twice, and an honest
report of where the self-serve SAE trainer currently falls short:
**[docs/results.md](docs/results.md)**.

```
contract 'ioi_name_mover_circuit': PASS
  [PASS] necessity=0.390 (min 0.3)
  [PASS] sufficiency=0.134 (min 0.08)
```

A real fine-tune-induced regression, caught:

```
contract 'ioi_circuit_drift_check': FAIL
  [FAIL] drift=0.824 (max 0.5)
```

A real Prometheus scrape:

```
warden_necessity{contract="demo", ...} 0.39
```

---

## Development

```bash
cargo test                       # Rust unit tests (pure ndarray math + gradient checking, no Python needed)
maturin develop --release         # rebuild the extension after Rust changes
pytest -m "not integration"       # fast, fully offline Python tests
pytest                            # + integration tests: real GPT-2 + SAE download/forward passes
pytest --warden                   # also runs *.warden.yaml / @warden.contract items directly
```

`-m "not integration"` controls which of *this repo's own tests* touch the
network/a real model; `--warden` controls whether *contracts a warden user
writes* run when their project's `pytest` executes — two independent gates
for two different things. CI (`.github/workflows/ci.yml`) runs `cargo test`
+ `pytest -m "not integration"` only — no network needed.

See **[docs/README.md](docs/README.md)** for the full documentation index.

