Metadata-Version: 2.4
Name: falsify
Version: 0.2.0
Summary: Pre-registration and CI for AI-agent claims — deterministic PASS or FAIL.
Author: Cüneyt Öztürk
License: MIT
Project-URL: Homepage, https://falsify.dev
Project-URL: Repository, https://github.com/studio-11-co/falsify
Project-URL: Issues, https://github.com/studio-11-co/falsify/issues
Project-URL: Changelog, https://github.com/studio-11-co/falsify/blob/main/CHANGELOG.md
Keywords: falsifiability,pre-registration,AI,CI,claude,hackathon
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: pyyaml>=6.0
Provides-Extra: mcp
Requires-Dist: mcp>=1.0.0; extra == "mcp"
Dynamic: license-file

<img src="brand/lockup.svg" alt="falsify" width="320">

**ML evaluation claims should be locked before the experiment runs, not reported after.**

`falsify` commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.

```bash
$ falsify lock claim.yaml
locked: sha256:a3f9...c821

$ falsify verdict claim.yaml
PASS  accuracy 0.934 >= 0.90  (hash verified)

# tampered:
$ falsify verdict claim.yaml
TAMPERED  sha256 mismatch — spec modified after locking  (exit 3)
```

4 reference implementations — Python, JavaScript, Go, Rust — byte-equivalent on the 12 v0.1 conformance vectors (8 v0.2 candidates ship alongside, full 20-vector parity targeted for v0.2 freeze 2026-05-22). Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.

> **Pre-registration + CI for AI-agent claims.** Lock the claim and threshold with SHA-256 *before* running the experiment — or the result doesn't count.

![CI](https://github.com/studio-11-co/falsify/actions/workflows/falsify.yml/badge.svg)
![Multi-lang Conformance](https://github.com/studio-11-co/falsify/actions/workflows/multi-lang-conformance.yml/badge.svg)
![PyPI](https://img.shields.io/pypi/v/falsify?color=brightgreen&label=pypi)
![coverage](https://img.shields.io/badge/tests-564%20passing-brightgreen)
![impls](https://img.shields.io/badge/reference%20impls-4%20(py%20%C2%B7%20js%20%C2%B7%20go%20%C2%B7%20rs)-brightgreen)
![honesty](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/studio-11-co/falsify/main/.falsify/badge.json)
![python](https://img.shields.io/badge/python-3.11%2B-blue)
![license](https://img.shields.io/badge/license-MIT-blue.svg)
[![SchemaStore](https://img.shields.io/badge/schema-in%20SchemaStore-blue.svg)](https://github.com/SchemaStore/schemastore/pull/5673)
[![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.20177839-blue.svg)](https://doi.org/10.5281/zenodo.20177839)

> Code: MIT. "FALSIFY" name and chevron logo: ™ reserved. See [NOTICE](NOTICE) · [docs/COMMERCIAL.md](docs/COMMERCIAL.md).

---

**Open `*.prml.yaml` in your IDE:** as of May 2026 the PRML JSON Schema is in the [SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) catalog. VS Code, JetBrains IDEs, Helix, Zed, and anything using `yaml-language-server` autocomplete and validate manifest files out of the box. No config.

**Try it without installing:** [`registry.falsify.dev`](https://registry.falsify.dev) — paste a PRML manifest, get a SHA-256 permalink and a README badge. No account, no server-side state beyond the hash.

**Add it to your CI in five lines:** [`studio-11-co/prml-verify-action@v1`](https://github.com/studio-11-co/prml-verify-action) — composite GitHub Action wrapping the falsify CLI ([listed on the GitHub Marketplace](https://github.com/marketplace/actions/prml-verify)). Block merges on tampered or regressed eval claims. Optional public registry anchor.

**Already on MLflow?** [`pip install mlflow-falsify`](https://pypi.org/project/mlflow-falsify/) — discoverable plugin that tags every MLflow run with the PRML manifest hash, version, metric, comparator, threshold, and dataset id. Zero code changes to your existing MLflow workflow. Source: [`studio-11-co/mlflow-falsify`](https://github.com/studio-11-co/mlflow-falsify).

**Need it locked for one of your published claims?** [`falsify.dev/sprint`](https://falsify.dev/sprint) — Diagnostic Sprint, fixed-scope engagement for regulated AI teams. PRML manifest authored, verifier deployed in CI, audit report shipped. Pricing scoped per client; single-claim review available as a sub-procurement option.

**What is PRML?** [`falsify.dev/what-is-prml`](https://falsify.dev/what-is-prml) — plain-English answer page.

---

> **Latest — 2026-05-14** · v0.1 published on Zenodo: citable DOI [10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839). PRML JSON Schema [merged into SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) (2026-05-11) by Mads Kristensen (Microsoft) — `.prml.yaml` files now autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor. [OECD.AI Catalogue submission](https://oecd.ai/en/catalogue/tools) filed, vetting in progress. NIST AI 800-2 late comment archived. JTC 21 routed via Dr. Sebastian Hallensleben. `registry.falsify.dev` live with README badges at `registry.falsify.dev/badge/<hash>.svg`. **v0.1.4 released** ([release notes](https://github.com/studio-11-co/falsify/releases/tag/v0.1.4) · `pip install falsify==0.1.4`). PRML v0.1 specification published with **four reference implementations** (Python · [JavaScript](impl/js/) · [Go](impl/go/) · [Rust](impl/rust/)) all reproducing the [12 v0.1 vectors](spec/test-vectors/v0.1/) and [8 v0.2 candidate vectors](spec/test-vectors/v0.2/) byte-for-byte (20 vectors total). [14-page arXiv preprint](spec/paper/) and [v0.2 RFC](https://spec.falsify.dev/v0.2-rfc) (freeze 2026-05-22) open for public review.

---

## The problem

Your team claims the model hits **94% accuracy**. You ship it. Three weeks later a customer proves the real number is **71%**.

The claim was never *falsifiable*. Nobody wrote down — cryptographically, before the experiment ran — what "94%" meant, which dataset, which metric, which threshold. So when the number changed, nobody could say whether the claim was wrong, the data drifted, or the metric got silently relaxed.

**This isn't what MLflow, Docker, or Model Cards do.** MLflow tracks what happened during a run. Docker and DVC let you re-run. Model Cards and Datasheets describe a model after it ships. All three are *post-hoc* — written after the result is known. PRML is *pre-hoc*: the claim is committed before the result is observable. After the fact every other tool still lets you quietly adjust the story; PRML changes the hash and breaks the audit trail the moment you try.

PRML does not prove an ML result is true. It proves that a specific evaluation claim was committed before it could be silently rewritten. That is a smaller guarantee than reproducibility — and a different one.

**Falsify fixes this with a single idea from science:** you must pre-register the claim *before* you run the experiment. If you change the spec after seeing the data, the hash changes, the audit trail breaks, and CI fails with exit code 3.

    $ falsify lock accuracy_claim        # SHA-256 the spec
    $ falsify run  accuracy_claim        # reproducible experiment
    $ falsify verdict accuracy_claim     # exit 0 = PASS, 10 = FAIL, 3 = tampered

Deterministic exit codes are the API. CI gates on them. Humans read the audit trail. The claim either survives contact with the data or it doesn't.

---

## 60-second demo

[![60-second walkthrough — paste YAML, lock, get SHA-256, drop badge](https://spec.falsify.dev/demo/demo.gif)](https://spec.falsify.dev/demo/)

*Click for the live looping version, or watch the [MP4](https://spec.falsify.dev/demo/demo.mp4). Full storyboard in [`docs/DEMO_SCRIPT.md`](docs/DEMO_SCRIPT.md).*

[**▶ Watch the longer 90-second walkthrough on YouTube**](https://youtu.be/vVZTNeak5PA) (lock, run, tamper, CI block).

---

## Why this matters

Every week another paper, blog post, or product launch claims an AI metric that quietly evaporates under scrutiny. It's not usually malice — it's that the claim was never structured to be falsifiable. Falsify is the smallest possible tool that forces that structure.

- **ML teams** — gate deploys on pre-registered accuracy / NDCG / recall
- **DevOps** — treat p95 latency claims the same way you treat tests
- **LLM pipelines** — pin prompt + eval + threshold so "it works" means something
- **Research** — replicate a paper by running its spec.lock.json

See [docs/CASE_STUDIES.md](docs/CASE_STUDIES.md) for three concrete adoption stories.

---

**Current version:** 0.1.0 — run `python3 falsify.py --version`.
**Working with Claude Code?** See [CLAUDE.md](CLAUDE.md).

---

## Specification artifacts

Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML Manifest Specification. The spec, conformance suite, and adjacent documents live under `spec/`:

- **[`spec/PRML-v0.1.md`](spec/PRML-v0.1.md)** — the spec (RFC-style, CC BY 4.0)
- **[`spec/test-vectors/v0.1/`](spec/test-vectors/v0.1/)** — 12 conformance vectors with locked SHA-256 digests
- **[`spec/analysis/positioning-v0.1.md`](spec/analysis/positioning-v0.1.md)** — PRML vs in-toto / SLSA / Model Cards / HELM / ClinicalTrials.gov
- **[`spec/analysis/canonicalization-portability-v0.1.md`](spec/analysis/canonicalization-portability-v0.1.md)** — three cross-language findings from the JS second implementation
- **[`spec/compliance/AI-Act-mapping-v0.1.md`](spec/compliance/AI-Act-mapping-v0.1.md)** — EU AI Act Article 12/17/18/50/72/73 mapping
- **[`spec/compliance/landing.md`](spec/compliance/landing.md)** — compliance-audience landing copy
- **[`spec/paper/`](spec/paper/)** — 14-page arXiv preprint (LaTeX, CC BY 4.0)
- **[`spec/v0.2/ROADMAP.md`](spec/v0.2/ROADMAP.md)** — v0.2 RFC roadmap (freeze 2026-05-22)

**Audit & compliance crosswalks** (subcategory-by-subcategory maps from major AI governance frameworks to PRML fields, FULL/PARTIAL/NONE tagged):

- **[EU AI Act Article 12](https://spec.falsify.dev/eu-ai-act/article-12/)** — code-level pattern for the 2 August 2026 high-risk applicability deadline
- **[Article 12 readiness diagnostic](https://spec.falsify.dev/article-12-readiness/)** — 10-question browser-only self-assessment
- **[NIST AI RMF 1.0 crosswalk](https://spec.falsify.dev/nist-ai-rmf/)** — GOVERN / MAP / MEASURE / MANAGE subcategory map (incl. AI 600-1 GenAI Profile)
- **[ISO/IEC 42001:2023 crosswalk](https://spec.falsify.dev/iso-42001/)** — AIMS clause-by-clause evidence map (Clauses 7-9 + Annex A controls)

**Reference implementations** (four languages, 12 v0.1 + 8 v0.2 candidate vectors = 20 total; multi-lang CI runs all 20 byte-for-byte per push and daily at 04:00 UTC):

- **Python:** [`falsify.py`](falsify.py) — original reference, uses PyYAML
- **Node.js:** [`impl/js/`](impl/js/) — second reference, ~400 LOC, hand-rolled, zero deps
- **Go:** [`impl/go/`](impl/go/) — third reference, ~450 LOC, hand-rolled, stdlib only
- **Rust:** [`impl/rust/`](impl/rust/) — fourth reference, ~600 LOC, hand-rolled, two deps (`serde_json`, `sha2`)

Hosted spec at [spec.falsify.dev/v0.1](https://spec.falsify.dev/v0.1). Public review thread at [GitHub Discussion #6](https://github.com/studio-11-co/falsify/discussions/6). Comments via `hello@studio-11.co`.

**Companion projects** (separate repos under `studio-11-co`, each MIT or CC0 licensed):

- **[`falsify-cookbook`](https://github.com/studio-11-co/falsify-cookbook)** — field manual for the spec: 11 patterns + 4 anti-patterns, every one a single page with a runnable example, including [Pattern 11: PRML + Sigstore for execution integrity](https://github.com/studio-11-co/falsify-cookbook/blob/main/patterns/11-sigstore-execution.md) closing the §8.1 gap. CC0.
- **[`falsify-integrity-index`](https://github.com/studio-11-co/falsify-integrity-index)** — public scorecard of how 25+ well-known ML eval claims meet the 9 PRML falsifiability criteria. Live at [falsify.dev/integrity](https://falsify.dev/integrity). CC0 data, MIT tooling.
- **[`falsify-inspect`](https://github.com/studio-11-co/falsify-inspect)** — Inspect AI adapter: anchor an Inspect AI eval claim's threshold to a SHA-256 hash before the run, verify the post-run log against it. MIT.
- **[`prml-verify-action`](https://github.com/studio-11-co/prml-verify-action)** — composite GitHub Action ([listed on Marketplace](https://github.com/marketplace/actions/prml-verify)) for CI integration. MIT.
- **[`mlflow-falsify`](https://github.com/studio-11-co/mlflow-falsify)** — MLflow plugin (`pip install mlflow-falsify`) auto-tags every run with the PRML manifest hash. MIT.
- **[`falsify-js`](https://github.com/studio-11-co/falsify-js)** — JS reference implementation, [`npm install falsify-js`](https://www.npmjs.com/package/falsify-js). MIT.

---

## Why

AI agents make empirical claims all day — *"accuracy is up"*, *"the
new retriever is faster"*, *"this filter catches every edge case"*.
We rarely pin down the threshold, the metric, or the stopping rule
before the data arrives.

Without pre-registration, every verdict is post-hoc rationalization:
the goalposts move a little, the sample is chosen a little, the
winning explanation is kept.

Falsification Engine forces scientific discipline onto that loop.
You declare the test, lock the spec with a cryptographic hash, run
the experiment, and read the exit code. PASS or FAIL is mechanical,
not rhetorical — and CI enforces it on every push.

## What you get

- A single-file CLI (`falsify`) with **18 subcommands**: `init`,
  `lock`, `run`, `verdict`, `guard`, `list`, `stats`, `diff`, `hook`,
  `doctor`, `version`, `export`, `verify`, `replay`, `why`, `trend`,
  `score`, `bench`.
- A `commit-msg` git hook that blocks commits whose messages
  contradict a locked verdict.
- A GitHub Actions workflow that re-verdicts every push and PR
  across Python 3.11 and 3.12.
- **Five Claude Code skills** and **two forked-context subagents**
  that draft specs, audit arbitrary text against the verdict log,
  review PR diffs for honesty violations, and keep the log itself
  fresh.

## Install

```bash
pip install falsify
```

That's it. The `falsify` command is on your `PATH`, the docs site
is at <https://falsify.dev>, and the project page is at
<https://pypi.org/project/falsify>.

Requires Python **3.11+**.

### Development install (from the repo)

```bash
git clone https://github.com/studio-11-co/falsify
cd falsify
pip install -e .
```

The `-e` editable form is for hacking on falsify itself — your
edits to `falsify.py` take effect immediately without reinstalling.

### Docker

```bash
docker build -t falsify-demo . && docker run --rm -it falsify-demo
```

Runs the auto-demo in a clean container. See
[docs/DOCKER.md](docs/DOCKER.md) for interactive and repo-mount
modes.

### pre-commit integration

Consume falsify's hooks from your own repo:

```yaml
repos:
  - repo: https://github.com/studio-11-co/falsify
    rev: v0.1.4
    hooks:
      - id: falsify-guard
      - id: falsify-doctor
```

Then `pre-commit install && pre-commit install --hook-type commit-msg`.
See [docs/PRE_COMMIT.md](docs/PRE_COMMIT.md) for the full list of
exported hooks and how this repo eats its own dog food.

## Quickstart

```bash
./demo.sh   # auto-narrated: PASS → tamper → FAIL → guard block

# Either form works — `falsify` is the installed entry point,
# `python3 falsify.py` is the uninstalled fallback.
falsify init my_claim
# edit .falsify/my_claim/spec.yaml to fill in the template
falsify lock my_claim
falsify run my_claim
falsify verdict my_claim
falsify hook install      # enable the commit-msg guard
```

Exit code `0` on PASS, `10` on FAIL. Everything else is documented
below.

New to pre-registration? Walk through [TUTORIAL.md](TUTORIAL.md) — 15 minutes, zero to first locked claim.

### Start from a template

```bash
falsify init --template accuracy
falsify lock accuracy
falsify run accuracy
falsify verdict accuracy
```

Five templates ship with a runnable spec + metric + dataset:

- `accuracy` — classifier holdout accuracy ≥ 0.80
- `latency` — p95 request latency ≤ 200 ms
- `brier` — probabilistic calibration Brier ≤ 0.25
- `llm-judge` — LLM-judge agreement rate ≥ 0.75
- `ab` — A/B test absolute lift ≥ 0.05

Each scaffolds into `claims/<name>/` (sources) and mirrors
`spec.yaml` into `.falsify/<name>/` so the CLI runtime works
without further setup. Override the default name with `--name`
or the directory with `--dir`.

### Developer commands

```bash
make install   # pip install pyyaml
make test      # run unittest suite
make smoke     # run tests/smoke_test.sh
make demo      # JUJU end-to-end (lock → run → verdict)
```

See [Makefile](Makefile) for all targets (`make help`).

Questions and objections? See [docs/FAQ.md](docs/FAQ.md) — 15
direct answers to "why not just X?" questions.

Feature matrix vs adjacent tools: [docs/COMPARISON.md](docs/COMPARISON.md).

### Explain any claim

`falsify why <name>` is the human-friendly companion to `verdict`
— it always exits `0` and tells you exactly what the next honest
move is:

```
claim: juju
state: STALE
reasoning: the spec has been edited (sha256:1038219d75a8) but no run
  exists against this hash. Last run was against sha256:164f619d4860.
locked: yes (sha256:164f619d4860, 2h ago)
last run: 2026-04-22T02:10:17+00:00 (2h ago)
next action: `falsify run <name>` to produce a fresh verdict against
  the current spec.
```

Add `--json` for a scripted pipeline, `--verbose` for full hashes
and the last five runs.

### Spot drift with a sparkline

`falsify trend <name>` draws an ASCII sparkline of the metric
across its recorded runs, marks the threshold line, and classifies
the trajectory as **improving**, **degrading**, **flat**, or
**mixed**.

```
claim: juju
threshold: 0.25 (direction: below)
runs: 20 shown (of 20)

▁▂▂▃▃▄▄▅▅▆▆▆▇▇████
                    TT
threshold=0.25 (shown)

first: 0.12 @ ... (PASS)
last:  0.23 @ ... (PASS)
min:   0.09
max:   0.23
mean:  0.17
latest verdict: PASS
trend: degrading
```

`--ascii` swaps in `_.oO#`; `--width` resizes the sparkline;
`--last` caps history (default 20, max 200).

### Measure the CLI itself

`falsify bench` spawns each subcommand under a fresh temporary
directory and records per-command latency (min / median / p95 /
max / mean / stddev). Useful as a sanity check before a release
or when investigating a suspected startup-time regression.

```bash
falsify bench --runs 5 --commands "--help,list,stats,score"
falsify bench --runs 5 --json     # machine-readable output
```

`--runs <N>` sets the timed-iteration count (default 5, capped at
100); `--warmup <N>` discards the first N spawns so JIT / import
caches stabilize before timing (default 1).

## Exit codes

| Code | Meaning                                       |
|------|-----------------------------------------------|
| 0    | PASS                                          |
| 10   | FAIL                                          |
| 2    | Bad spec / INCONCLUSIVE                       |
| 3    | Hash mismatch (spec tampered)                 |
| 11   | Guard violation (commit blocked)              |

## The Opus 4.7 layers

**Skills** (`.claude/skills/`) — in-session helpers that fire on
trigger phrases.
- `hypothesis-author` walks the user through a 5-question dialogue
  and writes a falsifiable `spec.yaml`.
- `falsify` is the orchestrator: routes any empirical claim to the
  right place in the init → lock → run → verdict pipeline.
- `claim-audit` runs a fast keyword+regex audit over pasted text
  and escalates to the `claim-auditor` subagent when paraphrases or
  >2 claims show up.
- `claim-review` reads a PR diff and flags unlocked specs, silent
  threshold edits, and `metric_fn` references to missing modules —
  runs in PR CI, exits `1` on any CRITICAL finding. See
  [`docs/PR_REVIEW.md`](docs/PR_REVIEW.md).
- `falsify-ci-doctor` ingests `make release-check` output and
  maps each FAIL gate to a likely cause and an exact fix command
  — one-shot triage when CI is red.

**Subagents** (`.claude/agents/`) — forked-context agents invoked
via the `Task` tool for heavier work.
- `claim-auditor` does the semantic cross-reference that the
  keyword-pass `claim-audit` skill deliberately skips; used on PR
  bodies, release notes, and README edits.
- `verdict-refresher` scans `.falsify/*/` for STALE, INCONCLUSIVE,
  or UNRUN verdicts and re-runs them through the CLI — keeping
  `guard` decisions trustworthy.

**Slash commands** (`.claude/commands/`) — in-IDE shortcuts that
compose the skills and CLI.
- `/new-claim <template> [name]` — guided scaffold → lock → run →
  verdict for one of the five templates.
- `/audit-claims` — repo-wide semantic audit; merges
  `list`/`stats`/`score` with findings from the `claim-audit`
  skill into a single markdown report.
- `/ship-verdict <name>` — four-gate release check (verdict,
  freshness, replay, audit-chain). Exits non-zero on any gate
  failure. Does not ship; only verifies.

**CI** (`.github/workflows/falsify.yml`) — on every push and PR,
the workflow runs the unittest suite, `tests/smoke_test.sh`, the
JUJU end-to-end (`lock` → `run` → `verdict`), a guard self-check,
and a skill-lint pass over every SKILL.md and agent file.

## Demo

- Walk through the pipeline in 5 runnable steps: [DEMO.md](DEMO.md).
- Second-by-second shooting script for the 3-minute video:
  [docs/DEMO_SHOT_LIST.md](docs/DEMO_SHOT_LIST.md).
- Four more claim types (accuracy regression, latency gate,
  prediction calibration, LLM agreement, AB test):
  [docs/EXAMPLES.md](docs/EXAMPLES.md).

## MCP integration

Expose the verdict store to Claude Desktop / Claude Code via
Model Context Protocol with four read-only tools (`list_verdicts`,
`get_verdict`, `get_stats`, `check_claim`) and three resource URIs.

```bash
pip install -e '.[mcp]'
python -m mcp_server   # speaks MCP over stdio
```

Then merge the snippet in
[`mcp_server/claude_desktop_config.example.json`](mcp_server/claude_desktop_config.example.json)
into your Claude Desktop config, pointing `cwd` at your local
clone. Every Claude session in your org can now query live
verdicts — no more *"I think the latency claim still passes"*;
Claude just asks the MCP server. Falsify itself runs without the
SDK; if `mcp` isn't installed, `python -m mcp_server` exits 2 with
a clear install hint. Full surface in
[`mcp_server/README.md`](mcp_server/README.md).

### Managed Agents (optional)

Deploy the two subagents (`verdict-refresher`, `claim-auditor`)
to Anthropic Console for scheduled and on-demand execution.
See [docs/MANAGED_AGENTS.md](docs/MANAGED_AGENTS.md) for the
setup recipe and manifests under
[`managed_agents/`](managed_agents/).

## Install the git hook

```bash
cp hooks/commit-msg .git/hooks/commit-msg
chmod +x .git/hooks/commit-msg
```

Or, as a symlink so hook updates propagate automatically:

```bash
ln -sf "$(pwd)/hooks/commit-msg" .git/hooks/commit-msg
```

## Repository layout

- `falsify.py` — single-file Python CLI, stdlib + pyyaml only.
- `impl/js/falsify.js` — Node.js second reference implementation (12/12 v0.1 + 8/8 v0.2 = 20/20 vectors). Also published to npm as [`falsify-js`](https://www.npmjs.com/package/falsify-js).
- `impl/go/falsify.go` — Go third reference implementation (20/20 vectors).
- `impl/rust/` — Rust fourth reference implementation (20/20 vectors).
- `spec/PRML-v0.1.md` + `spec/test-vectors/v0.1/` (12) + `spec/test-vectors/v0.2/` (8) — spec + conformance suite.
- `spec/analysis/` — positioning + canonicalization portability findings.
- `spec/compliance/` — EU AI Act mapping + compliance landing copy.
- `spec/paper/` — 14-page arXiv preprint (LaTeX).
- `spec/v0.2/ROADMAP.md` — v0.2 RFC roadmap.
- `hypothesis.schema.yaml` — spec schema (claim, falsification,
  experiment, environment, artifacts).
- `examples/hello_claim/` — tiny smoke-test fixture.
- `examples/juju_sample/` — anonymized 20-row prediction ledger
  for the Brier score demo.
- `hooks/commit-msg` — the guard hook.
- `tests/` — `unittest` suite plus `smoke_test.sh` end-to-end driver.
- `.claude/skills/` — the five in-session skills.
- `.claude/agents/` — the two forked-context subagents.
- `.claude/commands/` — the three slash commands.
- `.github/workflows/` — CI + PRML manifest verification.

## Self-dogfooding

Falsify uses itself. Three real claims about this codebase live
under `claims/self/`:

- `cli_startup` — CLI startup stays under 500ms median
- `test_coverage_count` — test suite has more than 400 test methods
- `claude_surface` — Claude integration ships more than 8 artifacts

Run `make dogfood` to re-verify. CI runs these on every PR.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for release history.

## Roadmap

Two roadmaps run alongside each other:

- **CLI tool roadmap:** [ROADMAP.md](ROADMAP.md) — `falsify` features, integrations, dependencies. CLI v0.2 targeted 2026-06-15.
- **Specification roadmap:** [spec/v0.2/ROADMAP.md](spec/v0.2/ROADMAP.md) — PRML format evolution, canonicalization grammar, conformance. Spec v0.2 freeze 2026-05-22.

The CLI is downstream of the spec: when spec v0.2 freezes, CLI v0.2 follows about three weeks later. CLI v0.3 is loosely scoped for Q4 2026.

## Trust model

Falsify is a discipline tool, not a zero-trust system. For a full
enumeration of attacks defended and NOT defended, with the exact
exit code or command that catches each, see
[docs/ADVERSARIAL.md](docs/ADVERSARIAL.md). For private disclosure
of invariant breaks, see [.github/SECURITY.md](.github/SECURITY.md).

## License

MIT. See [LICENSE](LICENSE).

See [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for community standards.
See [.github/CODEOWNERS](.github/CODEOWNERS) for module-level
reviewers and [.github/dependabot.yml](.github/dependabot.yml) for
automated dependency updates.
See [docs/GLOSSARY.md](docs/GLOSSARY.md) for definitions of every
term used across the docs.
See [docs/CASE_STUDIES.md](docs/CASE_STUDIES.md) for three concrete
adoption scenarios: ML team, DevOps team, research group.

## Built with

Claude Opus 4.7 (1M context), in three days, for the Anthropic
Built with Opus 4.7 hackathon.

**Cite the spec:** Öztürk, C. (2026). *PRML v0.1*. Zenodo. [https://doi.org/10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839)
