vault-bench — the silent-corruption leaderboard

Reproducible scoring of conflict detection for agentic memory, on a six-way taxonomy.

The metric that matters: the silent-corruption rate — a real CONTRADICTION or TEMPORAL_SUPERSESSION misjudged as INDEPENDENT/DUPLICATE. That's a false fact that survives undetected and quietly poisons everything an agent recalls later. Lower is better.

Leaderboard

SystemGold setCases AccuracySilent-corruption ↓False alarms

Per-class breakdown — best system

RelationPrecisionRecallSupport

The gold set

Versioned, tagged JSONL. Every case carries domain, adversarial, and a held-out split. Labels are correct by construction: a deterministic generator encodes the taxonomy's semantics (functional predicate + conflicting value + overlapping validity ⇒ contradiction, and so on), so the set scales without hand-labelling error.

Run it yourself

# no API key — the heuristic baseline
uv run python eval/bench.py --subject heuristic --gold v1 --split all

# the Vault (Claude judge)
ANTHROPIC_API_KEY=… uv run python eval/bench.py --subject vault --gold v1 --limit 120

# CI gate: fail the build on a silent-corruption / accuracy regression
uv run python eval/bench.py --subject vault --max-silent-corruption 0.005 --min-accuracy 0.85

Submit your system

A subject implements one method — classify(new_claim, candidate) → relation. Memory systems are compared with a behavioural probe: seed the candidate fact, add the new one, and map the system's reaction (added / updated / invalidated / deduped) onto the taxonomy. Adapters for Mem0 and Zep ship in eval/competitors.py. To add yours, subclass ProbeAdapter, register it, and open a pull request with the reproducible run. We score every submission on the same gold set and the same machine.

The mapping is deliberately generous to competitors: when a system collapses contradiction and supersession into one "update", we count it as the non-silent class — we never inflate anyone's silent-corruption rate by our own coarseness.

Taxonomy