Reproducible scoring of conflict detection for agentic memory, on a six-way taxonomy.
CONTRADICTION or TEMPORAL_SUPERSESSION misjudged as
INDEPENDENT/DUPLICATE. That's a false fact that survives undetected and
quietly poisons everything an agent recalls later. Lower is better.
| System | Gold set | Cases | Accuracy | Silent-corruption ↓ | False alarms |
|---|
| Relation | Precision | Recall | Support |
|---|
Versioned, tagged JSONL. Every case carries domain, adversarial, and a
held-out split. Labels are correct by construction: a deterministic
generator encodes the taxonomy's semantics (functional predicate + conflicting value + overlapping
validity ⇒ contradiction, and so on), so the set scales without hand-labelling error.
# no API key — the heuristic baseline
uv run python eval/bench.py --subject heuristic --gold v1 --split all
# the Vault (Claude judge)
ANTHROPIC_API_KEY=… uv run python eval/bench.py --subject vault --gold v1 --limit 120
# CI gate: fail the build on a silent-corruption / accuracy regression
uv run python eval/bench.py --subject vault --max-silent-corruption 0.005 --min-accuracy 0.85
A subject implements one method — classify(new_claim, candidate) → relation. Memory
systems are compared with a behavioural probe: seed the candidate fact, add the
new one, and map the system's reaction (added / updated / invalidated / deduped) onto the taxonomy.
Adapters for Mem0 and Zep ship in eval/competitors.py.
To add yours, subclass ProbeAdapter, register it, and open a pull request with the
reproducible run. We score every submission on the same gold set and the same machine.
The mapping is deliberately generous to competitors: when a system collapses contradiction and supersession into one "update", we count it as the non-silent class — we never inflate anyone's silent-corruption rate by our own coarseness.