Benchmarks

Zaxy publishes benchmark evidence in two categories: same-harness Zaxy runs and external disclosures from other memory products. Keep those categories separate. Same-harness results are generated by Zaxy's benchmark CLI over committed or operator-supplied workloads. External disclosures are numbers quoted from public project pages or public benchmark analysis pages; they are useful market context, but they are not same-harness results until those systems run inside the same measurement protocol.

Current Headline

The current public Zaxy result is the archived 100-question LongMemEval-compatible run at reports/benchmarks/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, BM25 as the same-harness lexical baseline, and graph-backed Zaxy retrieval over 1,559 Eventloom events, 100 queries, 100 subjects, and 265 sessions.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms Approx tokens
BM25 0.540 0.500 1.000 0.710 0.840 0.870 85.77 5493
Zaxy 0.970 0.950 1.000 1.000 1.000 1.000 816.71 11038

This is the strongest public claim because it tests conversational long-memory retrieval across multi-session and temporal-reasoning questions. The report is not a full 500-question LongMemEval publication yet, and it should not be described as one. The tradeoff is explicit in the same report: BM25 is much faster and returns fewer tokens, while Zaxy substantially improves answer and multi-hop recall.

Full 500-Question LongMemEval Run

Legacy limit=10 full-set floor

The full 500-question LongMemEval-compatible hash run is archived at reports/benchmarks/longmemeval-500-hash/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, limit=10, BM25 as the same-harness lexical baseline, and Zaxy checkout retrieval over 5,372 Eventloom events, 500 queries, 500 subjects, and 948 sessions. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.560 0.516 1.000 0.592 0.770 0.902 356.67 433.55
Zaxy checkout 0.724 0.628 1.000 0.960 0.972 0.972 1472.11 2652.55

This legacy full-set result remains the limit=10 no-regression floor for checkout-wide changes that still run that archived harness. It is not a replacement for the stronger 100-question headline. The miss taxonomy shows the quality target clearly: Zaxy checkout now has 14 retrieval misses and 172 synthesis misses after the wedding-list answer-surface improvement. Future retrieval, checkout, Skill Memory, or backend changes should not reduce the published floor of mean score 0.626, Answer@5 0.608, citation coverage 1.000, and R@5 0.956 while they work down the synthesis-miss count.

Current same-harness backend-evaluation floor

Backend-evaluation work now uses the current limit=5 full-set control at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854, matching the pgGraph full-set comparison below.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.516 0.516 1.000 0.592 0.770 0.770 347.47 406.53
Zaxy checkout 0.714 0.626 1.000 0.946 0.958 0.958 1089.53 2456.86

Use this current backend-evaluation floor when comparing projection backends or other limit=5 full-set reports. Do not compare a limit=5 backend run directly against the legacy limit=10 mean-score floor without also running a same-command Neo4j control.

Skill Memory changes must pass the full 500-question guardrail before release, because the checkout skill lane shares ranking, evidence selection, prompt formatting, and MCP tool surfaces with factual memory. The Skill Memory lane may add cited procedural guidance, but it must not lower Zaxy checkout mean score, Answer@5, Recall@5, citation coverage, or the archived latency envelope unless a new public benchmark report explicitly replaces these floors. Skill Memory outcome analytics are read-only checkout diagnostics: promotion candidates, rollback candidates, and contradiction analytics can guide an agent, but they do not revise, delete, or promote a skill without an explicit skill.* event.

Projection backend changes must pass the full 500-question guardrail before release, because backend swaps can alter exact, keyword, vector, traversal, temporal, and citation behavior even when Eventloom remains the source of truth. Embedded Kuzu is the default backend after matching the answer-ready quality and citation gates; Neo4j remains the sidecar control backend for same-harness comparisons.

The experimental pgGraph adapter now has an initial same-harness backend comparison, but it remains experimental. It supports projection, exact search, keyword search, pgvector-backed vector search, invalidation, and traversal. It remains behind PROJECTION_BACKEND=pggraph, and vector search uses pgvector only when the PostgreSQL endpoint has the extension installed. pgGraph is still not eligible as a default backend until it passes the full guardrail on the same harness and has repeatable operations coverage.

pgGraph Backend Comparison

The 100-question backend comparison is archived at reports/benchmarks/longmemeval-100-pggraph-comparison/live-benchmark.md and reports/benchmarks/longmemeval-100-neo4j-comparison/live-benchmark.md. Both runs use the same cleaned LongMemEval-compatible slice, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.500 0.500 1.000 0.710 0.840 98.24 2514
pgGraph Zaxy 0.960 0.960 1.000 0.980 0.980 355.37 5789
pgGraph checkout 0.910 0.910 1.000 0.950 0.980 312.62 5033
Neo4j Zaxy 0.960 0.960 1.000 1.000 1.000 667.78 3937
Neo4j checkout 0.930 0.930 1.000 0.960 1.000 625.98 7419

The full 500-question pgGraph comparison is archived at reports/benchmarks/longmemeval-500-pggraph-comparison/live-benchmark.md. A same-harness Neo4j checkout control run is archived at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. It uses the full cleaned LongMemEval-compatible workload, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both. The pgGraph run uses --reset-graph to truncate and rebuild the PostgreSQL projection tables before ingestion so repeated benchmark runs do not accumulate stale benchmark projections.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.512 0.512 1.000 0.592 0.770 343.80 2661
pgGraph Zaxy 0.698 0.698 1.000 0.958 0.958 1077.11 4193
pgGraph checkout 0.714 0.632 1.000 0.948 0.958 1020.22 13016
Neo4j checkout control 0.714 0.626 1.000 0.946 0.958 1089.53 13431

The clean pgGraph run restored the full-set Recall@5 floor and passed Answer@5, citation coverage, and latency. The same-harness Neo4j checkout control on the current workload hash scored 0.714, and pgGraph checkout scored 0.714, so the current adapter comparison no longer shows a pgGraph-specific quality regression. Checkout token volume is higher than the previous archive because the benchmark now includes supporting facts and evidence from the model-facing Memory Checkout object even when compact contexts are present. pgGraph remains an evaluation backend only until the full 500-question floor is re-baselined on a frozen same-harness workload and operational coverage covers container bootstrap, schema reset, graph rebuild, and failure recovery.

BM25 Comparison

The current same-harness BM25 comparison is archived at reports/benchmarks/longmemeval-100-comparison/live-benchmark.md. It reruns the same 100-question LongMemEval-compatible slice with BM25 and Zaxy checkout at limit=5.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10
BM25 0.500 0.500 1.000 0.710 0.840 0.840
Zaxy checkout 0.900 0.880 1.000 0.950 0.990 0.990

The practical reading is that BM25 can find many answer-bearing sessions, but it loses much more often during temporal and multi-session synthesis. Zaxy's advantage comes from checkout-level recall planning, source-first evidence selection, temporal/entity bridging, and cited context assembly. The tradeoff is latency: BM25 is much faster in this run, while Zaxy returns richer cited context.

Representative Suite

Zaxy also has a synthetic suite-v1 benchmark review in benchmark-review.md. That review covers 650 paired queries across current memory, historical memory, graph traversal, indexed documents, sanitized transcripts, and mixed cross-lane context. On that representative agent-context workload, Zaxy scored 1.000 with OpenAI text-embedding-3-small, compared with 0.520 for vector and markdown+vector baselines and 0.005 for direct markdown scanning.

Use suite-v1 to evaluate Zaxy's architectural thesis: temporal, relational, replayable agent context should beat flat chunk retrieval on tasks that require current-vs-historical truth, graph relationships, citations, and mixed context. Use LongMemEval-compatible runs to compare with public memory-product claims.

CoordinationBench

CoordinationBench is the benchmark lane for Zaxy Coordinate. It measures whether a memory system can turn multiple isolated worker sessions into one governed parent mission history. The scorer reports accepted-finding precision and recall, conflict precision and recall, stale-claim rejection, duplicate consolidation, evidence coverage, parent-checkout answerability, citation coverage, Eventloom replayability, token estimates, and brief/promotion latency.

The current official CoordinationBench adapter result is a first-party, same-harness run through the external CoordinationBench scorer. The adapter was frozen before holdout evaluation and is recorded in CoordinationBench at submissions/participants/zaxy-coordinate.adapter.json with these source hashes:

Source SHA-256
src/zaxy/coordinationbench_adapter.py d2ff5d6124e7a1f0849cac8a6afbba328bd4fa8ef0fd806203575801dc5c6e7c
examples/adapters/coordinationbench_zaxy_adapter.py 3923b754fe31c81c572bc0c8bbfeb595e5fd69f6eb833746112b01c85982da00

The public v1 and v1-scale lanes scored perfectly, but those lanes should be described as first-party public-label reproducibility runs, not a representative leaderboard claim:

Lane Cases Overall Accepted precision Accepted recall Conflict recall Stale rejection Answerability Evidence grounding
v1-audited 10 1.000 1.000 1.000 1.000 1.000 1.000 1.000
v1-scale 72 1.000 1.000 1.000 1.000 1.000 1.000 1.000

After freezing the adapter, the same executable was run unchanged against existing public-derived holdout workload packs. That is the more honest generalization signal:

Holdout pack Overall Accepted precision Accepted recall Conflict recall Stale rejection Answerability Evidence grounding
public-derived-mini 0.593 0.667 0.667 1.000 0.000 0.000 0.333
public-derived-wave1 0.644 0.792 0.875 0.750 0.000 0.000 0.375
public-derived-wave2 0.593 0.833 0.938 0.125 0.000 0.000 0.438
public-derived-wave3 0.598 0.962 0.962 0.000 0.000 0.000 0.462
public-derived-wave4 0.604 0.977 0.977 0.000 0.000 0.000 0.477

The public-derived holdout mean is 0.606. That result is the right product signal: the replay-backed coordination layer gets strong accepted-state precision and duplicate consolidation, but the current frozen adapter still needs better source-aware final answering, stale-source interpretation, and conflict detection across public-derived cases. Until independent review and unseen workload promotion are complete, Zaxy should not market a perfect CoordinationBench score as representative performance.

The internal coordination-real-v1 report is archived at reports/benchmarks/coordination-real-v1/coordination-benchmark.md. It remains useful as a Zaxy development smoke test over real project history. It should not be used as the headline benchmark claim because it was produced inside the Zaxy repo and is easier to tune against than an external holdout pack. The report includes local baselines, disclosure-only adapter templates for Mem0, Agent Memory, and ActiveGraph, limitations, and reproduction commands.

The smaller coordination-v1 workload remains as the contract seed. It includes three workers, overlapping auth-failure findings, duplicate evidence, stale claims, conflicting claims, and a missing-evidence finding. Use it for adapter authors and fast protocol checks, not as the representative headline.

Run the MVP harness:

zaxy coordinate benchmark --output-dir reports/benchmarks/coordination-v1 --json

The command writes coordination-benchmark.json, coordination-benchmark.md, and the frozen workload JSON. The included flat-eventlog baseline intentionally accepts all worker findings, so it exposes the contamination problem that governed promotion is meant to solve.

The current coordination-v1 report is published at reports/benchmarks/coordination-v1/coordination-benchmark.md. It uses workload fingerprint 4b6f01f5a0e9275bd6cd0238d439ee326d471483d5da3cc1dcc9a258d21bfafc and reports:

System Accepted precision Conflict recall Stale rejection Parent answerability Citation coverage
Zaxy Coordinate 1.000 1.000 1.000 1.000 1.000
Markdown notes 0.400 0.000 0.000 0.000 0.000
BM25 worker logs 0.333 0.000 0.000 0.000 0.000
Flat transcript 0.200 0.000 0.000 0.000 0.000

The same report lists Mem0, Agent Memory, and ActiveGraph as not_run with disclosure_only claim status until a pinned runner manifest or strict result file is available. That is deliberate: CoordinationBench should make the adapter gap visible without turning placeholder templates into public claims.

External Disclosures

These rows summarize public claims from other projects. They are external disclosures, not same-harness results, because Zaxy did not execute those systems inside its benchmark harness.

System Public claim Source Interpretation
MemPalace 96.6% raw LongMemEval R@5; 98.4% held-out hybrid R@5; tuned full-set runs reported separately MemPalace BENCHMARKS.md, independent benchmark analysis Strong public target for LongMemEval-style retrieval; their docs and public analysis distinguish raw, held-out, and tuned reranked results.
Agent Memory 95.2% R@5 on LongMemEval-S, with BM25 + vector + graph retrieval and on-device reranking agent-memory.dev Direct product-positioning target for coding-agent memory with aggressive hook and viewer UX.
Mem0 +26% Accuracy over OpenAI Memory on LOCOMO; 91% faster responses and 90% lower token usage than full-context approaches mem0 LLM.md, memory-benchmarks Different benchmark family and metric; useful as production-memory context, but not directly comparable to LongMemEval R@5.

When writing public copy, do not collapse these into a single leaderboard. Metric families differ: R@5 retrieval, Answer@5 expected-term recall, LOCOMO judge accuracy, and token/latency reductions answer different questions.

Same-Harness Adapter Feasibility

As of May 18, 2026, competitor adapters have different readiness levels:

System Status Evidence Same-harness blocker
MemPalace adapter candidate The public repo documents benchmarks/longmemeval_bench.py, committed per-question results, and a no-API-key raw LongMemEval path. Build a wrapper that exports per-query top-k contexts into Zaxy's BenchmarkRun schema without changing MemPalace ranking settings.
Mem0 benchmark harness candidate mem0ai/memory-benchmarks includes LongMemEval scripts, but the OSS path requires Docker, Qdrant, model configuration, and LLM answer/judge settings. Separate retrieval-only evidence from answer/judge accuracy, pin backend config, and preserve token/latency accounting.
Agent Memory external disclosure only The product page reports LongMemEval-S R@5 and the retrieval stack, but it does not document a stable same-harness CLI/API contract for Zaxy to call. Keep the claim in external disclosures until a reproducible benchmark command, dataset contract, and result export are available.

No same-harness adapter should be published without a pinned install command, dataset mapping, retrieval limit, score mapping, latency/tokens capture, and a clear statement about whether the competitor result is retrieval recall, answer/judge accuracy, or another metric family.

Backend Shootout

Embedded graph work needs a backend shootout before any default-backend change. The shootout contract compares embedded, LatticeDB, Neo4j, pgGraph, and BM25 on the same Eventloom history and query file. It must report cold bootstrap time, first useful init time, first checkout time, append-to-projection p95, projection events per second, checkout p95, checkout p99, traversal p95, dashboard graph-load timing, returned tokens, injected tokens, citation coverage, quality against expected query terms when provided, Answer@5/Recall@5 fields for LongMemEval-compatible workloads, resident memory delta, on-disk footprint, and rebuild recovery time. Every generated JSON report also carries a report schema version, UTC generation timestamp, source fingerprints for the Eventloom and query files, and workload fingerprints for the filtered events and normalized query specs. Release evidence should use --require-report-metadata --require-markdown-report --require-query-results --require-git-tracked-inputs --verify-report-fingerprints so stale reports fail when their input Eventloom or query file changes, so the human-readable Markdown sidecar carries matching provenance, so aggregate metrics are backed by per-query diagnostics, and so release evidence cannot depend on local-only benchmark inputs. It also verifies event/query counts so tampered count metadata cannot pass as release evidence.

The local harness is:

python scripts/backend-shootout.py \
  --eventloom-path .eventloom \
  --session-id default \
  --queries-file reports/backend-shootout/queries.json \
  --output reports/backend-shootout/backend-shootout.json

Validate a labeled active-backend report before treating it as release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-answer-at-5 0.5 \
  --min-recall-at-5 0.5 \
  --min-citation-coverage 1.0 \
  --min-quality-per-1k-injected-tokens embedded=1.0 \
  --min-answer-at-5-per-1k-injected-tokens embedded=1.0 \
  --max-cold-bootstrap-ms embedded=250 \
  --max-first-checkout-ms embedded=25 \
  --max-append-to-projection-p95-ms embedded=50 \
  --max-resident-memory-delta-bytes embedded=256000000 \
  --max-on-disk-footprint-bytes embedded=256000000 \
  --max-dashboard-graph-load-ms embedded=250 \
  --max-checkout-p99-ms embedded=25 \
  --max-exact-p99-ms embedded=10 \
  --max-keyword-p99-ms embedded=5 \
  --max-vector-p99-ms embedded=5 \
  --max-traversal-p99-ms embedded=5

Validate the medium-scale embedded performance report before treating embedded projection throughput as protected release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/longmemeval-40-backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-citation-coverage 1.0 \
  --min-projection-events-per-second embedded=40 \
  --max-cold-bootstrap-ms embedded=250 \
  --max-first-useful-init-ms embedded=15000 \
  --max-first-checkout-ms embedded=50 \
  --max-append-to-projection-p95-ms embedded=35 \
  --max-resident-memory-delta-bytes embedded=768000000 \
  --max-on-disk-footprint-bytes embedded=256000000 \
  --max-dashboard-graph-load-ms embedded=500 \
  --max-rebuild-recovery-ms embedded=15000 \
  --max-checkout-p95-ms embedded=100 \
  --max-checkout-p99-ms embedded=85 \
  --min-quality-per-1k-returned-tokens embedded=0.10 \
  --min-answer-at-5-per-1k-returned-tokens embedded=0.10 \
  --min-quality-per-1k-injected-tokens embedded=0.10 \
  --min-answer-at-5-per-1k-injected-tokens embedded=0.10 \
  --max-exact-p95-ms embedded=15 \
  --max-exact-p99-ms embedded=10 \
  --max-keyword-p95-ms embedded=75 \
  --max-keyword-p99-ms embedded=40 \
  --max-vector-p95-ms embedded=25 \
  --max-vector-p99-ms embedded=35 \
  --max-traversal-p95-ms embedded=10 \
  --max-traversal-p99-ms embedded=10

Validate the 100-query embedded scale report before treating broader embedded runtime behavior as protected release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/longmemeval-100-backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-recall-at-5 0.90 \
  --min-citation-coverage 1.0 \
  --min-projection-events-per-second embedded=35 \
  --max-cold-bootstrap-ms embedded=600 \
  --max-first-useful-init-ms embedded=45000 \
  --max-first-checkout-ms embedded=150 \
  --max-append-to-projection-p95-ms embedded=40 \
  --max-resident-memory-delta-bytes embedded=1700000000 \
  --max-on-disk-footprint-bytes embedded=512000000 \
  --max-dashboard-graph-load-ms embedded=500 \
  --max-rebuild-recovery-ms embedded=45000 \
  --max-checkout-p95-ms embedded=200 \
  --max-checkout-p99-ms embedded=250 \
  --min-quality-per-1k-returned-tokens embedded=0.15 \
  --min-answer-at-5-per-1k-returned-tokens embedded=0.15 \
  --min-quality-per-1k-injected-tokens embedded=0.15 \
  --min-answer-at-5-per-1k-injected-tokens embedded=0.15 \
  --max-exact-p95-ms embedded=10 \
  --max-exact-p99-ms embedded=12 \
  --max-keyword-p95-ms embedded=20 \
  --max-keyword-p99-ms embedded=15 \
  --max-vector-p95-ms embedded=15 \
  --max-vector-p99-ms embedded=20 \
  --max-traversal-p95-ms embedded=10 \
  --max-traversal-p99-ms embedded=10

This performance guardrail intentionally checks total checkout latency, projection/resource costs, and the retrieval lanes that produce context. resident_memory_delta_bytes and on_disk_footprint_bytes keep the embedded runtime honest about local machine cost instead of only optimizing query latency. quality_per_1k_returned_tokens, answer_at_5_per_1k_returned_tokens, quality_per_1k_injected_tokens, and answer_at_5_per_1k_injected_tokens protect the token-efficiency goal directly, while the exact, keyword, vector, and traversal p95 ceilings make it harder for one degraded lane to hide behind an acceptable aggregate checkout p95.

The default active backend set is embedded and bm25. This keeps routine shootouts sidecar-free while still comparing Zaxy's embedded projection against a lexical baseline. Neo4j, pgGraph, and LatticeDB remain supported through an explicit backend set such as --backends embedded,neo4j,bm25 when you are running controlled sidecar comparisons. LatticeDB is a parked candidate after the first graph-traversal smoke failed both quality and latency gates. Use it only for targeted follow-up, not routine active-backend shootouts. Release evidence should pass --forbid-backends neo4j,pggraph,latticedb so routine active-backend evidence stays sidecar-free until each optional backend is explicitly selected for a controlled comparison. The --require-git-tracked-inputs flag is mandatory for active release evidence. It rejects reports whose eventloom_path or queries_file points at a local-only file, which prevents passing fingerprints that cannot be reproduced from a clean checkout. If you regenerate LongMemEval target-query files, track the replacement query inputs with the report update instead of leaving release evidence dependent on local scratch files.

The current focused embedded graph-traversal evidence is archived at reports/benchmarks/backend-shootout-graph-traversal-embedded-after-carry-forward. That run used mempalace-graph-traversal-v1, 10 subjects, hash embeddings, limit=5, BM25 as the baseline, and PROJECTION_BACKEND=embedded. It reports Zaxy embedded mean score 1.000, Answer@5=1.000, Recall@5=1.000, citation coverage 1.000, p50 checkout latency 18.31ms, and p95 checkout latency 31.87ms. The predecessor failed because the embedded adapter did not match Neo4j's undirected traversal semantics and did not carry active relationships forward when an entity was reasserted into a new temporal version.

A checked smoke workload is available at reports/backend-shootout/sample.eventloom with query specs in reports/backend-shootout/queries.json. It is intentionally small, so it is a reproducibility check rather than default-backend evidence. The checked report location is reports/backend-shootout/backend-shootout.json. The current checked smoke report covers embedded and bm25 with citation coverage at 1.0, report metadata, source fingerprints, and workload fingerprints. Run additional explicit backends locally when Neo4j, pgGraph, or LatticeDB infrastructure is available.

The medium-scale backend evidence is archived at reports/backend-shootout/longmemeval-40-backend-shootout.json, using reports/backend-shootout/longmemeval-40.eventloom.jsonl and reports/backend-shootout/longmemeval-40-queries-with-targets.json. This 40-question LongMemEval-compatible run is not a default-backend gate. It is a scale and surface check for the embedded runtime path. In that report, embedded/Kuzu completed with two contract rows. The raw retrieve row scored Answer@5=0.575, Recall@5=1.0, citation coverage 1.0, checkout p95 10.55ms, lane p95s of exact 0.007ms, keyword 3.285ms, vector 2.614ms, and traversal 0.005ms. The answer_ready row scored Answer@5=1.0 and Recall@5=1.0 with checkout p95 56.541ms, mean returned tokens 3372.5, mean injected tokens 3507.55, and Answer@5 per 1k injected tokens 0.2851. The shared projection path had cold bootstrap 225.93ms, first useful init 9347.717ms, append-to-projection p95 24.674ms, rebuild recovery 10433.54ms, projection throughput 57.007 events/sec, resident memory delta 652308480 bytes, on-disk footprint 28762112 bytes, and dashboard graph source embedded with 100 nodes and 100 edges. The BM25 control completed with Answer@5=0.55, Recall@5=1.0, citation coverage 1.0, checkout p95 169.674ms, mean returned/injected tokens 3944.9, and quality plus Answer@5 per 1k returned/injected tokens 0.1394. The embedded rows show a real local graph can be built and served to the dashboard without a sidecar, and that answer-ready synthesis now closes the answer-surface gap while the raw retrieve row remains the operational latency contract. Use the focused graph-traversal archive above for relationship behavior, and use this 40-query report as operational evidence that the embedded path can run a larger Eventloom file through projection, checkout, token-efficiency accounting, and dashboard summary. Explicit Kuzu bulk projection transactions and prewarmed keyword/vector caches reduced the earlier roughly 120-second projection/rebuild run to roughly 10 seconds while keeping vector retrieval enabled. This is still not the same as passing the full default-backend gate, but it moves embedded projection throughput from a product blocker into the next optimization target.

The current 100-query scale evidence is archived at reports/backend-shootout/longmemeval-100-backend-shootout.json, using reports/backend-shootout/longmemeval-100.eventloom.jsonl and reports/backend-shootout/longmemeval-100-queries-with-targets.json. This run covers 100 queries and 1,559 Eventloom events. Embedded/Kuzu again emits separate contract rows. The raw retrieve row scored Answer@5=0.52, Recall@5=0.99, citation coverage 1.0, checkout p95 19.904ms, checkout p99 21.75ms, lane p95s of exact 0.006ms, keyword 6.08ms, vector 7.689ms, and traversal 0.006ms, with mean injected tokens 1492.24 and Answer@5 per 1k injected tokens 0.3485. The answer_ready row scored Answer@5=0.99 and Recall@5=1.0, first checkout 37.615ms, checkout p95 90.478ms, with mean injected tokens 3426.8 and Answer@5 per 1k injected tokens 0.2889. The shared projection path had cold bootstrap 421.649ms, first useful init 29620.186ms, append-to-projection p95 26.931ms, rebuild recovery 27678.562ms, projection throughput 53.393 events/sec, resident memory delta 1604280320 bytes, on-disk footprint 57298944 bytes, dashboard source embedded, 100 dashboard nodes, and 100 dashboard edges. BM25 scored Answer@5=0.52, Recall@5=0.9, citation coverage 1.0, checkout p95 439.913ms, mean returned/injected tokens 4179.5, and Answer@5 per 1k injected tokens 0.1244. This vector-enabled 100-query report is now strong answer-ready evidence for the embedded default, and the raw retrieve path now clears a stricter Recall@5=0.90 release floor with the embedded scale guardrail passing. It also makes the next performance target explicit: resident memory and answer-ready tail latency still deserve focused optimization. The higher cold bootstrap is intentional: startup now prewarms the Eventloom verbatim source index so the first answer-ready checkout does not pay that source-lane build cost.

BM25 is the zero-infrastructure lexical control. LatticeDB is tracked as a candidate backend behind the ProjectionStore factory, but it is not in the default active set. Its first adapter slice projects graph state and supports exact, keyword, traversal, temporal invalidation, source-retirement behavior, and Eventloom citation metadata on returned entities. It delegates vector and full-text search to LatticeDB and reports inferred-edge audit diagnostics. The current parked candidate evidence is Answer@5=0.0, mean score 0.0, and roughly 6.4s p50 checkout latency on the graph-traversal smoke. Graph backends run through Zaxy's MemoryFabric projection contract so the shootout measures the same retrieval surface that agents use. Backends with missing infrastructure should emit an error row rather than hiding the failure or aborting the whole report.

Reproduction

Run the current LongMemEval-compatible release evidence with BM25 included as a local baseline. Plain zaxy benchmark commands use the embedded projection backend by default; pass --projection-backend neo4j or another backend only when running an explicit sidecar comparison.

zaxy benchmark \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the current BM25 comparison:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-100-comparison \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 5 \
  --baseline-backends bm25 \
  --zaxy-backend checkout \
  --reuse-projection \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the full 500-question archive:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-hash \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --reset-graph \
  --progress

Guard the full 500-question archive with floors pinned to the current observed legacy limit=10 result:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-hash/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.626 \
  --min-answer-recall-at-5 0.608 \
  --min-recall-at-5 0.956 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 15000 \
  --max-p99-ms 23000

Guard current same-harness backend-evaluation reports with the limit=5 Neo4j checkout control:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.714 \
  --min-answer-recall-at-5 0.626 \
  --min-recall-at-5 0.958 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1200 \
  --max-p99-ms 2500

For release gates, compare reports with zaxy benchmark-compare and keep the thresholds explicit. For market comparisons, keep external products in disclosure rows until they can run against the same dataset, query order, retrieval limit, scoring code, and citation requirements.

From a clean checkout with the cached LongMemEval dataset and embedding cache present, run all archived public LongMemEval guardrails with:

scripts/benchmark-guardrails.sh

zaxy doctor --beta-readiness also exposes the release benchmark posture as a named benchmark_no_regression check. It requires the release script to keep checkout quality floors, citation coverage at 1.0, and p95/p99 checkout latency budgets across smoke, performance, and scale backend reports.

Related references: testing.md, retrieval.md, competitive-positioning.md, benchmark-review.md, and README.md. The public explanation is site/index.html.