Benchmarks

Zaxy publishes benchmark evidence in two categories: same-harness Zaxy runs and external disclosures from other memory products. Keep those categories separate. Same-harness results are generated by Zaxy's benchmark CLI over committed or operator-supplied workloads. External disclosures are numbers quoted from public project pages or public benchmark analysis pages; they are useful market context, but they are not same-harness results until those systems run inside the same measurement protocol.

Current Headline

The current public Zaxy headline is the archived full 500-question LongMemEval-compatible checkout run at reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=10, checkout answer assembly, and the embedded Kuzu projection over 5,372 Eventloom events, 500 queries, 500 subjects, and 948 sessions.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms Approx tokens
Zaxy checkout 0.940 0.906 1.000 0.906 1.000 1.000 687.67 14034

This is now the strongest public Zaxy LongMemEval-compatible claim: full-set retrieval clears R@5=1.000 and R@10=1.000 with complete citation coverage. That supports the core Zaxy thesis that event-sourced, cited memory plus graph/lexical/vector/source checkout planning can make answer-bearing memory retrieval effectively solved on this benchmark. The remaining miss taxonomy is entirely synthesis-side: 47 synthesis_miss cases remain, so future gains should improve answer composition and answer placement without lowering retrieval or citation coverage.

The 100-question BM25 comparison remains useful for same-command tradeoffs. It shows BM25 as the faster lexical baseline and Zaxy as the higher-recall cited checkout path on a smaller slice, but it is no longer the headline result.

Full 500-Question LongMemEval Run

2026-06-04 current74 gated relative temporal anchor embedded full-set run

The latest embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=10, --zaxy-backend checkout, no same-command baselines, and the fresh embedded Kuzu projection built by current65 and reused by the current benchmark projection marker. It should be read as a zaxy-only quality validation, not as a BM25 or Neo4j latency comparison. The workload SHA-256 is 90fb2307195d7e16b963a2b8a30f03b375bd42a45d41aeaa55423029dd84e3fc.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.940 0.906 1.000 0.906 1.000 1.000 687.67 969.10

This run improves on current71 with mean score 0.940 and Answer@5 0.906, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remains synthesis-only and drops to 47 synthesis misses. The validated production change is gated relative temporal-anchor synthesis: checkout source synthesis now receives the question-time anchor only for queries that can use it, then derives answer candidates from cited session dates. It covers fully retrieved cases such as "How many months/weeks/days ago did I..." and concrete "what did I buy N days ago" questions without crowding unrelated arithmetic and state queries that merely mention relative durations.

Reproduce this report against the reusable embedded projection:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current65-percentage-boolean-comparison.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reuse-projection \
  --progress

Guard this embedded report against the current71 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current71-zaxyonly-semantic-scalar-totals-embedded-reuse-20260604/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.940 \
  --min-answer-recall-at-5 0.906 \
  --min-recall-at-5 1.000 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 800 \
  --max-p99-ms 1100

2026-06-04 current71 semantic scalar totals embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current71-zaxyonly-semantic-scalar-totals-embedded-reuse-20260604/live-benchmark.md. It improved on current70 with mean score 0.938 and Answer@5 0.898, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 51 synthesis misses. The validated production change was semantic scalar-total synthesis: total queries over query-named numeric quantities sum values whose units are carried by nearby domain words rather than simple suffixes.

2026-06-04 current70 aggregate answer priority embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current70-zaxyonly-aggregate-answer-priority-embedded-reuse-20260604/live-benchmark.md. It improved on current69 with mean score 0.934 and Answer@5 0.892, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 54 synthesis misses. The validated production change was aggregate answer priority: total and combined queries no longer let single latest-state observations or auxiliary relative-interval diagnostics outrank answer-ready aggregate totals. This covered multi-source duration totals such as combined book-reading time and total gameplay hours while keeping direct state answers available when no aggregate projection exists.

2026-06-04 current69 query-bound difference embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current69-zaxyonly-query-bound-difference-embedded-reuse-20260604/live-benchmark.md. It improved on current68 with mean score 0.932 and Answer@5 0.886, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 57 synthesis misses. The validated production change was query-bound difference synthesis: explicit comparison questions bind cited operands to query-named targets before emitting the answer surface. The operator covered same-query currency differences such as taxi-vs-train fare and actual-vs-target compound duration differences such as marathon overrun minutes, while local value scoring prevented estimated amounts from outranking later actual values in the same citation.

2026-06-04 current68 query-bound direct answer embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current68-zaxyonly-query-bound-direct-answer-embedded-reuse-20260604/live-benchmark.md. It improved on current67 with mean score 0.932 and Answer@5 0.878, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 61 synthesis misses. The validated production change was a query-bound direct-answer synthesis class for explicit cited personal-memory answer sentences. It handled direct current-state records, stated quantities such as weight loss, cited meet-up counts, latest limit-change direction, single-source duration answers, and weekly class-day counts without letting generic numeric totals or absence fallbacks outrank the direct answer surface.

2026-06-04 current67 direct boolean evidence embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current67-zaxyonly-direct-boolean-evidence-embedded-reuse-20260604/live-benchmark.md. It improved on current66 with mean score 0.932 and Answer@5 0.874, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 63 synthesis misses. The validated production change was a direct boolean evidence synthesis class for explicit yes/no evidence, including cited current possession, same-method equivalence, and bounded temporal frequency comparisons. The operator requires direct cited support such as I actually have ..., using the same ... as me, or explicit old/new weekly cadence evidence before it emits a boolean_evidence_answer.

2026-06-04 current66 structured scalar answer scoring embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current66-zaxyonly-structured-scalar-scoring-embedded-reuse-20260604/live-benchmark.md. It improved on current64 with mean score 0.924 and Answer@5 0.868, while preserving Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remained synthesis-only and dropped to 66 synthesis misses. The validated production changes were typed boolean percentage-comparison candidates plus structured scalar answer scoring. Cited percentage operands are bound to named targets before yes/no comparison answers are emitted, and the benchmark scorer recognizes compact structured answer fields such as answer=Yes and <type>_answer=12 as answer surfaces.

2026-06-04 current65 boolean percentage-comparison embedded full-set run

The current65 validation is archived at reports/benchmarks/longmemeval-500-current65-zaxyonly-percentage-boolean-comparison-embedded-reset-20260604/live-benchmark.md. It used a fresh embedded Kuzu projection rebuilt with --reset-graph, preserved mean score 0.922, raised Answer@5 to 0.864, kept Recall@5 1.000 and complete citation coverage, and reduced synthesis misses to 68. It is retained as the projection-build provenance for current66.

2026-06-04 current64 numeric-state delta and role-covered date operands embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current64-zaxyonly-numeric-state-delta-date-operands-embedded-reuse-20260604/live-benchmark.md. It improves on current57 with Answer@5 0.862, while preserving mean score 0.922, Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remains synthesis-only and drops to 69 synthesis misses. The validated production changes are numeric-state delta synthesis and role-covered temporal operands. Count-state questions that ask for an increase or decrease now bind cited earlier and later totals, such as Instagram follower counts, and emit an answer-ready difference. Inverted before-event questions now preserve explicit event-date operands over generic session metadata, so cited ordered and birthday party dates can produce the intended interval before duration or session-date distractors fill the answer surface.

2026-06-04 current57 latest-state promotion embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current57-zaxyonly-latest-state-promotion-embedded-reuse-20260604/live-benchmark.md. It improves on current56 with Answer@5 0.856, while preserving mean score 0.922, Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The validated production change is latest-state promotion: cited answer-bearing state spans such as RAM upgrade targets, current page progress, and updated duration ranges are surfaced as compact answer candidates before generic count or duration synthesis can fill the top answer surface.

2026-06-04 current56 query-bound arithmetic embedded full-set run

An earlier embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current56-zaxyonly-query-bound-arithmetic-extended-embedded-reuse-20260604/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=10, --zaxy-backend checkout, no same-command baselines, and the reusable embedded Kuzu projection from the current query-bound arithmetic control. It should be read as a zaxy-only quality validation, not as a BM25 or Neo4j latency comparison. The workload SHA-256 is 90fb2307195d7e16b963a2b8a30f03b375bd42a45d41aeaa55423029dd84e3fc.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.922 0.852 1.000 0.904 1.000 1.000 904.45 1207.69

This run improves on current54 with mean score 0.922, Answer@5 0.852, Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy remains synthesis-only and drops to 74 synthesis misses. The validated production change is query-bound arithmetic synthesis: cited numeric operands are bound to the requested unit, object, and operation before generic duration, absence, or raw numeric fallbacks can become the answer surface. The fixed class covers cited distance totals, title-scoped pages-remaining subtraction, and percentage calculations whose operands are split across long cited sessions or surrounded by unrelated numeric distractors.

Reproduce this report against the reusable embedded projection:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current55-query-bound-arithmetic.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current56-zaxyonly-query-bound-arithmetic-extended-embedded-reuse-20260604 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reuse-projection \
  --progress

Guard this embedded report against the current54 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current56-zaxyonly-query-bound-arithmetic-extended-embedded-reuse-20260604/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current54-zaxyonly-absence-neighborhood-embedded-reuse-20260604/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.922 \
  --min-answer-recall-at-5 0.852 \
  --min-recall-at-5 1.000 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1000 \
  --max-p99-ms 1500

2026-06-04 current54 absence-neighborhood embedded full-set run

An earlier embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current54-zaxyonly-absence-neighborhood-embedded-reuse-20260604/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=10, --zaxy-backend checkout, no same-command baselines, and the reusable embedded Kuzu projection from the current absence-routing control. It should be read as a zaxy-only quality validation, not as a BM25 or Neo4j latency comparison. The workload SHA-256 is 90fb2307195d7e16b963a2b8a30f03b375bd42a45d41aeaa55423029dd84e3fc.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.914 0.846 1.000 0.904 1.000 1.000 800.96 1120.97

This run improves on current53 with mean score 0.914, Answer@5 0.846, Recall@5 1.000, Recall@10 1.000, and complete citation coverage. The miss taxonomy is synthesis-only: 77 synthesis misses and no retrieval misses. The validated production change is scoped absence-neighborhood retrieval: temporal absence checks now recover and preserve the cited support groups that prove the inspected memory neighborhood, even when the original source query or temporal anchor would otherwise starve the source lane. The fixed class is not a benchmark-specific answer patch; absence answers now require nearby positive domain evidence and name the missing proposition, such as a missing current employer.

Reproduce this report against the reusable embedded projection:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current28-absence-routing.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current54-zaxyonly-absence-neighborhood-embedded-reuse-20260604 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reuse-projection \
  --progress

Guard this embedded report against the current53 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current54-zaxyonly-absence-neighborhood-embedded-reuse-20260604/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current53-zaxyonly-query-bound-scalar-embedded-reuse-20260604/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.910 \
  --min-answer-recall-at-5 0.844 \
  --min-recall-at-5 1.000 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2000

2026-06-04 current53 query-bound scalar embedded full-set run

An earlier embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current53-zaxyonly-query-bound-scalar-embedded-reuse-20260604/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=10, --zaxy-backend checkout, no same-command baselines, and the reusable embedded Kuzu projection from the current absence-routing control. It should be read as a zaxy-only quality validation, not as a BM25 or Neo4j latency comparison. The workload SHA-256 is 90fb2307195d7e16b963a2b8a30f03b375bd42a45d41aeaa55423029dd84e3fc.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.910 0.844 1.000 0.902 0.998 0.998 786.25 1050.77

This run improves on current52 with mean score 0.910, Answer@5 0.844, Recall@5 0.998, and complete citation coverage. The miss taxonomy is 1 retrieval miss and 77 synthesis misses. The validated production change is a query-bound scalar answer projection: direct scalar questions now bind candidate answers to the query object and predicate before generic numeric, duration, or assistant-recall fallbacks can become the top answer surface. In the full-set run this moved cited answer spans such as queried brands, quoted songs, and owned-object counts into Answer@5 without expanding retrieval.

Reproduce this report against the reusable embedded projection:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current28-absence-routing.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current53-zaxyonly-query-bound-scalar-embedded-reuse-20260604 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reuse-projection \
  --progress

Guard this embedded report against the current52 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current53-zaxyonly-query-bound-scalar-embedded-reuse-20260604/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current52-zaxyonly-temporal-list-reuse-control-embedded-20260604/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.908 \
  --min-answer-recall-at-5 0.834 \
  --min-recall-at-5 0.998 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2000

2026-06-03 current28 absence-routing embedded full-set run

An earlier embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current28-zaxyonly-absence-routing-embedded-isolated-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and an isolated embedded Kuzu projection path. It did not run BM25 in the same command and it should not be treated as a Neo4j latency comparison. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.876 0.810 1.000 0.896 0.990 0.990 568.56 698.06

This run clears the current27 embedded quality floor with mean score 0.876, Answer@5 0.810, Recall@5 0.990, and complete citation coverage. The miss taxonomy is 5 retrieval misses and 90 synthesis misses. The validated production change is high-precision absence-first routing for conjunctive aggregation and temporal-order questions: when cited sources prove one side of the query but not the other, generic arithmetic, count, date, and duration synthesis defers to cited absence guidance instead of fabricating a complete comparison. This is a synthesis-contract improvement, not a retrieval expansion.

Reproduce this report with an isolated embedded projection path:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current28-absence-routing.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current28-zaxyonly-absence-routing-embedded-isolated-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard this embedded report against the current27 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current28-zaxyonly-absence-routing-embedded-isolated-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current27-zaxyonly-state-qualifier-embedded-isolated-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.874 \
  --min-answer-recall-at-5 0.802 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2000

2026-06-03 current27 state-qualifier embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current27-zaxyonly-state-qualifier-embedded-isolated-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and an isolated embedded Kuzu projection path. It did not run BM25 in the same command and it should not be treated as a Neo4j latency comparison. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.874 0.802 1.000 0.894 0.990 0.990 563.76 702.90

This run cleared the current25 embedded quality floor with mean score 0.874, Answer@5 0.802, Recall@5 0.990, and complete citation coverage. The miss taxonomy was 5 retrieval misses and 94 synthesis misses. The validated production change was qualifier-slot completeness for numeric state synthesis: current-state answer candidates are promoted only when explicit role qualifiers in the query are supported by the cited evidence span. Incomplete state programs remain diagnostic evidence instead of displacing stronger answer surfaces.

2026-06-03 current25 answer-promotion embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current25-zaxyonly-answer-promotion-embedded-isolated-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and an isolated embedded Kuzu projection path. It did not run BM25 in the same command and it should not be treated as a Neo4j latency comparison. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.874 0.800 1.000 0.894 0.990 0.990 725.43 830.35

This run clears the current24 embedded quality floor with mean score 0.874, Answer@5 0.800, Recall@5 0.990, and complete citation coverage. The miss taxonomy is 5 retrieval misses and 95 synthesis misses. The validated production change is answer-ready candidate promotion inside aggregate synthesis: rendered candidate blocks and result metadata now share the same operation-priority ranking, so specific deterministic answers such as date_interval_answer surface ahead of generic duration/count fallbacks when both are cited. This is a placement and synthesis-contract improvement, not a retrieval expansion.

2026-06-03 current24 numeric-state embedded full-set run

The previous embedded zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current24-zaxyonly-numeric-state-embedded-isolated-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and an isolated embedded Kuzu projection path. It did not run BM25 in the same command and it should not be treated as a Neo4j latency comparison. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.870 0.798 1.000 0.894 0.990 0.990 614.68 737.06

This run clears the current22 quality floor with mean score 0.870, Answer@5 0.798, Recall@5 0.990, and complete citation coverage. The miss taxonomy is 5 retrieval misses and 96 synthesis misses. The validated production change is a typed current-state numeric synthesis operator: cited latest totals and later incremental updates can produce answer-ready numeric_state_answer candidates without treating state questions as generic event counts.

Reproduce this report with an isolated embedded projection path so it does not contend with a live zaxy serve process:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current24-numeric-state.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current24-zaxyonly-numeric-state-embedded-isolated-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard this embedded report against the current22 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current24-zaxyonly-numeric-state-embedded-isolated-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current22-zaxyonly-page-future-age-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.868 \
  --min-answer-recall-at-5 0.796 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2000

2026-06-03 current22 page/future-age zaxy-only full-set run

The latest Neo4j zaxy-only quality validation is archived at reports/benchmarks/longmemeval-500-current22-zaxyonly-page-future-age-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. It did not run BM25 in the same command; use current5 below for the most recent same-command BM25 row. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.868 0.796 1.000 0.896 0.992 0.992 1234.20 1430.16

This run preserves the current18 full-set mean score floor at 0.868 and raises Answer@5 from 0.778 to 0.796, while preserving the above-99% Recall@5 floor and complete citation coverage. The miss taxonomy improves from 4 retrieval misses and 107 synthesis misses to 4 retrieval misses and 98 synthesis misses. The validated production changes are typed cited synthesis operators for target-aware page-count totals and future age-at-event arithmetic. Page-count questions now sum only user-completed reading observations that match the query's temporal target, excluding recommendation-list page counts and older explicit month distractors. Future age-at-event questions now combine a cited current age with a cited future offset such as next year and emit a typed future_age_at_event_answer.

Latency note: this report was generated during an airplane run and passed the absolute p95/p99 budgets, but failed the relative latency regression comparison against current18. Treat current18 as the stable latency baseline until current22 is rerun on stable local conditions.

The remaining current22 misses are class-level synthesis gaps, not citation coverage failures: Recall@5 is already 0.992 and citation coverage is 1.000. The next architecture target is the Evidence Program layer: every deterministic synthesis operation should expose required evidence slots, bound cited source groups, missing slots, and operation completeness before emitting an answer-ready candidate. This prevents benchmark-shaped fixes by making temporal ordering, explicit insufficiency, state transitions, elapsed-time arithmetic, and aggregate ledgers share the same auditable slot-coverage contract.

Reproduce this report from a clean Neo4j projection with:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current22-zaxyonly-page-future-age-neo4j-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend neo4j \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard the quality floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current22-zaxyonly-page-future-age-neo4j-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current18-zaxyonly-direct-values-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.868 \
  --min-answer-recall-at-5 0.796 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

2026-06-03 current18 direct-value zaxy-only full-set run

The stable-latency zaxy-only full-set validation is archived at reports/benchmarks/longmemeval-500-current18-zaxyonly-direct-values-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. It did not run BM25 in the same command; use current5 below for the most recent same-command BM25 row. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.868 0.778 1.000 0.894 0.992 0.992 810.59 923.27

This run improves the current17 full-set checkout floor from mean 0.862 to 0.868 and Answer@5 from 0.772 to 0.778, while preserving the above-99% Recall@5 floor and complete citation coverage. The miss taxonomy improves from 4 retrieval misses and 110 synthesis misses to 4 retrieval misses and 107 synthesis misses. The validated production change is an answer-ready direct numeric value lane for cited source synthesis: personal-best times, current or "so far" counts, latest earned amounts, current daily durations, and direct currency differences can be projected as typed direct_numeric_answer evidence when the cited source text matches the query. The lane is deliberately scoped so domain-specific ledgers still own aggregate route totals, lodging comparisons, absence checks, and other higher-precision synthesis packets.

Reproduce this report from a clean Neo4j projection with:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current18-zaxyonly-direct-values-neo4j-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend neo4j \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard this zaxy-only report against the current17 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current18-zaxyonly-direct-values-neo4j-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current17-zaxyonly-temporal-anchors-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.868 \
  --min-answer-recall-at-5 0.778 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

2026-06-03 current15 assistant-recall zaxy-only full-set run

The previous zaxy-only full-set validation is archived at reports/benchmarks/longmemeval-500-current15-zaxyonly-assistant-recall-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. It did not run BM25 in the same command; use current5 below for the most recent same-command BM25 row. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.838 0.744 1.000 0.890 0.992 0.992 868.66 994.30

This run improves the current14 full-set checkout floor from mean 0.830 to 0.838 and Answer@5 from 0.728 to 0.744, while preserving the above-99% Recall@5 floor and complete citation coverage. The miss taxonomy improves from 4 retrieval misses and 132 synthesis misses to 4 retrieval misses and 124 synthesis misses. The validated production change is deterministic assistant-recall slot extraction for schedule tables, campaign budget lines, recommended item lists, recommended video title and URL, dilution ratios, direct quote recall, named websites, and paired company examples. The single-session-assistant category now has zero misses in this full-set run.

Reproduce this report from a clean Neo4j projection with:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current15-zaxyonly-assistant-recall-neo4j-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend neo4j \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard this zaxy-only report against the current14 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current15-zaxyonly-assistant-recall-neo4j-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current14-zaxyonly-absence-deferral-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.838 \
  --min-answer-recall-at-5 0.744 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

2026-06-03 current14 absence-deferral zaxy-only full-set run

The previous zaxy-only full-set validation is archived at reports/benchmarks/longmemeval-500-current14-zaxyonly-absence-deferral-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. It did not run BM25 in the same command; use current5 below for the most recent same-command BM25 row. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.830 0.728 1.000 0.888 0.992 0.992 875.71 1009.44

This run improves the current12 full-set checkout floor from mean 0.808 to 0.830 and Answer@5 from 0.716 to 0.728, while preserving the above-99% Recall@5 floor and complete citation coverage. The miss taxonomy improves from 4 retrieval misses and 138 synthesis misses to 4 retrieval misses and 132 synthesis misses. The validated production change is evidence-aware absence deferral: generic synthesis only defers when the cited source lane exposes a high-precision missing slot such as a month-scoped count, title-specific reading progress, duration-location absence, or explicit alternative target. A broader absence-first routing attempt reached mean 0.834 but regressed Answer@5 to 0.712, so it is not the release floor.

Reproduce this report from a clean Neo4j projection with:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current14-zaxyonly-absence-deferral-neo4j-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend neo4j \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard this zaxy-only report against the current12 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current14-zaxyonly-absence-deferral-neo4j-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current12-zaxyonly-interval-priority-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.830 \
  --min-answer-recall-at-5 0.728 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

2026-06-03 current12 interval-priority zaxy-only full-set run

The previous zaxy-only full-set validation is archived at reports/benchmarks/longmemeval-500-current12-zaxyonly-interval-priority-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. It did not run BM25 in the same command; use current5 below for the most recent same-command BM25 row. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
Zaxy checkout 0.808 0.716 1.000 0.890 0.992 0.992 868.83 1003.72

This run raises the full-set checkout mean score while preserving the current5 Answer@5 floor, the above-99% Recall@5 floor, and complete citation coverage. The miss taxonomy remains dominated by synthesis misses: 4 retrieval misses and 138 synthesis misses. The validated production change is deterministic answer-key priority for direct interval answers, so week_interval_answer and similar answer-ready interval keys outrank generic elapsed-time answers.

Guard this zaxy-only report against the current5 floor with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current12-zaxyonly-interval-priority-neo4j-20260603/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current5-synthesis-ledger-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.808 \
  --min-answer-recall-at-5 0.716 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

2026-06-03 current5 synthesis-ledger full-set run

The current synthesis-ledger full-set report is archived at reports/benchmarks/longmemeval-500-current5-synthesis-ledger-neo4j-20260603/live-benchmark.md. It uses the cleaned LongMemEval-compatible workload, deterministic local hash embeddings, limit=5, BM25 as the same-harness lexical baseline, --zaxy-backend checkout, and Neo4j projection rebuild with --reset-graph. The workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.520 0.520 1.000 0.592 0.770 0.770 327.24 336.56
Zaxy checkout 0.802 0.716 1.000 0.892 0.992 0.992 869.71 1015.19

This run establishes a new full-set retrieval floor for the current limit=5 checkout path: Recall@5 is above 99% and citation coverage remains complete. It is not an industry-leading Answer@5 or mean-score result. The miss taxonomy still shows the hard problem: Zaxy checkout has 4 retrieval misses and 138 synthesis misses. The previous current4 checkout report was mean 0.794, Answer@5 0.706, Recall@5 0.992, and citation coverage 1.000; current5 improves answer synthesis while preserving the 99% retrieval floor. The production direction remains deterministic answer synthesis and answer-candidate ordering, not broad retrieval expansion.

Reproduce this report from a clean Neo4j projection with:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current5-synthesis-ledger-neo4j-20260603 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 5 \
  --baseline-backends bm25 \
  --zaxy-backend checkout \
  --projection-backend neo4j \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reset-graph \
  --progress

Guard the current report with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current5-synthesis-ledger-neo4j-20260603/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.802 \
  --min-answer-recall-at-5 0.716 \
  --min-recall-at-5 0.990 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1500 \
  --max-p99-ms 2500

Legacy limit=10 full-set floor

The full 500-question LongMemEval-compatible hash run is archived at reports/benchmarks/longmemeval-500-hash/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, limit=10, BM25 as the same-harness lexical baseline, and Zaxy checkout retrieval over 5,372 Eventloom events, 500 queries, 500 subjects, and 948 sessions. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.560 0.516 1.000 0.592 0.770 0.902 356.67 433.55
Zaxy checkout 0.724 0.628 1.000 0.960 0.972 0.972 1472.11 2652.55

This legacy full-set result remains the limit=10 no-regression floor for checkout-wide changes that still run that archived harness. It is not the current full-set headline; current74 above supersedes it for full-set quality claims. Keep it only as historical BM25-included evidence and as a floor for older commands that still target reports/benchmarks/longmemeval-500-hash.

Current same-harness backend-evaluation floor

Backend-evaluation work now uses the current limit=5 full-set control at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854, matching the pgGraph full-set comparison below.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.516 0.516 1.000 0.592 0.770 0.770 347.47 406.53
Zaxy checkout 0.714 0.626 1.000 0.946 0.958 0.958 1089.53 2456.86

Use this current backend-evaluation floor when comparing projection backends or other limit=5 full-set reports. Do not compare a limit=5 backend run directly against the legacy limit=10 mean-score floor without also running a same-command Neo4j control.

Skill Memory changes must pass the full 500-question guardrail before release, because the checkout skill lane shares ranking, evidence selection, prompt formatting, and MCP tool surfaces with factual memory. The Skill Memory lane may add cited procedural guidance, but it must not lower Zaxy checkout mean score, Answer@5, Recall@5, citation coverage, or the archived latency envelope unless a new public benchmark report explicitly replaces these floors. Skill Memory outcome analytics are read-only checkout diagnostics: promotion candidates, rollback candidates, and contradiction analytics can guide an agent, but they do not revise, delete, or promote a skill without an explicit skill.* event.

Projection backend changes must pass the full 500-question guardrail before release, because backend swaps can alter exact, keyword, vector, traversal, temporal, and citation behavior even when Eventloom remains the source of truth. Embedded Kuzu is the default backend after matching the answer-ready quality and citation gates; Neo4j remains the sidecar control backend for same-harness comparisons.

The experimental pgGraph adapter now has an initial same-harness backend comparison, but it remains experimental. It supports projection, exact search, keyword search, pgvector-backed vector search, invalidation, and traversal. It remains behind PROJECTION_BACKEND=pggraph, and vector search uses pgvector only when the PostgreSQL endpoint has the extension installed. pgGraph is still not eligible as a default backend until it passes the full guardrail on the same harness and has repeatable operations coverage.

pgGraph Backend Comparison

The 100-question backend comparison is archived at reports/benchmarks/longmemeval-100-pggraph-comparison/live-benchmark.md and reports/benchmarks/longmemeval-100-neo4j-comparison/live-benchmark.md. Both runs use the same cleaned LongMemEval-compatible slice, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.500 0.500 1.000 0.710 0.840 98.24 2514
pgGraph Zaxy 0.960 0.960 1.000 0.980 0.980 355.37 5789
pgGraph checkout 0.910 0.910 1.000 0.950 0.980 312.62 5033
Neo4j Zaxy 0.960 0.960 1.000 1.000 1.000 667.78 3937
Neo4j checkout 0.930 0.930 1.000 0.960 1.000 625.98 7419

The full 500-question pgGraph comparison is archived at reports/benchmarks/longmemeval-500-pggraph-comparison/live-benchmark.md. A same-harness Neo4j checkout control run is archived at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. It uses the full cleaned LongMemEval-compatible workload, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both. The pgGraph run uses --reset-graph to truncate and rebuild the PostgreSQL projection tables before ingestion so repeated benchmark runs do not accumulate stale benchmark projections.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.512 0.512 1.000 0.592 0.770 343.80 2661
pgGraph Zaxy 0.698 0.698 1.000 0.958 0.958 1077.11 4193
pgGraph checkout 0.714 0.632 1.000 0.948 0.958 1020.22 13016
Neo4j checkout control 0.714 0.626 1.000 0.946 0.958 1089.53 13431

The clean pgGraph run restored the full-set Recall@5 floor and passed Answer@5, citation coverage, and latency. The same-harness Neo4j checkout control on the current workload hash scored 0.714, and pgGraph checkout scored 0.714, so the current adapter comparison no longer shows a pgGraph-specific quality regression. Checkout token volume is higher than the previous archive because the benchmark now includes supporting facts and evidence from the model-facing Memory Checkout object even when compact contexts are present. pgGraph remains an evaluation backend only until the full 500-question floor is re-baselined on a frozen same-harness workload and operational coverage covers container bootstrap, schema reset, graph rebuild, and failure recovery.

BM25 Comparison

The current same-harness BM25 comparison is archived at reports/benchmarks/longmemeval-100-comparison/live-benchmark.md. It reruns the same 100-question LongMemEval-compatible slice with BM25 and Zaxy checkout at limit=5.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10
BM25 0.500 0.500 1.000 0.710 0.840 0.840
Zaxy checkout 0.900 0.880 1.000 0.950 0.990 0.990

The practical reading is that BM25 can find many answer-bearing sessions, but it loses much more often during temporal and multi-session synthesis. Zaxy's advantage comes from checkout-level recall planning, source-first evidence selection, temporal/entity bridging, and cited context assembly. The tradeoff is latency: BM25 is much faster in this run, while Zaxy returns richer cited context.

Representative Suite

Zaxy also has a synthetic suite-v1 benchmark review in benchmark-review.md. That review covers 650 paired queries across current memory, historical memory, graph traversal, indexed documents, sanitized transcripts, and mixed cross-lane context. On that representative agent-context workload, Zaxy scored 1.000 with OpenAI text-embedding-3-small, compared with 0.520 for vector and markdown+vector baselines and 0.005 for direct markdown scanning.

Use suite-v1 to evaluate Zaxy's architectural thesis: temporal, relational, replayable agent context should beat flat chunk retrieval on tasks that require current-vs-historical truth, graph relationships, citations, and mixed context. Use LongMemEval-compatible runs to compare with public memory-product claims.

CoordinationBench

CoordinationBench is the benchmark lane for Zaxy Coordinate. It measures whether a memory system can turn multiple isolated worker sessions into one governed parent mission history. The scorer reports accepted-finding precision and recall, conflict precision and recall, stale-claim rejection, duplicate consolidation, evidence coverage, parent-checkout answerability, citation coverage, accepted-state synthesis quality, non-authoritative leakage, purpose feedback coverage, Eventloom replayability, token estimates, and brief/promotion latency.

The current official CoordinationBench adapter result is a first-party, same-harness run through the external CoordinationBench scorer. The adapter was frozen before holdout evaluation and is recorded in CoordinationBench at submissions/participants/zaxy-coordinate.adapter.json with these source hashes:

Source SHA-256
src/zaxy/coordinationbench_adapter.py d2ff5d6124e7a1f0849cac8a6afbba328bd4fa8ef0fd806203575801dc5c6e7c
examples/adapters/coordinationbench_zaxy_adapter.py 3923b754fe31c81c572bc0c8bbfeb595e5fd69f6eb833746112b01c85982da00

The public v1 and v1-scale lanes scored perfectly, but those lanes should be described as first-party public-label reproducibility runs, not a representative leaderboard claim:

Lane Cases Overall Accepted precision Accepted recall Conflict recall Stale rejection Answerability Evidence grounding
v1-audited 10 1.000 1.000 1.000 1.000 1.000 1.000 1.000
v1-scale 72 1.000 1.000 1.000 1.000 1.000 1.000 1.000

After freezing the adapter, the same executable was run unchanged against existing public-derived holdout workload packs. That is the more honest generalization signal:

Holdout pack Overall Accepted precision Accepted recall Conflict recall Stale rejection Answerability Evidence grounding
public-derived-mini 0.593 0.667 0.667 1.000 0.000 0.000 0.333
public-derived-wave1 0.644 0.792 0.875 0.750 0.000 0.000 0.375
public-derived-wave2 0.593 0.833 0.938 0.125 0.000 0.000 0.438
public-derived-wave3 0.598 0.962 0.962 0.000 0.000 0.000 0.462
public-derived-wave4 0.604 0.977 0.977 0.000 0.000 0.000 0.477

The public-derived holdout mean is 0.606. That result is the right product signal: the replay-backed coordination layer gets strong accepted-state precision and duplicate consolidation, but the current frozen adapter still needs better source-aware final answering, stale-source interpretation, and conflict detection across public-derived cases. Until independent review and unseen workload promotion are complete, Zaxy should not market a perfect CoordinationBench score as representative performance.

The in-repo Zaxy-owned CoordinationBench adapter now emits a source-aware accepted-state answer packet for each case: returned_text, answer_candidate, synthesis_artifact, support_source_ids, excluded_source_ids, and non_authoritative_rows_injected. The scorer treats accepted_state_synthesis_quality as proof-backed: plain returned text is not enough to receive synthesis-quality credit. Public-derived holdout numbers must be rerun and archived before this adapter hardening changes any published holdout claim. The report also includes a machine-readable coordinate_purpose_synthesis_gate that passes only when accepted-state synthesis quality, non-authoritative leakage, Coordinate-purpose feedback coverage, citation coverage, parent-checkout answerability, and Eventloom replayability all meet their required floors. This is the gate for Zaxy Coordinate product claims; it is separate from the competitor claim gate. Coordinate proof packets are also projected into graph memory as mission-scoped proof nodes with accepted-finding, conflict, handoff, and non-authoritative-row edges, so benchmark evidence remains replayable and queryable instead of living only as text output. purpose_feedback_coverage is an internal Coordinate audit metric: Zaxy gets credit when accepted parent-state findings are feedback-ready with source citations, while external same-harness adapters get credit only when their strict result files include explicit Coordinate-purpose feedback events for the accepted findings. Disclosure-only competitor rows do not receive feedback coverage credit.

The internal coordination-real-v1 report is archived at reports/benchmarks/coordination-real-v1/coordination-benchmark.md. It remains useful as a Zaxy development smoke test over real project history. It should not be used as the headline benchmark claim because it was produced inside the Zaxy repo and is easier to tune against than an external holdout pack. The report includes local baselines, disclosure-only adapter templates for Mem0, Agent Memory, ActiveGraph, Quarq, and Semantic Reach/Hybi, limitations, and reproduction commands. Its competitor_claim_gate is currently blocked for Quarq and Hybi, which means the report may disclose adapter status but must not be used as a same-harness public claim for either system. Its coordinate_purpose_synthesis_gate is passed, which means the internal real-history report is suitable as a development proof of the Coordinate purpose/synthesis contract while still not being a public competitor claim.

Purpose-Conditioned Memory Gate

The purpose-v1 benchmark is Zaxy's deterministic internal gate for the "memory is purpose" claim. It does not compare against Semantic Reach, Quarq, or other products. Comparative SOTA claims remain blocked until same-harness adapters are pinned and scored.

Run:

python -m zaxy purpose-benchmark --output-dir reports/benchmarks/purpose-v1 --include-holdouts

Archived report: reports/benchmarks/purpose-v1/purpose-benchmark.json

Lane Status Score What it proves
Purpose Recall passed 1.000 Purpose profiles apply recall floors and ontology evidence terms.
Ontology Shift passed 0.750 The same query resolves to distinct purpose-specific retrieval lenses and graph path roles.
Consequence Retention passed 1.000 Profiles retain failures, accepted decisions, risks, and proof outcomes.
Governed Forgetting passed 1.000 Decay mode protects obligations and risk memory while downweighting noise.
Action Outcome Loop passed 1.000 Purpose outcome history changes future rank and warning candidates.
Evidence Policy Discipline passed 1.000 Purpose fixtures enforce missing and supported evidence policies.
Broader Profile Fixtures passed 1.000 Support, product, sales, legal, and executive profiles have checkout, compaction, and benchmark fixtures.
Neutral Substrate Projection passed 1.000 One neutral customer artifact can rebuild distinct cited purpose projections.
Cross-Role Citation passed 1.000 The same citation can support different role-specific memories.
Accepted-State Discipline passed 1.000 Coordinate compaction keeps accepted parent state and suppresses pending worker rows.

The archived report also includes a diagnostic public-derived holdout pack at reports/benchmarks/purpose-v1/holdouts/public-derived-purpose-v1/. Its fingerprint is 0d8217bb4e905164305970050ef34c987d7e9b287ce648a1730685f3dd0e61f6. Holdouts are reported as gate_status=diagnostic, not as release-pass lanes, and cases are frozen with source disclosures and forbidden overclaims.

The smaller coordination-v1 workload remains as the contract seed. It includes three workers, overlapping auth-failure findings, duplicate evidence, stale claims, conflicting claims, and a missing-evidence finding. Use it for adapter authors and fast protocol checks, not as the representative headline.

Run the MVP harness:

zaxy coordinate benchmark --output-dir reports/benchmarks/coordination-v1 --json

The command writes coordination-benchmark.json, coordination-benchmark.md, and the frozen workload JSON. The included flat-eventlog baseline intentionally accepts all worker findings, so it exposes the contamination problem that governed promotion is meant to solve.

The current coordination-v1 report is published at reports/benchmarks/coordination-v1/coordination-benchmark.md. It uses workload fingerprint 4b6f01f5a0e9275bd6cd0238d439ee326d471483d5da3cc1dcc9a258d21bfafc and reports:

System Accepted precision Conflict recall Stale rejection Parent answerability Synthesis quality Leakage guard Citation coverage
Zaxy Coordinate 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Markdown notes 0.400 0.000 0.000 0.000 0.000 0.000 0.000
BM25 worker logs 0.333 0.000 0.000 0.000 0.000 0.000 0.000
Flat transcript 0.200 0.000 0.000 0.000 0.000 0.000 0.000

The same report lists Mem0, Agent Memory, ActiveGraph, Quarq, and Semantic Reach/Hybi as not_run with disclosure_only claim status until a pinned runner result is locally scored. Quarq and Hybi now ship pinned disclosure manifests with source refs, install commands, workload/result contracts, and explicit unsupported runner commands; those manifests still are not performance results. The report also writes a machine-readable competitor_claim_gate; public same-harness claims for Quarq or Hybi are blocked until the gate sees completed, locally scored, fingerprinted result audits. That is deliberate: CoordinationBench should make the adapter gap visible without turning placeholder templates into public claims.

External Disclosures

These rows summarize public claims from other projects. They are external disclosures, not same-harness results, because Zaxy did not execute those systems inside its benchmark harness.

System Public claim Source Interpretation
MemPalace 96.6% raw LongMemEval R@5; 98.4% held-out hybrid R@5; LLM-reranked full-set runs reported at 99%+ R@5 MemPalace README, MemPalace BENCHMARKS.md Strongest public retrieval target. Compare Zaxy's R@5=1.000 to no-LLM retrieval disclosures separately from MemPalace's optional LLM-rerank line.
Agent Memory 95.2% R@5 on LongMemEval-S, with BM25 + vector retrieval and broader graph-memory positioning Agent Memory LONGMEMEVAL.md Direct product-positioning target for coding-agent memory with aggressive hook and viewer UX.
Mem0 Research pages report LongMemEval accuracy in the low-to-mid 90s plus lower token usage; managed-platform category rows report up to 97.0% on temporal reasoning Mem0 research, memory-benchmarks Different metric family and hosted/managed setup; useful as production-memory context, but not directly comparable to Zaxy's retrieval R@5 and cited checkout metrics.
Quarq Reports memory-first agent behavior and high LongMemEval-S accuracy claims. quarq.io/agent, quarqlabs/agent-oss Strong retrieval-protocol target, but Zaxy does not treat the claim as same-harness until a pinned runner or strict result file is locally scored.
Semantic Reach / HyperBinder / Hybi Claims a unified HDC-backed substrate for semantic, graph, relational, and exact retrieval. semantic-reach.io, HyperBinder SDK, hybi on PyPI Architecture target for slot-aware retrieval, but public evidence is not a Zaxy same-harness result without a pinned HyperBinder/Hybi runtime adapter.

When writing public copy, do not collapse these into a single leaderboard. Metric families differ: R@5 retrieval, Answer@5 expected-term recall, LOCOMO judge accuracy, and token/latency reductions answer different questions.

Same-Harness Adapter Feasibility

As of May 18, 2026, competitor adapters have different readiness levels:

System Status Evidence Same-harness blocker
MemPalace adapter candidate The public repo documents benchmarks/longmemeval_bench.py, committed per-question results, and a no-API-key raw LongMemEval path. Build a wrapper that exports per-query top-k contexts into Zaxy's BenchmarkRun schema without changing MemPalace ranking settings.
Mem0 benchmark harness candidate mem0ai/memory-benchmarks includes LongMemEval scripts, but the OSS path requires Docker, Qdrant, model configuration, and LLM answer/judge settings. Separate retrieval-only evidence from answer/judge accuracy, pin backend config, and preserve token/latency accounting.
Agent Memory external disclosure only The product page reports LongMemEval-S R@5 and the retrieval stack, but it does not document a stable same-harness CLI/API contract for Zaxy to call. Keep the claim in external disclosures until a reproducible benchmark command, dataset contract, and result export are available.
Quarq pinned unsupported runner manifest The OSS repo exposes a local memory-first agent architecture. Zaxy pins quarqlabs/agent-oss at b68386048795765d46c87bef5bd88ecfb1301337, but no CoordinationBench runner adapter is committed. Replace the packaged unsupported runner with a real workload replay adapter and score the generated result locally before publishing metrics.
Semantic Reach / HyperBinder / Hybi pinned unsupported runner manifest The public hybi SDK is pinned to 0.1.1 by PyPI wheel hash, but it is an HTTP client for a HyperBinder runtime and Zaxy has no pinned server/runtime adapter yet. Pin a HyperBinder server/runtime, replace the unsupported runner, and export strict result files before publishing metrics.

No same-harness adapter should be published without a pinned install command, dataset mapping, retrieval limit, score mapping, latency/tokens capture, and a clear statement about whether the competitor result is retrieval recall, answer/judge accuracy, or another metric family. For Quarq and Semantic Reach/Hybi specifically, public CoordinationBench copy must also pass zaxy coordinate benchmark --require-competitor-claim quarq --require-competitor-claim hybi ...; otherwise the report remains disclosure-only for those systems.

Backend Shootout

Embedded graph work needs a backend shootout before any default-backend change. The shootout contract compares embedded, LatticeDB, Neo4j, pgGraph, and BM25 on the same Eventloom history and query file. It must report cold bootstrap time, first useful init time, first checkout time, append-to-projection p95, projection events per second, checkout p95, checkout p99, traversal p95, dashboard graph-load timing, returned tokens, injected tokens, citation coverage, quality against expected query terms when provided, Answer@5/Recall@5 fields for LongMemEval-compatible workloads, resident memory delta, on-disk footprint, and rebuild recovery time. Every generated JSON report also carries a report schema version, UTC generation timestamp, source fingerprints for the Eventloom and query files, and workload fingerprints for the filtered events and normalized query specs. Release evidence should use --require-report-metadata --require-markdown-report --require-query-results --require-git-tracked-inputs --verify-report-fingerprints so stale reports fail when their input Eventloom or query file changes, so the human-readable Markdown sidecar carries matching provenance, so aggregate metrics are backed by per-query diagnostics, and so release evidence cannot depend on local-only benchmark inputs. It also verifies event/query counts so tampered count metadata cannot pass as release evidence.

The local harness is:

python scripts/backend-shootout.py \
  --eventloom-path .eventloom \
  --session-id default \
  --queries-file reports/backend-shootout/queries.json \
  --output reports/backend-shootout/backend-shootout.json

Validate a labeled active-backend report before treating it as release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-answer-at-5 0.5 \
  --min-recall-at-5 0.5 \
  --min-citation-coverage 1.0 \
  --min-quality-per-1k-injected-tokens embedded=1.0 \
  --min-answer-at-5-per-1k-injected-tokens embedded=1.0 \
  --max-cold-bootstrap-ms embedded=250 \
  --max-first-checkout-ms embedded=25 \
  --max-append-to-projection-p95-ms embedded=50 \
  --max-resident-memory-delta-bytes embedded=256000000 \
  --max-on-disk-footprint-bytes embedded=256000000 \
  --max-dashboard-graph-load-ms embedded=250 \
  --max-checkout-p99-ms embedded=25 \
  --max-exact-p99-ms embedded=10 \
  --max-keyword-p99-ms embedded=5 \
  --max-vector-p99-ms embedded=5 \
  --max-traversal-p99-ms embedded=5

Validate the medium-scale embedded performance report before treating embedded projection throughput as protected release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/longmemeval-40-backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-citation-coverage 1.0 \
  --min-projection-events-per-second embedded=40 \
  --max-cold-bootstrap-ms embedded=250 \
  --max-first-useful-init-ms embedded=15000 \
  --max-first-checkout-ms embedded=50 \
  --max-append-to-projection-p95-ms embedded=35 \
  --max-resident-memory-delta-bytes embedded=768000000 \
  --max-on-disk-footprint-bytes embedded=256000000 \
  --max-dashboard-graph-load-ms embedded=500 \
  --max-rebuild-recovery-ms embedded=15000 \
  --max-checkout-p95-ms embedded=100 \
  --max-checkout-p99-ms embedded=85 \
  --min-quality-per-1k-returned-tokens embedded=0.10 \
  --min-answer-at-5-per-1k-returned-tokens embedded=0.10 \
  --min-quality-per-1k-injected-tokens embedded=0.10 \
  --min-answer-at-5-per-1k-injected-tokens embedded=0.10 \
  --max-exact-p95-ms embedded=15 \
  --max-exact-p99-ms embedded=10 \
  --max-keyword-p95-ms embedded=75 \
  --max-keyword-p99-ms embedded=40 \
  --max-vector-p95-ms embedded=25 \
  --max-vector-p99-ms embedded=35 \
  --max-traversal-p95-ms embedded=10 \
  --max-traversal-p99-ms embedded=10

Validate the 100-query embedded scale report before treating broader embedded runtime behavior as protected release evidence:

python scripts/check-backend-shootout.py \
  reports/backend-shootout/longmemeval-100-backend-shootout.json \
  --require-report-metadata \
  --require-markdown-report \
  --require-query-results \
  --require-git-tracked-inputs \
  --verify-report-fingerprints \
  --require-backends embedded,bm25 \
  --forbid-backends neo4j,pggraph,latticedb \
  --require-labeled-metrics \
  --require-dashboard-source embedded=embedded \
  --min-recall-at-5 0.90 \
  --min-citation-coverage 1.0 \
  --min-projection-events-per-second embedded=35 \
  --max-cold-bootstrap-ms embedded=600 \
  --max-first-useful-init-ms embedded=45000 \
  --max-first-checkout-ms embedded=150 \
  --max-append-to-projection-p95-ms embedded=40 \
  --max-resident-memory-delta-bytes embedded=1700000000 \
  --max-on-disk-footprint-bytes embedded=512000000 \
  --max-dashboard-graph-load-ms embedded=500 \
  --max-rebuild-recovery-ms embedded=45000 \
  --max-checkout-p95-ms embedded=200 \
  --max-checkout-p99-ms embedded=250 \
  --min-quality-per-1k-returned-tokens embedded=0.15 \
  --min-answer-at-5-per-1k-returned-tokens embedded=0.15 \
  --min-quality-per-1k-injected-tokens embedded=0.15 \
  --min-answer-at-5-per-1k-injected-tokens embedded=0.15 \
  --max-exact-p95-ms embedded=10 \
  --max-exact-p99-ms embedded=12 \
  --max-keyword-p95-ms embedded=20 \
  --max-keyword-p99-ms embedded=15 \
  --max-vector-p95-ms embedded=15 \
  --max-vector-p99-ms embedded=20 \
  --max-traversal-p95-ms embedded=10 \
  --max-traversal-p99-ms embedded=10

This performance guardrail intentionally checks total checkout latency, projection/resource costs, and the retrieval lanes that produce context. resident_memory_delta_bytes and on_disk_footprint_bytes keep the embedded runtime honest about local machine cost instead of only optimizing query latency. quality_per_1k_returned_tokens, answer_at_5_per_1k_returned_tokens, quality_per_1k_injected_tokens, and answer_at_5_per_1k_injected_tokens protect the token-efficiency goal directly, while the exact, keyword, vector, and traversal p95 ceilings make it harder for one degraded lane to hide behind an acceptable aggregate checkout p95.

The default active backend set is embedded and bm25. This keeps routine shootouts sidecar-free while still comparing Zaxy's embedded projection against a lexical baseline. Neo4j, pgGraph, and LatticeDB remain supported through an explicit backend set such as --backends embedded,neo4j,bm25 when you are running controlled sidecar comparisons. LatticeDB is a parked candidate after the first graph-traversal smoke failed both quality and latency gates. Use it only for targeted follow-up, not routine active-backend shootouts. Release evidence should pass --forbid-backends neo4j,pggraph,latticedb so routine active-backend evidence stays sidecar-free until each optional backend is explicitly selected for a controlled comparison. The --require-git-tracked-inputs flag is mandatory for active release evidence. It rejects reports whose eventloom_path or queries_file points at a local-only file, which prevents passing fingerprints that cannot be reproduced from a clean checkout. If you regenerate LongMemEval target-query files, track the replacement query inputs with the report update instead of leaving release evidence dependent on local scratch files.

The current focused embedded graph-traversal evidence is archived at reports/benchmarks/backend-shootout-graph-traversal-embedded-after-carry-forward. That run used mempalace-graph-traversal-v1, 10 subjects, hash embeddings, limit=5, BM25 as the baseline, and PROJECTION_BACKEND=embedded. It reports Zaxy embedded mean score 1.000, Answer@5=1.000, Recall@5=1.000, citation coverage 1.000, p50 checkout latency 18.31ms, and p95 checkout latency 31.87ms. The predecessor failed because the embedded adapter did not match Neo4j's undirected traversal semantics and did not carry active relationships forward when an entity was reasserted into a new temporal version.

A checked smoke workload is available at reports/backend-shootout/sample.eventloom with query specs in reports/backend-shootout/queries.json. It is intentionally small, so it is a reproducibility check rather than default-backend evidence. The checked report location is reports/backend-shootout/backend-shootout.json. The current checked smoke report covers embedded and bm25 with citation coverage at 1.0, report metadata, source fingerprints, and workload fingerprints. Run additional explicit backends locally when Neo4j, pgGraph, or LatticeDB infrastructure is available.

The medium-scale backend evidence is archived at reports/backend-shootout/longmemeval-40-backend-shootout.json, using reports/backend-shootout/longmemeval-40.eventloom.jsonl and reports/backend-shootout/longmemeval-40-queries-with-targets.json. This 40-question LongMemEval-compatible run is not a default-backend gate. It is a scale and surface check for the embedded runtime path. In that report, embedded/Kuzu completed with two contract rows. The raw retrieve row scored Answer@5=0.575, Recall@5=1.0, citation coverage 1.0, checkout p95 10.55ms, lane p95s of exact 0.007ms, keyword 3.285ms, vector 2.614ms, and traversal 0.005ms. The answer_ready row scored Answer@5=1.0 and Recall@5=1.0 with checkout p95 56.541ms, mean returned tokens 3372.5, mean injected tokens 3507.55, and Answer@5 per 1k injected tokens 0.2851. The shared projection path had cold bootstrap 225.93ms, first useful init 9347.717ms, append-to-projection p95 24.674ms, rebuild recovery 10433.54ms, projection throughput 57.007 events/sec, resident memory delta 652308480 bytes, on-disk footprint 28762112 bytes, and dashboard graph source embedded with 100 nodes and 100 edges. The BM25 control completed with Answer@5=0.55, Recall@5=1.0, citation coverage 1.0, checkout p95 169.674ms, mean returned/injected tokens 3944.9, and quality plus Answer@5 per 1k returned/injected tokens 0.1394. The embedded rows show a real local graph can be built and served to the dashboard without a sidecar, and that answer-ready synthesis now closes the answer-surface gap while the raw retrieve row remains the operational latency contract. Use the focused graph-traversal archive above for relationship behavior, and use this 40-query report as operational evidence that the embedded path can run a larger Eventloom file through projection, checkout, token-efficiency accounting, and dashboard summary. Explicit Kuzu bulk projection transactions and prewarmed keyword/vector caches reduced the earlier roughly 120-second projection/rebuild run to roughly 10 seconds while keeping vector retrieval enabled. This is still not the same as passing the full default-backend gate, but it moves embedded projection throughput from a product blocker into the next optimization target.

The current 100-query scale evidence is archived at reports/backend-shootout/longmemeval-100-backend-shootout.json, using reports/backend-shootout/longmemeval-100.eventloom.jsonl and reports/backend-shootout/longmemeval-100-queries-with-targets.json. This run covers 100 queries and 1,559 Eventloom events. Embedded/Kuzu again emits separate contract rows. The raw retrieve row scored Answer@5=0.52, Recall@5=0.99, citation coverage 1.0, checkout p95 19.904ms, checkout p99 21.75ms, lane p95s of exact 0.006ms, keyword 6.08ms, vector 7.689ms, and traversal 0.006ms, with mean injected tokens 1492.24 and Answer@5 per 1k injected tokens 0.3485. The answer_ready row scored Answer@5=0.99 and Recall@5=1.0, first checkout 37.615ms, checkout p95 90.478ms, with mean injected tokens 3426.8 and Answer@5 per 1k injected tokens 0.2889. The shared projection path had cold bootstrap 421.649ms, first useful init 29620.186ms, append-to-projection p95 26.931ms, rebuild recovery 27678.562ms, projection throughput 53.393 events/sec, resident memory delta 1604280320 bytes, on-disk footprint 57298944 bytes, dashboard source embedded, 100 dashboard nodes, and 100 dashboard edges. BM25 scored Answer@5=0.52, Recall@5=0.9, citation coverage 1.0, checkout p95 439.913ms, mean returned/injected tokens 4179.5, and Answer@5 per 1k injected tokens 0.1244. This vector-enabled 100-query report is now strong answer-ready evidence for the embedded default, and the raw retrieve path now clears a stricter Recall@5=0.90 release floor with the embedded scale guardrail passing. It also makes the next performance target explicit: resident memory and answer-ready tail latency still deserve focused optimization. The higher cold bootstrap is intentional: startup now prewarms the Eventloom verbatim source index so the first answer-ready checkout does not pay that source-lane build cost.

BM25 is the zero-infrastructure lexical control. LatticeDB is tracked as a candidate backend behind the ProjectionStore factory, but it is not in the default active set. Its first adapter slice projects graph state and supports exact, keyword, traversal, temporal invalidation, source-retirement behavior, and Eventloom citation metadata on returned entities. It delegates vector and full-text search to LatticeDB and reports inferred-edge audit diagnostics. The current parked candidate evidence is Answer@5=0.0, mean score 0.0, and roughly 6.4s p50 checkout latency on the graph-traversal smoke. Graph backends run through Zaxy's MemoryFabric projection contract so the shootout measures the same retrieval surface that agents use. Backends with missing infrastructure should emit an error row rather than hiding the failure or aborting the whole report.

Reproduction

Run the current full 500-question Zaxy checkout headline with the cached oracle dataset, deterministic hash embeddings, and the reusable embedded projection:

EMBEDDED_GRAPH_PATH=.eventloom/projections/longmemeval-current65-percentage-boolean-comparison.kuzu \
zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604 \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --baseline-backends none \
  --zaxy-backend checkout \
  --projection-backend embedded \
  --embedding-cache .cache/zaxy/longmemeval-500-synthesis-ledger-20260603-embeddings.json \
  --reuse-projection \
  --progress

Guard the current full 500-question headline with:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-current74-zaxyonly-gated-relative-temporal-anchor-embedded-reuse-20260604/live-benchmark.json \
  --baseline reports/benchmarks/longmemeval-500-current71-zaxyonly-semantic-scalar-totals-embedded-reuse-20260604/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.940 \
  --min-answer-recall-at-5 0.906 \
  --min-recall-at-5 1.000 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 800 \
  --max-p99-ms 1100

Run the 100-question LongMemEval-compatible release evidence with BM25 included as a local baseline when you need same-command lexical tradeoffs. Plain zaxy benchmark commands use the embedded projection backend by default; pass --projection-backend neo4j or another backend only when running an explicit sidecar comparison.

zaxy benchmark \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the current BM25 comparison:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-100-comparison \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 5 \
  --baseline-backends bm25 \
  --zaxy-backend checkout \
  --reuse-projection \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the legacy BM25-included full 500-question archive:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-hash \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --reset-graph \
  --progress

Guard the legacy full 500-question archive with floors pinned to its observed limit=10 result:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-hash/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.626 \
  --min-answer-recall-at-5 0.608 \
  --min-recall-at-5 0.956 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 15000 \
  --max-p99-ms 23000

Guard current same-harness backend-evaluation reports with the limit=5 Neo4j checkout control:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.714 \
  --min-answer-recall-at-5 0.626 \
  --min-recall-at-5 0.958 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1200 \
  --max-p99-ms 2500

For release gates, compare reports with zaxy benchmark-compare and keep the thresholds explicit. For market comparisons, keep external products in disclosure rows until they can run against the same dataset, query order, retrieval limit, scoring code, and citation requirements.

From a clean checkout with the cached LongMemEval dataset and embedding cache present, run all archived public LongMemEval guardrails with:

scripts/benchmark-guardrails.sh

zaxy doctor --beta-readiness also exposes release benchmark posture through named checks. benchmark_no_regression requires the release script to keep checkout quality floors, citation coverage at 1.0, and p95/p99 checkout latency budgets across smoke, performance, and scale backend reports. coordination_competitor_claims verifies that the archived CoordinationBench report, docs, and Quarq/Hybi manifest templates preserve the public-claim gate: disclosure-only rows must remain blocked, and any same-harness claim must carry locally scored metrics plus result-audit provenance.

Related references: testing.md, retrieval.md, competitive-positioning.md, benchmark-review.md, and README.md. The public explanation is site/index.html.