Benchmarks
Zaxy publishes benchmark evidence in two categories: same-harness Zaxy runs and external disclosures from other memory products. Keep those categories separate. Same-harness results are generated by Zaxy's benchmark CLI over committed or operator-supplied workloads. External disclosures are numbers quoted from public project pages or public benchmark analysis pages; they are useful market context, but they are not same-harness results until those systems run inside the same measurement protocol.
Current Headline
The current public Zaxy result is the archived 100-question LongMemEval-compatible run at reports/benchmarks/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, BM25 as the same-harness lexical baseline, and graph-backed Zaxy retrieval over 1,559 Eventloom events, 100 queries, 100 subjects, and 265 sessions.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | Recall@10 | p95 ms | Approx tokens |
|---|---|---|---|---|---|---|---|---|
| BM25 | 0.540 | 0.500 | 1.000 | 0.710 | 0.840 | 0.870 | 85.77 | 5493 |
| Zaxy | 0.970 | 0.950 | 1.000 | 1.000 | 1.000 | 1.000 | 816.71 | 11038 |
This is the strongest public claim because it tests conversational long-memory retrieval across multi-session and temporal-reasoning questions. The report is not a full 500-question LongMemEval publication yet, and it should not be described as one. The tradeoff is explicit in the same report: BM25 is much faster and returns fewer tokens, while Zaxy substantially improves answer and multi-hop recall.
Full 500-Question LongMemEval Run
Legacy limit=10 full-set floor
The full 500-question LongMemEval-compatible hash run is archived at reports/benchmarks/longmemeval-500-hash/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, limit=10, BM25 as the same-harness lexical baseline, and Zaxy checkout retrieval over 5,372 Eventloom events, 500 queries, 500 subjects, and 948 sessions. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | Recall@10 | p95 ms | p99 ms |
|---|---|---|---|---|---|---|---|---|
| BM25 | 0.560 | 0.516 | 1.000 | 0.592 | 0.770 | 0.802 | 356.67 | 433.55 |
| Zaxy checkout | 0.724 | 0.628 | 1.000 | 0.960 | 0.972 | 0.972 | 1472.11 | 2652.55 |
This legacy full-set result remains the limit=10 no-regression floor for checkout-wide changes that still run that archived harness. It is not a replacement for the stronger 100-question headline. The miss taxonomy shows the quality target clearly: Zaxy checkout now has 14 retrieval misses and 172 synthesis misses after the wedding-list answer-surface improvement. Future retrieval, checkout, Skill Memory, or backend changes should not reduce the published floor of mean score 0.626, Answer@5 0.608, citation coverage 1.000, and R@5 0.956 while they work down the synthesis-miss count.
Current same-harness backend-evaluation floor
Backend-evaluation work now uses the current limit=5 full-set control at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854, matching the pgGraph full-set comparison below.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | Recall@10 | p95 ms | p99 ms |
|---|---|---|---|---|---|---|---|---|
| BM25 | 0.516 | 0.516 | 1.000 | 0.592 | 0.770 | 0.770 | 347.47 | 406.53 |
| Zaxy checkout | 0.714 | 0.626 | 1.000 | 0.946 | 0.958 | 0.958 | 1089.53 | 2456.86 |
Use this current backend-evaluation floor when comparing projection backends or other limit=5 full-set reports. Do not compare a limit=5 backend run directly against the legacy limit=10 mean-score floor without also running a same-command Neo4j control.
Skill Memory changes must pass the full 500-question guardrail before release, because the checkout skill lane shares ranking, evidence selection, prompt formatting, and MCP tool surfaces with factual memory. The Skill Memory lane may add cited procedural guidance, but it must not lower Zaxy checkout mean score, Answer@5, Recall@5, citation coverage, or the archived latency envelope unless a new public benchmark report explicitly replaces these floors. Skill Memory outcome analytics are read-only checkout diagnostics: promotion candidates, rollback candidates, and contradiction analytics can guide an agent, but they do not revise, delete, or promote a skill without an explicit skill.* event.
Projection backend changes must pass the full 500-question guardrail before release, because backend swaps can alter exact, keyword, vector, traversal, temporal, and citation behavior even when Eventloom remains the source of truth. Neo4j is the default backend until an experimental backend matches or beats the archived quality, citation, temporal, latency, and operations gates on the same harness.
The experimental pgGraph adapter now has an initial same-harness backend comparison, but it remains experimental. It supports projection, exact search, keyword search, pgvector-backed vector search, invalidation, and traversal. It remains behind PROJECTION_BACKEND=pggraph, and vector search uses pgvector only when the PostgreSQL endpoint has the extension installed. pgGraph is still not eligible as the default backend until it passes the full guardrail on the same harness and has repeatable operations coverage.
pgGraph Backend Comparison
The 100-question backend comparison is archived at reports/benchmarks/longmemeval-100-pggraph-comparison/live-benchmark.md and reports/benchmarks/longmemeval-100-neo4j-comparison/live-benchmark.md. Both runs use the same cleaned LongMemEval-compatible slice, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | p95 ms | Approx tokens |
|---|---|---|---|---|---|---|---|
| BM25 | 0.500 | 0.500 | 1.000 | 0.710 | 0.840 | 98.24 | 2514 |
| pgGraph Zaxy | 0.960 | 0.960 | 1.000 | 0.980 | 0.980 | 355.37 | 5789 |
| pgGraph checkout | 0.910 | 0.910 | 1.000 | 0.950 | 0.980 | 312.62 | 5033 |
| Neo4j Zaxy | 0.960 | 0.960 | 1.000 | 1.000 | 1.000 | 667.78 | 3937 |
| Neo4j checkout | 0.930 | 0.930 | 1.000 | 0.960 | 1.000 | 625.98 | 7419 |
The full 500-question pgGraph comparison is archived at reports/benchmarks/longmemeval-500-pggraph-comparison/live-benchmark.md. A same-harness Neo4j checkout control run is archived at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. It uses the full cleaned LongMemEval-compatible workload, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both. The pgGraph run uses --reset-graph to truncate and rebuild the PostgreSQL projection tables before ingestion so repeated benchmark runs do not accumulate stale benchmark projections.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | p95 ms | Approx tokens |
|---|---|---|---|---|---|---|---|
| BM25 | 0.512 | 0.512 | 1.000 | 0.592 | 0.770 | 343.80 | 2661 |
| pgGraph Zaxy | 0.698 | 0.698 | 1.000 | 0.958 | 0.958 | 1077.11 | 4193 |
| pgGraph checkout | 0.714 | 0.632 | 1.000 | 0.948 | 0.958 | 1020.22 | 13016 |
| Neo4j checkout control | 0.714 | 0.626 | 1.000 | 0.946 | 0.958 | 1089.53 | 13431 |
The clean pgGraph run restored the full-set Recall@5 floor and passed Answer@5, citation coverage, and latency. The same-harness Neo4j checkout control on the current workload hash scored 0.714, and pgGraph checkout scored 0.714, so the current adapter comparison no longer shows a pgGraph-specific quality regression. Checkout token volume is higher than the previous archive because the benchmark now includes supporting facts and evidence from the model-facing Memory Checkout object even when compact contexts are present. pgGraph remains an evaluation backend only until the full 500-question floor is re-baselined on a frozen same-harness workload and operational coverage covers container bootstrap, schema reset, graph rebuild, and failure recovery.
BM25 Comparison
The current same-harness BM25 comparison is archived at reports/benchmarks/longmemeval-100-comparison/live-benchmark.md. It reruns the same 100-question LongMemEval-compatible slice with BM25 and Zaxy checkout at limit=5.
| Backend | Mean score | Answer@5 | Citation coverage | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|---|---|---|
| BM25 | 0.500 | 0.500 | 1.000 | 0.710 | 0.840 | 0.840 |
| Zaxy checkout | 0.900 | 0.880 | 1.000 | 0.950 | 0.990 | 0.990 |
The practical reading is that BM25 can find many answer-bearing sessions, but it loses much more often during temporal and multi-session synthesis. Zaxy's advantage comes from checkout-level recall planning, source-first evidence selection, temporal/entity bridging, and cited context assembly. The tradeoff is latency: BM25 is much faster in this run, while Zaxy returns richer cited context.
Representative Suite
Zaxy also has a synthetic suite-v1 benchmark review in benchmark-review.md. That review covers 650 paired queries across current memory, historical memory, graph traversal, indexed documents, sanitized transcripts, and mixed cross-lane context. On that representative agent-context workload, Zaxy scored 1.000 with OpenAI text-embedding-3-small, compared with 0.520 for vector and markdown+vector baselines and 0.005 for direct markdown scanning.
Use suite-v1 to evaluate Zaxy's architectural thesis: temporal, relational, replayable agent context should beat flat chunk retrieval on tasks that require current-vs-historical truth, graph relationships, citations, and mixed context. Use LongMemEval-compatible runs to compare with public memory-product claims.
External Disclosures
These rows summarize public claims from other projects. They are external disclosures, not same-harness results, because Zaxy did not execute those systems inside its benchmark harness.
| System | Public claim | Source | Interpretation |
|---|---|---|---|
| MemPalace | 96.6% raw LongMemEval R@5; 98.4% held-out hybrid R@5; tuned full-set runs reported separately | MemPalace BENCHMARKS.md, independent benchmark analysis | Strong public target for LongMemEval-style retrieval; their docs and public analysis distinguish raw, held-out, and tuned reranked results. |
| Agent Memory | 95.2% R@5 on LongMemEval-S, with BM25 + vector + graph retrieval and on-device reranking | agent-memory.dev | Direct product-positioning target for coding-agent memory with aggressive hook and viewer UX. |
| Mem0 | +26% Accuracy over OpenAI Memory on LOCOMO; 91% faster responses and 90% lower token usage than full-context approaches | mem0 LLM.md, memory-benchmarks | Different benchmark family and metric; useful as production-memory context, but not directly comparable to LongMemEval R@5. |
When writing public copy, do not collapse these into a single leaderboard. Metric families differ: R@5 retrieval, Answer@5 expected-term recall, LOCOMO judge accuracy, and token/latency reductions answer different questions.
Same-Harness Adapter Feasibility
As of May 18, 2026, competitor adapters have different readiness levels:
| System | Status | Evidence | Same-harness blocker |
|---|---|---|---|
| MemPalace | adapter candidate | The public repo documents benchmarks/longmemeval_bench.py, committed per-question results, and a no-API-key raw LongMemEval path. |
Build a wrapper that exports per-query top-k contexts into Zaxy's BenchmarkRun schema without changing MemPalace ranking settings. |
| Mem0 | benchmark harness candidate | mem0ai/memory-benchmarks includes LongMemEval scripts, but the OSS path requires Docker, Qdrant, model configuration, and LLM answer/judge settings. |
Separate retrieval-only evidence from answer/judge accuracy, pin backend config, and preserve token/latency accounting. |
| Agent Memory | external disclosure only | The product page reports LongMemEval-S R@5 and the retrieval stack, but it does not document a stable same-harness CLI/API contract for Zaxy to call. | Keep the claim in external disclosures until a reproducible benchmark command, dataset contract, and result export are available. |
No same-harness adapter should be published without a pinned install command, dataset mapping, retrieval limit, score mapping, latency/tokens capture, and a clear statement about whether the competitor result is retrieval recall, answer/judge accuracy, or another metric family.
Reproduction
Run the current LongMemEval-compatible release evidence with BM25 included as a local baseline:
zaxy benchmark \
--embedding-provider hash \
--workload longmemeval \
--dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
--questions 100 \
--runs 1 \
--limit 10 \
--zaxy-backend checkout \
--baseline-backends bm25 \
--embedding-cache .cache/zaxy/longmemeval-embeddings.json \
--progress
Run the current BM25 comparison:
zaxy benchmark \
--output-dir reports/benchmarks/longmemeval-100-comparison \
--embedding-provider hash \
--workload longmemeval \
--dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
--questions 100 \
--runs 1 \
--limit 5 \
--baseline-backends bm25 \
--zaxy-backend checkout \
--reuse-projection \
--embedding-cache .cache/zaxy/longmemeval-embeddings.json \
--progress
Run the full 500-question archive:
zaxy benchmark \
--output-dir reports/benchmarks/longmemeval-500-hash \
--embedding-provider hash \
--workload longmemeval \
--dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
--questions 500 \
--runs 1 \
--limit 10 \
--zaxy-backend checkout \
--baseline-backends bm25 \
--embedding-cache .cache/zaxy/longmemeval-embeddings.json \
--reset-graph \
--progress
Guard the full 500-question archive with floors pinned to the current observed legacy limit=10 result:
zaxy benchmark-compare reports/benchmarks/longmemeval-500-hash/live-benchmark.json \
--backend zaxy-checkout \
--min-mean-score 0.626 \
--min-answer-recall-at-5 0.608 \
--min-recall-at-5 0.956 \
--min-citation-coverage 1.0 \
--max-p95-ms 15000 \
--max-p99-ms 23000
Guard current same-harness backend-evaluation reports with the limit=5 Neo4j checkout control:
zaxy benchmark-compare reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.json \
--backend zaxy-checkout \
--min-mean-score 0.714 \
--min-answer-recall-at-5 0.626 \
--min-recall-at-5 0.958 \
--min-citation-coverage 1.0 \
--max-p95-ms 1200 \
--max-p99-ms 2500
For release gates, compare reports with zaxy benchmark-compare and keep the thresholds explicit. For market comparisons, keep external products in disclosure rows until they can run against the same dataset, query order, retrieval limit, scoring code, and citation requirements.
From a clean checkout with the cached LongMemEval dataset and embedding cache present, run all archived public LongMemEval guardrails with:
scripts/benchmark-guardrails.sh
Related references: testing.md, retrieval.md, competitive-positioning.md, benchmark-review.md, and README.md. The public explanation is site/index.html.