Benchmarks

Zaxy publishes benchmark evidence in two categories: same-harness Zaxy runs and external disclosures from other memory products. Keep those categories separate. Same-harness results are generated by Zaxy's benchmark CLI over committed or operator-supplied workloads. External disclosures are numbers quoted from public project pages or public benchmark analysis pages; they are useful market context, but they are not same-harness results until those systems run inside the same measurement protocol.

Current Headline

The current public Zaxy result is the archived 100-question LongMemEval-compatible run at reports/benchmarks/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, BM25 as the same-harness lexical baseline, and graph-backed Zaxy retrieval over 1,559 Eventloom events, 100 queries, 100 subjects, and 265 sessions.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms Approx tokens
BM25 0.540 0.500 1.000 0.710 0.840 0.870 85.77 5493
Zaxy 0.970 0.950 1.000 1.000 1.000 1.000 816.71 11038

This is the strongest public claim because it tests conversational long-memory retrieval across multi-session and temporal-reasoning questions. The report is not a full 500-question LongMemEval publication yet, and it should not be described as one. The tradeoff is explicit in the same report: BM25 is much faster and returns fewer tokens, while Zaxy substantially improves answer and multi-hop recall.

Full 500-Question LongMemEval Run

Legacy limit=10 full-set floor

The full 500-question LongMemEval-compatible hash run is archived at reports/benchmarks/longmemeval-500-hash/live-benchmark.md. It uses the cleaned LongMemEval workload, deterministic local hash embeddings, limit=10, BM25 as the same-harness lexical baseline, and Zaxy checkout retrieval over 5,372 Eventloom events, 500 queries, 500 subjects, and 948 sessions. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.560 0.516 1.000 0.592 0.770 0.802 356.67 433.55
Zaxy checkout 0.724 0.628 1.000 0.960 0.972 0.972 1472.11 2652.55

This legacy full-set result remains the limit=10 no-regression floor for checkout-wide changes that still run that archived harness. It is not a replacement for the stronger 100-question headline. The miss taxonomy shows the quality target clearly: Zaxy checkout now has 14 retrieval misses and 172 synthesis misses after the wedding-list answer-surface improvement. Future retrieval, checkout, Skill Memory, or backend changes should not reduce the published floor of mean score 0.626, Answer@5 0.608, citation coverage 1.000, and R@5 0.956 while they work down the synthesis-miss count.

Current same-harness backend-evaluation floor

Backend-evaluation work now uses the current limit=5 full-set control at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. Its workload SHA-256 is 0dc36a139bb9a4fdc7c6cd34400737a58a1eb7410517341f015e9fbfc76ed854, matching the pgGraph full-set comparison below.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10 p95 ms p99 ms
BM25 0.516 0.516 1.000 0.592 0.770 0.770 347.47 406.53
Zaxy checkout 0.714 0.626 1.000 0.946 0.958 0.958 1089.53 2456.86

Use this current backend-evaluation floor when comparing projection backends or other limit=5 full-set reports. Do not compare a limit=5 backend run directly against the legacy limit=10 mean-score floor without also running a same-command Neo4j control.

Skill Memory changes must pass the full 500-question guardrail before release, because the checkout skill lane shares ranking, evidence selection, prompt formatting, and MCP tool surfaces with factual memory. The Skill Memory lane may add cited procedural guidance, but it must not lower Zaxy checkout mean score, Answer@5, Recall@5, citation coverage, or the archived latency envelope unless a new public benchmark report explicitly replaces these floors. Skill Memory outcome analytics are read-only checkout diagnostics: promotion candidates, rollback candidates, and contradiction analytics can guide an agent, but they do not revise, delete, or promote a skill without an explicit skill.* event.

Projection backend changes must pass the full 500-question guardrail before release, because backend swaps can alter exact, keyword, vector, traversal, temporal, and citation behavior even when Eventloom remains the source of truth. Neo4j is the default backend until an experimental backend matches or beats the archived quality, citation, temporal, latency, and operations gates on the same harness.

The experimental pgGraph adapter now has an initial same-harness backend comparison, but it remains experimental. It supports projection, exact search, keyword search, pgvector-backed vector search, invalidation, and traversal. It remains behind PROJECTION_BACKEND=pggraph, and vector search uses pgvector only when the PostgreSQL endpoint has the extension installed. pgGraph is still not eligible as the default backend until it passes the full guardrail on the same harness and has repeatable operations coverage.

pgGraph Backend Comparison

The 100-question backend comparison is archived at reports/benchmarks/longmemeval-100-pggraph-comparison/live-benchmark.md and reports/benchmarks/longmemeval-100-neo4j-comparison/live-benchmark.md. Both runs use the same cleaned LongMemEval-compatible slice, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.500 0.500 1.000 0.710 0.840 98.24 2514
pgGraph Zaxy 0.960 0.960 1.000 0.980 0.980 355.37 5789
pgGraph checkout 0.910 0.910 1.000 0.950 0.980 312.62 5033
Neo4j Zaxy 0.960 0.960 1.000 1.000 1.000 667.78 3937
Neo4j checkout 0.930 0.930 1.000 0.960 1.000 625.98 7419

The full 500-question pgGraph comparison is archived at reports/benchmarks/longmemeval-500-pggraph-comparison/live-benchmark.md. A same-harness Neo4j checkout control run is archived at reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.md. It uses the full cleaned LongMemEval-compatible workload, deterministic hash embeddings, limit=5, BM25 as the lexical baseline, and --zaxy-backend both. The pgGraph run uses --reset-graph to truncate and rebuild the PostgreSQL projection tables before ingestion so repeated benchmark runs do not accumulate stale benchmark projections.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 p95 ms Approx tokens
BM25 0.512 0.512 1.000 0.592 0.770 343.80 2661
pgGraph Zaxy 0.698 0.698 1.000 0.958 0.958 1077.11 4193
pgGraph checkout 0.714 0.632 1.000 0.948 0.958 1020.22 13016
Neo4j checkout control 0.714 0.626 1.000 0.946 0.958 1089.53 13431

The clean pgGraph run restored the full-set Recall@5 floor and passed Answer@5, citation coverage, and latency. The same-harness Neo4j checkout control on the current workload hash scored 0.714, and pgGraph checkout scored 0.714, so the current adapter comparison no longer shows a pgGraph-specific quality regression. Checkout token volume is higher than the previous archive because the benchmark now includes supporting facts and evidence from the model-facing Memory Checkout object even when compact contexts are present. pgGraph remains an evaluation backend only until the full 500-question floor is re-baselined on a frozen same-harness workload and operational coverage covers container bootstrap, schema reset, graph rebuild, and failure recovery.

BM25 Comparison

The current same-harness BM25 comparison is archived at reports/benchmarks/longmemeval-100-comparison/live-benchmark.md. It reruns the same 100-question LongMemEval-compatible slice with BM25 and Zaxy checkout at limit=5.

Backend Mean score Answer@5 Citation coverage Recall@1 Recall@5 Recall@10
BM25 0.500 0.500 1.000 0.710 0.840 0.840
Zaxy checkout 0.900 0.880 1.000 0.950 0.990 0.990

The practical reading is that BM25 can find many answer-bearing sessions, but it loses much more often during temporal and multi-session synthesis. Zaxy's advantage comes from checkout-level recall planning, source-first evidence selection, temporal/entity bridging, and cited context assembly. The tradeoff is latency: BM25 is much faster in this run, while Zaxy returns richer cited context.

Representative Suite

Zaxy also has a synthetic suite-v1 benchmark review in benchmark-review.md. That review covers 650 paired queries across current memory, historical memory, graph traversal, indexed documents, sanitized transcripts, and mixed cross-lane context. On that representative agent-context workload, Zaxy scored 1.000 with OpenAI text-embedding-3-small, compared with 0.520 for vector and markdown+vector baselines and 0.005 for direct markdown scanning.

Use suite-v1 to evaluate Zaxy's architectural thesis: temporal, relational, replayable agent context should beat flat chunk retrieval on tasks that require current-vs-historical truth, graph relationships, citations, and mixed context. Use LongMemEval-compatible runs to compare with public memory-product claims.

External Disclosures

These rows summarize public claims from other projects. They are external disclosures, not same-harness results, because Zaxy did not execute those systems inside its benchmark harness.

System Public claim Source Interpretation
MemPalace 96.6% raw LongMemEval R@5; 98.4% held-out hybrid R@5; tuned full-set runs reported separately MemPalace BENCHMARKS.md, independent benchmark analysis Strong public target for LongMemEval-style retrieval; their docs and public analysis distinguish raw, held-out, and tuned reranked results.
Agent Memory 95.2% R@5 on LongMemEval-S, with BM25 + vector + graph retrieval and on-device reranking agent-memory.dev Direct product-positioning target for coding-agent memory with aggressive hook and viewer UX.
Mem0 +26% Accuracy over OpenAI Memory on LOCOMO; 91% faster responses and 90% lower token usage than full-context approaches mem0 LLM.md, memory-benchmarks Different benchmark family and metric; useful as production-memory context, but not directly comparable to LongMemEval R@5.

When writing public copy, do not collapse these into a single leaderboard. Metric families differ: R@5 retrieval, Answer@5 expected-term recall, LOCOMO judge accuracy, and token/latency reductions answer different questions.

Same-Harness Adapter Feasibility

As of May 18, 2026, competitor adapters have different readiness levels:

System Status Evidence Same-harness blocker
MemPalace adapter candidate The public repo documents benchmarks/longmemeval_bench.py, committed per-question results, and a no-API-key raw LongMemEval path. Build a wrapper that exports per-query top-k contexts into Zaxy's BenchmarkRun schema without changing MemPalace ranking settings.
Mem0 benchmark harness candidate mem0ai/memory-benchmarks includes LongMemEval scripts, but the OSS path requires Docker, Qdrant, model configuration, and LLM answer/judge settings. Separate retrieval-only evidence from answer/judge accuracy, pin backend config, and preserve token/latency accounting.
Agent Memory external disclosure only The product page reports LongMemEval-S R@5 and the retrieval stack, but it does not document a stable same-harness CLI/API contract for Zaxy to call. Keep the claim in external disclosures until a reproducible benchmark command, dataset contract, and result export are available.

No same-harness adapter should be published without a pinned install command, dataset mapping, retrieval limit, score mapping, latency/tokens capture, and a clear statement about whether the competitor result is retrieval recall, answer/judge accuracy, or another metric family.

Reproduction

Run the current LongMemEval-compatible release evidence with BM25 included as a local baseline:

zaxy benchmark \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the current BM25 comparison:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-100-comparison \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 100 \
  --runs 1 \
  --limit 5 \
  --baseline-backends bm25 \
  --zaxy-backend checkout \
  --reuse-projection \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --progress

Run the full 500-question archive:

zaxy benchmark \
  --output-dir reports/benchmarks/longmemeval-500-hash \
  --embedding-provider hash \
  --workload longmemeval \
  --dataset .cache/zaxy/benchmarks/longmemeval_oracle.json \
  --questions 500 \
  --runs 1 \
  --limit 10 \
  --zaxy-backend checkout \
  --baseline-backends bm25 \
  --embedding-cache .cache/zaxy/longmemeval-embeddings.json \
  --reset-graph \
  --progress

Guard the full 500-question archive with floors pinned to the current observed legacy limit=10 result:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-hash/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.626 \
  --min-answer-recall-at-5 0.608 \
  --min-recall-at-5 0.956 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 15000 \
  --max-p99-ms 23000

Guard current same-harness backend-evaluation reports with the limit=5 Neo4j checkout control:

zaxy benchmark-compare reports/benchmarks/longmemeval-500-neo4j-current-checkout/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.714 \
  --min-answer-recall-at-5 0.626 \
  --min-recall-at-5 0.958 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 1200 \
  --max-p99-ms 2500

For release gates, compare reports with zaxy benchmark-compare and keep the thresholds explicit. For market comparisons, keep external products in disclosure rows until they can run against the same dataset, query order, retrieval limit, scoring code, and citation requirements.

From a clean checkout with the cached LongMemEval dataset and embedding cache present, run all archived public LongMemEval guardrails with:

scripts/benchmark-guardrails.sh

Related references: testing.md, retrieval.md, competitive-positioning.md, benchmark-review.md, and README.md. The public explanation is site/index.html.