Benchmark results

Accuracy by tier, latency, and cost across all 7 retrieval strategies.

Sample data — run kb-arena benchmark to generate real results
Strategy
Tier 1
LookupiSingle fact retrieval from one document. Example: 'What is the default timeout?'
Tier 2
How-ToiStep-by-step procedure within one topic. Example: 'How do I enable server-side encryption?'
Tier 3
ComparisoniChoosing between two options or configurations. Example: 'Compare hot storage vs cold archive for compliance.'
Tier 4
IntegrationiCross-topic dependencies requiring 3–4 connected components. Example: 'What permissions does service A need for B and C?'
Tier 5
ArchitectureiFull system design spanning 3–5+ topics. Example: 'How does a request flow from ingress through processing to storage?'
Avg % Latency Cost/Q
QnA PairsiLLM pre-generates question-answer pairs from each doc page at index time. Direct question-to-answer matching, but misses novel cross-topic questions.79%85%83%84%66%79.4%9043 ms$0.4800
Knowledge GraphiExtracts entities and relationships into Neo4j. Queries via Cypher templates. Excels at multi-hop dependency chains.72%69%61%77%79%71.6%20322 ms$1.3700
HybridiRoutes by intent — vector path for lookups, graph path for integration queries, both paths fused via RRF for how-to questions.39%81%61%80%62%64.6%41549 ms$3.0200
RAPTORiBuilds a recursive tree of LLM cluster summaries (L0 chunks → L1 summaries → L2). Queries all levels simultaneously for superior Tier 4/5 multi-hop performance.30%16%15%36%30%25.4%7240 ms$0.6900
Naive VectoriChunks documents, embeds with text-embedding-3-large, retrieves top-k by cosine similarity. Fast and simple, but no cross-topic understanding.27%15%14%26%22%20.8%6421 ms$0.3300
Contextual VectoriSame as Naive Vector, but prepends parent topic context to each chunk before embedding. Better at disambiguating domain-specific terms.25%11%9%26%11%16.4%5114 ms$0.2900
PageIndexiVectorless, reasoning-based retrieval. Builds a hierarchical tree index from document structure, then uses LLM reasoning to traverse the tree — no embeddings, no chunking. Excels on well-structured docs.19%12%7%21%12%14.2%10933 ms$0.2900

Results from your benchmark runs. Run kb-arena benchmark to regenerate with different corpora or strategies.

Accuracy by tier

Methodology

Each question is sent to all 7 strategies. Answers are evaluated through a 4-pass pipeline: structural checks (must_mention / must_not_claim), entity coverage against source documentation, source attribution, and LLM-as-judge scoring for accuracy, completeness, and faithfulness.

Composite ranking: 0.5 * accuracy + 0.3 * reliability + 0.2 * latency_score. Latency score inverts p95 so lower is better.

Tiers: 1 = lookup (single fact retrieval), 2 = how-to (procedure within one topic), 3 = comparison (option A vs B), 4 = integration (cross-topic dependencies), 5 = architecture (3+ topics, system design).