Benchmark results
Accuracy by tier, latency, and cost across all 7 retrieval strategies.
kb-arena benchmark to generate real results| Strategy | Tier 1 LookupiSingle fact retrieval from one document. Example: 'What is the default timeout?' | Tier 2 How-ToiStep-by-step procedure within one topic. Example: 'How do I enable server-side encryption?' | Tier 3 ComparisoniChoosing between two options or configurations. Example: 'Compare hot storage vs cold archive for compliance.' | Tier 4 IntegrationiCross-topic dependencies requiring 3–4 connected components. Example: 'What permissions does service A need for B and C?' | Tier 5 ArchitectureiFull system design spanning 3–5+ topics. Example: 'How does a request flow from ingress through processing to storage?' | Avg % ↓ | Latency | Cost/Q |
|---|---|---|---|---|---|---|---|---|
| QnA PairsiLLM pre-generates question-answer pairs from each doc page at index time. Direct question-to-answer matching, but misses novel cross-topic questions. | 79% | 85% | 83% | 84% | 66% | 79.4% | 9043 ms | $0.4800 |
| Knowledge GraphiExtracts entities and relationships into Neo4j. Queries via Cypher templates. Excels at multi-hop dependency chains. | 72% | 69% | 61% | 77% | 79% | 71.6% | 20322 ms | $1.3700 |
| HybridiRoutes by intent — vector path for lookups, graph path for integration queries, both paths fused via RRF for how-to questions. | 39% | 81% | 61% | 80% | 62% | 64.6% | 41549 ms | $3.0200 |
| RAPTORiBuilds a recursive tree of LLM cluster summaries (L0 chunks → L1 summaries → L2). Queries all levels simultaneously for superior Tier 4/5 multi-hop performance. | 30% | 16% | 15% | 36% | 30% | 25.4% | 7240 ms | $0.6900 |
| Naive VectoriChunks documents, embeds with text-embedding-3-large, retrieves top-k by cosine similarity. Fast and simple, but no cross-topic understanding. | 27% | 15% | 14% | 26% | 22% | 20.8% | 6421 ms | $0.3300 |
| Contextual VectoriSame as Naive Vector, but prepends parent topic context to each chunk before embedding. Better at disambiguating domain-specific terms. | 25% | 11% | 9% | 26% | 11% | 16.4% | 5114 ms | $0.2900 |
| PageIndexiVectorless, reasoning-based retrieval. Builds a hierarchical tree index from document structure, then uses LLM reasoning to traverse the tree — no embeddings, no chunking. Excels on well-structured docs. | 19% | 12% | 7% | 21% | 12% | 14.2% | 10933 ms | $0.2900 |
Results from your benchmark runs. Run kb-arena benchmark to regenerate with different corpora or strategies.
Accuracy by tier
Methodology
Each question is sent to all 7 strategies. Answers are evaluated through a 4-pass pipeline: structural checks (must_mention / must_not_claim), entity coverage against source documentation, source attribution, and LLM-as-judge scoring for accuracy, completeness, and faithfulness.
Composite ranking: 0.5 * accuracy + 0.3 * reliability + 0.2 * latency_score. Latency score inverts p95 so lower is better.
Tiers: 1 = lookup (single fact retrieval), 2 = how-to (procedure within one topic), 3 = comparison (option A vs B), 4 = integration (cross-topic dependencies), 5 = architecture (3+ topics, system design).