CAIRN · implementation canvas · 0.1.0a4

Stop dumping documents into AI.
Give it a map instead.

How RAPTOR, BookRAG, and A-RAG each tried to fix retrieval — and how Cairn synthesizes their ideas into a repo-aware, MCP-native documentation graph for software repositories and large structured documents.

STATUS  Alpha · 0.1.0a4 LICENSE  Apache 2.0 RUNTIME  Python 3.11+ · Local-first INTEROP  MCP stdio · repo or document mode
Problem

Large structured documents break every existing AI workflow.

A 500-page handbook, a multi-thousand-page regulation, an enterprise wiki — these have structure the AI ought to use. Today's tools throw that structure away.

// Dump it all in

Burns tokens. Dilutes attention. Accuracy collapses past ~50k tokens, regardless of advertised context windows. Wallet hurts; answer wrong.

// Naive vector RAG

Splits structured documents into context-free shards. Returns the right paragraph under the wrong premise. Cross-references are lost. Hierarchy is lost.

// GraphRAG-style pipelines

Expensive to build, opaque to debug. Over-engineered for documents that already have explicit structure. The entity graph is born of the same shredding that lost the structure in the first place.

State of the art

Three research strands, each got part of it right.

Cairn isn't novel science. It's an opinionated synthesis of three lines of retrieval research, packaged as a real, agent-ready product.

Baseline · 2022→present

Naive vector RAG

Split the document into fixed-size chunks, embed each, store in a vector DB. At query time, embed the query and return the top-k nearest chunks.

  • Fast to build, easy to ship — the default in every framework
  • Chunks have no awareness of headings, sections, or document tree
  • Cross-references and entities are pulverized at chunking time
  • "Why was this chunk returned?" is a hard question
Verdict. Works on Q&A over flat knowledge. Falls apart on structured documents where context comes from where a paragraph sits, not just what it says.
document.md chunk c1 [v⃗] c2 [v⃗] c3 [v⃗] c4 [v⃗] c5 [v⃗] c6 [v⃗] c7 [v⃗] c8 [v⃗] c9 [v⃗] c₁₀[v⃗] ? query
Structure → uniform chunks → vector search → context-free hits
ICLR 2024 · arXiv:2401.18059

RAPTOR — recursive summarization tree

Take naive chunks, then recursively cluster + summarize bottom-up. The result is a tree where higher nodes are abstractions of their children. Retrieval can hit any level of the tree.

  • First widely-adopted hierarchical retrieval — +20% on QuALITY
  • Multi-granularity retrieval: thematic queries → high nodes
  • But the tree is built from vector clusters, not the document's own structure
  • Summaries can hallucinate (~4% in the paper's audit)
Verdict. Proved hierarchy works. But the hierarchy it builds is statistical, not authorial — it ignores the TOC the author already wrote.
L0 · raw chunks L1 · cluster summaries L2 · super-summaries root summary L3 · global gist ↑ cluster + summarize
Tree built bottom-up by clustering chunks, not from the document's TOC
Dec 2025 · arXiv:2512.03413

BookRAG — structure-aware index + entity graph

For books, manuals, handbooks: the author already gave you a TOC. Use it. BookIndex = (T, G, M): a hierarchical Tree of the actual TOC, an entity Graph, and a Mapping from every entity back to where it lives in the tree.

  • Tree T mirrors the document's own headings — pseudo-TOC reconstruction
  • Entity graph G traces concepts across sections
  • Mapping M tells you exactly where each concept lives
  • Agent loop inspired by Information Foraging Theory: scent → patch → forage
  • SOTA on three benchmarks · lower token consumption than baselines
  • No public reference implementation
Verdict. The right vision for structured documents. The paper is 6 months old; the code never landed.
Book Ch.1 §1.1 §1.2 Ch.2 §2.1 §2.2 Ch.3 §3.1 §3.2 tree T (TOC) e₁ e₂ e₃ e₄ e₅ e₆ graph G (entities) mapping M : entities ⇄ tree nodes
BookIndex = Tree (T) + Entity Graph (G) + Mapping (M)
Feb 2026 · arXiv:2602.03442 · MIT licensed

A-RAG — agent with multi-granularity tools

The agent isn't a passive recipient of retrieved chunks. Expose three retrieval tools at different granularities — keyword, semantic, chunk-read — and let a ReAct-style agent decide what to call and when.

  • Truly agentic: autonomous strategy + iterative execution + interleaved tool use
  • 94.5% HotpotQA, 89.7% 2WikiMultiHop with GPT-5-mini
  • Test-time scaling: better with more compute
  • But "hierarchy" here is granularity, not document structure
  • Designed for multi-document corpora, not single large docs
Verdict. The right interface for agents. The wrong index for structured documents.
agent ReAct keyword_search exact match semantic_search dense vector chunk_read full content multi-doc corpus iterate · reason · act
Agent calls retrieval tools iteratively, deciding strategy on the fly
Synthesis

Cairn — repo graph + document index + MCP tools.

BookRAG's vision of a document-native index, A-RAG's discipline of agentic multi-granularity tools, RAPTOR's idea of multi-level summaries — shipped as a local repository workflow. Cairn discovers docs, builds one structured index per source, then exposes repo search and per-document drilldown over MCP.

01 · REPO INDEX

Documents stay separate

docsgraph sync discovers README files, docs, specs, ADRs, PDFs, and optional MarkItDown sources. Each source gets its own Tree, Summaries, Entities, Cross-references, and Vectors.

02 · TOOLS

Repo first, then drill in

Repo mode adds list_documents and search_documents, then routes the eight document tools with optional doc. The agent composes; Cairn never speaks for it.

03 · DISCLOSURE

Progressive by default

Outlines return gists. Searches return synopses, short heads, and evidence snippets. Full text is opt-in through get_section, expand, or read_range.

04 · INTEROP

MCP-native

Works with Claude Code, Cursor, Codex, Goose, and other MCP clients. Fake providers support offline smoke tests; production embedders and summarizers stay pluggable.

EXTERNAL AGENTS Claude Code Cursor Cline Goose your agent Model Context Protocol L4 MCP Server stdio · structured envelopes · repo mode routes optional doc L3 Repo-Scoped MCP Surface REPO DISCOVERY list_documents search_documents · repo_context repo_graph · repo_impact DOC TOOLS (+ doc) outline get_section expand read_range search_semantic search_keyword find_mentions get_related document tools accept optional doc in repo mode L2 DocumentIndex — one per source document T · Tree hierarchical sections S · Summaries gist · synopsis · digest E · Entities terms · code · proper X · Cross-refs links · textual · entity V · Vectors dense overlay L1 Ingestion · native Markdown/PDF · optional MarkItDown Office/data/web → Document AST

Repo search is hybrid

search_documents searches every indexed repo document and returns globally ranked section hits with doc, source, section id, stable anchor, evidence, score breakdown, and explanation. repo_context composes those hits into an agent-ready context pack.

The ranker blends dense vector similarity, field-supported lexical evidence, BM25-style sparse score, document/path identity, and graph propagation from tree, xref, and entity-neighborhood edges. repo_graph and repo_impact expose that docs graph without reimplementing CodeGraph's source-code symbol graph.

Document tools stay explicit

search_semantic is dense vector search inside one document. search_keyword is exact case-insensitive lexical scan, not BM25. get_related returns tree and xref neighbors.

That separation is intentional: repo discovery gets the hybrid ranker; follow-up tools stay predictable so an agent can inspect, expand, and cite exactly what it selected.

Side by side: same question, two strategies

Asking a 200-page handbook: "How do I handle timeouts in the retry middleware?"

// Naive vector RAG

01
embed(query) + top-10 chunk search— · k=10
02
return 10 chunks of ~512 tokens each — most unrelated5,120 tok
03
agent reads them all to figure out which is relevant5,120 tok
04
agent realizes context is missing — needs surrounding sections+ retry
05
second pass: another 10 chunks5,120 tok
06
still no cross-references; answer is guessedlow trace
Total tokens consumed ≈ 15,360
Traceability low

// Cairn (repo-scoped)

01
list_documents() — see indexed docs and freshnessmetadata
02
search_documents("retry middleware timeouts")620 tok
03
get_section(doc="handbook", id="middleware/retry")280 tok
04
get_related(doc="handbook", kinds=["xref","parent"])240 tok
05
expand(doc="handbook", id="middleware/retry", to="full")1,200 tok
06
cites cairn://handbook/middleware/retry#timeout-handlingstable anchor
Total tokens consumed ≈ 2,340
Traceability high

Flow numbers are illustrative. The benchmark harness ships now; README has starter-suite numbers and reproduction commands.

Comparison at a glance

Naive RAG RAPTOR BookRAG A-RAG Cairn
Primary index Vector chunks Clustered tree Doc TOC + graph Vector chunks Repo catalog + doc TOC + 4 overlays
Respects authorial structure no partial yes no yes
Multi-granularity no yes yes yes yes
Agentic tool interface no no yes yes yes
Progressive disclosure default no partial yes partial implemented
MCP-native no no no no yes
Local-first by default depends no no depends yes
Citations / stable anchors no weak yes weak mandatory
Public implementation many yes none yet yes (MIT) this repo