Metadata-Version: 2.4
Name: docsgraph
Version: 0.1.0a3
Summary: Local-first documentation graph for AI agents. CodeGraph for docs, exposed through MCP.
Project-URL: Homepage, https://github.com/jokeuncle/cairn
Project-URL: Documentation, https://github.com/jokeuncle/cairn/tree/main/docs
Project-URL: Repository, https://github.com/jokeuncle/cairn
Project-URL: Issues, https://github.com/jokeuncle/cairn/issues
Project-URL: Changelog, https://github.com/jokeuncle/cairn/blob/main/CHANGELOG.md
Author: The Cairn Authors
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for describing the origin of the Work and
              reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Support. While redistributing the Work or
              Derivative Works thereof, You may choose to offer, and charge a
              fee for, acceptance of support, warranty, indemnity, or other
              liability obligations and/or rights consistent with this License.
              However, in accepting such obligations, You may act only on Your
              own behalf and on Your sole responsibility, not on behalf of any
              other Contributor, and only if You agree to indemnify, defend, and
              hold each Contributor harmless for any liability incurred by, or
              claims asserted against, such Contributor by reason of your
              accepting any such warranty or support.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright 2026 The Cairn Authors
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
License-File: LICENSE
Keywords: agents,ai-agents,documentation,documents,hierarchical,lancedb,markdown,mcp,model-context-protocol,pdf,rag,repository,retrieval,vector-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx[socks]>=0.27
Requires-Dist: lancedb>=0.13
Requires-Dist: markdown-it-py>=3.0
Requires-Dist: mcp>=1.2
Requires-Dist: mdit-py-plugins>=0.4
Requires-Dist: numpy>=1.26
Requires-Dist: pyarrow>=15.0
Requires-Dist: pydantic>=2.7
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-slugify>=8.0
Requires-Dist: structlog>=24.1
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: types-python-slugify; extra == 'dev'
Provides-Extra: markitdown
Requires-Dist: markitdown>=0.1.0; extra == 'markitdown'
Description-Content-Type: text/markdown

# Cairn

> **The DocsGraph for AI agents. CodeGraph helps agents navigate code; Cairn
> helps them navigate docs. Install it as `docsgraph`; keep the `cairn` name
> for the product and compatibility alias.**

[![CI](https://github.com/jokeuncle/cairn/actions/workflows/ci.yml/badge.svg)](https://github.com/jokeuncle/cairn/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-0.1.0a3-blue.svg)](CHANGELOG.md)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/)
[![MCP](https://img.shields.io/badge/MCP-native-7c3aed.svg)](https://modelcontextprotocol.io/)

![Cairn demo: repository documentation graph and MCP tools](docs/assets/cairn-demo.svg)

Cairn is a **local-first, MCP-native DocsGraph** for software
repositories and large structured documents. It turns README files, specs,
ADRs, docs folders, PDFs, and optional MarkItDown-converted Office/data/web
files into a navigable map: document catalog, hierarchical sections,
multi-granularity summaries, entity mentions, cross-reference edges, and a
semantic vector overlay.

Instead of dumping whole docs into context or relying on anonymous chunks, an
agent can ask Cairn to `list_documents`, `search_documents`, inspect an
`outline`, and drill into exact sections with stable `cairn://` anchors. The
same engine also works for standalone handbooks, papers, and PDFs.

The result: better retrieval accuracy, lower token spend, and a practical MCP
tool layer between your project documentation and every AI coding agent you
use. Local-first. Vendor-neutral. Designed for open-source repos.

> 🚀 **Alpha — `0.1.0a3`.** Markdown + PDF ingest, all eight MCP tools,
> the full structure-aware index (tree + summaries + entities + xrefs +
> vectors), repo-level `init/sync/status`, repo-scoped MCP with
> `list_documents`, `search_documents`, `repo_context`, `repo_graph`, and
> `repo_impact`, failure-isolated sync, static graph inspector, Doubao
> multimodal embeddings, and a benchmark harness with headline numbers. See
> [`CHANGELOG.md`](CHANGELOG.md) for what's in this
> release and [`ROADMAP.md`](ROADMAP.md) for what's next.

---

## Why Cairn?

| Today | With Cairn |
|---|---|
| AI coding agents guess from README snippets or grep. | Agent gets a repo-level documentation map with stable section anchors. |
| Dump the whole document into context. Burns tokens, dilutes attention. | Agent fetches only what it needs, at the granularity it needs. |
| Naive RAG splits structure into context-free chunks. | The document's own structure is the index. |
| Cross-references and entities are lost in chunking. | They are first-class objects. |
| Locked into one vendor's embeddings / vector DB. | Pluggable everything. Local-first defaults. |
| Different tool stacks for Claude / Cursor / Cline / Goose. | One MCP server. Any compliant agent works. |

For the in-depth motivation, see [`PRODUCT.md`](PRODUCT.md).
For the technical design, see [`ARCHITECTURE.md`](ARCHITECTURE.md).
For the public documentation quality contract Cairn optimizes for, see
[`docs/golden-docs-standard.md`](docs/golden-docs-standard.md).

---

## How It Works (90 seconds)

1. **Discover.** `docsgraph init -y` writes `.cairn/config.toml`; `docsgraph sync`
   discovers README, Markdown docs, ADRs, specs, and PDFs from conservative
   repo globs.
2. **Index.** Each document becomes a normal Cairn index: structural tree (T),
   multi-level summaries (S), entity index (E), cross-reference graph (X), and
   vector overlay (V). A bad source file is isolated instead of breaking the
   whole repo sync.
3. **Serve.** `docsgraph serve` exposes repo-scoped MCP tools:
   `list_documents`, `search_documents`, plus `outline`, `get_section`,
   `expand`, `search_semantic`, `search_keyword`, `find_mentions`,
   `get_related`, and `read_range` routed by optional `doc`.
4. **Navigate.** Your agent searches across the repo, picks a document, drills
   into promising sections, and only fetches full text when justified. Every
   result carries stable anchors for verification.

A visual explainer comparing Cairn's approach to RAPTOR, BookRAG, and A-RAG
lives at [`docs/canvas.html`](docs/canvas.html). Open it in any browser.

---

## Quickstart

The fastest way to see Cairn work is to index this repo's own documentation.
**Zero API keys, zero model downloads** — the `--fake` flag uses deterministic
in-process plugins so the whole thing runs offline.

The PyPI distribution is `docsgraph`; the primary CLI command is `docsgraph`.
The older `cairn` command is installed as a compatibility alias:

```bash
pip install docsgraph
```

Or run it without installing:

```bash
uvx docsgraph --help
```

AI agents that can run shell commands can install and wire Cairn into their own
MCP config. Start with a dry run, then write the config once the target path
looks right:

```bash
uvx docsgraph init -y
uvx docsgraph sync --fake
uvx docsgraph install --client codex --dry-run --fake
uvx docsgraph install --client codex --yes --fake
```

Use `--client claude`, `--client cursor`, or `--client goose` for other MCP
clients. `docsgraph install` writes the same server config that
`docsgraph mcp config` prints, with `command = "docsgraph"` and
`args = ["serve", "--repo", "..."]`.

### Repository Workflow

Inside any repository:

```bash
docsgraph init -y
docsgraph sync --fake
docsgraph status
docsgraph query repo "where are docs indexed?" --fake
docsgraph doctor
docsgraph mcp config --client claude --fake
docsgraph serve --fake
```

`docsgraph doctor` checks repo config, index freshness, primary-doc routing,
and model settings. `docsgraph mcp config` prints copy-pasteable stdio snippets for
Claude, Cursor, Codex, and Goose:

```bash
docsgraph mcp config --client claude
docsgraph mcp config --client cursor
docsgraph mcp config --client codex
docsgraph mcp config --client goose
```

For local development from source:

```bash
git clone https://github.com/jokeuncle/cairn.git
cd cairn

python3.11 -m venv .venv
.venv/bin/pip install -e ".[dev]"

# 1. Create .cairn/config.toml with conservative documentation globs.
.venv/bin/docsgraph init -y

# 2. Index README, Markdown docs, and PDFs.
.venv/bin/docsgraph sync --fake

# 3. Inspect freshness and indexed document ids.
.venv/bin/docsgraph status

# 4. Search across all indexed repository docs.
.venv/bin/docsgraph query repo "where are docs indexed?" --fake

# 5. Start the repo-scoped MCP stdio server for Claude Code / Cursor / Cline / Goose.
.venv/bin/docsgraph serve --fake
```

Repo mode writes a shareable config plus ignored runtime data:

```text
.cairn/
  config.toml       # commit this if you want a stable repo docs policy
  manifest.json     # generated
  documents/        # generated per-document Cairn indexes
    readme/
    architecture/
    docs-specs-mcp-tools/
```

Repo-scoped MCP adds:

| Tool | Use it for |
|---|---|
| `list_documents` | See every indexed doc, its source path, freshness, and section count. |
| `search_documents` | Search across all indexed docs and get globally ranked, explainable section hits with `doc` ids, skipped docs, and stale-doc warnings. |
| `repo_context` | Get a ready-to-read context pack: ranked hits, selected section text, hit explanations, and a relationship map. |
| `repo_graph` | Inspect the repo documentation graph: document, section, entity, contains, xref, and mention edges. Cross-document links are exposed through shared entity nodes. |
| `repo_impact` | Estimate documentation surfaces affected by a document or section change. |
| normal Cairn tools + `doc` | Drill into a chosen document with `outline`, `get_section`, `search_semantic`, `get_related`, etc. |

Repo behavior is intentionally configurable in `.cairn/config.toml`:

| Setting | Default | Impact |
|---|---|---|
| `include` | README, top-level Markdown/PDF, `docs/**` Markdown/PDF, one-level nested README | Expands or narrows what Cairn treats as repository documentation. Broader globs improve coverage but can index noisy generated files. |
| `exclude` | `.git`, `.cairn`, `.codegraph`, caches, virtualenvs, build output, `node_modules` | Keeps generated or tool-owned docs out of search. Simple `name/**` directory excludes match at any depth, so `frontend/node_modules/...` and `apps/web/dist/...` are skipped. Add project-specific generated doc folders here. |
| `enable_markitdown` | `false` | Enables non-Markdown/PDF conversion when the `markitdown` extra is installed. Useful for DOCX/PPTX/XLSX/HTML-heavy repos, slower and less deterministic than native Markdown/PDF parsing. |
| `primary_doc` | `readme` | Chooses the default document for normal tools when `doc` is omitted in repo mode. |
| `search_sections_per_doc` | `1` | Default diversity for `search_documents`. `1` helps agents find the right doc first; raise it when a repo has a few long docs and you want deeper hits from each doc by default. |
| `preferred_locales` | `[]` | Optional locale preference for repo search, for example `["en"]` or `["zh"]`. When omitted, English queries prefer English or locale-neutral docs without hiding other languages. |

MarkItDown integration is local-file only and optional. Cairn uses it as a
conversion layer, then feeds the generated Markdown into the same canonical
Markdown parser. This expands coverage to formats such as DOCX, PPTX, XLSX,
HTML, CSV, JSON, XML, and EPUB without making the base install heavy:

```bash
pip install "docsgraph[markitdown]"
.venv/bin/docsgraph init -y --force --markitdown
.venv/bin/docsgraph sync --fake
```

Generate a standalone graph inspector for the primary repo doc:

```bash
docsgraph inspect --out /tmp/cairn-repo-inspector.html
```

### Single Document Workflow

Cairn still works as a focused index for one large document:

```bash
# Index Cairn's own architecture document.
.venv/bin/docsgraph index ARCHITECTURE.md --out /tmp/cairn-arch --fake

# Get the map — gists only, never full text.
.venv/bin/docsgraph outline /tmp/cairn-arch --depth 2

# Keyword search: every section that mentions "LanceDB".
.venv/bin/docsgraph query keyword /tmp/cairn-arch LanceDB

# Multi-term keyword search with mode=all.
.venv/bin/docsgraph query keyword /tmp/cairn-arch progressive disclosure --mode all

# Generate a standalone graph inspector for the built index.
.venv/bin/docsgraph inspect /tmp/cairn-arch --out /tmp/cairn-arch/inspector.html

# Start a single-document MCP stdio server.
.venv/bin/docsgraph serve /tmp/cairn-arch --fake
```

A walkthrough with full output and an MCP-client config snippet is in
[`examples/hero-demo.md`](examples/hero-demo.md).

### Benchmarks

Cairn ships with `cairn-bench`, a small framework that compares Cairn against
a naive 512-word-chunk vector-RAG baseline (both backed by LanceDB and the
same embedder, so the comparison is apples-to-apples).

Running the starter suite (10 hand-curated questions over Cairn's own
`ARCHITECTURE.md`) with deterministic in-process plugins:

```bash
docsgraph bench benchmarks/architecture.toml --fake
```

| metric | naive vector RAG | Cairn |
|---|---:|---:|
| mean recall@8 | 25% | 25% |
| mean tokens returned | 3,670 | **1,388 (37.8% of naive)** |

Caveat — these numbers come from the deterministic `FakeEmbedder` (a
bag-of-words hash with no semantic understanding). Recall ties because
neither system has semantics; **the 2.6× token efficiency win is independent
of the embedder**: it comes from progressive disclosure and section-aware
retrieval, not from vector quality. Cairn now returns a short `evidence`
snippet with every semantic hit by default, which raises the token count but
makes ranking errors easier to inspect. Reproduce these numbers in under a
second on any machine — and re-run with Ollama (`nomic-embed-text`) or
Doubao for the real-semantics version. See
[`benchmarks/README.md`](benchmarks/README.md) for caveats and how to author
your own suites.

Repo-level smoke tests are also public and reproducible:

```bash
python scripts/eval_repos.py --repo all --refresh --strict
python scripts/smoke_many_repos.py --limit 37 --strict
```

The labeled eval set covers `astral-sh/uv`, `pydantic/pydantic-ai`,
`modelcontextprotocol/python-sdk`, and `fastapi/full-stack-fastapi-template`.
The broad smoke matrix currently spans 37 public repositories across Python,
JavaScript/TypeScript, Rust, and Go ecosystems. It is not an accuracy
leaderboard; it verifies clone/discovery/sync/search/drilldown robustness and
latency across different documentation shapes.

Latest fake-plugin runs on this machine:

| suite | result |
|---|---|
| `pydantic-ai` labeled eval | 178/178 docs indexed, 8/8 top1, 8/8 top5, 8/8 drilldown |
| `uv` labeled eval | 89/89 docs indexed, 15/16 top1, 16/16 top3/top5, 16/16 drilldown |
| `mcp-python-sdk` labeled eval | 17/17 docs indexed, 4/4 top1, 4/4 drilldown |
| `fastapi-template` labeled eval | 7/7 docs indexed, 4/4 top1, 4/4 drilldown |
| 37-repo smoke matrix | 2931 docs indexed, 0 sync failures, 185/185 searches with hits, 185/185 drilldowns |

`search_documents` uses a general hybrid ranker: dense vector similarity,
BM25-style sparse evidence, structure-aware field support, weighted query-term
coverage, path/title identity prior, and local graph-neighborhood propagation.
Repo search builds a process-local cache and scores dense vectors in batches so
large documentation sets stay warm-query friendly. On large section sets it
uses a two-stage path: dense seeds, cheap lexical/path seeds, and graph
neighbors form a wide shortlist, then the full BM25/graph/explanation ranker
scores only that candidate set. Cold cache construction loads per-document
indexes concurrently while preserving per-document failure isolation.
Search responses expose `ranker.mode`, `total_sections`, and `scored_sections`
so the performance path is visible to clients and benchmarks.
Each hit includes a score breakdown and short explanation so agents and humans
can see whether dense, lexical, sparse, or graph evidence dominated the result.
Changelog, release-note, and migration-history documents are intent-gated: they
stay first-class results for release/version/change queries, but broad topic
queries prefer guides, API docs, and README-style docs when comparable evidence
exists.
Search candidates are freshness-aware: repo status records a file-level
fingerprint, and query responses expose `stale_documents` when source files have
changed since the last sync.
`repo_context` composes search, section content, and local relationships into
one agent-ready payload; `repo_graph` and `repo_impact` expose the documentation
graph without reimplementing source-code analysis. Pair Cairn with CodeGraph
when you need AST symbols, callers/callees, or code impact.
The ranker does not special-case repository names, document ids, or benchmark
answers.

### Real LLM + real embeddings

The `--fake` plugins are great for offline reproducibility but they have no
semantic understanding. For production indexing, point Cairn at any
OpenAI-compatible endpoint. The defaults target a **local Ollama** so you
keep the local-first promise without paying for API tokens:

```bash
ollama serve
ollama pull llama3.2:3b
ollama pull nomic-embed-text

.venv/bin/docsgraph index ARCHITECTURE.md --out /tmp/cairn-arch   # no --fake
```

OpenAI, vLLM, Together, Anyscale, …all of them work the same way; override
`CAIRN_LLM_*` and `CAIRN_EMBED_*` environment variables.

For Doubao's vision embedding model, use the dedicated provider because the
model is served through Volcengine's `/embeddings/multimodal` endpoint:

```bash
export CAIRN_LLM_BASE_URL=https://ark.cn-beijing.volces.com/api/v3
export CAIRN_LLM_MODEL=doubao-seed-2-0-code-preview-260215
export CAIRN_LLM_API_KEY=...

export CAIRN_EMBED_PROVIDER=doubao-vision
export CAIRN_EMBED_MODEL=doubao-embedding-vision-251215
export CAIRN_EMBED_API_KEY=...

docsgraph index ARCHITECTURE.md --out /tmp/cairn-arch
```

To run the public-repo eval with the real provider configured by your
environment instead of the deterministic fake plugins:

```bash
python scripts/eval_repos.py --repo pydantic-ai \
  --provider env \
  --workdir /tmp/cairn-repo-eval-real \
  --refresh
```

The eval report includes provider mode, model names, and vector dimension, but
never prints API keys. Cairn also invalidates old indexes when the summarizer,
embedder, vector dimension, entity extractor, or xref extractor changes, so
switching from `--fake` to Doubao rebuilds the affected documents instead of
quietly reusing incompatible vectors.

Useful operational knobs when running against hosted APIs:

| variable | default | purpose |
|---|---:|---|
| `CAIRN_LLM_TIMEOUT` | `60` | per-request summary timeout in seconds |
| `CAIRN_LLM_MAX_RETRIES` | `2` | retries for 429/5xx and transport errors |
| `CAIRN_EMBED_TIMEOUT` | `60` | per-request embedding timeout in seconds |
| `CAIRN_EMBED_MAX_RETRIES` | `2` | retries for embedding 429/5xx and transport errors |
| `CAIRN_SUMMARY_CONCURRENCY` | `4` | concurrent summary calls during indexing and benchmarks |
| `CAIRN_EMBED_BATCH_SIZE` | `32` | sections/chunks per embedding batch |

---

## Inspiration and Lineage

Cairn synthesizes two strands of recent research and ships them as a real,
agent-ready tool:

- **[BookRAG](https://arxiv.org/abs/2512.03413)** (Dec 2025): structure-aware
  index combining a hierarchical tree with an entity graph, queried via an
  Information-Foraging-Theory-inspired agent. Cairn implements this vision in
  production-grade form.
- **[A-RAG](https://arxiv.org/abs/2602.03442)** (Feb 2026): clean agent loop
  with hierarchical retrieval tools (keyword/semantic/chunk). Cairn borrows the
  agent-tool philosophy and replaces A-RAG's chunk-based index with a
  structure-first one.
- **[RAPTOR](https://arxiv.org/abs/2401.18059)** (ICLR 2024): the seminal
  recursive-summarization tree. Cairn's summary layer takes inspiration from it
  while anchoring summaries to the document's own structure instead of
  clustered chunks.

We are deeply grateful to these authors; see ADRs for the specific design
choices we adopted, modified, or declined.

---

## Status & Roadmap

| Phase | Status | What |
|---|---|---|
| 0 — Foundation | ☑ | Authoritative docs in place (PRODUCT, ARCHITECTURE, CLAUDE, ROADMAP, ADR-0001) |
| 1 — v0.1 walking skeleton | ☑ | Markdown ingest, Tree + Summaries + Vectors indexes, 5 MCP tools, stdio server, CLI, hero demo |
| 2 — v0.2 structure-aware retrieval | ☑ | Entities, cross-references, PDF ingest, digest summaries, incremental rebuild, static inspector, `cairn-bench` |
| 3 — v0.3 repo docs graph | ◐ | Repo `init/sync/status`, repo-scoped MCP, `list_documents`, `search_documents`, `repo_context`, `repo_graph`, `repo_impact`, shareable `.cairn/config.toml`; hosted inspector and telemetry still next |
| 4 — v0.4 polish for production | ☐ | DOCX/RTF/EPUB, VSCode extension, security review |
| v1.0 GA | ☐ | All `PRODUCT.md` §7 success criteria met |

Full plan: [`ROADMAP.md`](ROADMAP.md). Current test suite: **440 passing**,
mypy strict clean, ruff clean.

Maintainer release gate: [`docs/release-checklist.md`](docs/release-checklist.md).

---

## Contributing

Cairn is opinionated. Before opening a PR, please read:

1. [`PRODUCT.md`](PRODUCT.md) — especially the non-goals.
2. [`ARCHITECTURE.md`](ARCHITECTURE.md) — the end-state design we're building toward.
3. [`CONTRIBUTING.md`](CONTRIBUTING.md) — workflow and PR expectations.
4. [`docs/decisions/`](docs/decisions/) — existing ADRs.

If you're an AI agent helping a contributor, you'll find your session anchor in
[`CLAUDE.md`](CLAUDE.md).

---

## License

Apache 2.0. See [`LICENSE`](LICENSE).

---

*A cairn is a small stack of stones marking a trail through difficult terrain.
This project is one for AI agents lost in large documents.*
