Metadata-Version: 2.4
Name: doculayer
Version: 0.1.1
Summary: Live documentation access layer for AI agents — no hallucination, no stale docs
Project-URL: Homepage, https://github.com/inamdarmihir/doculayer
Project-URL: Documentation, https://inamdarmihir.github.io/doculayer/
Project-URL: Repository, https://github.com/inamdarmihir/doculayer.git
License: Apache-2.0
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.27
Requires-Dist: lxml>=5.0
Requires-Dist: markdownify>=0.13
Requires-Dist: mcp>=1.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: rank-bm25>=0.2.2
Provides-Extra: dev
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

```
  ██████╗  ██████╗  ██████╗██╗   ██╗██╗      █████╗ ██╗   ██╗███████╗██████╗
  ██╔══██╗██╔═══██╗██╔════╝██║   ██║██║     ██╔══██╗╚██╗ ██╔╝██╔════╝██╔══██╗
  ██║  ██║██║   ██║██║     ██║   ██║██║     ███████║ ╚████╔╝ █████╗  ██████╔╝
  ██║  ██║██║   ██║██║     ██║   ██║██║     ██╔══██║  ╚██╔╝  ██╔══╝  ██╔══██╗
  ██████╔╝╚██████╔╝╚██████╗╚██████╔╝███████╗██║  ██║   ██║   ███████╗██║  ██║
  ╚═════╝  ╚═════╝  ╚═════╝ ╚═════╝ ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚══════╝╚═╝  ╚═╝
```

**The live documentation layer for AI agents**

[![PyPI version](https://img.shields.io/pypi/v/doculayer?color=0a7bca&label=PyPI&logo=pypi&logoColor=white)](https://pypi.org/project/doculayer/)
[![Python](https://img.shields.io/pypi/pyversions/doculayer?logo=python&logoColor=white&label=Python)](https://pypi.org/project/doculayer/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue?logo=apache&logoColor=white)](LICENSE)
[![MCP Compatible](https://img.shields.io/badge/MCP-Compatible-6f42c1?logo=claude&logoColor=white)](https://modelcontextprotocol.io)
[![Downloads](https://img.shields.io/pypi/dm/doculayer?color=2ea44f&label=Downloads&logo=pypi&logoColor=white)](https://pypi.org/project/doculayer/)

[![Claude Code](https://img.shields.io/badge/Claude_Code-Supported-D97706?logo=anthropic&logoColor=white)](#-quick-install-30-seconds)
[![Cursor](https://img.shields.io/badge/Cursor-Supported-black?logo=cursor&logoColor=white)](#-quick-install-30-seconds)
[![VS Code](https://img.shields.io/badge/VS_Code-Supported-007ACC?logo=visualstudiocode&logoColor=white)](#-quick-install-30-seconds)
[![Windsurf](https://img.shields.io/badge/Windsurf-Supported-00B4AB?logo=codeium&logoColor=white)](#-quick-install-30-seconds)
[![Zed](https://img.shields.io/badge/Zed-Supported-084CCF?logo=zed&logoColor=white)](#-quick-install-30-seconds)

</div>

---

AI agents hallucinate APIs. Not because they're broken — because their training data is stale.  
A function signature that changed six months ago, a parameter that was renamed, a new method that didn't exist at training time — the model confidently fabricates the old behavior.

**DocuLayer fixes this.** It sits between your agent and the real documentation, fetching live content on demand, returning verbatim text with full source attribution, and never generating a single word.

```
 Your agent  (Claude, Cursor, Codex, any MCP client…)
     │   "what parameters does httpx.AsyncClient.get() take?"
     ▼
 ┌──────────────────────────────────────────────────────────────┐
 │  DocuLayer                                                   │
 │  ──────────────────────────────────────────────────────────  │
 │  resolve_identifier  →  shortcut table / PyPI / npm         │
 │                                                              │
 │  discover_llms_txt   →  targeted page index (when present)  │
 │       keyword score entries  →  fetch only relevant pages   │
 │                                                              │
 │  DocParser  (HTML → Markdown, heading split)                │
 │  DocSearcher  (BM25 — no ML, no embeddings, no network)     │
 │                                                              │
 │  TTLCache  (in-memory only — zero disk writes)              │
 └──────────────────────────────────────────────────────────────┘
     │   verbatim section + source URL + "fetched 3s ago"
     ▼
 LLM  —  reads real docs, answers correctly
```

---

## ⚡ Quick Install — 30 seconds

### Step 1 — Install the package

```bash
pip install doculayer==0.1.1
```

### Step 2 — Wire up your IDE (auto-detected, zero prompts)

```bash
doculayer setup
```

That's it. The wizard detects your IDE and writes the MCP config automatically — no editor, no JSON, no manual steps.

> **Restart your IDE after setup** and the four DocuLayer tools appear immediately.

---

## 🖥️ IDE-Specific Setup

The `doculayer setup` command handles all of these automatically.  
Manual one-liners are listed below for reference.

### Claude Code

```bash
pip install doculayer==0.1.1 && claude mcp add doculayer -- doculayer mcp
```

Restart Claude Code. Tools are live.

### Cursor

```bash
pip install doculayer==0.1.1 && doculayer setup --ide cursor
```

Or add manually to `~/.cursor/mcp.json`:

```json
{
  "mcpServers": {
    "doculayer": {
      "command": "doculayer",
      "args": ["mcp"]
    }
  }
}
```

### Windsurf

```bash
pip install doculayer==0.1.1 && doculayer setup --ide windsurf
```

Or add manually to `~/.codeium/windsurf/mcp_config.json` with the same JSON block above.

### VS Code (with MCP support)

```bash
pip install doculayer==0.1.1 && doculayer setup --ide vscode
```

Or add to your VS Code `settings.json`:

```json
{
  "mcpServers": {
    "doculayer": {
      "command": "doculayer",
      "args": ["mcp"]
    }
  }
}
```

### Zed

```bash
pip install doculayer==0.1.1 && doculayer setup --ide zed
```

Or add to `~/.config/zed/settings.json` (macOS/Linux) or `%APPDATA%\Zed\settings.json` (Windows):

```json
{
  "context_servers": {
    "doculayer": {
      "command": { "path": "doculayer", "args": ["mcp"] }
    }
  }
}
```

### All IDEs at once

```bash
doculayer setup --all-ides
```

---

## 🔧 What it does

- **MCP server** — `doculayer_search`, `doculayer_fetch`, `doculayer_symbol`, `doculayer_sources` for any MCP client
- **Python library** — `await search("query", "fastapi")` / `await fetch("httpx", section="AsyncClient")` inline in any app
- **CLI** — `doculayer search "streaming responses" --source nextjs`
- **llms.txt-first** — uses the [llms.txt](https://llmstxt.org) index when present to fetch only the 1–3 pages most relevant to the query
- **Zero hallucination** — every byte returned is verbatim from the fetched URL; attribution header on every response; never generates text
- **Zero disk storage** — content lives in a process-local TTL cache; no database, no files, no persistence across restarts

---

## 🛠️ MCP Tools

| Tool | What it does |
|---|---|
| `doculayer_search(query, source, max_results=5)` | BM25 search across live doc sections. Returns verbatim content ranked by relevance. |
| `doculayer_fetch(source, section=None)` | Fetch a whole page or a named heading. Use `section=` to target large pages. |
| `doculayer_symbol(symbol, source=None)` | Look up a function, class, or method. Source auto-inferred from dotted prefix. |
| `doculayer_sources()` | List known sources, identifier formats, and live cache stats. |

Every response includes a citation block:

```
> **Source**: https://docs.pydantic.dev/latest/concepts/validators/
> **Fetched**: 4s ago
```

No generated text ever appears in a response.

---

## 📦 Python Library

```python
import asyncio
from doculayer import search, fetch

# Search live docs — returns verbatim sections with source attribution
results = asyncio.run(search("dependency injection", "fastapi"))
for r in results:
    print(r.score, r.section.title)
    print(r.section.source_url)     # "https://fastapi.tiangolo.com/tutorial/..."
    print(r.section.content)        # verbatim Markdown from the real docs
    print(r.section.cited_content)  # content with attribution header prepended

# Fetch a specific section
content = asyncio.run(fetch("httpx", section="AsyncClient"))
print(content)
# > **Source**: https://www.python-httpx.org/...
# > **Fetched**: 2s ago
#
# ## AsyncClient
# ...verbatim text...
```

---

## 🔍 Identifier Formats

| Format | Example | Resolves via |
|---|---|---|
| bare name | `fastapi` | shortcut table → PyPI → npm |
| `pypi:` | `pypi:httpx` | PyPI JSON API |
| `npm:` | `npm:react` | npm registry |
| `gh:` | `gh:owner/repo` | GitHub URL |
| direct URL | `https://docs.example.com` | passthrough |

---

## 📚 Packages with llms.txt (fastest access)

These packages publish an [llms.txt](https://llmstxt.org) index that DocuLayer uses to fetch only the pages relevant to your query — instead of the whole docs site.

`anthropic` · `astro` · `fastapi` · `httpx` · `langchain` · `nextjs` · `openai` · `pydantic` · `react` · `shadcn` · `supabase` · `svelte` · `tailwindcss` · `vite` · `vue`

Any package without llms.txt falls back to HTML parsing of the root docs page. Passing a direct URL also works — DocuLayer probes for llms.txt automatically.

---

## ⚙️ How it works

```
1. resolve_identifier("pydantic")
        │
        ▼  shortcut table hit → https://docs.pydantic.dev

2. discover_llms_txt("https://docs.pydantic.dev")
        │
        ▼  fetches /llms.txt → 88 indexed entries

3. _candidate_urls("field validators")
        │
        ▼  keyword-score each entry → top 3 pages
           ["concepts/validators/", "api/validators/", "concepts/models/"]

4. DocFetcher.fetch(url) × 3   (parallel, TTLCache checked first)
        │
        ▼  raw HTML → markdownify → DocParser → list[DocSection]

5. DocSearcher(all_sections).search("field validators", top_k=5)
        │
        ▼  BM25Okapi scores → ranked SearchResult list

6. Return verbatim section content + source URL + fetch timestamp
```

**No embeddings. No vector store. No ML inference. No generated text.**

---

## 🔒 Storage guarantee

DocuLayer **never writes to disk**.

- All fetched pages go into a `TTLCache[FetchResult]` in process memory
- Cache entries expire after `DOCULAYER_CACHE_TTL` seconds (default: 1 hour)
- On process restart, the cache is empty — nothing persisted
- Safe for privacy-sensitive environments; docs can never go stale past the TTL

---

## 🔧 Configuration

All settings are environment variables — no config files, no disk reads:

| Variable | Default | Description |
|---|---|---|
| `DOCULAYER_CACHE_TTL` | `3600` | Cache entry lifetime in seconds |
| `DOCULAYER_MAX_CACHE` | `256` | Max cached pages (oldest evicted on overflow) |
| `DOCULAYER_FETCH_TIMEOUT` | `12.0` | HTTP timeout per request |
| `DOCULAYER_MAX_BYTES` | `524288` | Max page size (512 KB) |
| `DOCULAYER_MAX_WORDS` | `400` | Max words per returned section |

---

## 📊 Compared to RAG

| | RAG | DocuLayer |
|---|---|---|
| Storage | Vector DB required | None — in-memory TTL only |
| Freshness | Depends on indexing schedule | Always live (TTL-bounded) |
| Accuracy | Semantic similarity | Verbatim text from source |
| Setup | Embedding model + DB + ingestion pipeline | `pip install doculayer==0.1.1` |
| Hallucination risk | Embedding drift, chunking artifacts | Zero — no generated text |

---

## ❓ Why not just give the agent a URL?

You could. But:

- The agent still has to know _which_ URL. For a library with 200 pages, it guesses.
- The agent will fetch the whole page and summarize it — that's generation, which means drift.
- DocuLayer uses llms.txt to fetch only the 1–3 pages most likely to answer the query, then returns the relevant section verbatim. The agent reads real documentation, not a paraphrase of it.

---

## 🤝 Contributing

```bash
git clone https://github.com/inamdarmihir/doculayer
cd doculayer
pip install -e ".[dev]"
pytest
```

---

## 📄 License

Apache 2.0 — see [LICENSE](LICENSE).
