Metadata-Version: 2.4
Name: edex
Version: 0.2.0
Summary: Local MCP server that turns PDFs into a filesystem-native knowledge base
Keywords: mcp,pdf,rag,docling,claude,agent,knowledge-base
Author: Leo Gao
Author-email: Leo Gao <leogao2006@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup
Classifier: Operating System :: OS Independent
Requires-Dist: docling-core>=2.77.0
Requires-Dist: fastmcp>=3.3.1
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.13.4
Requires-Dist: python-frontmatter>=1.3.0
Requires-Dist: tantivy>=0.26.0
Requires-Dist: docling>=2.95.0 ; extra == 'local'
Requires-Python: >=3.13
Project-URL: Homepage, https://edex.dev
Project-URL: Repository, https://github.com/leog25/edex
Project-URL: Issues, https://github.com/leog25/edex/issues
Provides-Extra: local
Description-Content-Type: text/markdown

# edex

Local MCP server that turns PDFs into a filesystem-native knowledge base.
Instead of chunking + embedding + vector search, an agent navigates the
ingested PDFs the way it would a codebase — `list_documents`, `get_outline`,
`read_section`, `grep`, BM25 `search`.

## Why

2026 benchmarks (Anthropic's internal switch off RAG, LlamaIndex `fs-explorer`,
Aktagon "keyword search is all you need") converge on the same finding: when
the agent can iterate, lexical filesystem retrieval hits ≥90% of vector-RAG
quality with vastly simpler infra and verifiable citations.

## What you get

For each PDF you ingest:

```
workspace/
  INDEX.md                       # one-line summary per doc (BM25 entry point)
  SCHEMA.md                      # conventions doc for the agent
  .index/                        # tantivy BM25 store
  <doc-slug>/
    README.md                    # title, abstract, full TOC, element registry
    manifest.json                # outline tree + element registry + page->section map
    sections/                    # flat, numeric-prefixed: 02.1-data-collection.md
    tables/                      # tNNN-*.html (preserves merged cells) + .md sidecar
    figures/                     # fNNN-*.png + .md sidecar
    raw/source.pdf
```

Each `sections/*.md` carries YAML frontmatter (`doc`, `section_id`, `title`,
`parent`, `pages`, `tokens`, `tables`, `figures`, `description`) — the
`description` field is the BM25-load-bearing summary the agent reads to decide
whether to open the body.

## MCP tools

| Tool | What it does |
|---|---|
| `ingest_pdf(path, doc_slug?)` | Run Docling, write workspace, reindex |
| `list_documents()` | Slug + title + counts for every ingested doc |
| `get_outline(doc)` | Hierarchical TOC |
| `read_section(doc, section)` | By `section_id` ("2.1") or filename stem ("02.1-data-collection") |
| `read_table(doc, table_id)` | HTML + caption + page |
| `get_figure(doc, figure_id)` | PNG image content block |
| `grep(pattern, doc?)` | Regex over section markdown |
| `search(query, doc?, k=10)` | BM25 with title^3 / description^2 / body^1 boosts |
| `find_section(query, doc)` | Single best `section_id`; title-substring rerank |

## Install

```bash
pip install edex                  # MCP server + hosted-ingest client
pip install 'edex[local]'         # add full Docling stack for in-process ingest
```

The base install is small — only `docling-core` (types) is pulled in, not
`docling` itself. The `[local]` extra adds the heavy ML stack (~1GB+ of
PyTorch + TableFormer models).

## Run

```bash
# stdio MCP server (the usual mode)
EDEX_WORKSPACE=/path/to/workspace edex serve

# one-shot CLI ingest
edex ingest path/to/paper.pdf --slug paper
```

## Hosted ingest

By default `ingest_pdf` calls a hosted GPU endpoint so you don't need a local
GPU (or even `docling` installed). Request a token by emailing
`leogao2006@gmail.com` and set:

```bash
export EDEX_DOCLING_BACKEND=modal
export EDEX_INGEST_TOKEN=<your-token>
```

Falls back gracefully: set `EDEX_DOCLING_BACKEND=local` (with the `[local]`
extra installed) to ingest in-process, or pass `backend="local"` per-call via
the MCP tool.

## Use with Claude Desktop

```json
{
  "mcpServers": {
    "edex": {
      "command": "edex",
      "args": ["serve"],
      "env": {
        "EDEX_WORKSPACE": "/path/to/workspace",
        "EDEX_DOCLING_BACKEND": "modal",
        "EDEX_INGEST_TOKEN": "<your-token>"
      }
    }
  }
}
```

## Self-host the ingest endpoint

If you want to run your own GPU ingest service instead of using the public
endpoint:

```bash
uv sync                                  # base deps
uv run modal setup                       # authenticate with Modal
modal secret create edex-ingest-tokens \
    EDEX_INGEST_TOKENS_JSON='{"<your-token>": {"name": "me", "rate_per_hour": 100}}'
uv run modal deploy infra/modal_app.py
```

Then point your client at it via `EDEX_INGEST_URL=https://<workspace>--edex-ingest-prod-convert.modal.run`.

Knobs at the top of `infra/modal_app.py`: `GPU_TYPE`, `MAX_CONTAINERS`,
`SCALEDOWN_WINDOW_SEC`, `CONVERT_TIMEOUT_SEC`, `MAX_PDF_BYTES`. Models are
baked into the image so cold starts skip the ~3GB download.

## Development

```bash
uv run pytest                   # fast unit + MCP server tests
uv run pytest -m slow           # gated; needs `docling` installed
RUN_DOCLING=1 uv run pytest     # actually run the slow integration test
```

The codebase is split so the heavy ML deps (`docling`) only load when actually
ingesting:

- `parser.py` is pure — it consumes an already-parsed `DoclingDocument` and
  emits the internal model. Unit-tested against committed
  `tests/fixtures/docling_cache/*.json`.
- `docling_runner.py` is the only place that imports `docling`, lazily.
- `writer.py`, `index.py`, `workspace.py` are pure FS / pure tantivy.
- `server.py` is FastMCP glue (no business logic).

136 tests run in ~12 seconds; the 5 Docling-gated tests are skipped by default.
