Metadata-Version: 2.4
Name: pdf-context
Version: 0.1.1
Summary: Structure-aware local PDF ingestion and retrieval for AI clients
Author: PDF Context contributors
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mcp>=1.0.0
Requires-Dist: chromadb>=0.5.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: scikit-learn>=1.4.0
Requires-Dist: sentence-transformers>=3.0.0
Requires-Dist: watchdog>=4.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: httpx>=0.27.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

# PDF Context Server

**PDF Context** (`pdf-context` on PyPI, import `pdf_context`) is a local-first library and MCP server that transforms PDF documents into structured, retrievable context for AI applications.

Drop PDFs into a watch folder, and the server ingests them automatically — extracting structure, classifying document type, chunking with awareness of chapters/sections, embedding locally, and exposing retrieval tools that AI clients use to teach, answer questions, or navigate documents sequentially.

**Drop in PDFs. Build context once. Query from anywhere.**

---

## Install

**From PyPI (when published):**

```bash
pip install pdf-context
```

**From source (development):**

```bash
git clone https://github.com/yourusername/pdf-context-server.git
cd pdf-context-server
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
```

Console entry points:

- `pdf-context` — developer CLI (ingest, search, smoke tests)
- `pdf-context-mcp` — MCP stdio server for AI clients

---

## Programmatic API

```python
from pdf_context import PdfContext, PdfContextConfig

config = PdfContextConfig(
    pdf_data_dir="/path/to/pdfs",
    storage_dir="/path/to/storage",
)
ctx = PdfContext(config, watch=False)
ctx.ingest("my-book.pdf")
results = ctx.search("virtual memory", document="my-book.pdf", top_k=5)
print(results["chunks"])
```

Each `PdfContext` instance is isolated: separate `(pdf_dir, storage_dir)` pairs get separate SQLite + Chroma indexes.

---

## Features (v1)

- PDF ingestion with folder watch and background job queue
- Structure extraction from PDF outlines, heading heuristics, or page-level fallback
- Auto document classification: `textbook`, `technical_reference`, `paper`, `notes`
- Per-type retrieval profiles (chunk size, sequential vs semantic-first)
- Local embeddings (sentence-transformers default, Ollama optional)
- ChromaDB vector storage + SQLite metadata
- Semantic search with structure filters (chapter, section, page range)
- Sequential navigation for chapter-by-chapter learning
- MCP stdio server for any compatible AI client
- Production-grade local reliability: dedup, retries, checkpoints, resume

---

## Architecture

```text
PDF Documents (data/pdfs/)
      │
      ▼
 Folder Watcher ──► Job Queue (SQLite)
      │
      ▼
 Structure Extract + Classify + Parse + Chunk + Embed
      │
      ├──► ChromaDB (vectors + metadata)
      └──► SQLite (documents, structure, chunks, jobs)
      │
      ▼
 MCP Server (stdio)
      ├── NavigationalEngine (sequential / section content)
      └── SemanticEngine (scoped semantic search)
      │
      ▼
 AI Client (Cursor, Claude Desktop, etc.)
```

---

## Project Structure

```text
pdf-context-server/
├── pdf_context/                # installable package
│   ├── client.py               # PdfContext public API
│   ├── config.py               # PdfContextConfig
│   ├── context.py              # AppContext runtime
│   ├── cli.py                  # pdf-context CLI
│   ├── mcp/                    # MCP factory + stdio entry
│   ├── classification/
│   ├── structure/
│   ├── parsers/
│   ├── chunking.py
│   ├── embeddings.py
│   ├── vector_store.py
│   ├── db/
│   ├── ingest/
│   ├── retrieval/
│   └── skills/                 # bundled agent skills (CLI install)
├── app/                        # deprecated shim (python -m app.main)
├── .cursor/
│   ├── mcp.json                # project MCP config (example)
│   └── skills/pdf-context/
├── data/pdfs/
├── storage/
├── tests/
├── pyproject.toml
├── requirements.txt            # dev convenience (see pyproject.toml)
├── .env.example
└── README.md
```

---

## Installation (legacy dev clone)

See **Install** above. `requirements.txt` mirrors runtime deps; prefer `pip install -e ".[dev]"`.

---

## Quick test (no MCP required)

Drop a PDF in `data/pdfs/`, then run **one command**:

```bash
pdf-context smoke
```

That ingests all PDFs, runs a sample search, and prints `PASS` or `FAIL` with details.

Other useful commands:

```bash
pdf-context status
pdf-context list
pdf-context ingest
pdf-context ingest "my-book.pdf"
pdf-context search "virtual memory" -d "my-book.pdf"
pdf-context --pdf-dir /path/pdfs --storage-dir /path/storage status
pdf-context skill list
pdf-context skill install
pytest
```

Or with Make: `make smoke`, `make status`, `make test`.

**MCP is for daily use in Cursor.** The CLI is for verifying everything works without configuring or reloading MCP.

---

## Adding Documents

Place PDFs in your configured PDF folder (default `data/pdfs/`):

```text
data/pdfs/
├── operating-systems.pdf
├── api-reference.pdf
└── lecture-notes.pdf
```

**Keep PDF and storage folders separate.** `pdf_data_dir` and `storage_dir` must not be the same path, and neither may live inside the other. Mixing them causes the folder watcher to pick up Chroma/SQLite files, or ingest metadata into your PDF tree. Use sibling directories (defaults `data/pdfs/` + `storage/` are fine).

The folder watcher auto-enqueues new or changed PDFs for ingestion.

Optional type override sidecar:

```text
data/pdfs/operating-systems.pdf.meta.json
```

```json
{ "doc_type": "textbook" }
```

Valid types: `textbook`, `technical_reference`, `paper`, `notes`

---

## MCP Setup

Enable **pdf-context only in projects** where PDFs are your source of truth. Avoid enabling it globally in Cursor user settings if most chats are code or general work—when the server is disconnected, the model cannot call PDF tools at all.

Add to **project** `.cursor/mcp.json` (Cursor) or your client's MCP config:

```json
{
  "mcpServers": {
    "pdf-context": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/absolute/path/to/pdfs",
        "--storage-dir", "/absolute/path/to/storage"
      ]
    }
  }
}
```

No repo clone required after `pip install pdf-context`. For local dev, point `command` at `.venv/bin/pdf-context-mcp`.

Legacy (deprecated): `"command": "python", "args": ["-m", "app.main"]`

Use a **descriptive server name** (`pdf-context`, `pdf-ml-book`, `pdf-papers`) so rules and skills can refer to the right corpus.

Restart or reload MCP after changing config.

### Multiple corpora (research vs papers)

Run one MCP process per `(pdf folder, storage)` pair. Example:

```json
{
  "mcpServers": {
    "pdf-textbooks": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/Users/me/books",
        "--storage-dir", "/Users/me/.pdf-context/books",
        "--instance-id", "textbooks"
      ]
    },
    "pdf-papers": {
      "command": "pdf-context-mcp",
      "args": [
        "--pdf-dir", "/Users/me/papers",
        "--storage-dir", "/Users/me/.pdf-context/papers",
        "--instance-id", "papers"
      ]
    }
  }
}
```

Or via environment (`PDF_CONTEXT_PDF_DATA_DIR`, `PDF_CONTEXT_STORAGE_DIR`; legacy `PDF_DATA_DIR` / `STORAGE_DIR` still work):

```json
"env": {
  "PDF_CONTEXT_PDF_DATA_DIR": "/Users/me/books",
  "PDF_CONTEXT_STORAGE_DIR": "/Users/me/.pdf-context/books"
}
```

## When the AI client should call MCP tools

The model **chooses tools** from your message, tool descriptions, and project skills—it is not automatic. This project steers that behavior in three layers:

1. **Tool docstrings** in `pdf_context/mcp/server.py` — each tool states *use when* / *do not use when*.
2. **Project skill** — `.cursor/skills/pdf-context/SKILL.md` tells Cursor when to use pdf-context vs codebase tools.
3. **Project-scoped MCP** — enable the server only where PDFs matter.

| User intent | Expected tools |
|-------------|----------------|
| Fix code / git / tests | None (pdf-context idle) |
| "What does the book say about X?" | `search_pdf_context` (+ maybe `list_documents`) |
| Chapter walkthrough with cites | `list_chapters`, `get_section_content`, `get_next_chunks` |
| "Is my PDF indexed?" | `get_ingest_status`, `list_documents` |
| Casual chat | None |

Phrases that help: *"From the indexed PDFs…"*, *"Search [filename] for…"*, *"Don't guess—use pdf-context."*

Phrases that skip PDF tools: *"In general (no PDF)"*, *"Fix this Python file."*

After pulling this repo, **reload MCP** so clients pick up new tool descriptions.

### Install agent skill for any AI client

Bundled skills live in `pdf_context/skills/`. Install into Cursor, Claude Code, VS Code Copilot, Codex/AGENTS.md, Windsurf, Gemini, or a custom path:

```bash
pdf-context skill install
pdf-context skill list
pdf-context skill install -s pdf-context -c claude-code -p .
```

| `--client` | Writes to |
|------------|-----------|
| `cursor-project` | `.cursor/skills/pdf-context/SKILL.md` |
| `cursor-global` | `~/.cursor/skills/pdf-context/SKILL.md` |
| `claude-code` | `CLAUDE.md` |
| `vscode-copilot` | `.github/copilot-instructions.md` |
| `codex-agents` | `AGENTS.md` |
| `windsurf` | `.windsurfrules` |
| `gemini` | `GEMINI.md` |
| `custom` | path from `--output` |

Markdown targets include marked blocks (`<!-- pdf-context-skill:start/end -->`) so re-running install can update the section without wiping your file.

---

## MCP Tools

| Tool | Purpose |
|------|---------|
| `list_documents` | **Corpus check** — what's indexed; call if unsure scope |
| `get_ingest_status` | Queue health; new PDFs; empty search debugging |
| `get_document_profile` | Doc type, retrieval profile, per-document guidance |
| `list_structure` | Full TOC tree |
| `list_chapters` | Flat chapter list (textbooks) |
| `get_section_content` | Ordered chunks for a chapter/section |
| `get_next_chunks` | Sequential read-ahead from cursor |
| `search_pdf_context` | Semantic search with optional structure filters |
| `set_document_type` | Override auto-classification (when user asks) |
| `reingest_document` | Force re-index (when user asks) |

Each tool's MCP description includes when to call it and when to skip it.

---

## Document Types

| Type | Treatment |
|------|-----------|
| **textbook** | Sequential chapter navigation; larger chunks; chapter-scoped search |
| **technical_reference** | Semantic-first; section-scoped search; no forced sequential reading |
| **paper** | Section-scoped semantic search (abstract, methods, etc.) |
| **notes** | Weak structure; semantic-only; page-level fallback navigation |

Classification is automatic at ingest. Override via `.meta.json` or `set_document_type`.

---

## Chapter-by-Chapter Learning Workflow

The AI client holds progress via the `cursor` returned by navigational tools.

```text
1. get_document_profile("operating-systems.pdf")
2. list_chapters("operating-systems.pdf")
3. get_section_content("operating-systems.pdf", node_id=<chapter_id>, limit=5)
4. [Client teaches / summarizes from returned chunks]
5. search_pdf_context("page faults", document="operating-systems.pdf", chapter_id=<id>)
6. get_next_chunks("operating-systems.pdf", cursor=<last_cursor>, limit=5)
```

For unstructured notes, use `list_structure` and semantic search without sequential navigation.

---

## Configuration

See [`.env.example`](.env.example). Key settings:

| Variable | Default | Description |
|----------|---------|-------------|
| `PDF_CONTEXT_PDF_DATA_DIR` | `data/pdfs` | PDF watch folder (must not overlap storage) |
| `PDF_CONTEXT_STORAGE_DIR` | `storage` | SQLite + Chroma (must not overlap PDF folder) |
| `PDF_CONTEXT_EMBEDDING_PROVIDER` | `sentence_transformers` | or `ollama` |
| `PDF_CONTEXT_EMBEDDING_MODEL` | `all-MiniLM-L6-v2` | Local embedding model |
| `PDF_CONTEXT_WATCH_ENABLED` | `true` | Auto-ingest on folder changes |
| `PDF_CONTEXT_CHECKPOINT_PAGE_INTERVAL` | `50` | Resume checkpoint during large ingests |

Legacy `PDF_DATA_DIR` / `STORAGE_DIR` (no prefix) are accepted for one release.

**Path layout rule:** After resolving to absolute paths, `pdf_data_dir` and `storage_dir` must differ and must not be nested (parent/child). Configuration is validated at startup when directories are created; invalid layouts raise a clear error.

First ingest of a large library (20+ textbooks, ~20k pages) on CPU may take hours. Checkpoints make ingestion resumable if interrupted.

---

## Technology Stack

- Python 3.11+
- PyMuPDF — PDF parsing and outline extraction
- sentence-transformers — local embeddings
- ChromaDB — vector storage
- SQLite — metadata, structure, job queue
- MCP — AI client integration
- watchdog — folder watching

---

## Development

```bash
pip install -e ".[dev]"
pytest
pdf-context --help
pdf-context-mcp --help
```

---

## Vision

PDF Context Server converts static PDFs into structured, searchable knowledge that AI applications consume on demand — without re-uploading documents every session.

**Retrieval, not synthesis:** the server returns ranked chunks and structure metadata; your AI client generates answers, lessons, and summaries.
