Metadata-Version: 2.4
Name: vectorise-mcp
Version: 0.8.2
Summary: Local stdio MCP server that turns folders of dense documents into a vector embedding DB Claude can search.
Project-URL: Homepage, https://github.com/jameslovespancakes/Vectorised-Embedding-MCP
Project-URL: Issues, https://github.com/jameslovespancakes/Vectorised-Embedding-MCP/issues
Project-URL: Source, https://github.com/jameslovespancakes/Vectorised-Embedding-MCP
Author: vectorise-mcp
License: MIT
License-File: LICENSE
Keywords: claude,embeddings,mcp,rag,sqlite-vec,vector-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: mcp>=1.2.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pypdf>=5.0.0
Requires-Dist: python-docx>=1.1.0
Requires-Dist: python-pptx>=0.6.23
Requires-Dist: sentence-transformers>=3.0.0
Requires-Dist: sqlite-vec>=0.1.6
Requires-Dist: xlrd>=2.0.1
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: notify
Requires-Dist: plyer>=2.1.0; extra == 'notify'
Provides-Extra: ocr
Requires-Dist: pillow>=10.0.0; extra == 'ocr'
Requires-Dist: pypdfium2>=4.20.0; extra == 'ocr'
Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == 'ocr'
Description-Content-Type: text/markdown

# vectorise-mcp

Local stdio MCP server that turns folders of dense documents (PDFs, Word, text, markdown) into a **hybrid retrieval database** that **Claude Desktop** can semantically search mid-conversation. Built so Claude can effectively work with corpora far larger than its context window — point it at 100M+ tokens of reports and ask questions; it pulls only the *relevant* chunks.

Fully offline after first model download. No API keys. Free.

## Why this is built for *quality*, not just for ticking a box

Cheap RAG implementations ("just embed with MiniLM and dot-product search") fail badly on dense documents. They miss rare terms, conflate similar sentences, and rank irrelevant chunks at the top. This server uses a stack designed for real retrieval quality:

| Stage | What it does | Why |
|---|---|---|
| **BGE-small-en-v1.5** embeddings | Dense semantic vectors (384-dim, normalized) | Top-tier MTEB scores at small size; far better than MiniLM on technical English. |
| **SQLite FTS5 BM25** | Keyword retrieval in parallel | Catches rare terms, names, IDs, acronyms that pure semantic search misses. |
| **Reciprocal Rank Fusion** | Merges vector + keyword candidates | Robust hybrid signal — neither side dominates. |
| **bge-reranker-base** cross-encoder | Re-scores top-50 jointly with the query | Massive precision boost; cross-encoders consistently rank 5-10 points higher than bi-encoders alone. |
| **Sentence-aware chunking** | 384-tok chunks, 96-tok overlap, sentence-bounded | Preserves coherence; overlap stops boundary loss. |
| **SHA1 incremental reindex** | Only re-embeds changed files | Cheap to keep up to date as the folder evolves. |

This is the same retrieval pattern used in production search systems (e.g., Anthropic's contextual retrieval, Vespa hybrid recipes).

## Stack

| Component | Library |
|---|---|
| MCP server | [`mcp` SDK](https://github.com/modelcontextprotocol/python-sdk) (FastMCP) |
| Embeddings | [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5) (384-dim, ~130MB, CPU) |
| Reranker | [`BAAI/bge-reranker-base`](https://huggingface.co/BAAI/bge-reranker-base) (~110MB, CPU) |
| Vector DB | [`sqlite-vec`](https://github.com/asg017/sqlite-vec) (single-file SQLite extension) |
| Keyword DB | SQLite FTS5 (BM25) |
| PDF | `pypdf` |
| DOCX | `python-docx` |

## Install

```bash
# core (text-based docs only)
pip install vectorise-mcp

# with OCR for scanned PDFs + images (.png .jpg .tiff .bmp .webp)
pip install "vectorise-mcp[ocr]"

vectorise-mcp setup       # downloads ~250MB models (+30MB OCR if installed)
```

Python ≥ 3.10. `pip` cannot reliably run post-install hooks (PEP 517), so models download on `setup` (or first `serve` boot if you skip setup). After that, fully offline.

**OCR**: uses `rapidocr-onnxruntime` (pure Python, ONNX, no system Tesseract install) and `pypdfium2` (no Poppler). When installed, scanned PDF pages auto-fall-back to OCR; image files become first-class indexable docs.

## Wire into Claude Desktop

Edit `claude_desktop_config.json`:
- **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
- **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
- **Linux**: `~/.config/Claude/claude_desktop_config.json`

```json
{
  "mcpServers": {
    "vectorise": {
      "command": "vectorise-mcp",
      "args": ["serve"]
    }
  }
}
```

Ready-to-paste configs are also in the repo:

| File | Use |
|---|---|
| `claude_desktop_config.example.json` | Minimal config — just this server. |
| `examples/claude_desktop_config.windows.json` | Windows-pathed variant with comment. |
| `examples/claude_desktop_config.macos.json` | macOS-pathed variant with comment. |
| `examples/claude_desktop_config.linux.json` | Linux-pathed variant with comment. |
| `examples/claude_desktop_config.advanced.json` | Pinned Python interpreter + env tuning for GPU. |

If you already have other MCP servers in your config, **merge** the `"vectorise"` key into your existing `mcpServers` object — don't overwrite the whole file.

Restart Claude Desktop. Server eagerly loads both models on boot — first tool call is instant, no surprise pause.

## Use it

In a Claude Desktop conversation:

> *"Index `C:\Users\me\Documents\Q1Reports` and tell me what the revenue projections were."*

Claude will:
1. Call `index_folder` — streams progress notifications: *"Indexing 47 files, ~2 min remaining…"*
2. Call `search` — hybrid retrieval + cross-encoder rerank → top 5 chunks.
3. Synthesize answer, citing source file + page.

## MCP tools exposed

| Tool | What it does |
|---|---|
| `index_folder(folder_path, collection?)` | Walk folder, embed + index every supported doc. SHA1-safe to re-run. |
| `reindex_folder(collection)` | Re-scan source folder. Re-embed only changed files; drop deleted. |
| `list_collections()` | All indexed collections with size, doc count, indexed-at timestamp. |
| `search(collection, query, k=5, candidate_pool=75, file_glob?, subdirectory?, page_min?, page_max?, min_similarity?)` | Hybrid + reranked. Filters: filename glob, path substring, PDF page range, similarity floor. Claude is encouraged to raise `k` (10-20) and `candidate_pool` (150-300) for hard queries. |
| `delete_collection(collection)` | Drops the .db file. Returns freed MB. |

All tools have docstrings — Claude reads them automatically.

## Performance

| Metric | Value |
|---|---|
| Indexing throughput (CPU) | ~80 chunks/sec (bge-small) |
| 100-page PDF index time | ~3 sec |
| 50 × 100-page PDFs | ~3 min |
| 100M-token corpus | ~40 min one-time |
| Search latency (K=5, ≤500K chunks) | ~150ms (vector + FTS + rerank) |
| Disk per chunk | ~2 KB |

GPU auto-detected by `sentence-transformers` if PyTorch sees CUDA/MPS — 5-10× faster.

## Supported document types

- `.pdf` — page-aware extraction via `pypdf`. Pages with empty/sparse text auto-fall-back to OCR (if `[ocr]` extra installed).
- `.docx` — paragraphs + tables via `python-docx`
- `.txt`, `.md`, `.markdown` — UTF-8 text
- `.png`, `.jpg`, `.jpeg`, `.tiff`, `.tif`, `.bmp`, `.webp` — OCR via RapidOCR (requires `[ocr]` extra)

Unsupported files are skipped silently. Images without OCR installed are skipped with a warning.

## Storage

All collections live under `~/.vectorise-mcp/`. One `.db` file per collection, fully self-contained (vector + keyword + chunks + metadata). Portable — copy the file to share, back up the folder for safekeeping.

## Configuration via env vars

| Var | Default | Purpose |
|---|---|---|
| `VECTORISE_MCP_EMBED_MODEL` | `BAAI/bge-small-en-v1.5` | Override embedding model. Must produce 384-dim vectors. |
| `VECTORISE_MCP_RERANKER_MODEL` | `BAAI/bge-reranker-base` | Override cross-encoder reranker. |
| `VECTORISE_MCP_EMBED_BATCH` | `32` | Embedding batch size (lower if OOM). |
| `VECTORISE_MCP_RERANKER_BATCH` | `16` | Rerank batch size. |

## Troubleshooting

**`sqlite-vec` extension fails to load**
The PyPI package ships prebuilt binaries; ensure your Python's sqlite has `enable_load_extension` (true on standard CPython).

**Indexing very slow**
- First file is slow — model loading (~10s).
- Install GPU PyTorch for big speedup. Check `nvidia-smi` / Activity Monitor.

**Claude doesn't see the server**
- Quit Claude Desktop fully; restart.
- Validate JSON (no trailing commas).
- Run `vectorise-mcp serve` in a terminal — should hang awaiting stdio. Errors there = install issue.

**Search returns weak results**
- Try larger `candidate_pool` (e.g. 100). Recall trades against latency.
- Check the chunk text in `search` results — if chunks are tiny, your source may be image-only PDFs (pypdf can't OCR; use a separate OCR step first).

## License

MIT.
