Metadata-Version: 2.4
Name: paperloom
Version: 0.1.0
Summary: Local-first document toolkit. Streaming OCR (GLM-OCR via Ollama), PII anonymizer, and 19 chainable PDF/Markdown/HTML tools. MCP server included.
Project-URL: Homepage, https://github.com/luciopalmieri/paperloom
Project-URL: Repository, https://github.com/luciopalmieri/paperloom
Project-URL: Issues, https://github.com/luciopalmieri/paperloom/issues
Project-URL: Documentation, https://github.com/luciopalmieri/paperloom/tree/main/doc
Author: Lucio Palmieri
License: MIT
Keywords: agent,anonymization,llm,local-first,markdown,mcp,ocr,ollama,pdf,privacy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup
Requires-Python: <3.14,>=3.11
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: markdown-it-py>=3.0.0
Requires-Dist: mcp>=1.12.0
Requires-Dist: pillow>=11.0.0
Requires-Dist: pydantic-settings>=2.6.0
Requires-Dist: pydantic>=2.9.0
Requires-Dist: pypdf>=5.1.0
Requires-Dist: pypdfium2>=4.30.0
Requires-Dist: python-multipart>=0.0.18
Requires-Dist: reportlab>=4.2.0
Requires-Dist: uvicorn[standard]>=0.32.0
Provides-Extra: all
Requires-Dist: weasyprint>=68.1; extra == 'all'
Provides-Extra: pdf
Requires-Dist: weasyprint>=68.1; extra == 'pdf'
Description-Content-Type: text/markdown

# paperloom

> **Local-first document toolkit. Streaming OCR + PII anonymizer + 19 chainable tools. MCP-native.**

`paperloom` is the Python library and MCP server behind [paperloom](https://github.com/luciopalmieri/paperloom) — a local-first web app for OCR, PDF/Markdown/HTML transforms, and PII redaction. Every tool runs on your machine. No cloud round-trips, no telemetry.

## Why paperloom

paperloom rides a **state-of-the-art OCR model** — GLM-OCR scores **94.62 on OmniDocBench V1.5 (rank #1)** and is SOTA on formula / table recognition and information extraction. paperloom commits to **tracking the current SOTA**: when a stronger open model ships, the Ollama pin gets updated.

paperloom's value-add is **agent orchestration around the model**:

- **19 chainable tools** — `pdf-to-images → ocr → anonymize → markdown-to-pdf` in one call.
- **MCP server with security model** — `register_file` + path allowlist + `file_id` tokens. Drop-in for Claude Desktop, Claude Code, Cursor, Cline, Agno.
- **Built-in PII redaction** — OPF model, 8 entity categories, verbatim.
- **Streaming SSE** — Markdown emits page-by-page as the OCR model writes it.
- **One Ollama dep** — reuses any GLM-OCR model you already pulled. No multi-GB model zoo download.

For raw model quality on dense scientific PDFs, [`marker`](https://github.com/datalab-to/marker), [`docling`](https://github.com/docling-project/docling), and [`MinerU`](https://github.com/opendatalab/mineru) are excellent companion projects — paperloom doesn't try to out-research them, it focuses on the orchestration and privacy layer around the model.

## Install

```bash
# Library + CLI + MCP server (no PDF rendering, no anonymizer):
uvx paperloom doctor

# Full toolkit:
uvx --with 'paperloom[all]' paperloom doctor

# Or pip:
pip install paperloom            # core
pip install 'paperloom[pdf]'     # + WeasyPrint (markdown→pdf, html→pdf)
pip install 'paperloom[all]'     # everything published on PyPI
```

`pdf` extra needs native libs (`brew install pango` on macOS).

The OPF anonymizer is **not** a PyPI extra — it's distributed as a git repo. The `anonymize` tool auto-installs it on first call (~250 MB Python deps + ~4 GB checkpoint). To opt out and install manually:

```bash
PAPERLOOM_AUTO_INSTALL_OPF=0  # disables the auto-installer
uv pip install 'opf @ git+https://github.com/openai/privacy-filter@main'
```

## Use as a library

```python
from paperloom import ocr_to_markdown, anonymize, Chain

# One-shot OCR
md = ocr_to_markdown("scan.pdf")

# Redact PII
clean = anonymize(md, preset="balanced")

# Compose tools
result = Chain([
    ("pdf-to-images", {"dpi": 200}),
    ("ocr-to-markdown", {}),
    ("anonymize", {"preset": "recall"}),
]).run(["doc.pdf"])
```

## Use as an MCP server

```bash
uvx --from paperloom paperloom-mcp
```

Wire into Claude Desktop:

```json
{
  "mcpServers": {
    "paperloom": {
      "command": "uvx",
      "args": ["--from", "paperloom", "paperloom-mcp"],
      "env": {
        "PAPERLOOM_MCP_ALLOWED_DIRS": "/Users/you/Documents,/Users/you/Downloads"
      }
    }
  }
}
```

## Use from the CLI

```bash
paperloom ocr scan.pdf -o out.md
paperloom anonymize out.md --preset recall
paperloom chain --steps pdf-to-images,ocr-to-markdown,anonymize doc.pdf
paperloom doctor      # check Ollama, glm-ocr, OPF, allowlist
```

## Requirements

- **Ollama** with `glm-ocr:latest` pulled (`ollama pull glm-ocr:latest`).
- Python 3.11+.
- **Hardware:** GLM-OCR is RAM-bound. Recommended: Apple Silicon M-series Pro with ≥ 24 GB unified RAM, or x86 with 24 GB+ GPU VRAM. Minimum 16 GB. **Below 16 GB the OS can freeze hard enough to require a reboot — don't.**
- (Optional) `pango` for the `pdf` extra; ~4 GB checkpoint for the `anonymizer` extra (auto-downloaded on first call).

## License

MIT. Depends on [OpenAI Privacy Filter](https://github.com/openai/privacy-filter) (Apache 2.0) when the anonymizer extra is installed.
