Metadata-Version: 2.4
Name: pdfsmith
Version: 0.2.0
Summary: PDF to Markdown conversion with multiple backend support
Project-URL: Homepage, https://github.com/applied-artificial-intelligence/pdfsmith
Project-URL: Documentation, https://github.com/applied-artificial-intelligence/pdfsmith#readme
Project-URL: Repository, https://github.com/applied-artificial-intelligence/pdfsmith
Project-URL: Issues, https://github.com/applied-artificial-intelligence/pdfsmith/issues
Author-email: Applied AI <info@applied-ai.com>
License: MIT
Keywords: document processing,markdown,ocr,pdf,text extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Provides-Extra: all
Requires-Dist: docling>=2.0; extra == 'all'
Requires-Dist: extractous>=0.1; extra == 'all'
Requires-Dist: kreuzberg>=3.0; extra == 'all'
Requires-Dist: pdfminer-six>=20221105; extra == 'all'
Requires-Dist: pdfplumber>=0.10; extra == 'all'
Requires-Dist: pymupdf4llm>=0.0.10; extra == 'all'
Requires-Dist: pymupdf>=1.23; extra == 'all'
Requires-Dist: pypdf>=4.0; extra == 'all'
Requires-Dist: pypdfium2>=4.0; extra == 'all'
Requires-Dist: unstructured>=0.10; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: aws
Requires-Dist: boto3>=1.34; extra == 'aws'
Requires-Dist: pymupdf>=1.23; extra == 'aws'
Provides-Extra: azure
Requires-Dist: azure-ai-documentintelligence>=1.0; extra == 'azure'
Provides-Extra: commercial
Requires-Dist: azure-ai-documentintelligence>=1.0; extra == 'commercial'
Requires-Dist: boto3>=1.34; extra == 'commercial'
Requires-Dist: databricks-sdk>=0.20; extra == 'commercial'
Requires-Dist: google-cloud-documentai>=2.0; extra == 'commercial'
Requires-Dist: google-cloud-storage>=2.0; extra == 'commercial'
Requires-Dist: llama-parse>=0.6; extra == 'commercial'
Requires-Dist: pymupdf>=1.23; extra == 'commercial'
Provides-Extra: databricks
Requires-Dist: databricks-sdk>=0.20; extra == 'databricks'
Provides-Extra: dev
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: reportlab>=4.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: extractous
Requires-Dist: extractous>=0.1; extra == 'extractous'
Provides-Extra: frontier
Requires-Dist: anthropic>=0.40; extra == 'frontier'
Requires-Dist: google-genai>=1.0; extra == 'frontier'
Requires-Dist: openai>=1.50; extra == 'frontier'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0; extra == 'gemini'
Provides-Extra: google
Requires-Dist: google-cloud-documentai>=2.0; extra == 'google'
Requires-Dist: google-cloud-storage>=2.0; extra == 'google'
Provides-Extra: kreuzberg
Requires-Dist: kreuzberg>=3.0; extra == 'kreuzberg'
Provides-Extra: light
Requires-Dist: pdfplumber>=0.10; extra == 'light'
Requires-Dist: pymupdf>=1.23; extra == 'light'
Requires-Dist: pypdf>=4.0; extra == 'light'
Provides-Extra: llamaparse
Requires-Dist: llama-parse>=0.6; extra == 'llamaparse'
Provides-Extra: marker
Requires-Dist: marker-pdf>=1.0; extra == 'marker'
Provides-Extra: openai
Requires-Dist: openai>=1.50; extra == 'openai'
Provides-Extra: pdfminer
Requires-Dist: pdfminer-six>=20221105; extra == 'pdfminer'
Provides-Extra: pdfplumber
Requires-Dist: pdfplumber>=0.10; extra == 'pdfplumber'
Provides-Extra: pymupdf
Requires-Dist: pymupdf>=1.23; extra == 'pymupdf'
Provides-Extra: pymupdf4llm
Requires-Dist: pymupdf4llm>=0.0.10; extra == 'pymupdf4llm'
Provides-Extra: pypdf
Requires-Dist: pypdf>=4.0; extra == 'pypdf'
Provides-Extra: pypdfium2
Requires-Dist: pypdfium2>=4.0; extra == 'pypdfium2'
Provides-Extra: recommended
Requires-Dist: kreuzberg>=3.0; extra == 'recommended'
Requires-Dist: pdfplumber>=0.10; extra == 'recommended'
Requires-Dist: pymupdf4llm>=0.0.10; extra == 'recommended'
Requires-Dist: pypdf>=4.0; extra == 'recommended'
Provides-Extra: unstructured
Requires-Dist: pikepdf>=9.0; extra == 'unstructured'
Requires-Dist: unstructured>=0.10; extra == 'unstructured'
Description-Content-Type: text/markdown

# pdfsmith

> PDF to Markdown conversion with multiple backend support

[![PyPI version](https://badge.fury.io/py/pdfsmith.svg)](https://badge.fury.io/py/pdfsmith)
[![CI](https://github.com/applied-artificial-intelligence/pdfsmith/actions/workflows/ci.yaml/badge.svg)](https://github.com/applied-artificial-intelligence/pdfsmith/actions/workflows/ci.yaml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A unified interface to 19+ PDF parsing libraries including frontier LLMs. Pick the right tool for the job, or let pdfsmith choose for you.

## Why pdfsmith?

- **One API, many backends** - Switch between parsers without changing your code
- **Auto-selection** - Automatically uses the best available parser
- **Lightweight core** - Install only the backends you need
- **Battle-tested** - Wrappers refined through extensive benchmarking

## Installation

```bash
# Core package (no backends)
pip install pdfsmith

# With lightweight backends
pip install pdfsmith[light]

# Recommended stack (good balance of quality and speed)
pip install pdfsmith[recommended]

# All open-source backends
pip install pdfsmith[all]

# Frontier LLMs (GPT, Claude, Gemini)
pip install pdfsmith[frontier]

# Commercial cloud APIs
pip install pdfsmith[commercial]

# Specific backend
pip install pdfsmith[docling]
```

## Quick Start

```python
from pdfsmith import parse

# Auto-select best available backend
markdown = parse("document.pdf")

# Use a specific backend
markdown = parse("document.pdf", backend="docling")

# Check available backends
from pdfsmith import available_backends
for backend in available_backends():
    print(f"{backend.name}: {backend.description}")
```

## CLI Usage

```bash
# Parse PDF to stdout
pdfsmith parse document.pdf

# Parse to file
pdfsmith parse document.pdf -o output.md

# Use specific backend
pdfsmith parse document.pdf -b docling

# List available backends
pdfsmith backends
```

## Available Backends

### Open Source

| Backend | Weight | Best For |
|---------|--------|----------|
| `docling` | heavy | Highest quality, complex documents |
| `marker` | heavy | Academic papers, LaTeX content |
| `pymupdf4llm` | medium | Good balance of speed and quality |
| `kreuzberg` | medium | Fast extraction with OCR |
| `unstructured` | medium | Versatile document processing |
| `pdfplumber` | light | Tables and structured data |
| `pymupdf` | light | Fast general-purpose extraction |
| `pypdf` | light | Lightweight, pure Python |
| `pdfminer` | light | Mature, handles encodings well |
| `pypdfium2` | light | Chrome's PDF engine |
| `extractous` | medium | Rust-based extraction |

### Commercial Cloud APIs

| Backend | Provider | Cost | Best For |
|---------|----------|------|----------|
| `aws_textract` | AWS | $1.50/1k pages | High-accuracy OCR |
| `azure_document_intelligence` | Azure | $1.50/1k pages | Enterprise documents |
| `google_document_ai` | Google Cloud | $1.50/1k pages | Multi-language support |
| `databricks` | Databricks | ~$3/1k pages | SQL-based workflows |
| `llamaparse` | LlamaIndex | $0.003/page | Cost-effective API |

### Frontier LLMs

| Backend | Model | Cost | Best For |
|---------|-------|------|----------|
| `anthropic` | Claude Sonnet 4.5 | ~$0.04/page | High accuracy |
| `openai` | GPT-4o | ~$0.02/page | General purpose |
| `gemini` | Gemini 2.0 Flash | ~$0.001/page | Budget LLM option |

**Note**: Frontier LLM backends require API keys set via environment variables (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`).

### Choosing a Backend

- **Best quality**: `anthropic` or `openai` - Frontier LLM accuracy (highest cost)
- **Best value**: `llamaparse` - Near-LLM quality at 10x lower cost
- **Structure preservation**: `docling` - Deep learning, GPU recommended
- **Academic papers**: `marker` - Optimized for LaTeX/equations
- **Tables**: `pdfplumber` - Excellent table detection
- **Speed**: `pymupdf` or `kreuzberg` - Fast extraction
- **Minimal dependencies**: `pypdf` - Pure Python, no binaries
- **Budget LLM**: `gemini` with gemini-2.0-flash - Very low cost LLM option

### System Dependencies

Some backends require system packages for OCR functionality:

**Tesseract OCR** (for `kreuzberg` and `unstructured` with OCR):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki
```

Without tesseract, these backends will still work for text-based PDFs but cannot extract text from scanned/image PDFs.

## Async Support

```python
from pdfsmith import parse_async

# Async parsing (uses backend's native async if available)
markdown = await parse_async("document.pdf")
```

## Benchmarks

pdfsmith's backend wrappers were developed and refined through the [pdf-bench](https://github.com/applied-artificial-intelligence/pdf-bench) benchmarking project, which evaluates parser performance across diverse document types.

## License

MIT

## Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.
