Metadata-Version: 2.4
Name: llama-index-readers-pdfmux
Version: 0.1.0
Summary: LlamaIndex reader for pdfmux -- self-healing PDF extraction for RAG pipelines
Project-URL: Homepage, https://pdfmux.com
Project-URL: Repository, https://github.com/NameetP/llama-index-readers-pdfmux
Project-URL: Issues, https://github.com/NameetP/llama-index-readers-pdfmux/issues
Author: Nameet Potnis
License-Expression: MIT
License-File: LICENSE
Keywords: document-loader,llama-index,llamaindex,pdf,pdfmux,rag,reader
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.9
Requires-Dist: llama-index-core>=0.10.0
Requires-Dist: pdfmux>=1.2.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# llama-index-readers-pdfmux

[![PyPI version](https://img.shields.io/pypi/v/llama-index-readers-pdfmux.svg)](https://pypi.org/project/llama-index-readers-pdfmux/)
[![Python versions](https://img.shields.io/pypi/pyversions/llama-index-readers-pdfmux.svg)](https://pypi.org/project/llama-index-readers-pdfmux/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

LlamaIndex reader for [pdfmux](https://pdfmux.com) -- self-healing PDF extraction for RAG pipelines.

## Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

- **Smart routing** -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
- **Confidence scoring** -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
- **Self-healing** -- retries with alternative extractors when the primary one returns low-quality output

## Install

```bash
pip install llama-index-readers-pdfmux
```

## Usage

```python
from llama_index_readers_pdfmux import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")
```

Each `Document` includes metadata with extraction quality signals:

```python
reader = PDFMuxReader(quality="high")
for doc in reader.load_data("report.pdf"):
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }
```

### Options

```python
# Quality presets: "fast", "standard" (default), "high"
reader = PDFMuxReader(quality="high")

# Load all PDFs in a directory
docs = reader.load_data("./papers/")

# Custom glob pattern
reader = PDFMuxReader(glob="**/*.pdf")
docs = reader.load_data("./papers/")

# Attach extra metadata
docs = reader.load_data("report.pdf", extra_info={"project": "Q4 analysis"})
```

### With LlamaIndex pipelines

```python
from llama_index.core import VectorStoreIndex
from llama_index_readers_pdfmux import PDFMuxReader

reader = PDFMuxReader(quality="high")
docs = reader.load_data("./papers/")

# Filter low-confidence chunks
docs = [d for d in docs if d.metadata["confidence"] > 0.8]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What were the key findings?")
```

## License

MIT
