Metadata-Version: 2.4
Name: langchain-pdfmux
Version: 0.2.0
Summary: LangChain document loader for pdfmux — self-healing PDF extraction
Project-URL: Homepage, https://pdfmux.com
Project-URL: Repository, https://github.com/NameetP/langchain-pdfmux
Project-URL: Issues, https://github.com/NameetP/langchain-pdfmux/issues
Author: Nameet Potnis
License-Expression: MIT
License-File: LICENSE
Keywords: document-loader,langchain,llm,pdf,pdfmux,rag
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.11
Requires-Dist: langchain-core>=0.2.0
Requires-Dist: pdfmux>=1.2.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# langchain-pdfmux

[![PyPI version](https://img.shields.io/pypi/v/langchain-pdfmux.svg)](https://pypi.org/project/langchain-pdfmux/)
[![Python versions](https://img.shields.io/pypi/pyversions/langchain-pdfmux.svg)](https://pypi.org/project/langchain-pdfmux/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

LangChain document loader for [pdfmux](https://pdfmux.com) -- self-healing PDF extraction for RAG pipelines.

## Why pdfmux?

Most PDF loaders use a single extraction method and silently fail on complex layouts. pdfmux routes each page through the best extraction pipeline automatically:

- **Smart routing** -- selects the optimal parser per page (text-heavy, scanned, tables, mixed)
- **Confidence scoring** -- every chunk includes a confidence score so your RAG pipeline can filter or re-rank
- **Self-healing** -- retries with alternative extractors when the primary one returns low-quality output

## Install

```bash
pip install langchain-pdfmux
```

## Usage

```python
from langchain_pdfmux import PDFMuxLoader

docs = PDFMuxLoader("report.pdf").load()
```

Each `Document` includes metadata with extraction quality signals:

```python
loader = PDFMuxLoader("report.pdf", quality="high")
for doc in loader.lazy_load():
    print(doc.metadata)
    # {
    #   "source": "report.pdf",
    #   "title": "Q4 Results",
    #   "page_start": 1,
    #   "page_end": 3,
    #   "tokens": 820,
    #   "confidence": 0.94
    # }
```

### Options

```python
# Quality presets: "fast", "standard" (default), "high"
loader = PDFMuxLoader("report.pdf", quality="high")

# Load all PDFs in a directory
loader = PDFMuxLoader("./papers/")

# Custom glob pattern
loader = PDFMuxLoader("./papers/", glob="**/*.pdf")

# Streaming with lazy_load
for doc in PDFMuxLoader("large.pdf").lazy_load():
    process(doc)
```

## License

MIT
