Metadata-Version: 2.4
Name: natural-pdf
Version: 0.6.2
Summary: A more intuitive interface for working with PDFs
Author-email: Jonathan Soma <jonathan.soma@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jsoma/natural-pdf
Project-URL: Repository, https://github.com/jsoma/natural-pdf
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: scikit-learn
Requires-Dist: markdown
Requires-Dist: pandas
Requires-Dist: pdfplumber>=0.11.7
Requires-Dist: colormath2
Requires-Dist: pillow
Requires-Dist: colour
Requires-Dist: numpy
Requires-Dist: urllib3
Requires-Dist: tqdm
Requires-Dist: rich
Requires-Dist: pydantic
Requires-Dist: jenkspy
Requires-Dist: scipy
Requires-Dist: scikit-image
Requires-Dist: openai
Requires-Dist: ipywidgets>=7.0.0
Requires-Dist: python-bidi
Requires-Dist: matplotlib
Requires-Dist: onnxruntime
Requires-Dist: huggingface_hub
Requires-Dist: platformdirs
Provides-Extra: ai
Requires-Dist: rapidocr; extra == "ai"
Requires-Dist: torch; extra == "ai"
Requires-Dist: torchvision; extra == "ai"
Requires-Dist: transformers[sentencepiece]; extra == "ai"
Requires-Dist: sentence-transformers; extra == "ai"
Requires-Dist: timm; extra == "ai"
Requires-Dist: doclayout_yolo; extra == "ai"
Provides-Extra: paddle
Requires-Dist: paddlepaddle>=3.0.0; extra == "paddle"
Requires-Dist: paddleocr>=3.0.1; extra == "paddle"
Requires-Dist: paddlex[ocr]>=3.0.2; extra == "paddle"
Requires-Dist: numpy<2.0; extra == "paddle"
Provides-Extra: export
Requires-Dist: pikepdf; extra == "export"
Requires-Dist: img2pdf; extra == "export"
Requires-Dist: jupytext; extra == "export"
Requires-Dist: nbformat; extra == "export"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: setuptools; extra == "test"
Requires-Dist: tomli; extra == "test"
Requires-Dist: mktestdocs; extra == "test"
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: nox; extra == "dev"
Requires-Dist: nox-uv; extra == "dev"
Requires-Dist: scipy-stubs; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: uv; extra == "dev"
Requires-Dist: pipdeptree; extra == "dev"
Requires-Dist: nbformat; extra == "dev"
Requires-Dist: jupytext; extra == "dev"
Requires-Dist: nbclient==0.10.2; extra == "dev"
Requires-Dist: jupyter_core==5.7.2; extra == "dev"
Requires-Dist: ipykernel; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: setuptools; extra == "dev"
Requires-Dist: mktestdocs; extra == "dev"
Requires-Dist: mkdocs-redirects; extra == "dev"
Provides-Extra: quality
Requires-Dist: pyspellchecker; extra == "quality"
Requires-Dist: langdetect; extra == "quality"
Provides-Extra: all
Requires-Dist: natural-pdf[ai]; extra == "all"
Requires-Dist: natural-pdf[export]; extra == "all"
Dynamic: license-file

# Natural PDF

[![CI](https://github.com/jsoma/natural-pdf/actions/workflows/ci.yml/badge.svg)](https://github.com/jsoma/natural-pdf/actions/workflows/ci.yml)

A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

- [Complete documentation here](https://jsoma.github.io/natural-pdf)
- [Live demos here](https://colab.research.google.com/github/jsoma/natural-pdf/)

<div style="max-width: 400px; margin: auto"><a href="sample-screen.png"><img src="sample-screen.png"></a></div>

## Installation

```bash
pip install natural-pdf
```

Need OCR, semantic search, export, or AI-powered extraction? Install what you need:

```bash
pip install "natural-pdf[all]"      # Recommended feature-complete install
pip install "natural-pdf[export]"   # Export helpers only
pip install rapidocr                # Default OCR backend
pip install easyocr                 # Extra OCR backend
pip install "natural-pdf[paddle]"   # PaddleOCR stack
pip install "surya-ocr<0.15"        # Surya OCR engine
pip install python-doctr            # Doctr OCR engine
```

More details in the [installation guide](https://jsoma.github.io/natural-pdf/installation/).

`natural-pdf[all]` is the recommended feature-complete runtime bundle for core features: the default RapidOCR engine, sentence-transformers-based semantic search, QA/extraction dependencies, YOLO layout detection, and export support. It does not install every optional backend. Extra engines such as EasyOCR, PaddleOCR, Surya, and Doctr stay opt-in, and Natural PDF will tell you what to install when you try to use something that is missing.

Check your local setup with:

```bash
npdf doctor
```

## Quick Start

```python
from natural_pdf import PDF

# Open a PDF
pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]

# Extract all of the text on the page
page.extract_text()

# Find elements using CSS-like selectors
heading = page.find('text:contains("Summary"):bold')

# Extract content below the heading
content = heading.below().extract_text()

# Examine all the bold text on the page
page.find_all('text:bold').show()

# Exclude parts of the page from selectors/extractors
header = page.find('text:contains("CONFIDENTIAL")').above()
footer = page.find_all('line')[-1].below()
page.add_exclusion(header)
page.add_exclusion(footer)

# Extract clean text from the page ignoring exclusions
clean_text = page.extract_text()
```

And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.

## Key Features

Natural PDF offers a range of features for working with PDFs:

*   **CSS-like Selectors:** Find elements using intuitive query strings (`page.find('text:bold')`).
*   **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
*   **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
*   **OCR Integration:** Extract text from scanned documents with RapidOCR by default, plus opt-in engines like EasyOCR, PaddleOCR, or Surya.
*   **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
*   **Document QA:** Ask natural language questions about your document's content.
*   **Semantic Search:** Rank pages within a PDF by semantic similarity using sentence-transformer embeddings.
*   **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.

## Learn More

Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).

## Extending Natural PDF

Natural PDF now exposes its pluggable engines through small helper functions so you rarely have to touch the core registry directly. Two handy entry points:

```python
from natural_pdf.tables import register_table_function

def table_delim(region, *, context=None, **kwargs):
    # return a TableResult or list-of-lists
    ...

register_table_function("table_delim", table_delim)
```

```python
from natural_pdf.selectors import register_selector_engine

class DebugSelectorEngine:
    def query(self, *, context, selector, options):
        ...

register_selector_engine("debug", lambda **_: DebugSelectorEngine())
```


## Best friends

Natural PDF sits on top of a *lot* of fantastic tools and mdoels, some of which are:

- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [EasyOCR](https://www.jaided.ai/easyocr/)
- [PaddleOCR](https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html)
- [Surya](https://github.com/VikParuchuri/surya)
- A specific [YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [doctr](https://github.com/mindee/doctr)
- [docling](https://github.com/docling-project/docling)
- [Hugging Face](https://huggingface.co/models)
