Metadata-Version: 2.4
Name: paradox-pdf
Version: 0.3.1
Summary: Structured text extraction framework for digital and scanned PDFs with inline formatting preservation
Author-email: CreAI <feliperodriguez@creai.mx>
License: Proprietary
Project-URL: Homepage, https://github.com/CreAI-mx/DocumentExtractor
Project-URL: Repository, https://github.com/CreAI-mx/DocumentExtractor
Keywords: pdf,extraction,ocr,nlp,document,parsing,table,structured
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Markup
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF>=1.23.0
Requires-Dist: opencv-python>=4.8.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: torch>=2.0
Requires-Dist: torchvision>=0.15
Requires-Dist: timm>=0.9
Requires-Dist: doclayout-yolo
Requires-Dist: huggingface_hub
Requires-Dist: rapidocr-onnxruntime>=1.3
Requires-Dist: numpy<2.0,>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy
Requires-Dist: transformers>=4.30
Provides-Extra: gpu
Requires-Dist: paddleocr[doc-parser]>=3.0; extra == "gpu"
Requires-Dist: paddlex>=3.4; extra == "gpu"

# paradox-pdf  ·  `[cpu]` or `[gpu]`

> Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.
>
> **Two install profiles, one wheel.** Pick CPU for portability, GPU for higher accuracy on photographed/distorted documents.

[![PyPI](https://img.shields.io/pypi/v/paradox-pdf.svg)](https://pypi.org/project/paradox-pdf/)
[![Python](https://img.shields.io/pypi/pyversions/paradox-pdf.svg)](https://pypi.org/project/paradox-pdf/)
[![Status](https://img.shields.io/pypi/status/paradox-pdf.svg)](https://pypi.org/project/paradox-pdf/)

```bash
pip install paradox-pdf            # CPU profile (default — works everywhere)
pip install 'paradox-pdf[gpu]'     # GPU profile (PaddleOCR-VL 0.9B, needs CUDA)
```

Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.

---

## Install

Two profiles. Pick one:

### CPU (default — works on any machine)

```bash
pip install paradox-pdf
```

Vision pipeline: YOLO + Table Transformer + RapidOCR (ONNX) + TexTAR. Runs on CPU; uses GPU when PyTorch detects CUDA. Wheel ~60 MB; fully self-contained — bundled TexTAR weights, no extra setup.

### GPU (PaddleOCR-VL 0.9B for higher accuracy)

```bash
pip install 'paradox-pdf[gpu]'

# paddlepaddle-gpu>=3.0 is not on PyPI (CUDA-version-specific wheels);
# install it from Paddle's index, picking the URL that matches your CUDA:
pip install paddlepaddle-gpu==3.2.1 \
  -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```

Adds PaddleOCR-VL — a 0.9B VLM that does layout + OCR + table structure in one pass. Higher accuracy on photographed/distorted documents and complex tables; ~12 GB VRAM at inference.

### Which one when

| Scenario | Profile |
|---|---|
| Digital PDFs (text already in PDF) | either; CPU is enough |
| Cleanly-scanned documents | CPU |
| Photographed / off-center / curved pages | GPU |
| Complex multi-level-header tables | GPU |
| Production with no GPU available | CPU |

You can also mix: in the GPU install, pass ``backend="cpu"`` to force the classic pipeline for a particular call.

Python 3.9+. First run downloads a few HuggingFace models (~500 MB CPU profile, ~2 GB GPU profile) into the local cache; subsequent runs use the cache.

---

## 60-second quick start

```python
import paradox_pdf as pdx

doc = pdx.extract("contract.pdf")

print(doc["total_pages"], "pages")
print(doc["type_summary"])     # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}

for el in doc["elements"]:
    print(el["type"], "-", (el.get("text") or "")[:60])
```

That's it. `doc` is a plain `dict` — no custom classes, no streaming generators. JSON-serializable as-is.

---

## Public API

The package exposes 5 functions and 1 dataclass:

| Symbol | Purpose |
|---|---|
| `extract(pdf, **opts) -> dict` | Run the full pipeline, return JSON in memory. |
| `extract_to_file(pdf, output, images_dir, **opts) -> dict` | Same as `extract` but also writes JSON + images to disk. |
| `extract_pages(pdf, pages, **opts) -> dict` | Subset by page number. |
| `extract_text(pdf, **opts) -> str` | Plain-text concatenation only. |
| `extract_tables(pdf, **opts) -> list[dict]` | Flat list of every TABLE element. |
| `PipelineConfig` | Dataclass to override 30+ thresholds. |

All functions accept the same keyword options:

| Argument | Type | Default | Description |
|---|---|---|---|
| `pages` | `Sequence[int] \| None` | `None` | 1-based page numbers; `None` = all pages. |
| `no_images` | `bool` | `False` | Skip image extraction (faster, no PNGs). |
| `force_mode` | `"heuristic" \| "vision" \| None` | `None` | Force a pipeline; `None` auto-routes per page. |
| `backend` | `"auto" \| "cpu" \| "gpu"` | `"auto"` | Vision backend: `cpu` = classic, `gpu` = PaddleOCR-VL ([gpu] extra). `auto` picks GPU when available. |
| `output` | `str \| Path \| None` | `None` | If set, also writes JSON here. |
| `images_dir` | `str \| Path \| None` | tempdir | Where extracted images go. |
| `config` | `PipelineConfig \| None` | `None` | Override pipeline thresholds. |

---

## Examples

### 1. Get the document tree

```python
import paradox_pdf as pdx

doc = pdx.extract("annual_report.pdf")

# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
#            'type_summary', 'elements'])

# Walk the heading tree
def walk(nodes, depth=0):
    for n in nodes:
        text = (n.get("text") or "").strip()[:80]
        print(f"{'  '*depth}{n['type']:14s} {text}")
        walk(n.get("children", []), depth + 1)

walk(doc["elements"])
```

### 2. Process only certain pages

```python
doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))
```

### 3. Plain text in one call

```python
text = pdx.extract_text("contract.pdf")
```

### 4. Extract every table

```python
tables = pdx.extract_tables("contract.pdf")

for t in tables:
    rows, cols = t["shape"]
    cells = t["cells"]
    print(f"Table {rows}×{cols}, {len(cells)} cells")

    for c in cells:
        p = c["p"]
        if len(p) == 2:                          # simple cell
            r, col = p
            print(f"  ({r},{col}): {c['t']!r}")
        else:                                    # merged cell
            r, col, rowspan, colspan = p
            print(f"  ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")
```

Cell schema:

```python
{"p": [row, col], "t": "Some cell text"}                       # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"}        # merged
```

### 5. Persist to disk

```python
doc = pdx.extract_to_file(
    "contract.pdf",
    output="out/contract.json",
    images_dir="out/images/",
)
```

The function still returns the dict.

### 6. Convert a folder

```python
from pathlib import Path
import paradox_pdf as pdx

for pdf in Path("inbox/").glob("*.pdf"):
    doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
    print(f"{pdf.name:40s}  {doc['total_pages']}p  {doc['total_elements']} elements")
```

### 7. Custom configuration

```python
from paradox_pdf import extract, PipelineConfig

cfg = PipelineConfig(
    render_dpi=300,                    # higher DPI for vision pipeline
    scan_text_threshold=80,            # treat pages with <80 chars as scanned
    cv_border_missing_threshold=0.40,  # be stricter about declaring borders absent
    yolo_confidence=0.30,              # stricter YOLO detections
)

doc = extract("noisy_scan.pdf", config=cfg)
```

Full reference of the 30+ tunables is in `docs/configuration.md`.

You can also override any parameter with environment variables prefixed `PDF_`:

```bash
PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py
```

### 8. Force a specific pipeline

```python
# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")

# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")
```

### 9. Just count things

```python
doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}
```

### 10. Build markdown from the tree

```python
import paradox_pdf as pdx

LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}

def to_markdown(nodes, out=None):
    out = out if out is not None else []
    for n in nodes:
        t = n.get("type")
        text = (n.get("text") or "").strip()
        if t in LEVEL and text:
            out.append("#" * LEVEL[t] + " " + text)
        elif t == "PARAGRAPH":
            out.append(text)
        elif t == "TABLE":
            out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
        out.append("")
        to_markdown(n.get("children", []), out)
    return "\n".join(out)

doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))
```

---

## Output schema

```jsonc
{
  "source": "contract.pdf",
  "total_pages": 12,
  "total_elements": 145,
  "total_images": 4,
  "type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
  "elements": [
    {
      "type": "TITLE",
      "marks": ["BOLD"],
      "text": "**Annual Report — Q4 2025**",
      "ref": "(p1,l1):(p12,l8)",
      "children": [
        {"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
        {"type": "H1",
         "text": "**1. Financial Summary**",
         "ref": "(p1,l3):(p2,l4)",
         "children": [
           {"type": "TABLE",
            "shape": [5, 4],
            "cells": [
              {"p": [0, 0], "t": "Category"},
              {"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
            ],
            "ref": "(p1,l4):(p1,l4)"}
         ]}
      ]
    }
  ]
}
```

### `ref` field

Every element gets a `ref` of the form `"(pX,lY):(pX,lY)"` where:
- `pX` = page number (1-based),
- `lY` = element index within that page (1-based).
- The first tuple is the start; the second is the end of the element's last descendant.

### Element types (excerpt)

`TITLE`, `SUBTITLE`, `H1`–`H4`, `PARAGRAPH`, `TABLE`, `LIST` (with `items[]`), `TOC` (with `entries[]`), `IMAGE`, `SIGNATURE`, `AMENDMENT_DEL`, `EXHIBIT`, `APPENDIX`, `FOOTER`, `HEADER`, `PAGE_NUMBER`, plus 50+ more. Full list: `pdf_tagger/catalog.py`.

### Inline marks

Marks are preserved both in `marks: [...]` (per-element) and inline in the text:

| Mark | Inline syntax |
|---|---|
| BOLD | `**bold text**` |
| ITALIC | `*italic*` |
| UNDERLINE | `++underlined++` |
| STRIKETHROUGH | `~~deleted~~` |
| SUPERSCRIPT | `^superscript^` |
| MONOSPACE | `` `code` `` |

---

## CLI

The same package installs a `paradox-pdf` command:

```bash
paradox-pdf contract.pdf                       # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8           # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf
```

Run `paradox-pdf --help` for the full set of flags.

---

## How it works

```
                 ┌─────────────────┐
PDF ─────────────► scan_detector   │  per page (<50 chars → vision)
                 └────────┬────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌──────────────────┐        ┌──────────────────────┐
   │ Heuristic        │        │ Vision               │
   │ (PyMuPDF fonts)  │        │ YOLO + RapidOCR      │
   │                  │        │ + Table Transformer  │
   │                  │        │ + HDBSCAN borderless │
   │                  │        │ + TexTAR (marks)     │
   └────────┬─────────┘        └──────────┬───────────┘
            └─────────────┬───────────────┘
                          ▼
              ┌────────────────────────┐
              │ Section tree builder   │
              │ Post-processing passes │
              └───────────┬────────────┘
                          ▼
                       JSON dict
```

For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on `fill_rate + source_bonus − merge_penalty`. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.

---

## Performance notes

- **Digital page**: ~0.05 s on CPU.
- **Scanned page**: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
- **First run**: HuggingFace models are downloaded once (~500 MB total).

If you see multi-minute startup per document with the vision pipeline, set `HF_HUB_OFFLINE=1` after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:

```bash
HF_HUB_OFFLINE=1 python my_script.py
```

Or in code:

```python
import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx
```

---

## Repository layout

```
paradox_pdf/         Public Python API (extract, extract_text, …)
pdf_tagger/          Core extraction (font classifier, vision layout, marks)
pdf_grid/            Vector-line table detection
scripts/             CLI implementation
docs/                Configuration reference, API reference, research notes
examples/            Sample PDFs + expected outputs
_dev/                Test suites, fixtures, benchmarks (not shipped in wheel)
```

---

## License

Proprietary — © CreAI. Contact <feliperodriguez@creai.mx> for commercial use.

---

## Links

- **PyPI**: https://pypi.org/project/paradox-pdf/
- **Repository**: https://github.com/CreAI-mx/DocumentExtractor
- **Issues**: https://github.com/CreAI-mx/DocumentExtractor/issues
