Metadata-Version: 2.4
Name: ytcc-pipeline
Version: 0.2.0
Summary: PDF processing pipeline for academic theses.
Project-URL: homepage, https://github.com/ozefe/ytcc-pipeline
Project-URL: source, https://github.com/ozefe/ytcc-pipeline
Project-URL: issues, https://github.com/ozefe/ytcc-pipeline/issues
Author-email: Efe Özyay <hi@efe.cv>
License-Expression: MIT
License-File: LICENSE
Keywords: document-understanding,formula-recognition,grobid,layout-analysis,ocr,pdf,table-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: Markup
Classifier: Typing :: Typed
Requires-Python: >=3.14
Requires-Dist: ftfy<7,>=6.3
Requires-Dist: huggingface-hub<2,>=1
Requires-Dist: numpy<3,>=2
Requires-Dist: onnxruntime-gpu==1.26.0
Requires-Dist: opencv-python-headless<5,>=4.10
Requires-Dist: pdf-oxide==0.3.47
Requires-Dist: rapid-table==3.0.2
Requires-Dist: rapidocr==3.8.1
Requires-Dist: safetensors<1,>=0.4
Requires-Dist: torch<3,>=2.5
Requires-Dist: torchvision<1,>=0.20
Requires-Dist: transformers<6,>=5
Description-Content-Type: text/markdown

# ytcc-pipeline

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/ytcc-pipeline)
![PyPI - License](https://img.shields.io/pypi/l/ytcc-pipeline)
![PyPI - Status](https://img.shields.io/pypi/status/ytcc-pipeline)
![PyPI - Downloads](https://img.shields.io/pypi/dm/ytcc-pipeline)

<img alt="ytcc-pipeline mascot generated by Google's Nano Banana 2" align="right" src=".github/mascot.png" width="200" />

A synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas. Ships an optional FastAPI wrapper for service deployments and Docker images for four deployment profiles.

## What it does

Given a PDF, the pipeline runs eight stages in fixed order:

```text
render -> metadata -> layout -> blocks -> table -> formula -> reference -> bundle
```

1. **render** -- decode every page to an image (`pdf-oxide`, `cfg.render_workers` processes).
2. **metadata** -- sha256, byte size, XMP fields. Cheap; runs before model load so I/O failures surface early.
3. **layout** -- PP-DocLayoutV3 emits one `LayoutDetection` per detected block with `label`, `bbox`, `confidence`, `reading_order`.
4. **blocks** -- per-page dispatcher routes each detection by `Route`. Text comes from the PDF text layer (digital-born) or RapidOCR over the rendered crop (scanned). Figures, tables, and formulas are cropped and saved.
5. **table** (opt-in) -- RapidTable SLANet+ recovers cell grids for `TABLE` blocks.
6. **formula** (default on) -- PP-FormulaNet-L recovers LaTeX from every `FORMULA` block's crop.
7. **reference** (opt-in) -- batched POST to an externally-managed GROBID server enriches `REFERENCE` blocks with parsed `Reference` records.
8. **bundle** -- pack `document.json` and every saved crop into a single uncompressed tar.

Disabled stages never load their model. Each stage emits one INFO log line on completion; skips short-circuit with `skipped reason=...`.

## Features

- **One public entry point.** `process_pdf(pdf_path, language=..., ...)` -- everything else is implementation detail.
- **Auto-detected digital-born vs scanned.** The orchestrator probes the PDF's text layer once and picks the path. Override with `digital_born=True/False`.
- **Two text-extraction paths, asymmetric workers.** `digital_born_workers` (cheap `pdf_oxide` processes) and `ocr_workers` (RapidOCR + CUDA, ~2 GiB VRAM each) are tuned independently.
- **Bucketed formula batching.** Sorts crops by bbox area into small/medium/large buckets with per-bucket `max_new_tokens` caps. Measured 1.63x speedup over flat batching.
- **fp16 + cv2 fast preproc on layout.** ~1.7x and ~2.3x isolated layout-stage speedups on the SafeTensors backend.
- **Auto-DPI on the digital-born path.** Renders at 150 DPI for digital-born (layout downsamples to 800x800 anyway, `pdf_oxide` is resolution-independent), 300 DPI for scanned. Halves render wall.
- **Injectable resident models.** Pass pre-loaded `LayoutAnalyzer`, `FormulaRecognizer`, and `TableEngine` to skip the ~5s + ~3s + ~1s reload between calls. The FastAPI service does this in `lifespan`.
- **Frozen-dataclass schema.** `Document` / `Page` / `Block` / `Cell` / `Reference` are immutable; `dataclasses.replace` is the only rewrite path. JSON serialisation is stable across runs (modulo `uuid4` crop filenames).
- **Streaming-first tar bundle.** `document.json` is the first archive member; consumers parse the index before the image bytes arrive. The FastAPI service ships it directly via `FileResponse`.
- **Three-layer config.** `config.toml` (service-wide), `YTCC_*` env vars (per-field overrides), `PipelineConfig(...)` keyword args (per-call). Unknown TOML keys raise `ValueError`.

## Requirements

- Python 3.14+.
- A CUDA GPU for any non-trivial throughput. CPU works but is neither tested nor recommended.
- An externally-managed GROBID server when `references_enabled=true` (not bundled).

The project pins CUDA-enabled `torch` and `onnxruntime-gpu`. Substitute these for CPU wheels if you target CPU-only environments; nothing else in the codebase assumes CUDA at import time.

## Installation

```bash
pip install ytcc-pipeline       # library only
pip install ytcc-pipeline[api]  # + FastAPI service
pip install ytcc-pipeline[dev]  # + tests + lint + typing + benchmarks (includes [api])
```

## Library quickstart

```python
from ytcc_pipeline import process_pdf

# paper.tar: contains document.json + images/*
bundle_path = process_pdf("paper.pdf", language="en")
```

`process_pdf` is **synchronous and blocking** -- internally it uses `multiprocessing.spawn` pools, not asyncio. Call it from a thread (`asyncio.to_thread(process_pdf, ...)`) if you need to integrate with an event loop.

## Service quickstart

```bash
pip install ytcc-pipeline[api]
uvicorn ytcc_pipeline.api.app:app --host 0.0.0.0 --port 8000
```

The service reads `config.toml` at startup -- override the path with `YTCC_CONFIG=/path/to/config.toml`. One process, one GPU, one PDF at a time; concurrency is serialised on an `asyncio.Lock`.

```bash
curl -X POST http://localhost:8000/process \
  -F "pdf=@paper.pdf" \
  -F "language=en" \
  -o paper.tar
```

`GET /health` reports liveness + readiness (`{"status":"ok","model_loaded":true}`). `POST /process` accepts `pdf` (file), `language` (ISO 639-1), and optional `digital_born` (`true`/`false`). The response body is the tar bundle; `X-Processing-Time` carries the server-side wall in seconds.

> [!CAUTION]
> Run **one** uvicorn worker per GPU. `--workers N>1` multi-loads every resident model and contends for VRAM.

## Docker

Pre-built images for four deployment profiles are published to GitHub Container Registry, each available as a slim variant (~6 GB, models fetched on first request) or a baked variant (~11 GB, models pre-downloaded). Pin to a versioned tag in production.

| Tag | Profile | Use case |
|---|---|---|
| `:scanned[-baked]` | Mixed (digital-born + scanned) | Default; loads OCR engines |
| `:digital-born[-baked]` | `scanned_enabled=false` | Rejects scanned PDFs with HTTP 415; saves ~12 GiB VRAM |
| `:digital-born-a100[-baked]` | A100-tuned | Larger batches, `formula_torch_compile=true` |
| `:text-extract[-baked]` | 48GB VRAM-tuned, formula + table OFF | Text + image + references only; ~10x faster on math-heavy theses |

Each profile ships a compose file with a GROBID sidecar:

```bash
docker compose -f docker/compose.scanned.yml up -d
curl -X POST http://localhost:8000/process \
    -F "pdf=@paper.pdf" \
    -F "language=en" \
    -o paper.tar
```

Override config via env vars (every `PipelineConfig` field is reachable via `YTCC_<UPPERCASE_FIELD>`) or by mounting your own TOML over `/app/config.toml`. See `docker/README.md` for the full image matrix, build instructions, and troubleshooting.

## Output format

The output is one uncompressed tar:

```text
paper.tar
├── document.json              # the schema document
└── images/                    # cropped block images
    ├── 0001-image-{uuid}.png
    ├── 0014-formula-{uuid}.png
    ├── 0014-table-{uuid}.png
    └── 0027-formula-MISS-{uuid}.png
```

`document.json` is the first archive member -- consumers can stream-parse it before image bytes arrive. Image filenames sort by page (1-based, zero-padded), then by layout label, then by random UUID. The `-MISS-` marker identifies fallback crops written when primary extraction failed.

```python
import json, tarfile

with tarfile.open("paper.tar") as tf:
    doc = json.loads(tf.extractfile("document.json").read())

for page in doc["pages"]:
    for block in page["blocks"]:
        print(block["reading_order"], block["type"], (block["text"] or "")[:60])
```

The schema mirrors the `Document` / `Page` / `Block` / `Cell` / `Reference` dataclasses. `bbox` floats are rounded to two decimals; pixel coordinates are in the **effective render DPI** (150 for digital-born, 300 for scanned), origin top-left.

Per-block invariants:

| Block kind | `text` | `image_path` | `miss` |
|---|---|---|---|
| TEXT, success | extracted text | `null` | `false` |
| TEXT, MISS | `null` | crop (if bundled) or `null` | `true` |
| REFERENCE, success | extracted text | `null` | `false` |
| REFERENCE, MISS | `null` | crop (if bundled) or `null` | `true` |
| IMAGE | `null` | crop path | `false` |
| FORMULA, success | LaTeX | `null` (crop deleted) | `false` |
| FORMULA, MISS | `null` | `-MISS-` marked crop | `true` |
| TABLE, structured | `null` | crop path + `cells` / `n_rows` / `n_cols` set | `false` |
| TABLE, image-only fallback | `null` | crop path, `cells=null` | `false` |

`miss=True` always means the primary representation is unavailable. TABLE blocks never carry `miss=True` -- a structure failure degrades silently to image-only.

Full schema reference: `docs/output-format.md`.

## Configuration

Three layers, broadest to narrowest:

1. **`config.toml`** at the project root -- single source of truth for the service and benchmarks. Loaded by `load_service_config()`.
2. **`YTCC_*` environment variables** -- per-field overrides via `PipelineConfig.from_env()`.
3. **`PipelineConfig(...)` keyword arguments** -- explicit, per-call.

> [!IMPORTANT]
> The three layers don't compose automatically. The TOML and env vars are read only by `load_service_config()` and `PipelineConfig.from_env()`. Use `dataclasses.replace(loaded.pipeline, ...)` to layer overrides on top of a TOML-loaded config.

TOML resolution walks: `path` arg -> `YTCC_CONFIG` env -> `./config.toml` -> installed-package `config.toml` -> dataclass defaults. Unknown TOML keys raise `ValueError` -- typos don't pass silently.

Every `PipelineConfig` field has a matching `YTCC_<UPPERCASE_NAME>` env var. Bool parser accepts `1`/`true`/`yes`/`on` (case-insensitive). Comma-list fields strip whitespace and drop empties.

Full knob reference and tuning guidance: `docs/configuration.md` and `docs/performance.md`.

## Digital-born vs scanned

The two paths share rendering, layout, formula recognition, table extraction, and bundling. They diverge only inside the block stage:

| | Digital-born | Scanned |
|---|---|---|
| Text source | `pdf_oxide` text layer | `RapidOCR` over rendered crops |
| Render DPI | 150 (default) | 300 (default) |
| Per-worker cost | ~10 MiB RSS (one `PdfDocument` handle) | ~2 GiB VRAM (one RapidOCR engine + CUDA context) |
| Typical wall on RTX 3090 | ~15s for 150 pages | ~5-10x that |

Auto-detect samples 5 pages and checks per-page non-whitespace character counts; override with `digital_born=True/False`. Set `scanned_enabled=false` to reject scanned PDFs entirely (saves ~12 GiB VRAM; FastAPI returns HTTP 415).

## References (GROBID)

The reference stage requires an externally-managed GROBID server -- the pipeline never spawns the JVM. Start it separately:

```bash
docker run --rm -p 8070:8070 grobid/grobid:0.9.0
```

Or use the bundled helper which generates a citation-only config (drops startup from ~10s to ~3s, saves ~1 GiB RSS):

```bash
scripts/grobid_start.sh
GROBID_PORT=9090 scripts/grobid_start.sh
scripts/grobid_stop.sh
```

When enabled, `run_reference_stage` does one batched POST to `/api/processCitationList` per PDF. Failures (server unreachable, timeout, HTTP error, malformed XML) are logged at WARNING and the page list flows through unchanged -- references are an enrichment, not a hard requirement. The raw reference string always survives on `Block.text`.

## Design principles

- **Synchronous by default.** `process_pdf` is blocking. Async is layered on top in the FastAPI service via `asyncio.to_thread`. No async leakage into the core pipeline.
- **`multiprocessing.spawn`, never `fork`.** The parent process may hold a CUDA context; `fork` corrupts it. Workers re-import their module from scratch, which is why `pdf_oxide` is imported inside the digital-born worker entry rather than at module top.
- **One process, one GPU, one PDF at a time.** Concurrency is serialised at the `asyncio.Lock` in the service layer; the pipeline itself is sequential.
- **Resident models are injectable, not global.** `LayoutAnalyzer`, `FormulaRecognizer`, and `TableEngine` are constructor-injected with idempotent `close()`. The FastAPI lifespan loads them once; library callers either inject manually or let the orchestrator own the per-call lifecycle.
- **Opt-in opt-in opt-in.** Heavy stages (`table_enabled`, `references_enabled`) and slow tradeoffs (`formula_torch_compile`, `layout_fp16`) are off by default. Defaults are library-safe; production callers flip them on explicitly.
- **Streaming-first output.** Tar over zip because tar writes sequentially without seeking back for a central directory -- the bundle can be a pipe, socket, or HTTP response body. `document.json` is written first so consumers parse the index before image bytes arrive.
- **No silent failures.** Unknown TOML keys raise. MISS extractions are flagged on the block (`miss=true`) and preserve reading order + bbox. GROBID failures degrade the reference stage but never fail the pipeline.

## Limitations

- **Single-GPU, single-PDF concurrency.** The service serialises on a lock. Throughput scales with replicas, not workers.
- **Python 3.14+ only.** The project uses PEP 649/749 deferred-annotation semantics and modern stdlib features. No backport path.
- **No CPU path is supported.** CPU works but is untested and unoptimised. Production deployments need CUDA.
- **`fork` not supported.** Mixing this library with `multiprocessing.fork` corrupts CUDA contexts.
- **Tuned for academic theses.** The PP-DocLayoutV3 label set and routing rules target academic documents (abstracts, references, formulas, tables). General-purpose PDFs may produce surprising layouts.
- **GROBID is external.** The reference stage requires a separately-managed GROBID server. The pipeline never bundles or starts the JVM.
- **Reference output may have weird XMP keys.** `pdf_info` comes directly from the PDF's XMP block, cleaned of UTF-16 BOMs and null bytes -- but anything else (encrypted PDFs, IPTC, RDF) is out of scope.
- **Bundle filenames don't sort by reading order.** `document.json` is the authoritative reading order; image filenames sort by page + label + UUID.

## Benchmarks

Knob-sweep benchmarks live in the `benchmarks/` package. Each sweep varies one `PipelineConfig` field across a range of values and records per-stage wall, process-tree CPU/RSS, device VRAM, and quality metrics (block counts, MISS counts, formulas recovered, references parsed). Standalone scripts cover cold-start, sustained load, API concurrency, GROBID payload scaling, and `torch.compile` amortisation.

```bash
python benchmarks/run_all.py                # every sweep (cached CSVs skipped)
python benchmarks/run_all.py --only formula # just the sweeps matching "formula"
python -m benchmarks.plot                   # generate plots from existing CSVs
```

Committed reference results (`benchmarks/results/summary.md`, `sweeps/*.{csv,md}`, `plots/*.png`) live in git. Full catalogue: `benchmarks/README.md`.

## Documentation

| File | Topic |
|---|---|
| `docs/quickstart.md` | Install, first run, library + service modes |
| `docs/architecture.md` | Stage-by-stage pipeline, module layout, resource lifecycle |
| `docs/output-format.md` | Tar layout, `document.json` schema, MISS semantics |
| `docs/configuration.md` | `PipelineConfig` knobs, TOML, env-var overrides |
| `docs/stages.md` | Per-stage behaviour, knobs, skip / no-op semantics |
| `docs/performance.md` | Recommended config, per-knob impact, VRAM budget, tuning checklist |
| `docs/digital-born-vs-scanned.md` | Auto-detect heuristic, when to override, scanned-only deployments |
| `docs/api-service.md` | FastAPI contract, lifespan, concurrency model |
| `docs/references.md` | GROBID setup, parsed `Reference` shape, failure modes |
| `docs/gotchas.md` | Common pitfalls, MISS handling, OOM recovery, log conventions |

## Samples

Six PDFs under `samples/` cover English / Turkish / Arabic, digital-born and scanned, good and bad quality. Use `904599.pdf` (English, digital-born, good) as the first sanity check -- it exercises every stage except OCR.

## License

MIT. See [`LICENSE`](LICENSE).
