Metadata-Version: 2.4
Name: lightningdoc
Version: 2.0.0
Summary: High-performance 2-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, math/table support, and header/footer stripping.
Author-email: Ayush Khamrui <khamruiasok@gmail.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/ayush-khamrui/lightningdoc
Project-URL: Repository, https://github.com/ayush-khamrui/lightningdoc
Project-URL: Issues, https://github.com/ayush-khamrui/lightningdoc/issues
Keywords: pdf,ocr,markdown,document-extraction,text-extraction,layout-detection,tesseract,pipeline
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: Pillow>=10.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: PyMuPDF>=1.23
Requires-Dist: opencv-python>=4.8
Requires-Dist: flask>=3.0
Provides-Extra: llm
Requires-Dist: torch>=2.0; extra == "llm"
Requires-Dist: transformers>=4.36; extra == "llm"
Requires-Dist: huggingface-hub>=0.20; extra == "llm"
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == "easyocr"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: all
Requires-Dist: lightningdoc[dev,easyocr,llm]; extra == "all"
Dynamic: license-file

# LightningDoc ⚡

**High-performance 2-stage PDF → Markdown extraction engine.**

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing with multi-strategy OCR into a single `pip install`.

[![PyPI version](https://img.shields.io/pypi/v/lightningdoc.svg)](https://pypi.org/project/lightningdoc/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

---

## ✨ Features

- **2-Stage Pipeline** — Layout Detection → Text Extraction
- **~25 ms/page** for native-text PDFs (200 pages in < 5 seconds)
- **Font-based math detection** — inline `$...$` and display `$$...$$` LaTeX equations
- **Table extraction** — PyMuPDF line detection + heuristic borderless table fallback
- **Header/footer stripping** — pattern + font-size aware margin detection
- **Multi-strategy OCR** — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
- **Column-aware reading order** — correct left→right ordering for multi-column layouts
- **Zero API keys** — everything runs 100% offline
- **Built-in web UI** — interactive viewer with bounding-box overlay, extraction dashboard

---

## 🚀 Quick Start

### Installation

```bash
pip install lightningdoc
```

With TrOCR handwriting support:

```bash
pip install lightningdoc[llm]
```

> **System requirement:** [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) must be installed separately:
> - macOS: `brew install tesseract`
> - Ubuntu: `sudo apt install tesseract-ocr`
> - Windows: [installer](https://github.com/UB-Mannheim/tesseract/wiki)

### Python API

```python
from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
```

### CLI

```bash
# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# With EasyOCR fusion
lightningdoc scanned.pdf --easyocr
```

### Web Viewer

```bash
lightningdoc --serve
# Open http://127.0.0.1:5050
```

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.

---

## 🏗 Architecture

```
PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown (math, tables, headings)
         ├─ Ligature & encoding repair
         ├─ Header/footer stripping
         ├─ Multi-strategy Tesseract OCR (scanned pages)
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Clean Markdown output
```

---

## 📦 Package Structure

```
lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection (pure PyMuPDF)
│   ├── stage2_extract.py # Extraction orchestrator
│   ├── math.py           # Math font detection & LaTeX conversion
│   ├── tables.py         # Table extraction (PyMuPDF + heuristic)
│   ├── headers.py        # Header/footer detection & stripping
│   ├── ocr.py            # Multi-strategy OCR
│   └── markdown.py       # Block-to-Markdown conversion
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # OCR text cleanup & fusion
└── models/
    └── trocr.py           # TrOCR handwriting model (lazy-loaded)
```

---

## ⚡ Performance

| Document Type | Pages/sec | Method |
|---|---|---|
| Native-text PDF | **80+ pages/sec** | Layout parsing |
| Scanned PDF | **~1 page/sec** | Tesseract OCR (parallel workers) |
| Handwritten form | **~0.15 pages/sec** | TrOCR + Tesseract hybrid |

- CPU-first — no CUDA required
- Apple Silicon MPS acceleration for TrOCR
- Parallel OCR workers for scanned pages

---

## 🔧 Optional Dependencies

| Extra | What it adds |
|---|---|
| `lightningdoc[llm]` | TrOCR handwriting recognition |
| `lightningdoc[easyocr]` | EasyOCR fusion for scanned pages |
| `lightningdoc[all]` | Everything |

---

## License

Apache 2.0 — see [LICENSE](LICENSE) for details.
