Metadata-Version: 2.4
Name: lightningdoc
Version: 1.0.0
Summary: High-performance 3-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, and NLP post-processing.
Author-email: Ayush Khamrui <khamruiasok@gmail.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/ayush-khamrui/lightningdoc
Project-URL: Repository, https://github.com/ayush-khamrui/lightningdoc
Project-URL: Issues, https://github.com/ayush-khamrui/lightningdoc/issues
Keywords: pdf,ocr,markdown,document-extraction,text-extraction,layout-detection,tesseract,pipeline
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: Pillow>=10.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: PyMuPDF>=1.23
Requires-Dist: opencv-python>=4.8
Requires-Dist: flask>=3.0
Provides-Extra: llm
Requires-Dist: torch>=2.0; extra == "llm"
Requires-Dist: transformers>=4.36; extra == "llm"
Requires-Dist: huggingface-hub>=0.20; extra == "llm"
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == "easyocr"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: all
Requires-Dist: lightningdoc[dev,easyocr,llm]; extra == "all"
Dynamic: license-file

# LightningDoc ⚡

**High-performance 3-stage PDF → Markdown extraction engine.**

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing, multi-strategy OCR, and optional AI post-processing into a single `pip install`.

[![PyPI version](https://img.shields.io/pypi/v/lightningdoc.svg)](https://pypi.org/project/lightningdoc/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

---

## ✨ Features

- **3-Stage Pipeline** — Layout Detection → Text Extraction → NLP Post-Processing
- **~25 ms/page** for native-text PDFs (200 pages in < 5 seconds)
- **Multi-strategy OCR** — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
- **GLM-OCR Vision Judge** — 0.9B param multimodal model re-reads scanned pages for higher accuracy
- **SmolLM2 NLP** — 360M param local LLM for OCR correction and document classification
- **Zero API keys** — everything runs 100% offline after first model download
- **Apple Silicon optimised** — MPS acceleration for all neural models
- **Built-in web UI** — interactive viewer with bounding-box overlay, upload, extraction dashboard

---

## 🚀 Quick Start

### Installation

```bash
pip install lightningdoc
```

With AI models (OCR correction, document classification, GLM-OCR):

```bash
pip install lightningdoc[llm]
```

> **System requirement:** [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) must be installed separately:
> - macOS: `brew install tesseract`
> - Ubuntu: `sudo apt install tesseract-ocr`
> - Windows: [installer](https://github.com/UB-Mannheim/tesseract/wiki)

### Python API

```python
from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
print(f"Stage 3 (NLP):     {result.stage3_ms:.0f}ms")
print(f"Document type:     {result.doc_type}")
```

### CLI

```bash
# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With GLM-OCR vision judge (for scanned docs)
lightningdoc scanned.pdf --glm-ocr

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# Skip AI (rules-only, fastest)
lightningdoc report.pdf --no-llm
```

### Web Viewer

```bash
lightningdoc --serve
# Open http://127.0.0.1:5050
```

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.

---

## 🏗 Architecture

```
PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown
         ├─ Ligature & encoding repair
         ├─ Multi-strategy Tesseract OCR
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Stage 3: NLP Post-Processing  (rules + AI)
         ├─ Rule-based OCR corrections
         ├─ GLM-OCR vision judge (optional)
         ├─ SmolLM2 field extraction (fallback)
         └─ Document classification

     ──→ Clean Markdown output
```

---

## 📦 Package Structure

```
lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection
│   ├── stage2_extract.py # Text extraction + OCR
│   └── stage3_nlp.py     # NLP post-processing
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # Numeric fix, medical forms
├── models/
│   ├── trocr.py          # TrOCR handwriting model
│   └── glm_ocr.py        # GLM-OCR vision model
└── llm/
    ├── engine.py          # SmolLM2-360M inference
    ├── correction.py      # OCR post-correction
    └── classifier.py      # Document classification
```

---

## ⚡ Performance

| Document Type | Pages/sec | Method |
|---|---|---|
| Native-text PDF | **80+ pages/sec** | Layout parsing |
| Scanned PDF | **~1 page/sec** | Tesseract OCR (parallel workers) |
| Handwritten form | **~0.15 pages/sec** | TrOCR + Tesseract hybrid |

- CPU-first — no CUDA required
- Apple Silicon MPS acceleration for neural models
- Parallel OCR workers for scanned pages
- Background model preloading (overlaps Stage 1+2)

---

## 🔧 Optional Dependencies

| Extra | What it adds |
|---|---|
| `lightningdoc[llm]` | SmolLM2 OCR correction + document classification + GLM-OCR vision judge |
| `lightningdoc[easyocr]` | EasyOCR fusion for scanned pages |
| `lightningdoc[all]` | Everything |

---

## License

Apache 2.0 — see [LICENSE](LICENSE) for details.
