Metadata-Version: 2.4
Name: docuvision
Version: 2.0.0
Summary: Adaptive, modular OCR and document-AI pipeline orchestrator
Author: Tanup Vats
Maintainer: Tanup Vats
License: Apache-2.0
Project-URL: Homepage, https://github.com/Tanupvats/docuvision
Project-URL: Documentation, https://github.com/Tanupvats/docuvision#readme
Project-URL: Repository, https://github.com/Tanupvats/docuvision
Project-URL: Issues, https://github.com/Tanupvats/docuvision/issues
Project-URL: Changelog, https://github.com/Tanupvats/docuvision/blob/main/CHANGELOG.md
Keywords: ocr,document-ai,computer-vision,tesseract,paddleocr,easyocr,doctr,trocr,donut,craft,dbnet,east,text-detection,document-classification,pii-masking,embossed-text,chassis-number
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Typing :: Typed
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Requires-Dist: opencv-python-headless>=4.6
Requires-Dist: Pillow>=9.0
Requires-Dist: psutil>=5.9
Requires-Dist: scikit-learn>=1.0
Provides-Extra: tesseract
Requires-Dist: pytesseract>=0.3.10; extra == "tesseract"
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7.0; extra == "easyocr"
Provides-Extra: paddle-cpu
Requires-Dist: paddleocr>=2.7.0; extra == "paddle-cpu"
Requires-Dist: paddlepaddle>=2.5.0; extra == "paddle-cpu"
Provides-Extra: paddle-gpu
Requires-Dist: paddleocr>=2.7.0; extra == "paddle-gpu"
Requires-Dist: paddlepaddle-gpu>=2.5.0; extra == "paddle-gpu"
Provides-Extra: doctr
Requires-Dist: python-doctr[torch]>=0.7.0; extra == "doctr"
Provides-Extra: trocr
Requires-Dist: torch>=2.0; extra == "trocr"
Requires-Dist: torchvision>=0.15; extra == "trocr"
Requires-Dist: transformers>=4.35; extra == "trocr"
Requires-Dist: sentencepiece>=0.1.99; extra == "trocr"
Requires-Dist: timm>=0.9; extra == "trocr"
Provides-Extra: donut
Requires-Dist: torch>=2.0; extra == "donut"
Requires-Dist: torchvision>=0.15; extra == "donut"
Requires-Dist: transformers>=4.35; extra == "donut"
Requires-Dist: sentencepiece>=0.1.99; extra == "donut"
Requires-Dist: timm>=0.9; extra == "donut"
Provides-Extra: yolo
Requires-Dist: ultralytics>=8.0; extra == "yolo"
Provides-Extra: onnx-cpu
Requires-Dist: onnxruntime>=1.16; extra == "onnx-cpu"
Provides-Extra: onnx-gpu
Requires-Dist: onnxruntime-gpu>=1.16; extra == "onnx-gpu"
Provides-Extra: hdbscan
Requires-Dist: hdbscan>=0.8.33; extra == "hdbscan"
Provides-Extra: cpu
Requires-Dist: pytesseract>=0.3.10; extra == "cpu"
Requires-Dist: easyocr>=1.7.0; extra == "cpu"
Requires-Dist: paddleocr>=2.7.0; extra == "cpu"
Requires-Dist: paddlepaddle>=2.5.0; extra == "cpu"
Requires-Dist: python-doctr[torch]>=0.7.0; extra == "cpu"
Requires-Dist: onnxruntime>=1.16; extra == "cpu"
Requires-Dist: hdbscan>=0.8.33; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: pytesseract>=0.3.10; extra == "gpu"
Requires-Dist: easyocr>=1.7.0; extra == "gpu"
Requires-Dist: paddleocr>=2.7.0; extra == "gpu"
Requires-Dist: paddlepaddle-gpu>=2.5.0; extra == "gpu"
Requires-Dist: python-doctr[torch]>=0.7.0; extra == "gpu"
Requires-Dist: torch>=2.0; extra == "gpu"
Requires-Dist: torchvision>=0.15; extra == "gpu"
Requires-Dist: transformers>=4.35; extra == "gpu"
Requires-Dist: sentencepiece>=0.1.99; extra == "gpu"
Requires-Dist: timm>=0.9; extra == "gpu"
Requires-Dist: ultralytics>=8.0; extra == "gpu"
Requires-Dist: onnxruntime-gpu>=1.16; extra == "gpu"
Requires-Dist: hdbscan>=0.8.33; extra == "gpu"
Provides-Extra: all
Requires-Dist: pytesseract>=0.3.10; extra == "all"
Requires-Dist: easyocr>=1.7.0; extra == "all"
Requires-Dist: paddleocr>=2.7.0; extra == "all"
Requires-Dist: python-doctr[torch]>=0.7.0; extra == "all"
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: torchvision>=0.15; extra == "all"
Requires-Dist: transformers>=4.35; extra == "all"
Requires-Dist: sentencepiece>=0.1.99; extra == "all"
Requires-Dist: timm>=0.9; extra == "all"
Requires-Dist: ultralytics>=8.0; extra == "all"
Requires-Dist: onnxruntime>=1.16; extra == "all"
Requires-Dist: hdbscan>=0.8.33; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Dynamic: license-file

# DocuVision

**Adaptive, modular document OCR & AI pipeline orchestrator.** DocuVision detects
your system's capabilities (CPU, GPU, VRAM, installed ML libraries) and
dynamically picks the best feasible OCR and document-AI stack — from Tesseract
on a 4 GB CPU box all the way up to TrOCR / Donut on a multi-GPU workstation.

## Features

- **System profiler** — OS, CPU, GPU, CUDA, VRAM, RAM, TensorRT, installed libs
- **Tiered engine selection** — Tier 0 (CPU / Tesseract) → Tier 4 (GPU / TrOCR, Donut)
- **Unified OCR registry** — Tesseract, EasyOCR, PaddleOCR (CPU/GPU), docTR, TrOCR, Donut
- **Text detection** — CRAFT, DBNet, EAST, PaddleOCR-det + pure-OpenCV contour fallback
- **Clustering** — DBSCAN / HDBSCAN bounding-box merging with anisotropic distance
- **Document classification** — pluggable YOLO / ONNX / keyword backends, custom labels
- **PII masking** — YOLO or regex backends, gaussian blur / black box / pixelation
- **Embossed / engraved OCR** — CLAHE + Sobel + shape-from-shading + morphology, ideal
  for chassis numbers, industrial plates, metal serials
- **Graceful fallbacks** — no hard deps on heavy libs, missing engines are skipped

## Installation

```bash
# Core only — works with the pure-OpenCV detector but needs an OCR engine
# installed separately
pip install docuvision

# Common bundles
pip install "docuvision[tesseract]"   # + pytesseract (requires tesseract binary)
pip install "docuvision[easyocr]"     # + easyocr
pip install "docuvision[paddle-cpu]"  # + paddleocr + paddlepaddle (CPU)
pip install "docuvision[paddle-gpu]"  # + paddlepaddle-gpu
pip install "docuvision[doctr]"       # + python-doctr
pip install "docuvision[trocr]"       # + torch + transformers + TrOCR deps
pip install "docuvision[donut]"       # + torch + transformers + Donut deps

# Aggregate bundles
pip install "docuvision[cpu]"         # every CPU-capable engine
pip install "docuvision[gpu]"         # every engine, GPU-preferred
pip install "docuvision[all]"         # everything, library versions only

# Development
pip install "docuvision[dev]"
```

Heavy engines (torch, paddle, transformers…) are **optional**. DocuVision lazily
imports them, so a missing library simply disables that one engine; the rest of
the pipeline keeps working.

## Quick start

```python
from docuvision import DocumentPipeline, profile_system

# See what DocuVision picks for your system
print(profile_system().summary())

# Zero-config pipeline — uses the best available engine
pipe = DocumentPipeline()
result = pipe.run("invoice.png")
print(result.text)

# Full-featured pipeline
pipe = DocumentPipeline(
    detect_text=True,
    classify_doc=True,
    mask=True,
    embossed_mode=False,
    language=["en"],
)
result = pipe.run("aadhaar.jpg")
print(result.doc_class.label)  # → 'aadhaar'
print(result.text)
print(result.masks)            # detected PII regions
# result.masked_image is a numpy.ndarray ready for cv2.imwrite(...)
```

## Command-line

```bash
docuvision profile                            # show the capability report
docuvision ocr invoice.png                    # default pipeline
docuvision ocr plate.jpg --embossed           # embossed / chassis OCR
docuvision ocr id.jpg --classify --mask --mask-method blackbox
docuvision engines                            # list registered OCR engines
docuvision detectors                          # list registered detectors
```

## Design principles

1. **No hard deps on heavy libs.** Every engine is imported lazily. `pip install
   docuvision` drops in fine even without torch, paddle, or transformers.
2. **Explicit tiering.** The profiler scores your hardware and picks a tier; you
   can override it per call.
3. **Composable, not monolithic.** Each stage (detect, classify, OCR, mask) is
   an independent, swappable component with a common ABC and a registry.
4. **Sane defaults, full control.** `DocumentPipeline().run(img)` just works;
   every knob is exposed via `PipelineConfig`.
5. **Fails gracefully.** A per-stage failure is captured in
   `PipelineResult.metadata["failures"]` rather than aborting the pipeline.

## Testing

The full test suite runs against the core install only — heavy engines are not
required:

```bash
pytest tests/
# or, without pytest installed:
python run_tests.py
```

## License

Apache License 2.0 — see [LICENSE](LICENSE).
