Metadata-Version: 2.4
Name: trivision-ocr
Version: 1.0.3
Summary: Multi-engine OCR pipeline — beats Google Vision API
Author-email: Parthasarathy G <parthasarathyg693@gmail.com>
License-Expression: MIT
Keywords: ocr,tesseract,paddleocr,easyocr,vision-api
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: opencv-contrib-python-headless
Requires-Dist: pytesseract
Requires-Dist: Pillow
Requires-Dist: paddleocr
Requires-Dist: easyocr
Requires-Dist: symspellpy
Requires-Dist: numpy
Requires-Dist: python-dotenv
Requires-Dist: requests
Provides-Extra: cpu
Requires-Dist: paddlepaddle; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: paddlepaddle-gpu; extra == "gpu"
Provides-Extra: api
Requires-Dist: fastapi; extra == "api"
Requires-Dist: uvicorn[standard]; extra == "api"
Requires-Dist: python-multipart; extra == "api"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"

# ocr-pipeline

Multi-engine OCR pipeline combining **Tesseract**, **PaddleOCR**, and **EasyOCR** with confidence-weighted voting and domain-aware spell correction. Output is compatible with the Google Vision API JSON schema.

## Architecture

```
Image → [Preprocessor] → Tesseract ┐
                       → PaddleOCR ├─→ [Voter] → [Corrector] → Vision API JSON
                       → EasyOCR   ┘
```

| Stage | What it does | Your edge vs Vision API |
|---|---|---|
| 0 — Preprocessor | Hough deskew, CLAHE, binarize, optional 2× EDSR | Domain-specific prep; Vision API gets raw images |
| 1 — Three engines | Tesseract (LSTM), PaddleOCR (DBNet+SVTR), EasyOCR (CRAFT+CRNN) | Different failure modes → ensemble eliminates each |
| 2 — Voter | IoU spatial grouping + weighted confidence + agreement bonus | Cross-engine consensus exposed; Vision API hides this |
| 3 — Corrector | SymSpell + domain vocabulary protection | Tuned to your label vocabulary; Vision API uses general LM |

## Prerequisites

**Tesseract binary (Windows — required):**
```powershell
winget install UB-Mannheim.TesseractOCR
```

**Python 3.10+** must be installed and on PATH.

## Quick Start

```powershell
# 1. Clone / copy the project
cd ocr-pipeline

# 2. Create virtual environment and install (CPU default)
setup_venv.bat

# For GPU (CUDA + paddlepaddle-gpu):
setup_venv.bat gpu

# 3. Activate
.venv\Scripts\activate

# 4. Run on an image
ocr-pipeline path\to\image.jpg --pretty

# Save output to file
ocr-pipeline path\to\image.jpg --output result.json

# Text only
ocr-pipeline path\to\image.jpg --text-only

# Skip super-resolution (faster, no EDSR model needed)
ocr-pipeline path\to\image.jpg --no-super-resolve
```

## Python API

```python
from ocr_pipeline import beat_vision_api

result = beat_vision_api("path/to/image.jpg")
print(result["responses"][0]["fullTextAnnotation"]["text"])
```

## Configuration

Copy `.env.example` to `.env` and edit:

```env
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
USE_GPU=false
USE_SUPER_RESOLVE=false
```

Add project-specific label codes to `DOMAIN_TERMS` in `ocr_pipeline/config.py`.

## Running Tests

```powershell
pytest tests/test_voter.py -v
```

All 7 tests run without GPU or model downloads.

## Optional: Super-Resolution

EDSR 2× upscale (~143 MB model) significantly improves small text on diagram scans. Enable it:

1. Run `python download_models.py` and choose `y` when prompted for EDSR
2. Set `USE_SUPER_RESOLVE=true` in `.env`

## Project Structure

```
ocr-pipeline/
├── ocr_pipeline/       ← Python package
│   ├── config.py       ← All tunable parameters
│   ├── preprocessor.py ← Stage 0
│   ├── engines.py      ← Stage 1
│   ├── voter.py        ← Stage 2
│   ├── corrector.py    ← Stage 3
│   └── pipeline.py     ← Orchestration
├── tests/
│   └── test_voter.py   ← Unit tests (no GPU)
├── models/             ← Downloaded model files
├── setup_venv.bat      ← Windows setup
└── download_models.py  ← Model downloader
```
