Metadata-Version: 2.4
Name: trollfab-ocr
Version: 1.0.0
Summary: Multi-engine OCR toolkit — Tesseract, OpenAI, Mistral, Anthropic, Google Vision, ensemble voting, image preprocessing, and Swedish text post-processing
Author-email: Trollfabriken AITrix AB <dev@trollfabriken.se>
License: MIT
Keywords: ocr,tesseract,openai,mistral,anthropic,google-vision,swedish,nlp,pdf
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pillow>=10.0
Provides-Extra: tesseract
Requires-Dist: pytesseract>=0.3; extra == "tesseract"
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == "openai"
Provides-Extra: mistral
Requires-Dist: mistralai>=1.0; extra == "mistral"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.34; extra == "anthropic"
Provides-Extra: google
Requires-Dist: google-cloud-vision>=3.7; extra == "google"
Provides-Extra: pdf
Requires-Dist: pdfplumber>=0.11; extra == "pdf"
Provides-Extra: preprocessing
Requires-Dist: numpy>=1.24; extra == "preprocessing"
Requires-Dist: opencv-python-headless>=4.8; extra == "preprocessing"
Requires-Dist: scipy>=1.11; extra == "preprocessing"
Provides-Extra: all
Requires-Dist: pytesseract>=0.3; extra == "all"
Requires-Dist: openai>=1.40; extra == "all"
Requires-Dist: mistralai>=1.0; extra == "all"
Requires-Dist: anthropic>=0.34; extra == "all"
Requires-Dist: google-cloud-vision>=3.7; extra == "all"
Requires-Dist: pdfplumber>=0.11; extra == "all"
Requires-Dist: numpy>=1.24; extra == "all"
Requires-Dist: opencv-python-headless>=4.8; extra == "all"
Requires-Dist: scipy>=1.11; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.9; extra == "dev"
Dynamic: license-file

﻿# trollfab-ocr

Multi-engine OCR toolkit with quality scoring, image preprocessing, ensemble voting,
and Swedish text post-processing — built for
[Trollfabriken AITrix AB](https://trollfabriken.se) document processing pipelines.

---

## Engines

| Engine | Class | Backend |
|---|---|---|
| Tesseract | `TesseractOCR` | Local, offline (requires Tesseract binary) |
| OpenAI | `VisionOCR` | GPT-4o vision API |
| Mistral | `MistralOCR` / `MistralOCRDedicated` | Pixtral + dedicated OCR API |
| Anthropic | `AnthropicOCR` | Claude vision API |
| Google Vision | `GoogleVisionOCR` | Cloud Vision `document_text_detection` |
| Multi-engine | `MultiEngineOCR` | Auto-fallback orchestrator |
| Ensemble | `OCREnsemble` | Voting-based merger across engines |

---

## Installation

```bash
# Core only (no external deps)
pip install trollfab-ocr

# With specific engine support
pip install "trollfab-ocr[tesseract]"    # pytesseract
pip install "trollfab-ocr[openai]"       # OpenAI GPT-4o
pip install "trollfab-ocr[mistral]"      # Mistral OCR API
pip install "trollfab-ocr[anthropic]"    # Anthropic Claude
pip install "trollfab-ocr[google]"       # Google Cloud Vision
pip install "trollfab-ocr[pdf]"          # pdfplumber (pre-OCR routing)
pip install "trollfab-ocr[preprocessing]"  # numpy + OpenCV + scipy
pip install "trollfab-ocr[all]"          # everything
```

---

## Quick start

### Single engine

```python
from multi_ocr import TesseractOCR, VisionOCR, MistralOCR

# Tesseract (local)
ocr = TesseractOCR(languages=["swe", "eng"])
result = ocr.extract_text("scan.png")
print(result.text, result.quality_score)

# OpenAI GPT-4o
ocr = VisionOCR()  # uses OPENAI_API_KEY from env
result = ocr.extract_text("document.jpg")

# Mistral dedicated OCR (best for PDFs)
ocr = MistralOCRDedicated()  # uses MISTRAL_API_KEY from env
result = ocr.extract_from_pdf("report.pdf")
```

### Auto-fallback orchestrator

```python
from multi_ocr import MultiEngineOCR

ocr = MultiEngineOCR()  # uses all available engines
result = ocr.extract("document.png")
print(result.text, result.quality_score, result.engine_used)
```

### Ensemble voting

```python
from multi_ocr import OCREnsemble, TesseractOCR, VisionOCR, MistralOCR

ensemble = OCREnsemble(engines=[TesseractOCR(), VisionOCR(), MistralOCR()])
result = ensemble.extract_with_voting("scan.png")
print(result.text, result.quality_score, result.agreement_score)
print(result.engines_used, result.voting_method)
```

### Image preprocessing

```python
from multi_ocr import ImagePreprocessor, TesseractPreprocessor, tesseract_preprocess

# Full preprocessing pipeline
prep = ImagePreprocessor()
enhanced = prep.prepare_for_ocr("noisy_scan.png")

# Tesseract-optimised 5-step pipeline (grayscale→binarize→denoise→enhance)
img = tesseract_preprocess("scan.jpg")

# Image quality analysis + targeted enhancement
from multi_ocr import ImageEnhancer
enhancer = ImageEnhancer()
analysis = enhancer.analyze_quality("document.jpg")
result = enhancer.enhance("document.jpg", preset="scan")
```

### Swedish text post-processing

```python
from multi_ocr import SwedishTextPostProcessor

pp = SwedishTextPostProcessor()
clean = pp.process(raw_ocr_text)
# Fixes ligatures, OCR artefacts, diacritical chars, all 290 municipalities,
# Swedish abbreviations, whitespace normalization
```

### Pre-OCR routing (avoid unnecessary API calls)

```python
from multi_ocr import has_text_layer, is_simple, extract_text_layer

path = "document.pdf"
if is_simple(path):
    # Fast path: extract native text, no OCR needed
    text = extract_text_layer(path)
elif has_text_layer(path):
    # Has text but complex layout — use Mistral or Docling
    ...
else:
    # Scanned image — run full OCR pipeline
    ...
```

### Unicode cleaning

```python
from multi_ocr import clean_unicode, repair_ligatures

text = clean_unicode(raw_text)         # NFKD → ligature repair → NFC → remove zero-width
text = repair_ligatures("ﬁle ﬀ")      # → "file ff"
```

### LLM-based OCR enhancement

```python
from multi_ocr import OCREnhancer

enhancer = OCREnhancer()  # uses OPENAI_API_KEY from env
corrected = enhancer.correct_errors(raw_ocr_text)
doc_type = enhancer.identify_document_type(raw_ocr_text)
structure = enhancer.extract_structure(raw_ocr_text)
```

---

## Quality scoring

`QualityScorer` rates text on multiple axes (length, Swedish characters,
municipal patterns, document structure) and returns a 0–1 score. Used
internally by all engines and the ensemble to select the best result.

---

## Environment variables

| Variable | Engine |
|---|---|
| `OPENAI_API_KEY` | `VisionOCR`, `OCREnhancer` |
| `MISTRAL_API_KEY` | `MistralOCR`, `MistralOCRDedicated` |
| `ANTHROPIC_API_KEY` | `AnthropicOCR` |
| `GOOGLE_APPLICATION_CREDENTIALS` | `GoogleVisionOCR` |

---

## Package structure

```
multi_ocr/
├── __init__.py              ← Public API
├── py.typed                 ← PEP 561 typed marker
├── ocr_engine.py            ← Core engines + QualityScorer + MultiEngineOCR
├── ensemble.py              ← OCREnsemble (voting-based merger)
├── voting.py                ← VotingStrategy (best_of_n / line_merge / quorum)
├── image_preprocessor.py   ← Deskew, denoise, binarize, region detection
├── image_enhancer.py        ← Quality analysis + targeted enhancement
├── tesseract_preprocessing.py ← 5-step Tesseract-optimised pipeline
├── ocr_enhancer.py          ← LLM-based correction + structure extraction
├── swedish_postprocessor.py ← 7-step Swedish OCR post-processing
├── mistral_ocr.py           ← Mistral dedicated OCR API (full PDF support)
├── google_vision_ocr.py     ← Google Cloud Vision integration
├── svg_table.py             ← SVG table generation from OCR output
├── routing.py               ← Pre-OCR routing (text-layer detection)
└── unicode_clean.py         ← Ligature repair + zero-width removal
```

---

© 2025 Trollfabriken AITrix AB — MIT License
