Metadata-Version: 2.4
Name: pdf-ocr-confidence
Version: 0.1.0
Summary: Estimate OCR reading confidence for PDF files before running expensive OCR
Author-email: Carlos Robles <roblesoftceo@gmail.com>
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF>=1.23.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: opencv-python>=4.8.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"

# PDF OCR Confidence

A Python library to estimate OCR reading confidence for PDF files **before** running expensive OCR models.

## What It Does

Analyzes PDF documents and returns a confidence score (0-1) indicating how well OCR will perform, helping you:
- Route low-quality PDFs to expensive/robust OCR models
- Route high-quality PDFs to cheap/fast OCR models
- Skip OCR entirely for native text PDFs

## Features

- ✅ Detects native text layer (instant 100% confidence, skip OCR)
- ✅ Image quality analysis (resolution, blur, contrast, noise)
- ✅ Text region detection and edge analysis
- ✅ Per-page and document-level confidence scores
- ✅ Configurable thresholds for routing decisions

## Installation

```bash
pip install -e .
```

## Quick Start

```python
from pdf_ocr_confidence import analyze_pdf

# Analyze a PDF
result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")  # "native", "cheap_ocr", "expensive_ocr"

# Access per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f} - {page.quality_metrics}")
```

## How It Works

1. **Native Text Detection**: Checks if PDF has extractable text layer
2. **Image Quality Analysis**:
   - DPI/Resolution check
   - Blur detection (Laplacian variance)
   - Contrast analysis (histogram)
   - Edge density (text clarity)
3. **Confidence Scoring**: Weighted combination of metrics
4. **Routing Recommendation**: Based on configurable thresholds

## Configuration

```python
from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=3,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)
```

## Dependencies

- `PyMuPDF` (fitz) - PDF text extraction and rendering
- `Pillow` - Image processing
- `numpy` - Numerical operations
- `opencv-python` - Advanced image analysis

## License

MIT
