Metadata-Version: 2.4
Name: pdf-ocr-confidence
Version: 0.2.0
Summary: Docling pipeline optimizer - Pre-filter PDFs to skip unnecessary OCR and route to optimal backends
Author-email: Carlos Robles <roblesoftceo@gmail.com>
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF>=1.23.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: opencv-python>=4.8.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"

# PDF OCR Confidence

**Optimize your Docling pipeline at scale** — Pre-filter PDFs to skip unnecessary OCR processing and route documents to the right backend.

## What It Does

Analyzes PDF documents **before** expensive OCR processing to:
- ✅ **Skip Docling entirely** for native text PDFs (10x faster)
- ✅ **Route to optimal OCR backend** (Tesseract vs EasyOCR)
- ✅ **Batch triage** 10K+ PDFs into priority queues
- ✅ **Estimate processing time/cost** before committing resources

## Why Use This?

**Problem:** Docling initialization takes 2-3 seconds per PDF. For native text documents, that's pure overhead.

**Solution:** Pre-analyze PDFs in milliseconds, extract native text directly, and only use Docling when necessary.

### Performance Gains

Processing 1000 mixed-quality PDFs:

| Method | Time | Cost |
|--------|------|------|
| **Without pre-filtering** | 50 min | 100% |
| **With pre-filtering** | 23 min | 54% faster ✅ |

*Assumes 60% native text, 30% high-quality scans, 10% low-quality scans*

## Installation

```bash
pip install pdf-ocr-confidence
```

## Quick Start: Docling Integration

### 1. Fast-path for Native Text PDFs

```python
from pdf_ocr_confidence import should_use_docling, extract_native_text

if not should_use_docling("report.pdf"):
    # Skip Docling - extract text directly (10x faster)
    text = extract_native_text("report.pdf")
else:
    # Use Docling for scanned/low-quality PDFs
    from docling.document_converter import DocumentConverter
    doc = DocumentConverter().convert("report.pdf")
    text = doc.export_to_markdown()
```

### 2. Batch Triage & Optimal Routing

```python
from pdf_ocr_confidence import get_docling_strategy

# Analyze each PDF
strategy = get_docling_strategy("invoice.pdf")

if not strategy["use_docling"]:
    # Native text - skip Docling
    text = extract_native_text("invoice.pdf")
elif strategy["ocr_backend"] == "tesseract":
    # High quality - use fast OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="tesseract"
        )
    ).convert("invoice.pdf")
else:
    # Low quality - use robust OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="easyocr"
        )
    ).convert("invoice.pdf")
```

### 3. Cost Estimation Before Processing

```python
from pdf_ocr_confidence import estimate_processing_time

estimate = estimate_processing_time("large_doc.pdf")
print(f"Expected time: {estimate['time_seconds']}s")
print(f"Recommended backend: {estimate['recommended_backend']}")
print(f"Page count: {estimate['page_count']}")
```

## Advanced Usage

### Standalone Analysis (Without Docling)

```python
from pdf_ocr_confidence import analyze_pdf

result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")

# Per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f}")
```

### Custom Thresholds

```python
from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=5,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)
```

## How It Works

1. **Native Text Detection**: Checks if PDF has extractable text layer
2. **Image Quality Analysis**:
   - DPI/Resolution check
   - Blur detection (Laplacian variance)
   - Contrast analysis (histogram)
   - Edge density (text clarity)
3. **Confidence Scoring**: Weighted combination of metrics
4. **Routing Recommendation**: Native text / Tesseract / EasyOCR

## Real-World Example

```python
from pdf_ocr_confidence import get_docling_strategy, extract_native_text

def process_pdf_batch(pdf_paths):
    native_count = 0
    docling_count = 0
    
    for pdf_path in pdf_paths:
        strategy = get_docling_strategy(pdf_path)
        
        if not strategy["use_docling"]:
            # Fast path: skip Docling
            text = extract_native_text(pdf_path)
            native_count += 1
        else:
            # Use Docling with optimal backend
            doc = DocumentConverter().convert(pdf_path)
            text = doc.export_to_markdown()
            docling_count += 1
    
    print(f"Processed {native_count} PDFs without Docling (fast)")
    print(f"Processed {docling_count} PDFs with Docling (OCR)")
```

**Result:** If 60% of your PDFs have native text, you save 60% of Docling initialization overhead.

## Examples

See `examples/docling_integration.py` for:
- Complete pipeline integration
- Batch processing with queues
- Cost estimation
- Priority-based routing

Run it:
```bash
python examples/docling_integration.py your_document.pdf
```

## Use Cases

✅ **Large-scale document processing** (thousands of PDFs)  
✅ **Mixed-quality document pipelines** (invoices, reports, scans)  
✅ **Cost optimization** for cloud OCR services  
✅ **Pre-filtering before Docling/Tesseract/EasyOCR**  
✅ **Queue-based batch processing**

❌ **Small batches** (< 100 PDFs) — overhead not worth it  
❌ **All scanned documents** — if everything needs OCR, skip pre-analysis  
❌ **Single OCR backend** — if you only use Tesseract, limited benefit

## Performance

| Operation | Time (avg) |
|-----------|------------|
| Native text detection | 0.05s per page |
| Quality analysis | 0.1s per page |
| Docling initialization | 2-3s per PDF |

**Breakeven point:** If >30% of your PDFs have native text, pre-filtering is worth it.

## Dependencies

- `PyMuPDF` (fitz) - PDF text extraction and rendering
- `Pillow` - Image processing
- `numpy` - Numerical operations
- `opencv-python` - Advanced image analysis

## Roadmap

- [ ] Batch API for parallel analysis
- [ ] Rotation/skew detection
- [ ] Language detection for OCR model selection
- [ ] Table/form detection heuristics
- [ ] Integration examples for AWS Textract, Google Vision

## Contributing

PRs welcome! Focus areas:
- Better quality heuristics
- Docling integration patterns
- Performance benchmarks
- Real-world use cases

## License

MIT

## Links

- PyPI: https://pypi.org/project/pdf-ocr-confidence/
- GitHub: *(coming soon)*
- Docling: https://github.com/DS4SD/docling
