Metadata-Version: 2.3
Name: ocr-detection
Version: 0.4.1
Summary: A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR
Keywords: pdf,ocr,text-extraction,document-processing,pdf-analysis
Author: satish
Author-email: satish <satish860@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pdfplumber>=0.10.0
Maintainer: satish
Maintainer-email: satish <satish860@gmail.com>
Requires-Python: >=3.13
Project-URL: Documentation, https://github.com/satish860/ocr-detection#readme
Project-URL: Homepage, https://github.com/satish860/ocr-detection
Project-URL: Issues, https://github.com/satish860/ocr-detection/issues
Project-URL: Repository, https://github.com/satish860/ocr-detection
Description-Content-Type: text/markdown

# OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.

**NEW in v0.3.0**: Smart Image Extraction provides **5x faster** performance for scanned PDFs with **33% less memory usage**! Now includes **40x faster** default processing mode and optimized parallel processing for large documents.

## Features

- **Page Type Detection**: Automatically classifies PDF pages as text, scanned, mixed, or empty
- **Smart Image Extraction**: 5x faster image processing for scanned PDFs using embedded images
- **Base64 Image Output**: Get page images as base64-encoded strings for visualization
- **Dual Processing Modes**: Fast mode (40x faster) for speed, accuracy mode for precision
- **Parallel Processing**: Fast analysis of large PDFs using multi-threading (up to 8x speedup)
- **Confidence Scoring**: Reliability indicators for classifications
- **Memory Efficient**: 33% reduction in memory usage with optimized image handling
- **Simple API**: Easy-to-use interface with minimal complexity

## Installation

```bash
# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install ocr-detection
```

## Usage

### Quick Start

```python
from ocr_detection import detect_ocr

# RECOMMENDED: Serverless mode with images - optimal for most use cases
# (12-17s for 1000+ pages, includes optimized images for OCR processing)
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)

# RECOMMENDED: Serverless mode for classification only - ultra-fast
# (sub-2 seconds for 1000+ pages, no images)
result = detect_ocr("document.pdf", serverless_mode=True)

# Traditional fast mode - 40x faster than accuracy mode
result = detect_ocr("document.pdf")

# Accuracy mode - slowest but most precise
result = detect_ocr("document.pdf", accuracy_mode=True)

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Check the status
if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")
```

### Recommended Usage (Serverless Optimized)

**For Google Cloud Functions/Run and other serverless environments:**

```python
from ocr_detection import detect_ocr, OCRDetection

# Option 1: Quick function call with images (RECOMMENDED)
# Perfect balance of speed and functionality
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)
# Performance: 12-17s for 1000+ pages with optimized images

# Option 2: Classification only (ultra-fast)
# When you only need to know which pages need OCR
result = detect_ocr("document.pdf", serverless_mode=True)
# Performance: sub-2 seconds for 1000+ pages

# Option 3: Class-based approach
detector = OCRDetection(serverless_mode=True, include_images=True)
result = detector.detect("document.pdf")
```

### Using the OCRDetection Class

```python
from ocr_detection import OCRDetection

# RECOMMENDED: Serverless mode - optimal for most use cases
# Automatically enables metadata_only=True, optimized images, and conservative parallelization
serverless_detector = OCRDetection(serverless_mode=True)

# RECOMMENDED: Serverless mode with images for OCR processing
# (12-17s for 1000+ pages with optimized ultra-fast image generation)
serverless_with_images = OCRDetection(serverless_mode=True, include_images=True)

# Traditional fast mode - 40x faster than accuracy mode
detector = OCRDetection(
    accuracy_mode=False,       # Fast mode (default)
    confidence_threshold=0.5,  # Minimum confidence for OCR detection
    parallel=True,             # Enable parallel processing
    include_images=False,      # No images by default
    image_format="png",        # Image format: "png" or "jpeg"
    image_dpi=150             # Image resolution (DPI)
)

# Accuracy mode - slowest but most precise
accurate_detector = OCRDetection(accuracy_mode=True)

# Analyze a document
result = detector.detect("document.pdf")

# With custom parallel settings for large documents
result = detector.detect("large_document.pdf", parallel=True, max_workers=8)
```

### Understanding Results

The library returns a dictionary with the following fields:

- **status**: Indicates the OCR requirement
  - `"true"` - All pages need OCR processing
  - `"false"` - No pages need OCR processing  
  - `"partial"` - Some pages need OCR processing

- **pages**: List of page numbers (1-indexed) that need OCR processing
  - Empty list when status is `"false"`
  - Contains all page numbers when status is `"true"`
  - Contains specific page numbers when status is `"partial"`

- **page_images**: Dictionary mapping page numbers to base64-encoded images (when `include_images=True`)
  - Only included for pages that need OCR processing
  - Page numbers are 1-indexed to match PDF page numbering
  - Images are base64-encoded PNG or JPEG strings

### Examples

```python
from ocr_detection import detect_ocr

# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}

# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}

# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}

# Example 4: With base64 images
result = detect_ocr("document.pdf", include_images=True)
# {
#   "status": "partial", 
#   "pages": [2, 5], 
#   "page_images": {
#     2: "iVBORw0KGgoAAAANSUhEUgAA...",  # base64 PNG data
#     5: "iVBORw0KGgoAAAANSUhEUgAA..."   # base64 PNG data
#   }
# }

# Example 5: Custom image settings
result = detect_ocr(
    "document.pdf", 
    include_images=True,
    image_format="jpeg",  # Use JPEG instead of PNG
    image_dpi=200        # Higher resolution
)

# Example 6: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True, max_workers=8)

# Example 7: Accuracy vs Speed modes
fast_result = detect_ocr("document.pdf")  # Fast mode (default)
accurate_result = detect_ocr("document.pdf", accuracy_mode=True)  # Accuracy mode

# Example 8: Serverless optimization (RECOMMENDED)
serverless_result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)  # Optimal balance

# Example 9: Ultra-fast classification only
classify_result = detect_ocr("document.pdf", serverless_mode=True)  # Sub-2 seconds for 1000+ pages
```

## Image Output Options

The library can generate base64-encoded images of pages that need OCR processing:

### Parameters
- **include_images**: `bool` - Enable base64 image output (default: `False`)
- **image_format**: `str` - Output format: `"png"` or `"jpeg"` (default: `"png"`)
- **image_dpi**: `int` - Resolution in DPI (default: `150`)

### Usage Notes
- Images are only generated for pages that need OCR processing
- **Smart extraction**: Scanned pages use embedded images for 5x faster processing
- Higher DPI values produce larger but clearer images (only affects rendered pages)
- PNG format preserves quality but has larger file sizes
- JPEG format is more compact but may have compression artifacts
- Page numbers in `page_images` match those in the `pages` list (1-indexed)

## Performance

### Version 0.3.0 Optimization

The library now features **Smart Image Extraction** for dramatically improved performance:

- **5x faster** processing for scanned PDFs (2.5s → 0.54s)
- **33% memory reduction** (116MB → 79MB)
- **8x smaller** image data (15.9MB → 2.0MB)
- **20x faster** per-image processing (1.2s → 0.06s per image)

### How It Works

- **Scanned PDFs**: Extracts original embedded JPEG images directly (no re-rendering)
- **Text PDFs**: Uses traditional rendering for vector content
- **Quality Preservation**: Maintains original image compression and quality
- **Thread Safety**: Works seamlessly with parallel processing

### Processing Modes

**Fast Mode (Default)**:
- **40x faster** than accuracy mode
- Uses optimized text extraction (PyMuPDF only)
- Fast page classification heuristics
- Recommended for most use cases

**Accuracy Mode**:
- Maximum precision using dual text extraction
- Comprehensive text quality analysis
- Better for documents requiring high confidence
- Use when precision is more important than speed

### Automatic Optimization

The library automatically optimizes performance based on document size and content:
- Documents with ≤10 pages use sequential processing
- Larger documents automatically use parallel processing
- **Current parallel limit**: 8 workers (configurable)
- **Parallel speedup**: 3-8x performance improvement for large documents
- **Worker optimization**: `min(cpu_count, total_pages, max_workers)`
- Smart image extraction eliminates unnecessary rendering overhead

### Performance Tuning

```python
# For maximum speed on large documents
result = detect_ocr(
    "large_document.pdf",
    accuracy_mode=False,    # Fast mode
    parallel=True,          # Enable parallel processing
    max_workers=8          # Use up to 8 workers
)

# For maximum accuracy
result = detect_ocr(
    "document.pdf",
    accuracy_mode=True     # Accuracy mode (slower)
)

# Custom worker count for high-core systems
result = detect_ocr(
    "huge_document.pdf",
    parallel=True,
    max_workers=16         # Increase for powerful hardware
)
```

### Benchmark Results

**Large Document Test** (1045 pages, 3.9MB):
- **Fast mode**: ~8.0s
- **Fast mode + images**: ~33.7s
- **Parallel processing**: 3-8x faster than sequential
- **Memory usage**: Optimized with 33% reduction

**Performance Guidelines**:
- Use **fast mode** for general document analysis
- Use **accuracy mode** when precision is critical
- **Parallel processing** automatically enabled for >10 pages
- Increase `max_workers` on high-core systems for better performance

## License

MIT License - see LICENSE file for details