Metadata-Version: 2.4
Name: pepsico-document-preprocessor
Version: 0.1.0
Summary: Document preprocessing library for PDF ingestion, rendering, enhancement, and classification
Author-email: PepsiCo <tech@pepsico.com>
Maintainer-email: PepsiCo <tech@pepsico.com>
License: MIT
Project-URL: Homepage, https://github.com/pepsico/document-preprocessor
Project-URL: Documentation, https://github.com/pepsico/document-preprocessor#readme
Project-URL: Repository, https://github.com/pepsico/document-preprocessor.git
Project-URL: Issues, https://github.com/pepsico/document-preprocessor/issues
Keywords: document-processing,pdf,preprocessing,ocr,computer-vision,classification,enhancement
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: document-core>=0.1.0
Requires-Dist: PyMuPDF>=1.24.0
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: opencv-python-headless>=4.9.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: upscale
Requires-Dist: realesrgan>=0.3.0; extra == "upscale"

# document-preprocessor

A production-ready document preprocessing library for PDF ingestion, rendering, enhancement, and classification. This library provides the complete preprocessing lifecycle for PDF documents before OCR, vision analysis, extraction, and AI processing.

## Overview

`document-preprocessor` is a core component within a larger Document Intelligence platform. It handles:

- PDF ingestion and page splitting
- High-resolution page rendering
- Image enhancement (deskewing, contrast, noise reduction, upscaling)
- Page classification for routing
- Document complexity analysis
- Content deduplication
- Complete preprocessing orchestration

## Architecture

The library follows Clean Architecture principles with SOLID design, Domain-Driven Design, dependency injection, and async-first APIs.

### Core Components

- **`PdfSplitter`** - Splits PDFs into document-core Page objects
- **`PageRenderer`** - Renders PDF pages to high-resolution images
- **`ImageEnhancer`** - Enhances images with deskewing, contrast, noise reduction, and optional upscaling
- **`PageClassifier`** - Classifies pages for routing (planogram, table, cover, appendix, unknown)
- **`ComplexityAnalyzer`** - Analyzes document complexity to recommend processing mode
- **`ContentDeduplicator`** - Detects and removes duplicate pages
- **`PreprocessorPipeline`** - Orchestrates the complete preprocessing workflow

## Processing Flow

```
PDF Input
    ↓
Phase 1: Split PDF → Pages
    ↓
Phase 2: Deduplicate Pages
    ↓
Phase 3: Render Pages → Images
    ↓
Phase 4: Enhance Images
    ↓
Phase 5: Classify Pages
    ↓
Phase 6: Compute Complexity
    ↓
Phase 7: Cache (optional)
    ↓
PreprocessResult
```

## Installation

### Requirements

- Python >= 3.11
- document-core >= 0.1.0
- PyMuPDF >= 1.24
- pdf2image >= 1.17
- opencv-python-headless >= 4.9
- Pillow >= 10.0
- pydantic >= 2.0

### Optional Dependencies

- `realesrgan` - For AI-based image upscaling (install with `pip install document-preprocessor[upscale]`)

### Install from Source

```bash
cd document-preprocessor
pip install -e .
```

### Install with Optional Dependencies

```bash
pip install -e ".[upscale,dev]"
```

## Configuration

```python
from document_preprocessor import PreprocessorConfig

config = PreprocessorConfig(
    render_dpi=300,
    image_format="png",
    temp_directory="/tmp/document-preprocessor",
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    enable_deduplication=True,
    cache_enabled=True,
    classification_confidence_threshold=0.80,
    complexity_simple_threshold=25,
    complexity_standard_threshold=60,
    max_workers=8,
)

# Or load from environment variables
config = PreprocessorConfig.from_env()
```

### Environment Variables

- `PREPROCESSOR_RENDER_DPI` - Rendering DPI (default: 300)
- `PREPROCESSOR_IMAGE_FORMAT` - Output image format (default: png)
- `PREPROCESSOR_TEMP_DIR` - Temporary directory (default: /tmp/document-preprocessor)
- `PREPROCESSOR_PARALLEL_RENDER` - Enable parallel rendering (default: true)
- `PREPROCESSOR_PARALLEL_ENHANCE` - Enable parallel enhancement (default: true)
- `PREPROCESSOR_MAX_WORKERS` - Maximum worker threads (default: 8)
- `PREPROCESSOR_ENABLE_DEDUP` - Enable deduplication (default: true)
- `PREPROCESSOR_CACHE_ENABLED` - Enable caching (default: true)
- `PREPROCESSOR_CLASSIFICATION_THRESHOLD` - Classification confidence threshold (default: 0.80)
- `PREPROCESSOR_COMPLEXITY_SIMPLE` - Simple complexity threshold (default: 25)
- `PREPROCESSOR_COMPLEXITY_STANDARD` - Standard complexity threshold (default: 60)

## Usage

### Basic Pipeline Usage

```python
import asyncio
from document_preprocessor import (
    PreprocessorPipeline,
    PdfSplitter,
    PageRenderer,
    ImageEnhancer,
    PageClassifier,
    ComplexityAnalyzer,
    PreprocessorConfig,
)

async def process_pdf(pdf_path: str):
    # Initialize components
    config = PreprocessorConfig()
    splitter = PdfSplitter()
    renderer = PageRenderer(dpi=config.render_dpi, image_format=config.image_format)
    enhancer = ImageEnhancer(temp_directory=config.temp_directory)
    classifier = PageClassifier(confidence_threshold=config.classification_confidence_threshold)
    analyzer = ComplexityAnalyzer(
        simple_threshold=config.complexity_simple_threshold,
        standard_threshold=config.complexity_standard_threshold,
    )
    
    # Create pipeline
    pipeline = PreprocessorPipeline(
        splitter=splitter,
        renderer=renderer,
        enhancer=enhancer,
        classifier=classifier,
        analyzer=analyzer,
        config=config,
    )
    
    # Process PDF
    result = await pipeline.process(pdf_path)
    
    # Access results
    print(f"Processed {len(result.document.pages)} pages")
    print(f"Complexity: {result.complexity.overall_score:.1f}")
    print(f"Recommended mode: {result.complexity.recommended_mode}")
    print(f"Reasoning: {result.complexity.reasoning}")
    
    return result

# Run pipeline
result = asyncio.run(process_pdf("document.pdf"))
```

### Individual Component Usage

#### PDF Splitting

```python
from document_preprocessor import PdfSplitter

splitter = PdfSplitter()
pages = splitter.split("document.pdf")

# Process in batches
batches = splitter.split_to_batches("document.pdf", batch_size=10)
```

#### Page Rendering

```python
from document_preprocessor import PageRenderer

renderer = PageRenderer(dpi=300, image_format="png")
image_path = renderer.render(page)

# Batch rendering
image_paths = renderer.render_batch(pages, parallel=True)
```

#### Image Enhancement

```python
from document_preprocessor import ImageEnhancer, EnhancerConfig

config = EnhancerConfig(
    enable_deskew=True,
    enable_contrast=True,
    enable_upscale=True,
    enable_binarization=True,
)

enhancer = ImageEnhancer(config=config)
enhanced_path = enhancer.enhance(image_path, current_dpi=150)
```

#### Page Classification

```python
from document_preprocessor import PageClassifier

classifier = PageClassifier(confidence_threshold=0.80)
page_type = classifier.classify(page)

# Batch classification
classifications = classifier.classify_batch(pages)
```

#### Complexity Analysis

```python
from document_preprocessor import ComplexityAnalyzer

analyzer = ComplexityAnalyzer(simple_threshold=25, standard_threshold=60)
complexity = analyzer.score_document(pages)

print(f"Overall score: {complexity.overall_score}")
print(f"Recommended mode: {complexity.recommended_mode}")
print(f"Reasoning: {complexity.reasoning}")
```

#### Content Deduplication

```python
from document_preprocessor import ContentDeduplicator

deduplicator = ContentDeduplicator()

# Find duplicates
duplicates = deduplicator.find_duplicates(pages)

# Remove duplicates
deduplicated_pages = deduplicator.remove_duplicates(pages)
```

## Page Classification Rules

The classifier uses heuristic rules to categorize pages:

- **PLANOGRAM**: `image_area_ratio > 0.60`
- **TABLE**: `detected_table_regions > 2` and `image_area_ratio < 0.30`
- **COVER**: `page_number == 1` and `raw_char_count < 500`
- **APPENDIX**: Detected via keyword analysis (appendix, glossary, references, notes)
- **UNKNOWN**: Fallback for unclassified pages

## Complexity Scoring

Complexity is scored on a scale of 0-100 across three dimensions:

### Layout Score
- Image density (40 points)
- Shelf regions (30 points)
- Region count (20 points)
- Mixed layout penalty (10 points)

### OCR Score
- Small text ratio (40 points)
- Rotation (30 points)
- Dense annotations (30 points)

### Structure Score
- Table regions (50 points)
- Nested layouts (30 points)
- Page position (20 points)

### Mode Selection

- **FAST**: Overall score < 25
- **BALANCED**: Overall score < 60
- **HIGH_ACCURACY**: Overall score >= 60

## Performance Tuning

### Parallel Processing

Enable parallel rendering and enhancement for large documents:

```python
config = PreprocessorConfig(
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    max_workers=16,  # Adjust based on CPU cores
)
```

### Memory Management

For very large PDFs (1000+ pages):

- Process in batches using `split_to_batches()`
- Increase temp directory size
- Monitor memory usage
- Use cache to avoid reprocessing

### Upscaling

Enable AI-based upscaling for low-DPI documents:

```python
config = EnhancerConfig(
    enable_upscale=True,
    upscale_threshold_dpi=150,
    upscale_factor=2,
)

# Install optional dependency
pip install realesrgan
```

## Extending Classification

### Custom CNN Classifier

```python
from document_preprocessor import PageClassifier

def custom_cnn_classifier(page: Page) -> PageType:
    # Your CNN logic here
    return PageType.PLANOGRAM

classifier = PageClassifier(
    confidence_threshold=0.80,
    classifier_model=custom_cnn_classifier,
)
```

### Custom Classification Rules

Extend the `PageClassifier` class to add custom rules:

```python
from document_preprocessor.classifier import PageClassifier

class CustomClassifier(PageClassifier):
    def _classify_heuristic(self, page: Page) -> PageType:
        # Add your custom logic
        if page.metadata.image_area_ratio > 0.80:
            return PageType.PLANOGRAM
        
        return super()._classify_heuristic(page)
```

## Troubleshooting

### PDF Splitting Errors

**Error**: `PdfSplitError: Failed to open PDF`

**Solution**: Ensure the PDF file exists and is not corrupted. Verify file permissions.

### Rendering Errors

**Error**: `RenderingError: Failed to render page`

**Solution**: Check that PyMuPDF is installed correctly. Verify the PDF is not password-protected.

### Enhancement Errors

**Error**: `EnhancementError: Image enhancement failed`

**Solution**: Ensure OpenCV is installed. Check that the image file exists and is readable.

### Memory Issues

**Error**: High memory usage with large PDFs

**Solution**: 
- Process in batches
- Reduce DPI
- Enable deduplication to reduce page count
- Increase system memory or use a machine with more RAM

### Real-ESRGAN Issues

**Error**: Real-ESRGAN not available

**Solution**: Install the optional dependency:
```bash
pip install realesrgan
```

If issues persist, the library will gracefully fall back to interpolation-based upscaling.

## Development Guide

### Running Tests

```bash
# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_preprocessor --cov-report=html

# Run specific test file
pytest tests/test_splitter.py
```

### Code Style

```bash
# Format code with black
black document_preprocessor/

# Lint with ruff
ruff check document_preprocessor/

# Type check with mypy
mypy document_preprocessor/
```

### Project Structure

```
document-preprocessor/
├── pyproject.toml
├── README.md
├── document_preprocessor/
│   ├── __init__.py
│   ├── config.py
│   ├── models.py
│   ├── exceptions.py
│   ├── splitter.py
│   ├── renderer.py
│   ├── enhancer.py
│   ├── classifier.py
│   ├── complexity.py
│   ├── dedup.py
│   └── pipeline.py
├── tests/
│   ├── test_splitter.py
│   ├── test_renderer.py
│   ├── test_enhancer.py
│   ├── test_classifier.py
│   ├── test_complexity.py
│   ├── test_dedup.py
│   └── test_pipeline.py
└── docs/
```

## Design Principles

- **SOLID** - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
- **Clean Architecture** - Separation of concerns, dependency injection
- **Domain-Driven Design** - Rich domain models, ubiquitous language
- **Async-First** - Non-blocking operations for high throughput
- **High Performance** - ThreadPoolExecutor for parallel processing, lazy image loading
- **Memory Efficient** - Streaming operations, temporary file cleanup
- **Type Safety** - Complete type hints, mypy validation
- **Pydantic Validation** - Strict data validation with ConfigDict
- **Extensible** - Plugin architecture for custom classifiers and enhancers
- **Testable** - Dependency injection, unit tests for all components
- **Production Observability** - Structured logging, error tracking

## Dependencies

### Internal

- `document-core` - Shared models, enums, interfaces, and utilities

### External

- `PyMuPDF` - PDF processing and rendering
- `pdf2image` - PDF to image conversion
- `opencv-python-headless` - Image enhancement
- `Pillow` - Image I/O
- `pydantic` - Data validation

### Optional

- `realesrgan` - AI-based image upscaling

## License

MIT License - PepsiCo

## Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.
