Metadata-Version: 2.4
Name: pepsico-document-parser
Version: 0.1.0
Summary: Unified parsing abstraction layer for document extraction providers
Author-email: PepsiCo <tech@pepsico.com>
Maintainer-email: PepsiCo <tech@pepsico.com>
License: MIT
Project-URL: Homepage, https://github.com/pepsico/document-parser
Project-URL: Documentation, https://github.com/pepsico/document-parser#readme
Project-URL: Repository, https://github.com/pepsico/document-parser.git
Project-URL: Issues, https://github.com/pepsico/document-parser/issues
Keywords: document-parsing,llamaparse,azure-document-intelligence,ocr,table-extraction,async
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: document-core>=0.1.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: tenacity>=8.0.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: azure
Requires-Dist: azure-ai-documentintelligence>=1.0.0; extra == "azure"

# document-parser

A unified parsing abstraction layer for document extraction providers. This library provides a consistent interface for multiple document parsing engines including LlamaParse Premium and Azure Document Intelligence.

## Overview

`document-parser` wraps multiple parsing engines behind the shared `IDocumentParser` interface from `document-core`, enabling:
- Provider abstraction and easy switching
- Response normalization across providers
- Retry handling with exponential backoff
- Batch parsing with concurrency control
- Cache integration
- Provider failover support
- Enterprise-scale document workloads (10,000+ pages)

## Architecture

The library follows SOLID principles with dependency injection, adapter pattern, and factory pattern:
- **Adapter Pattern** - Normalizes provider responses to `ParseResult`
- **Factory Pattern** - Creates parser instances by name
- **Dependency Injection** - Parser instances injected into `BatchParser`
- **Async-first** - Non-blocking operations for high throughput
- **Connection Pooling** - `httpx.AsyncClient` for efficient HTTP communication

## Supported Parsers

### LlamaParse Premium
- Primary provider for high-quality parsing
- Supports tables, images, charts
- Premium mode for best results
- Async job-based processing with polling

### Azure Document Intelligence
- Recovery parser and alternative provider
- OCR recovery when needed
- Table extraction with `prebuilt-layout`
- Text extraction with `prebuilt-read`

## Installation

### Requirements

- Python >= 3.11
- document-core >= 0.1.0
- httpx >= 0.27
- tenacity >= 8.0
- pydantic >= 2.0

### Optional Dependencies

- `azure-ai-documentintelligence` - For Azure Document Intelligence support

### Install from Source

```bash
cd document-parser
pip install -e .
```

### Install with Azure Support

```bash
pip install -e ".[azure]"
```

## Configuration

### LlamaParse Configuration

```python
from document_parser import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-api-key",
    base_url="https://api.cloud.llamaindex.ai",
    premium_mode=True,
    extract_tables=True,
    extract_images=True,
    extract_charts=True,
    preserve_layout=True,
    language="en",
    max_concurrent_jobs=10,
    poll_interval_seconds=2.0,
    timeout_seconds=300.0,
    max_retries=3,
)
```

### Azure Document Intelligence Configuration

```python
from document_parser import AzureDIConfig

config = AzureDIConfig(
    endpoint="https://your-endpoint.cognitiveservices.azure.com",
    api_key="your-api-key",
    api_version="2024-11-30",
    prebuilt_model="prebuilt-layout",
    timeout_seconds=120.0,
    max_retries=3,
)
```

## Usage Examples

### Using the Factory

```python
from document_parser import ParserFactory, LlamaParseConfig, AzureDIConfig

# Create LlamaParse parser
llama_config = LlamaParseConfig(api_key="your-key")
llama_parser = ParserFactory.create("llamaparse", llama_config)

# Create Azure DI parser
azure_config = AzureDIConfig(endpoint="https://...", api_key="your-key")
azure_parser = ParserFactory.create("azure_di", azure_config)
```

### Parsing a Single Page

```python
import asyncio
from document_parser import LlamaParseClient, LlamaParseConfig

async def parse_page(page):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    result = await parser.parse(page)
    print(f"Markdown: {result.markdown}")
    print(f"Tables: {len(result.tables)}")
    print(f"Images: {len(result.images)}")
    print(f"Duration: {result.parse_duration_ms}ms")
    
    await parser.close()

# Run
asyncio.run(parse_page(page))
```

### Batch Parsing

```python
import asyncio
from document_parser import BatchParser, LlamaParseClient, LlamaParseConfig

async def parse_document(document):
    config = LlamaParseConfig(api_key="your-key")
    parser = LlamaParseClient(config)
    
    batch_parser = BatchParser(
        parser=parser,
        cache=None,  # Optional cache implementation
        max_concurrency=10,
    )
    
    results = await batch_parser.parse_document(document)
    print(f"Parsed {len(results)} pages")
    
    await parser.close()

# Run
asyncio.run(parse_document(document))
```

### Batch Parsing with Cache

```python
from document_parser import BatchParser
from document_core.interfaces import IResultCache

class MyCache(IResultCache):
    async def get(self, key):
        # Retrieve from cache
        pass
    
    async def put(self, key, value, ttl_seconds):
        # Store in cache
        pass

cache = MyCache()
batch_parser = BatchParser(parser=parser, cache=cache)
```

## Retry Strategy

The library uses `tenacity` for retry logic with exponential backoff:

### Retry Conditions
- HTTP status codes: 429, 500, 502, 503, 504
- Connection errors
- Timeout errors
- Transport errors

### Backoff Schedule
- 1s, 2s, 4s, 8s, 16s (exponential)
- Configurable max retries (default: 3)

### Custom Retry Configuration

```python
from document_parser.llamaparse import LlamaParseConfig

config = LlamaParseConfig(
    api_key="your-key",
    max_retries=5,  # Increase retries
)
```

## Error Handling

### Exception Hierarchy

```python
DocumentParserError (base)
├── LlamaParseError
├── AzureDIError
├── ParserTimeoutError
├── ParserAuthenticationError
├── ParserRateLimitError
├── AdapterError
├── BatchParserError
└── UnsupportedParserError
```

### Error Handling Example

```python
from document_parser import LlamaParseError, ParserTimeoutError

try:
    result = await parser.parse(page)
except ParserTimeoutError as e:
    print(f"Parsing timed out: {e.message}")
except ParserAuthenticationError as e:
    print(f"Authentication failed: {e.message}")
except LlamaParseError as e:
    print(f"Parsing failed: {e.message}")
    print(f"Job ID: {e.details.get('job_id')}")
```

## Performance Tuning

### Concurrency Control

```python
# LlamaParse
config = LlamaParseConfig(
    api_key="your-key",
    max_concurrent_jobs=20,  # Increase for faster processing
)

# BatchParser
batch_parser = BatchParser(
    parser=parser,
    max_concurrency=20,
)
```

### Memory Efficiency

- Use streaming uploads for large files
- Process in batches for very large documents
- Enable caching to avoid reprocessing
- Configure appropriate timeout values

### Connection Pooling

The library uses `httpx.AsyncClient` with connection pooling:
- Default: 10 connections
- Configurable via `max_concurrent_jobs`
- Automatic keep-alive for reuse

## Extending New Parsers

### 1. Create Configuration

```python
from pydantic import BaseModel

class MyParserConfig(BaseModel):
    api_key: str
    base_url: str
    timeout_seconds: float = 120.0
```

### 2. Create Adapter

```python
from document_parser.models import ParseResult

class MyParserAdapter:
    def adapt(self, raw_response, page_number, parse_duration_ms) -> ParseResult:
        # Convert provider response to ParseResult
        pass
```

### 3. Implement IDocumentParser

```python
from document_core.interfaces import IDocumentParser

class MyParserClient(IDocumentParser):
    async def parse(self, page, config) -> ParseResult:
        # Parse page
        pass
    
    async def parse_batch(self, pages, config) -> list[ParseResult]:
        # Parse batch
        pass
```

### 4. Register in Factory

```python
from document_parser.factory import ParserFactory

class ParserFactory:
    @staticmethod
    def create(parser_name, config):
        if parser_name == "my_parser":
            return MyParserClient(config)
        # ... existing parsers
```

## Development Guide

### Running Tests

```bash
# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_parser --cov-report=html

# Run specific test file
pytest tests/test_factory.py
```

### Code Style

```bash
# Format code with black
black document_parser/

# Lint with ruff
ruff check document_parser/

# Type check with mypy
mypy document_parser/
```

### Project Structure

```
document-parser/
├── pyproject.toml
├── README.md
├── document_parser/
│   ├── __init__.py
│   ├── factory.py
│   ├── batch.py
│   ├── models.py
│   ├── exceptions.py
│   ├── llamaparse/
│   │   ├── __init__.py
│   │   ├── client.py
│   │   ├── config.py
│   │   ├── adapter.py
│   │   └── retry.py
│   └── azure_di/
│       ├── __init__.py
│       ├── client.py
│       ├── config.py
│       └── adapter.py
├── tests/
│   ├── test_factory.py
│   ├── test_batch.py
│   ├── test_llamaparse.py
│   ├── test_azure_di.py
│   ├── test_adapters.py
│   └── test_retry.py
└── docs/
```

## Design Principles

- **SOLID** - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
- **Dependency Injection** - Parser instances injected into batch parser
- **Adapter Pattern** - Normalize provider responses
- **Factory Pattern** - Create parsers by name
- **Async-first** - Non-blocking operations
- **High Concurrency** - Semaphore throttling, connection pooling
- **Provider Abstraction** - Switch providers without code changes
- **Open/Closed Principle** - Add new parsers without modifying existing code
- **Structured Logging** - Log all major operations
- **Testability** - Mock external APIs, unit tests for all components
- **Type Safety** - Complete type hints, mypy validation
- **Pydantic Validation** - Strict data validation

## Dependencies

### Internal

- `document-core` - Shared models, enums, interfaces, and utilities

### External

- `httpx` - Async HTTP client with connection pooling
- `tenacity` - Retry logic with exponential backoff
- `pydantic` - Data validation

### Optional

- `azure-ai-documentintelligence` - Azure Document Intelligence SDK

## License

MIT License - PepsiCo

## Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.
