Metadata-Version: 2.4
Name: pepsico-document-recovery
Version: 0.1.0
Summary: Recovery orchestration framework for targeted extraction failure recovery
Project-URL: Homepage, https://github.com/pepsico-ai/document-recovery
Project-URL: Repository, https://github.com/pepsico-ai/document-recovery
Project-URL: Issues, https://github.com/pepsico-ai/document-recovery/issues
Author-email: PepsiCo AI Team <ai@pepsico.com>
License: MIT
Keywords: document,extraction,ocr,planogram,recovery,spatial
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.0
Requires-Dist: typing-extensions>=4.0
Provides-Extra: dev
Requires-Dist: black>=23.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# document-recovery

A recovery orchestration framework for targeted extraction failure recovery. This library provides a production-ready architecture for routing deficiencies to specialized recovery strategies, merging recovered data, and tracking costs.

## Overview

`document-recovery` implements a recovery orchestration framework that receives deficiencies from `document-confidence`, routes them to specialized recovery strategies (OCR, Table, Spatial, Cross-reference), merges recovered data, resolves conflicts, and tracks costs.

## Architecture

The library follows a modular architecture with:

- **Strategy Pattern** - Specialized recovery strategies for different failure types
- **Router** - Deficiency-to-strategy mapping with budget enforcement
- **Merger** - Recursive merging of primary and recovery output
- **Reconciler** - Deterministic conflict resolution
- **Telemetry** - Cost tracking and performance metrics
- **Protocol-based Interfaces** - Type-safe contracts for extensibility

## Installation

```bash
pip install document-recovery
```

### Optional Dependencies

```bash
# For development
pip install document-recovery[dev]
```

## Quick Start

```python
from document_recovery import (
    RecoveryConfig,
    RecoveryRouter,
    RecoveryPipeline,
)
from document_recovery.strategies import OcrRecoveryStrategy, TableRecoveryStrategy

# Configure recovery
config = RecoveryConfig(
    max_recovery_attempts=2,
    budget_limit_usd=5.0,
    enable_ocr_recovery=True,
)

# Initialize router
router = RecoveryRouter(config)

# Initialize strategies
strategies = {
    "ocr_recovery": OcrRecoveryStrategy(config, azure_client),
    "table_recovery": TableRecoveryStrategy(config, azure_client),
}

# Initialize pipeline
pipeline = RecoveryPipeline(config, router, strategies)

# Execute recovery
result = await pipeline.execute(
    primary_result=extraction,
    deficiencies=deficiencies,
    pages=pages,
)
```

## Configuration

### Recovery Configuration

```python
from document_recovery import RecoveryConfig

config = RecoveryConfig(
    max_recovery_attempts=2,
    max_vision_calls_per_document=10,
    budget_limit_usd=5.0,
    enable_ocr_recovery=True,
    enable_table_recovery=True,
    enable_spatial_recovery=True,
    enable_crossref_recovery=True,
    vision_model="gpt-4.1",
    vision_timeout_seconds=60.0,
    parallel_recovery_limit=5,
)
```

## Recovery Strategies

### OCR Recovery Strategy

Recovers text using Azure Document Intelligence.

```python
from document_recovery.strategies import OcrRecoveryStrategy

strategy = OcrRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)
```

### Table Recovery Strategy

Recovers tables using Azure Document Intelligence.

```python
from document_recovery.strategies import TableRecoveryStrategy

strategy = TableRecoveryStrategy(config, azure_client)
result = await strategy.recover(pages, deficiency)
```

### Spatial Recovery Strategy

Recovers spatial/layout information using Vision LLMs.

```python
from document_recovery.strategies import SpatialRecoveryStrategy

with open("prompts/spatial_recovery.txt") as f:
    prompt = f.read()

strategy = SpatialRecoveryStrategy(config, vision_model, prompt)
result = await strategy.recover(pages, deficiency)
```

### Cross-Reference Recovery Strategy

Recovers cross-references using document-agents.

```python
from document_recovery.strategies import CrossrefRecoveryStrategy

strategy = CrossrefRecoveryStrategy(config, crossref_agent)
result = await strategy.recover(pages, deficiency)
```

## Deficiency Routing

The router maps deficiencies to strategies:

```python
STRATEGY_MAP = {
    "ocr_gap": "ocr_recovery",
    "table_missing": "table_recovery",
    "spatial_failure": "spatial_recovery",
    "crossref_broken": "crossref_recovery",
}
```

## Result Merging

The merger combines primary extraction with recovery results:

```python
from document_recovery import ResultMerger

merger = ResultMerger()
result = merger.merge(primary_result, recovery_results)
```

## Conflict Resolution

The reconciler resolves conflicts using predefined rules:

```python
RESOLUTION_RULES = {
    "text": "prefer_longer",
    "upc": "prefer_primary",
    "name": "prefer_primary",
    "default": "prefer_recovery",
}
```

## Telemetry

Track recovery metrics and costs:

```python
from document_recovery import RecoveryTelemetry

telemetry = RecoveryTelemetry()
telemetry.record_recovery(
    strategy_name="ocr_recovery",
    success=True,
    cost_usd=0.05,
    latency_ms=100,
    pages_count=2,
)

summary = telemetry.get_summary()
```

## Custom Recovery Strategies

Create custom strategies by extending BaseRecoveryStrategy:

```python
from document_recovery.strategies import BaseRecoveryStrategy

class CustomRecoveryStrategy(BaseRecoveryStrategy):
    async def _execute(self, pages, deficiency):
        # Custom recovery logic
        return RecoveryResult(...)
```

## Development

### Running Tests

```bash
# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=document_recovery
```

### Code Style

```bash
# Format code
black document_recovery

# Lint code
ruff check document_recovery

# Type check
mypy document_recovery
```

## Design Principles

1. **Async-first** - All operations are asynchronous
2. **Provider Agnostic** - Interfaces for external dependencies
3. **Extensible** - Plugin architecture for custom strategies
4. **Type-safe** - Full type hints with Pydantic validation
5. **Production-ready** - Enterprise-scale performance

## Dependencies

- `pydantic>=2.0` - Data validation
- `typing_extensions>=4.0` - Type extensions

## External Dependencies

The library depends on external interfaces that must be implemented:

- `IAzureDocumentIntelligenceClient` - Azure Document Intelligence
- `IVisionModel` - Vision LLM interface
- `IExtractionAgent` - document-agents interface
- `IDocumentParser` - Document parser interface

## Performance

The library is designed for:
- Parallel recovery execution
- Budget enforcement
- Cost tracking
- Conflict resolution

## License

MIT

## Support

For issues, questions, or contributions, please visit the project repository.
