Metadata-Version: 2.4
Name: document-core
Version: 0.1.0
Summary: Foundational shared library for document-processing and planogram-extraction platform
Author-email: PepsiCo <tech@pepsico.com>
Maintainer-email: PepsiCo <tech@pepsico.com>
License: MIT
Project-URL: Homepage, https://github.com/pepsico/document-core
Project-URL: Documentation, https://github.com/pepsico/document-core#readme
Project-URL: Repository, https://github.com/pepsico/document-core.git
Project-URL: Issues, https://github.com/pepsico/document-core/issues
Keywords: document-processing,planogram-extraction,ocr,parsing,hashing,caching
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: license-file

# document-core

A foundational shared library for the document-processing and planogram-extraction platform. This package provides domain models, interfaces, enums, schemas, exceptions, hashing utilities, and configuration classes for building document processing applications.

## Purpose

`document-core` serves as the contract and domain layer for a larger document processing ecosystem. It defines the shared data structures, protocols, and utilities used across OCR engines, vision models, storage backends, and orchestration services.

**Key Design Principles:**
- Domain Driven Design (DDD)
- SOLID principles
- Strong typing with complete type hints
- Async-first interfaces
- Immutable value objects where appropriate
- Pydantic v2 models for validation
- Production-grade validation
- JSON serialization support
- Auditability and forward compatibility

## Architecture

The library is organized into several key modules:

### Core Modules

- **`enums.py`** - Enumeration types for page types, processing modes, field sources, job status, review decisions, and deficiency types
- **`errors.py`** - Exception hierarchy with error codes, messages, and details
- **`hashing.py`** - SHA256 hashing utilities for files, bytes, and text
- **`config.py`** - Configuration management with environment variable support

### Domain Models (`models/`)

- **`document.py`** - Document, Page, and PageMetadata models
- **`planogram.py`** - Product, Shelf, Section, ExtractionMetadata, and PlanogramResult models
- **`extraction.py`** - FieldConflict and ExtractionResult models
- **`confidence.py`** - ConfidenceScore and ConfidenceReport models
- **`job.py`** - JobConfig and JobInfo models
- **`review.py`** - ReviewTask and ReviewResult models

### Interfaces (`interfaces/`)

Protocol definitions for implementing:
- **`parser.py`** - IDocumentParser for document parsing
- **`ocr.py`** - IOcrEngine for OCR text and table extraction
- **`vision.py`** - IVisionModel for image analysis
- **`agent.py`** - IExtractionAgent for AI-based extraction
- **`storage.py`** - IFileStorage for file operations
- **`cache.py`** - IResultCache for caching
- **`queue.py`** - IJobQueue for job management

### Schemas (`schemas/`)

- **`api_schemas.py`** - API request/response models
- **`output_schema.json`** - JSON Schema for PlanogramResult (Draft 2020-12)

## Installation

### Requirements

- Python >= 3.11
- pydantic >= 2.0

### Install from Source

```bash
cd document-core
pip install -e .
```

### Install Dependencies

```bash
pip install pydantic>=2.0
```

## Usage

### Basic Model Usage

```python
from document_core import PageMetadata, PageType, Document
from datetime import datetime

# Create page metadata
metadata = PageMetadata(
    page_number=1,
    page_type=PageType.PLANOGRAM,
    width_px=1920,
    height_px=1080,
    image_area_ratio=0.95,
    small_text_ratio=0.1,
    detected_table_regions=2,
    detected_shelf_regions=5,
    raw_char_count=1000,
    has_rotated_text=False,
    content_hash="a" * 64,
)

# Access computed properties
print(f"Aspect ratio: {metadata.aspect_ratio}")
```

### Planogram Models

```python
from document_core import Product, Shelf, Section, PlanogramResult, ExtractionMetadata
from document_core.enums import FieldSource
from datetime import datetime

# Create a product
product = Product(
    name="Coca-Cola 12oz",
    upc="04963406",
    facings=3,
    source=FieldSource.PRIMARY,
)

# Create a shelf with products
shelf = Shelf(
    shelf_number=1,
    products=[product],
)

# Create a section with shelves
section = Section(
    section_name="Beverages",
    shelves=[shelf],
)

# Create extraction metadata
metadata = ExtractionMetadata(
    processing_time_ms=1500.0,
    model_name="planogram-extractor-v1",
    ocr_engine="tesseract",
    confidence_score=0.92,
    created_at=datetime.now(),
)

# Create complete planogram result
planogram = PlanogramResult(
    store_name="Store #123",
    category="Beverages",
    sections=[section],
    metadata=metadata,
)

# Access computed properties
print(f"Total products: {planogram.total_products}")
print(f"Total shelves: {planogram.total_shelves}")
```

### Confidence Reports

```python
from document_core import ConfidenceScore, ConfidenceReport
from datetime import datetime

# Create field scores
field_scores = [
    ConfidenceScore(
        field_name="product_name",
        score=0.95,
        source="ocr",
        reason="Clear text",
    ),
    ConfidenceScore(
        field_name="upc",
        score=0.88,
        source="ocr",
        reason="Slightly blurry",
    ),
]

# Create confidence report
report = ConfidenceReport(
    overall_score=0.91,
    field_scores=field_scores,
    deficiencies=[],
    generated_at=datetime.now(),
)

# Check if review is required
if report.is_review_required():
    print("Manual review required")
else:
    print("Confidence is acceptable")
```

### Hashing Utilities

```python
from document_core import compute_sha256_file, compute_sha256_text

# Hash text
text_hash = compute_sha256_text("Hello, World!")
print(f"Text hash: {text_hash}")

# Hash file
file_hash = compute_sha256_file("/path/to/document.pdf")
print(f"File hash: {file_hash}")
```

### Configuration

```python
from document_core import BaseConfig

# Load from environment variables
config = BaseConfig.from_env()

# Or create directly
config = BaseConfig(
    environment="production",
    log_level="INFO",
    cache_ttl_seconds=3600,
)
```

### Error Handling

```python
from document_core import ValidationError, DocumentParseError

try:
    # Your validation logic
    pass
except ValidationError as e:
    print(f"Validation failed: {e.message}")
    print(f"Field: {e.details.get('field')}")
except DocumentParseError as e:
    print(f"Parse failed: {e.message}")
    print(f"Document ID: {e.details.get('document_id')}")
```

## Extending Interfaces

The library provides Protocol-based interfaces that you can implement to create custom components:

### Implementing a Custom OCR Engine

```python
from document_core.interfaces import IOcrEngine, OcrResult, TableResult
from document_core.errors import OcrError

class CustomOcrEngine(IOcrEngine):
    async def extract_text(self, image_path: str) -> OcrResult:
        try:
            # Your OCR implementation
            text = "Extracted text..."
            confidence = 0.95
            
            return OcrResult(
                text=text,
                confidence=confidence,
                processing_time_ms=500.0,
                success=True,
            )
        except Exception as e:
            raise OcrError(
                message=f"OCR failed: {str(e)}",
                ocr_engine="custom",
            )
    
    async def extract_tables(self, image_path: str) -> TableResult:
        # Your table extraction implementation
        pass
```

### Implementing a Custom Storage Backend

```python
from document_core.interfaces import IFileStorage
from document_core.errors import StorageError

class S3Storage(IFileStorage):
    async def upload(self, file_path: str, storage_key: str) -> str:
        # Your S3 upload implementation
        return f"s3://bucket/{storage_key}"
    
    async def download(self, storage_key: str, local_path: str) -> None:
        # Your S3 download implementation
        pass
    
    async def exists(self, storage_key: str) -> bool:
        # Your existence check implementation
        pass
    
    async def delete(self, storage_key: str) -> None:
        # Your delete implementation
        pass
```

## JSON Schema

A JSON Schema for the `PlanogramResult` model is provided in `schemas/output_schema.json`. This schema follows JSON Schema Draft 2020-12 and can be used for validation in systems that don't use Python/Pydantic.

```bash
# Validate a JSON file against the schema
ajv validate --schema=document_core/schemas/output_schema.json --data=data.json
```

## Testing

Run the test suite:

```bash
pytest tests/
```

Run tests with coverage:

```bash
pytest tests/ --cov=document_core --cov-report=html
```

## Versioning Strategy

This project follows [Semantic Versioning](https://semver.org/):

- **MAJOR**: Incompatible API changes
- **MINOR**: New functionality in a backwards compatible manner
- **PATCH**: Backwards compatible bug fixes

Given the library is in early development (v0.1.0), minor versions may include breaking changes until v1.0 is released.

## Project Structure

```
document-core/
├── pyproject.toml
├── README.md
├── document_core/
│   ├── __init__.py
│   ├── enums.py
│   ├── errors.py
│   ├── hashing.py
│   ├── config.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── document.py
│   │   ├── planogram.py
│   │   ├── extraction.py
│   │   ├── confidence.py
│   │   ├── job.py
│   │   └── review.py
│   ├── interfaces/
│   │   ├── __init__.py
│   │   ├── parser.py
│   │   ├── ocr.py
│   │   ├── vision.py
│   │   ├── agent.py
│   │   ├── storage.py
│   │   ├── cache.py
│   │   └── queue.py
│   └── schemas/
│       ├── __init__.py
│       ├── api_schemas.py
│       └── output_schema.json
└── tests/
    ├── test_hashing.py
    ├── test_enums.py
    ├── test_models.py
    ├── test_document_validation.py
    └── test_confidence.py
```

## Design Decisions

### Pure Contract Library

This package contains **no implementations** of:
- OCR engines
- AI models
- Storage backends
- Business logic
- Orchestration code

It is intentionally a pure shared contract/domain package. Implementations should be provided by separate service packages that depend on `document-core`.

### Pydantic v2

All models use Pydantic v2 with:
- `extra="forbid"` - Prevents unexpected fields
- `validate_assignment=True` - Validates on assignment
- Comprehensive validators for data integrity

### Async-First Interfaces

All protocol interfaces are async to support high-throughput, non-blocking operations in production environments.

### Enum Serialization

All enums inherit from both `str` and `Enum` for seamless JSON serialization/deserialization.

## License

Proprietary - PepsiCo AI Team

## Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.
