Metadata-Version: 2.4
Name: open-receipt-extractor
Version: 0.1.0
Summary: Modular Python pipeline that converts raw receipt documents (images or PDFs) into structured, analytics-ready JSON.
Project-URL: Homepage, https://github.com/malekatwiz/open-receipt-extractor
Project-URL: Documentation, https://github.com/malekatwiz/open-receipt-extractor/blob/main/docs/receipt-processing-design.md
Project-URL: Repository, https://github.com/malekatwiz/open-receipt-extractor
Project-URL: Bug Tracker, https://github.com/malekatwiz/open-receipt-extractor/issues
Project-URL: Changelog, https://github.com/malekatwiz/open-receipt-extractor/blob/main/CHANGELOG.md
Author: Open Receipt Extractor Contributors
License: MIT
License-File: LICENSE
Keywords: extraction,nlp,ocr,pipeline,receipt,structured-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: numpy>=1.24
Requires-Dist: opencv-python-headless>=4.8
Requires-Dist: pillow>=10.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dateutil>=2.8
Provides-Extra: all
Requires-Dist: easyocr>=1.7; extra == 'all'
Requires-Dist: mypy>=1.5; extra == 'all'
Requires-Dist: paddleocr>=2.7; extra == 'all'
Requires-Dist: pymupdf>=1.24.0; extra == 'all'
Requires-Dist: pytesseract>=0.3; extra == 'all'
Requires-Dist: pytest-cov>=4.0; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: pyyaml>=6.0; extra == 'all'
Requires-Dist: ruff>=0.1; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.5; extra == 'dev'
Requires-Dist: pymupdf>=1.24.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == 'easyocr'
Provides-Extra: paddleocr
Requires-Dist: paddleocr>=2.7; extra == 'paddleocr'
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == 'pdf'
Provides-Extra: tesseract
Requires-Dist: pytesseract>=0.3; extra == 'tesseract'
Description-Content-Type: text/markdown

# Open Receipt Extractor

[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Build](https://img.shields.io/github/actions/workflow/status/malekatwiz/open-receipt-extractor/ci.yml?branch=main&label=build)](https://github.com/malekatwiz/open-receipt-extractor/actions)
[![Coverage](https://img.shields.io/badge/coverage-80%25%2B-brightgreen.svg)](https://github.com/malekatwiz/open-receipt-extractor/actions)

**Open Receipt Extractor** is an open-source, modular **Python library** that converts raw receipt documents — images or PDFs, including poor-quality photos (wrinkled, skewed, shadowed, faded) — into **structured, analytics-ready JSON**. It supports bilingual extraction in **English and French**, covering common Canadian tax regimes (GST/HST/TVQ/QST) and international variations (VAT, sales tax).

The library is designed to be **consumed by other developers**. You bring the bytes; Open Receipt Extractor handles the extraction. How those bytes arrive — uploaded via a FastAPI endpoint, received as an email attachment, pulled from a cloud bucket, or read from a folder — is entirely up to your application. The library starts at raw bytes and returns a validated [`Receipt`](#what-it-produces) object; what you do with the result is your call.

The pipeline is also designed to be pluggable: OCR engines, preprocessing strategies, and output backends are all interchangeable without modifying core parsing logic.

---

## What It Produces

For every receipt processed, the pipeline emits:

| Output | Format | Description |
|---|---|---|
| **Receipt JSON** | `Receipt` Pydantic model → JSON | Merchant, transaction, amounts, taxes, line items, confidence score |
| **Tabular export** | `receipts` + `receipt_items` rows | Flat rows ready for data warehouse ingestion |
| **Debug artifacts** | Images, OCR JSON, parse trace | Stored to a configurable backend for audit and continuous improvement |

---

## Architecture Overview

The pipeline runs through seven sequential stages, each encapsulated in its own module:

```
Input bytes (image or PDF)
        │
        ▼
 1. Document Normalization    ── Detect format; decode bytes into PageImage[]
        │
        ▼
 2. Image Preprocessing       ── Generate up to 6 enhanced variants per page
        │
        ▼
 3. OCR (Pluggable)           ── Extract text + bounding boxes + confidence
        │
        ▼
 4. Layout Reconstruction     ── Group tokens into lines/blocks; detect right-aligned amounts
        │
        ▼
 5. Receipt Parsing           ── Extract merchant, date, totals, taxes, line items, payment
        │
        ▼
 6. Validation & Scoring      ── Cross-check math; compute parse_confidence; flag for review
        │
        ▼
 7. Structured Output         ── Emit validated JSON; persist artifacts
        │
        ▼
     Receipt JSON
```

For full architectural detail, see [ARCHITECTURE.md](ARCHITECTURE.md) and the [design document](docs/receipt-processing-design.md).

---

## Quick Start

### Installation

Install the core package:

```bash
pip install open-receipt-extractor
```

Install with PDF support (recommended):

```bash
pip install "open-receipt-extractor[pdf]"
```

Install with EasyOCR adapter (primary, recommended):

```bash
pip install "open-receipt-extractor[pdf,easyocr]"
```

> **Note:** EasyOCR downloads model weights on first use (~200 MB). No OS-level binaries required.

### Basic Usage

```python
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

# Build processor with default settings and EasyOCR
settings = Settings()
ocr = EasyOcrAdapter(settings)
processor = ReceiptProcessor(config=settings, ocr_adapter=ocr)

# Process from bytes
with open("receipt.jpg", "rb") as f:
    receipt = processor.process_bytes(f.read(), filename="receipt.jpg")

# Access structured fields
print(receipt.merchant.name)               # "GROCERY WORLD"
print(receipt.transaction.datetime)        # 2024-03-15T14:32:00
print(receipt.amounts.total)               # Decimal('47.83')
print(receipt.quality.parse_confidence)    # 0.92
print(receipt.quality.needs_review)        # False

# Serialize to JSON
from receipt_processor.output.json_serializer import serialize_receipt
json_output = serialize_receipt(receipt)
```

### Process from a `DocumentHandle`

```python
from receipt_processor.core.types import DocumentHandle

class FileHandle:
    def __init__(self, path: str) -> None:
        self._path = path

    def get_bytes(self) -> bytes:
        with open(self._path, "rb") as f:
            return f.read()

    def get_metadata(self) -> dict:
        return {"filename": self._path}

receipt = processor.process(FileHandle("receipt.pdf"))
```

### Integration Examples

Because Open Receipt Extractor is a library, ingestion is always your concern. Here are two minimal patterns showing how to connect the library to different delivery channels.

**FastAPI file upload**

```python
from fastapi import FastAPI, UploadFile
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

app = FastAPI()
settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

@app.post("/receipts/extract")
async def extract_receipt(file: UploadFile):
    data = await file.read()
    # aprocess_bytes is non-blocking — safe to use inside async endpoint handlers
    receipt = await processor.aprocess_bytes(data, filename=file.filename)
    return receipt.model_dump()
```

**Email attachment (imaplib)**

```python
import imaplib, email
from receipt_processor.pipeline.runner import ReceiptProcessor
from receipt_processor.config import Settings
from receipt_processor.ocr.adapters.easyocr_adapter import EasyOcrAdapter

settings = Settings()
processor = ReceiptProcessor(config=settings, ocr_adapter=EasyOcrAdapter(settings))

with imaplib.IMAP4_SSL("imap.example.com") as imap:
    imap.login("user@example.com", "password")
    imap.select("INBOX")
    _, ids = imap.search(None, "UNSEEN")
    for num in ids[0].split():
        _, data = imap.fetch(num, "(RFC822)")
        msg = email.message_from_bytes(data[0][1])
        for part in msg.walk():
            if part.get_content_maintype() == "image":
                receipt = processor.process_bytes(
                    part.get_payload(decode=True),
                    filename=part.get_filename() or "receipt",
                )
                print(receipt.amounts.total)
```

---

## Configuration

### YAML file

Create `receipt_processor_config.yaml` in your working directory or pass an explicit path:

```yaml
preprocessing:
  enabled_variants: ["V0", "V1", "V2", "V3", "V4", "V5"]
  fast_mode_variants: ["V0", "V1"]
  pdf_render_dpi: 200
  max_image_size: [4096, 4096]

ocr:
  engine: "easyocr"
  languages: ["en", "fr"]
  detect_orientation: true

parsing:
  region_hint: "CA"        # Canadian tax regime defaults
  currency_hint: "CAD"

validation:
  needs_review_threshold: 0.50

pipeline:
  mode: "balanced"          # fast | balanced | accurate

artifacts:
  store_original: true
  store_ocr_json: true
  store_parse_trace: false
  storage_backend: "local"
  local_path: "./artifacts"
```

```python
settings = Settings.from_file("receipt_processor_config.yaml")
```

### Environment variable overrides

Every configuration key can be overridden with a prefixed environment variable, which is useful for container deployments:

```bash
export RECEIPT_PROCESSOR_OCR__ENGINE=easyocr
export RECEIPT_PROCESSOR_PARSING__REGION_HINT=CA
export RECEIPT_PROCESSOR_PIPELINE__MODE=accurate
export RECEIPT_PROCESSOR_ARTIFACTS__STORAGE_BACKEND=none
```

```python
settings = Settings.from_env()
```

---

## Documentation

| Document | Description |
|---|---|
| [docs/receipt-processing-design.md](docs/receipt-processing-design.md) | Full pipeline design: data model, algorithm detail, extensibility points |
| [docs/project-deliverables.md](docs/project-deliverables.md) | Master deliverable list with IDs, owners, priorities, and acceptance criteria |
| [ARCHITECTURE.md](ARCHITECTURE.md) | Module boundaries, interface contracts, data-flow diagram, design decisions |
| [CONTRIBUTING.md](CONTRIBUTING.md) | Development setup, code style, testing guide, how to add an OCR adapter |

---

## Contributing

Contributions are welcome. Please read [CONTRIBUTING.md](CONTRIBUTING.md) for the development setup guide, code style requirements, and how to add new features such as OCR adapters or preprocessing variants.

---

## License

[MIT License](LICENSE) — Copyright © Open Receipt Extractor Contributors
