Metadata-Version: 2.4
Name: onflow-awb-ocr
Version: 0.1.0
Summary: Python SDK for extracting receiver information from AWB/shipping labels.
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: PyMuPDF>=1.20
Requires-Dist: requests>=2.25
Provides-Extra: ocr
Requires-Dist: opencv-python>=4.5; extra == "ocr"
Requires-Dist: pytesseract>=0.3; extra == "ocr"
Requires-Dist: numpy>=1.21; extra == "ocr"

# Onflow AWB OCR

Python SDK for extracting receiver information from AWB and shipping label files.

The package supports PDF files with a text layer first, then falls back to OCR for
scanned PDFs and image files when OCR dependencies are installed.

## Requirements

- Python 3.8+
- PyMuPDF for PDF text-layer extraction
- Optional OCR stack for scanned files and images:
  - Tesseract OCR
  - Vietnamese Tesseract language data
  - Poppler `pdftoppm`

## Installation

Install from PyPI:

```bash
pip install onflow-awb-ocr
```

Install with OCR dependencies:

```bash
pip install "onflow-awb-ocr[ocr]"
```

On Ubuntu, install the native OCR tools:

```bash
sudo apt install -y tesseract-ocr tesseract-ocr-vie poppler-utils
```

For local development:

```bash
pip install -e ".[ocr]"
```

## Usage

```python
from onflow_awb_ocr import OnflowAwbOcr

ocr = OnflowAwbOcr(lang="vie+eng")
result = ocr.extract("label.pdf")

print(result)
```

Example result:

```python
{
    "name": "Nguyen Van A",
    "address": "123 Nguyen Trai\nQuan 1, TP. Ho Chi Minh",
    "strategy": "shopee",
}
```

If no receiver can be detected, `extract()` returns `None`.

## Supported Inputs

`extract()` accepts:

- Local file path as `str`
- Local file path as `pathlib.Path`
- HTTP/HTTPS URL
- `bytes`
- `bytearray`
- Binary file-like object

Examples:

```python
from pathlib import Path

from onflow_awb_ocr import OnflowAwbOcr

ocr = OnflowAwbOcr()

from_path = ocr.extract(Path("label.pdf"))
from_url = ocr.extract("https://example.com/label.pdf")

with open("label.pdf", "rb") as file:
    from_file = ocr.extract(file)

with open("label.png", "rb") as file:
    from_bytes = ocr.extract(file.read())
```

## Compatibility

The old `ReceiverExtractor` class name is still available as an alias:

```python
from onflow_awb_ocr import ReceiverExtractor

ocr = ReceiverExtractor()
result = ocr.extract("label.pdf")
```

## Package Structure

- `extractor.py`: public `OnflowAwbOcr` class
- `input.py`: input preparation for paths, URLs, bytes, and binary streams
- `text_layer.py`: PDF text-layer extraction strategies
- `ocr.py`: OCR fallback for scanned PDFs and images
- `postprocess.py`: address cleanup
- `types.py`, `constants.py`, `utils.py`: shared types, constants, and helpers

## Publishing

GitHub Actions builds and publishes the package to PyPI on every push to `main`.

The repository must define this GitHub secret:

```text
PYPI_API_TOKEN
```

PyPI does not allow replacing an existing version. If a commit on `main` does not
bump `project.version` in `pyproject.toml`, the publish step skips the existing
distribution.
