Metadata-Version: 2.4
Name: awb-extractor
Version: 0.1.5
Summary: Extract recipient address from AWB/shipping label PDF using Claude AI
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: anthropic>=0.40.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pymupdf>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-mock>=3.14; extra == "dev"

# AWB Extractor

Python SDK for extracting receiver, shipment, carrier, and e-commerce platform
information from Vietnamese AWB/shipping label PDF files using Claude AI.

## Features

- Extract from PDF bytes, local PDF files, or PDF URLs
- Batch extraction from multiple URLs
- Optional PDF-to-JPEG optimization before sending to Claude
- Cost estimation helper for PDF and optimized image modes
- Optional default HTTP headers for protected AWB URLs
- Typed `AWBResult` dataclass output
- Custom exceptions for API key, PDF download, and JSON parsing failures

## Requirements

- Python 3.9+
- Anthropic API key
- `pymupdf` for optimized PDF-to-JPEG extraction

## Installation

Install from PyPI:

```bash
pip install awb-extractor
```

For local development:

```bash
pip install -e ".[dev]"
```

## Usage

```python
from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")

print(result.recipient_name)
print(result.carrier)
print(result.platform)
print(result.to_dict())
```

Example result:

```python
{
    "tracking_number": "NHSVC972103440",
    "recipient_name": "Nguyen Van A",
    "recipient_phone": "(+84)03******37",
    "recipient_address": "237 Nguyen Trai",
    "recipient_ward": "Phuong Ben Thanh",
    "recipient_district": "Quan 1",
    "recipient_province": "TP. Ho Chi Minh",
    "sender_name": "Onflow",
    "sender_address": "TP. Ho Chi Minh",
    "cod": "0",
    "weight": "0.700 KG",
    "order_id": "584425059595159079",
    "carrier": "GHN",
    "platform": "Shopee",
}
```

By default, `AWBExtractor` converts the first PDF page to JPEG before sending it
to Claude:

```python
extractor = AWBExtractor(
    api_key="sk-ant-...",
    optimize=True,
    dpi=200,
)
```

If you want to send the original PDF document directly, set `optimize=False`:

```python
extractor = AWBExtractor(api_key="sk-ant-...", optimize=False)
```

## Supported Inputs

### PDF bytes

```python
from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")

with open("label.pdf", "rb") as file:
    result = extractor.from_bytes(file.read())
```

### Local PDF file

```python
from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
result = extractor.from_file("label.pdf")
```

### PDF URL

```python
from awb_extractor import AWBExtractor

extractor = AWBExtractor(
    api_key="sk-ant-...",
    http_headers={"Authorization": "Bearer token"},
)

result = extractor.from_url("https://example.com/awb.pdf")
```

You can pass request-specific headers with `extra_headers`:

```python
result = extractor.from_url(
    "https://example.com/awb.pdf",
    extra_headers={"X-Request-ID": "request-123"},
)
```

### Multiple URLs

`from_urls()` returns a list of dictionaries with `url`, `data`, and `error`.
Failed URLs do not stop the whole batch.

```python
from awb_extractor import AWBExtractor

extractor = AWBExtractor(api_key="sk-ant-...")
results = extractor.from_urls([
    "https://example.com/good.pdf",
    "https://example.com/bad.pdf",
])
```

### Estimate Cost

`estimate_cost()` estimates token usage and cost before calling the API.

```python
from pathlib import Path
from awb_extractor import estimate_cost

pdf_bytes = Path("label.pdf").read_bytes()
cost = estimate_cost(pdf_bytes, optimize=True, dpi=200)

print(cost)
```

Example output:

```python
{
    "mode": "image/jpeg",
    "input_tokens": 800,
    "output_tokens": 150,
    "cost_usd": 0.00155,
    "awb_per_10_usd": 6451,
}
```

## Result Fields

`AWBResult` includes:

- `tracking_number`
- `recipient_name`
- `recipient_phone`
- `recipient_address`
- `recipient_ward`
- `recipient_district`
- `recipient_province`
- `sender_name`
- `sender_address`
- `cod`
- `weight`
- `order_id`
- `carrier`
- `platform`

Use `to_dict()` or `to_json()` to serialize the result.

Empty strings returned by Claude are normalized to `None`.

## Exceptions

- `APIKeyError`: missing API key
- `PDFDownloadError`: PDF URL download failed
- `ExtractionError`: Claude response could not be parsed as JSON

## Package Structure

- `awb_extractor/extractor.py`: public `AWBExtractor` class
- `awb_extractor/models.py`: `AWBResult` dataclass
- `awb_extractor/exceptions.py`: package exceptions

## Development

Install dependencies and run tests:

```bash
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/python -m pytest -q
```

## Publishing

GitHub Actions builds and publishes the package to PyPI on every push to `main`.

The repository must define this GitHub secret:

```text
PYPI_API_TOKEN
```

PyPI does not allow replacing an existing version. If a commit on `main` does not
bump `project.version` in `pyproject.toml`, the publish step skips the existing
distribution.
