Metadata-Version: 2.4
Name: po-extractor
Version: 0.2.0
Summary: Template-aware extractor for scanned Purchase Order / Sales Order PDFs and images (Tally / GST style). OCR + layout + saved YAML templates + optional Claude/Ollama.
Author-email: Lancer International <it@lancers.in>
License: MIT
Project-URL: Homepage, https://github.com/lancer-international/po-extractor
Project-URL: Documentation, https://github.com/lancer-international/po-extractor#readme
Project-URL: Issues, https://github.com/lancer-international/po-extractor/issues
Project-URL: Source, https://github.com/lancer-international/po-extractor
Keywords: ocr,purchase-order,sales-order,extraction,pdf,tally,gst,gstin,india,rapidocr,paddleocr,claude,ollama,document-ai
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business
Classifier: Topic :: Office/Business :: Financial :: Accounting
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: General
Classifier: Typing :: Typed
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic<3,>=2.6
Requires-Dist: pydantic-settings>=2.2
Requires-Dist: pypdfium2>=4.30
Requires-Dist: Pillow>=10.2
Requires-Dist: numpy<3,>=1.26
Requires-Dist: opencv-python-headless>=4.9
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: dateparser>=1.2
Requires-Dist: rapidfuzz>=3.6
Requires-Dist: pdfplumber>=0.11
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13.7
Provides-Extra: rapid
Requires-Dist: rapidocr-onnxruntime>=1.3; extra == "rapid"
Provides-Extra: paddle
Requires-Dist: paddleocr>=2.7; extra == "paddle"
Requires-Dist: paddlepaddle>=2.6; extra == "paddle"
Provides-Extra: llm
Requires-Dist: anthropic>=0.39; extra == "llm"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: hypothesis; extra == "dev"
Requires-Dist: rapidocr-onnxruntime>=1.3; extra == "dev"
Dynamic: license-file

# po_extractor

Template-aware extractor for scanned Purchase Order PDFs and images.

`po_extractor` turns scanned POs (Tally-style forms, vendor sales orders) into clean, schema-aligned JSON. It does **not** train a model from scratch. It pairs production OCR with geometry-driven layout analysis, persistent YAML templates, deterministic field/table extractors, validators, and an optional Claude Opus 4.7 stage for narrowly-scoped normalization.

Every emitted value carries evidence: the page, bounding box, raw OCR text, and confidence. New formats are remembered as YAML templates that grow via a correction loop.

## Highlights

- **OCR-pluggable** — `BaseOCREngine` abstraction. Default: `rapidocr-onnxruntime` (Windows-friendly, pure ONNX). Optional: PaddleOCR via `pip install po_extractor[paddle]`. Mock engine for tests.
- **Template registry** — YAML-defined anchors, label aliases, table-header aliases, field rules, validation rules. Match score is deterministic and inspectable.
- **Table reconstruction** — column inference from header bboxes, row clustering by y-gap, multi-line description coalescing, tax sub-row folding, multi-page table stitching.
- **Validation** — GSTIN regex+checksum, mobile, HSN length, dates (DMY default), Indian numerics (`1,23,456.78` and `₹` handled). Cross-row math: `qty * rate ≈ amount`, `sum(items) ≈ totals`.
- **LLM (optional)** — Claude Opus 4.7 (`claude-opus-4-7`) used for label mapping and template drafting. Every LLM-emitted value is grounded against OCR before it lands in the result.
- **Correction loop** — `apply-correction` writes confirmed aliases back into the matched template, so the next document of the same layout extracts cleanly without re-asking.
- **Pydantic v2 throughout** — every output is a typed model with stable JSON.

## Install

Once published to PyPI:

```powershell
pip install po-extractor[rapid]              # default OCR backend
pip install po-extractor[rapid,llm]          # + Anthropic SDK for Claude Opus 4.7
pip install po-extractor[paddle]             # PaddleOCR (heavier; may need toolchain on Windows)
```

From source (development install):

```powershell
git clone <repo-url>
cd po-extractor
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .[dev,llm]
```

Build a wheel locally:

```powershell
pip install build
python -m build .                            # produces dist/po_extractor-0.1.0-py3-none-any.whl
```

## Quick start

### Python API (one-liner)

```python
from po_extractor import extract

result = extract("invoice.pdf")              # any path, bytes, or file-like
print(result["po_number"])                   # subscript access — returns the value
print(result["buyer_gstin"])
print(result.values_dict())                  # {"po_number": "...", "items": [...], ...}
print(result.to_json(indent=2))              # full JSON with values + rich form
```

Provenance (bbox, confidence, raw OCR text, label seen on page) when you need it:

```python
ev = result.evidence("buyer_gstin")
print(ev.value, ev.bbox, ev.confidence, ev.label_seen, ev.source)
```

Other input forms — bytes, file-like, str, Path:

```python
data = open("invoice.pdf", "rb").read()
result = extract(data)                       # bytes
result = extract(open("invoice.pdf", "rb"))  # file-like
import io; result = extract(io.BytesIO(data))
```

Need fine-grained control? Use the class:

```python
from po_extractor import POExtractor

extractor = POExtractor(
    ocr_engine_name="rapid",                 # "rapid" | "paddle" | "mock"
    use_llm=False,                           # disable LLM stages
    llm_only=False,                          # set True to skip templates entirely
)
result = extractor.extract("invoice.pdf")
```

### CLI

Every command takes a verb. Both `po-extract` (console script) and `python -m po_extractor` work:

```powershell
po-extract extract invoice.pdf --out result.json
po-extract extract invoice.pdf --llm-only            # skip templates, use Claude/Ollama
po-extract list-templates
po-extract match invoice.pdf                         # show match score breakdown
po-extract apply-correction --result result.json --correction corrections.json
po-extract learn-template invoice.pdf --format-name "Acme Sales Order"
po-extract validate result.json                      # re-run validators on existing JSON

python -m po_extractor extract invoice.pdf           # equivalent to po-extract extract
po-extract --help                                    # full verb list
```

## Configuration

Settings come from environment variables (see `.env.example`) or programmatic `Settings` overrides:

| Var | Default | Meaning |
|---|---|---|
| `PO_EXTRACTOR_OCR_ENGINE` | `rapid` | `rapid`, `paddle`, or `mock` |
| `PO_EXTRACTOR_DPI` | `300` | PDF rasterization DPI |
| `PO_EXTRACTOR_LLM_PROVIDER` | `auto` | `auto` / `claude` / `ollama` / `none` |
| `ANTHROPIC_API_KEY` | (unset) | Used when provider is `claude` (or `auto` if Ollama isn't reachable) |
| `PO_EXTRACTOR_OLLAMA_HOST` | `http://localhost:11434` | Ollama server URL |
| `PO_EXTRACTOR_OLLAMA_MODEL` | `qwen2.5:7b-instruct` | Local model name |
| `PO_EXTRACTOR_REQUIRE_LLM` | `false` | Hard-fail if LLM unavailable |
| `PO_EXTRACTOR_ALLOW_DRAFTS` | `false` | Load draft templates from `store/drafts/` |
| `PO_EXTRACTOR_LOG_LEVEL` | `INFO` | Logger level |

### Using a local LLM via Ollama

Both `learn-template` and the optional in-pipeline label-mapping stage will pick up Ollama automatically when no Anthropic key is set.

```powershell
# 1) Install Ollama: https://ollama.com/download
# 2) Pull a model good enough for label mapping / template drafting:
ollama pull qwen2.5:7b-instruct          # recommended default (~4.7 GB)
# alternatives: llama3.1:8b, mistral:7b-instruct, qwen2.5:14b-instruct
# 3) Make sure the server is running:
ollama serve                              # usually runs as a service already

# 4) Use po_extractor exactly as you would with Claude:
po-extract learn-template "samples/SO-0005-2026.pdf" --format-name "vendor_xyz_so"
```

To force one provider regardless of detection:
```powershell
$env:PO_EXTRACTOR_LLM_PROVIDER = "ollama"      # or "claude" / "none"
```

## Adding a template

Two routes:

1. **By hand** — copy `po_extractor/templates/store/_default.yaml`, edit, drop in `store/`.
2. **Auto-draft from a sample** — `po-extract learn-template data/new_vendor.pdf --format-name "new_vendor_po"`. Requires `ANTHROPIC_API_KEY`. Writes a draft into `store/drafts/`.

Templates carry: `format_id`, `format_name`, `anchors[]`, `label_aliases{}`, `table_headers[]`, `field_rules[]`, `validation_rules[]`. See [tstanes_po_v1.yaml](po_extractor/templates/store/tstanes_po_v1.yaml) for a fully-worked example.

## Calibrating a template against a real document

The starter templates ship with `draft: true` because they were derived from anchor lists, not real samples. To calibrate:

```powershell
# 1) Extract — likely produces some warnings or missing fields
po-extract extract data\real_tstanes.pdf --out result.json --allow-drafts

# 2) Hand-write a corrections.json with the right values + the labels you saw on the document
# (see docs/corrections-format.md for the schema)

# 3) Apply the correction — adds new aliases / region hints to the matched template
po-extract apply-correction --result result.json --correction corrections.json

# 4) Re-extract — the template now knows the new aliases
po-extract extract data\real_tstanes.pdf --out result2.json --allow-drafts
```

Once a draft has at least three confirmed aliases per required field via corrections, it is automatically promoted (`draft: false`) so it participates in normal matching.

## Output schema

Every extraction produces an `ExtractionResult` (Pydantic model). Top-level shape:

```json
{
  "document_type": "purchase_order",
  "source_file": "...",
  "page_count": 1,
  "detected_format_id": "tstanes_po_v1",
  "extraction_status": "success | needs_review | needs_template_review",
  "confidence": 0.0,
  "header": { "po_number": { "value": "...", "raw_value": "...", "label_seen": "...", "page": 1, "bbox": [...], "confidence": 0.0 }, ... },
  "parties": { ... },
  "items": [ { "row_index": 0, "cells": { ... }, "taxes": { ... } }, ... ],
  "terms": { ... },
  "totals": { ... },
  "handwritten_notes": [],
  "unmapped_text": [],
  "validation": { "status": "passed | warning | failed", "issues": [] },
  "raw_ocr": { "pages": [ ... ] },
  "diagnostics": { ... }
}
```

## Testing

```powershell
pytest -q
```

Tests run with the `MockOCREngine` reading canned JSON fixtures under `data/fixtures/`. No real OCR install or sample PDFs required for CI.

## License

MIT.
