Metadata-Version: 2.4
Name: ocrcontext
Version: 0.1.0
Summary: Decoupled, LLM-agnostic document OCR + structured extraction. Vision and LLM parsing in 3 lines of code.
Project-URL: Homepage, https://github.com/bahadirkarsli/ocrcontext
Project-URL: Repository, https://github.com/bahadirkarsli/ocrcontext
Project-URL: Issues, https://github.com/bahadirkarsli/ocrcontext/issues
Project-URL: Changelog, https://github.com/bahadirkarsli/ocrcontext/blob/main/CHANGELOG.md
Author-email: Bahadır Karslı <bahadrkrsl@outlook.com>
Maintainer-email: Bahadır Karslı <bahadrkrsl@outlook.com>
License: MIT
License-File: LICENSE
Keywords: document-ai,langchain,ocr,paddleocr,pdf,structured-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: langchain-core>=0.3
Requires-Dist: numpy>=1.24
Requires-Dist: pillow>=9.0
Requires-Dist: pydantic>=2.5
Requires-Dist: pymupdf>=1.23
Provides-Extra: all
Requires-Dist: accelerate>=0.27; extra == 'all'
Requires-Dist: google-cloud-vision>=3.8.1; extra == 'all'
Requires-Dist: opencv-python-headless>=4.8; extra == 'all'
Requires-Dist: paddleocr>=2.7.0.3; extra == 'all'
Requires-Dist: paddlepaddle>=2.6; extra == 'all'
Requires-Dist: sentencepiece>=0.1.99; extra == 'all'
Requires-Dist: torch>=2.1; extra == 'all'
Requires-Dist: torchvision>=0.16; extra == 'all'
Requires-Dist: transformers>=4.40; extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: paddle
Requires-Dist: opencv-python-headless>=4.8; extra == 'paddle'
Requires-Dist: paddleocr>=2.7.0.3; extra == 'paddle'
Requires-Dist: paddlepaddle>=2.6; extra == 'paddle'
Provides-Extra: trocr
Requires-Dist: accelerate>=0.27; extra == 'trocr'
Requires-Dist: opencv-python-headless>=4.8; extra == 'trocr'
Requires-Dist: sentencepiece>=0.1.99; extra == 'trocr'
Requires-Dist: torch>=2.1; extra == 'trocr'
Requires-Dist: torchvision>=0.16; extra == 'trocr'
Requires-Dist: transformers>=4.40; extra == 'trocr'
Provides-Extra: vision
Requires-Dist: google-cloud-vision>=3.8.1; extra == 'vision'
Requires-Dist: opencv-python-headless>=4.8; extra == 'vision'
Description-Content-Type: text/markdown

# ocrcontext

**Decoupled, LLM-agnostic document OCR + structured extraction.** Turn a PDF or
image into clean text — or a typed Pydantic model — in three lines.

`ocrcontext` is the extraction core of a document-analysis platform, lifted out
of its web stack into a pure, pip-installable library. No FastAPI, no servers,
no hardcoded model providers.

```python
from ocrcontext import Analyzer

result = Analyzer().analyze("invoice.pdf")
print(result.text)
```

## Why

- **3-line DX** — instantiate, pass a file, get a result.
- **LLM-agnostic** — inject any LangChain chat model (OpenAI, Anthropic, Ollama,
  local). Only `langchain-core` is required; you bring the provider.
- **Resource-efficient** — heavy OCR models (PaddleOCR, TrOCR) load lazily and
  are cached as process-wide singletons, so they never reload per call.
- **Lightweight base install** — engines are opt-in extras.

## Install

```bash
pip install ocrcontext              # core only (PDF text layer + the API surface)
pip install 'ocrcontext[paddle]'    # printed text + scanned PDFs (PaddleOCR)
pip install 'ocrcontext[trocr]'     # handwriting fallback (Microsoft TrOCR)
pip install 'ocrcontext[vision]'    # handwriting primary (Google Cloud Vision)
pip install 'ocrcontext[all]'       # everything
```

Pick an LLM provider for refinement / extraction:

```bash
pip install langchain-openai        # or langchain-anthropic, langchain-ollama, ...
```

## Usage

### Raw OCR (no LLM, no API key)

```python
from ocrcontext import Analyzer

result = Analyzer().analyze("scan.png")
print(result.text, result.confidence, result.pages, result.text_source)
```

### LLM-refined OCR

Refinement fixes OCR errors **without** paraphrasing, translating, or inventing
text. Emails/URLs/IBANs are frozen so the model can't "correct" them, and output
that drifts too far from the source is rejected in favour of the raw text.

```python
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer

analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o"), lang="tr")
result = analyzer.analyze("handwritten_note.jpg", handwriting=True)
print(result.text)          # refined
print(result.raw_text)      # original OCR, kept alongside
```

### Structured extraction

```python
from langchain_openai import ChatOpenAI
from ocrcontext import Analyzer
from ocrcontext.schemas import Invoice

analyzer = Analyzer(llm=ChatOpenAI(model="gpt-4o-mini", temperature=0))
invoice = analyzer.extract("invoice.pdf", schema=Invoice)   # -> Invoice instance
print(invoice.total_amount, invoice.currency)
```

Define your own schema with plain Pydantic:

```python
from pydantic import BaseModel, Field

class Receipt(BaseModel):
    merchant: str | None = Field(None, description="Store name")
    total: float | None = Field(None, description="Grand total")

receipt = analyzer.extract("receipt.jpg", schema=Receipt)
```

### Same code, local model (no API key)

```python
from langchain_ollama import ChatOllama
from ocrcontext import Analyzer

analyzer = Analyzer(llm=ChatOllama(model="llama3.1"))
print(analyzer.analyze("scan.png").text)
```

## How it routes a document

1. **Digital PDF** → embedded text-layer extraction (exact text; LLM refine is
   skipped so identifiers aren't altered).
2. **Image / scanned PDF** → PaddleOCR with preprocessing (deskew, denoise,
   CLAHE), multi-language *coverage-first* selection, and a line-band recovery
   fallback.
3. **Handwriting** (`handwriting=True`, or auto when printed OCR yields too
   little text) → Google Vision primary, TrOCR fallback.
4. **Optional LLM refine** → fidelity-first, literal-preserved, drift-guarded.
5. **Optional `extract(schema=...)`** → typed Pydantic model.

## Refinement modes

`RefinementMode`: `conservative` (scans), `layout` (digital PDFs),
`handwriting_prose`, `handwriting_layout`. The handwriting mode is auto-selected
based on whether the text looks like a DIKW/pyramid diagram. Modes and prompts
are ported verbatim from the production pipeline.

## Configuration

```python
from ocrcontext import Analyzer, AnalyzerConfig

cfg = AnalyzerConfig(
    lang="tr",
    prefer_pdf_text_layer=True,
    auto_handwriting_fallback=True,
)
analyzer = Analyzer(llm=..., config=cfg)
```

## Development

```bash
pip install -e '.[dev]'
pytest            # runs without GPU/network — engines and LLM are faked
```

## License

MIT
