OE
v0.6.0 | 11 providers, typed errors

Structured Extraction

Typed output from any media.

Extract structured data from documents, images, audio, and video using LLMs and Pydantic schemas.

View Source
Terminal extract.py
from openextract import extract

result = extract(
    schema=Invoice,
    model="openai:gpt-5",
    input_file="https://example.com/doc.pdf",
    instructions="Extract invoice data",
)
Pydantic Any LLM

Zero Config

One function call. Bring a schema, a URL, and an LLM model string.

Multi-Media

Documents, images, audio, and video with smart routing.

Any LLM

11 providers wired in: OpenAI, Anthropic, Google, AWS Bedrock, xAI, Cohere, Hugging Face, Groq, Mistral, OpenRouter, and Ollama.

Type Safe

Pydantic schemas ensure validated, typed output every time.

What’s new

v0.6.0

Full changelog
v0.6.0 · Providers

11 model providers, one model string

v0.6.0 wires every provider that pydantic-ai ships. Each provider’s native error type is wrapped into ModelError automatically — no provider-specific catch blocks.

openai anthropic google bedrock xai cohere huggingface groq cerebras mistral openrouter outlines ollama

Carried forward from v0.5.0

Concurrency

Async & batch

extract_async, extract_many, and extract_many_async for concurrent runs with an explicit limit.

results = extract_many(
    schema=Invoice,
    model="openai:gpt-5",
    input_files=paths,
    max_concurrency=5,
)
Inputs

Bytes & file-like

Pass raw bytes or any BinaryIO. No temp file dance.

extract(
    schema=Invoice,
    model="openai:gpt-5",
    input_file=pdf_bytes,
    media_type="application/pdf",
)
Reliability

Retry with backoff

Opt-in retries on ModelError with jittered exponential backoff.

extract(
    schema=Invoice,
    model="openai:gpt-5",
    input_file=path,
    max_retries=3,
)
Observability

Token usage

extract_with_usage returns input, output, and total token counts.

output, usage = extract_with_usage(
    schema=Invoice,
    model="openai:gpt-5",
    input_file=path,
)
print(usage.total_tokens)
CLI

Shell-friendly

Drive extractions from the shell with structured exit codes.

$ openextract doc.pdf \
    --schema mypkg:Invoice \
    --model openai:gpt-5
Errors

Typed classifier

Provider errors (openai.APIError, Google, pydantic-ai) become ModelError by type, not substring.

try:
    extract(...)
except ModelError:
    ...  # provider issue
except SchemaValidationError:
    ...  # bad shape

How it works

Schema in, typed data out

Define a BaseModel, call extract(), get validated output.

1

Define a schema

Describe the shape you want with a Pydantic model.

2

Point at any media

Documents, images, audio, or video via URL.

3

Get typed output

Validated against your schema. No parsing, no regex.

Terminal extract.py
from pydantic import BaseModel
from openextract import extract

class Report(BaseModel):
    title: str
    findings: list[str]
    severity: int

result = extract(
    schema=Report,
    model="openai:gpt-5",
    input_file="https://example.com/report.pdf",
    instructions="Extract findings",
)

Works with any media

Documents Images Audio Video

PDF, DOCX, PNG, JPG, MP3, MP4, and 20+ formats