Structured Extraction
Zero Config
One function call. Bring a schema, a URL, and an LLM model string.
Multi-Media
Documents, images, audio, and video with smart routing.
Any LLM
11 providers wired in: OpenAI, Anthropic, Google, AWS Bedrock, xAI, Cohere, Hugging Face, Groq, Mistral, OpenRouter, and Ollama.
Type Safe
Pydantic schemas ensure validated, typed output every time.
What’s new
v0.6.0
11 model providers, one model string
v0.6.0 wires every provider that pydantic-ai ships. Each provider’s native error type is wrapped into ModelError automatically — no provider-specific catch blocks.
Carried forward from v0.5.0
Async & batch
extract_async, extract_many, and extract_many_async for concurrent runs with an explicit limit.
results = extract_many(
schema=Invoice,
model="openai:gpt-5",
input_files=paths,
max_concurrency=5,
)
Bytes & file-like
Pass raw bytes or any BinaryIO. No temp file dance.
extract(
schema=Invoice,
model="openai:gpt-5",
input_file=pdf_bytes,
media_type="application/pdf",
)
Retry with backoff
Opt-in retries on ModelError with jittered exponential backoff.
extract(
schema=Invoice,
model="openai:gpt-5",
input_file=path,
max_retries=3,
)
Token usage
extract_with_usage returns input, output, and total token counts.
output, usage = extract_with_usage(
schema=Invoice,
model="openai:gpt-5",
input_file=path,
)
print(usage.total_tokens)
Shell-friendly
Drive extractions from the shell with structured exit codes.
$ openextract doc.pdf \
--schema mypkg:Invoice \
--model openai:gpt-5
Typed classifier
Provider errors (openai.APIError, Google, pydantic-ai) become ModelError by type, not substring.
try:
extract(...)
except ModelError:
... # provider issue
except SchemaValidationError:
... # bad shape
How it works
Schema in, typed data out
Define a BaseModel, call extract(), get validated output.
Define a schema
Describe the shape you want with a Pydantic model.
Point at any media
Documents, images, audio, or video via URL.
Get typed output
Validated against your schema. No parsing, no regex.
from pydantic import BaseModel
from openextract import extract
class Report(BaseModel):
title: str
findings: list[str]
severity: int
result = extract(
schema=Report,
model="openai:gpt-5",
input_file="https://example.com/report.pdf",
instructions="Extract findings",
)
Works with any media
PDF, DOCX, PNG, JPG, MP3, MP4, and 20+ formats