OX
OpenXtract
v0.2.0 Now Available

Turn documents into structured data.

An open-source framework to extract clean, typed data from documents, images, audio, and video with minimal setup.

Star on GitHub
$ uv add open-xtract
Copied!
main.py
from pydantic import BaseModel
from open_xtract import extract

class Invoice(BaseModel):
    invoice_number: str
    date: str
    total: float

result = extract(
    schema=Invoice,
    model="anthropic:claude-sonnet-4-5",
    url="https://example.com/invoice.pdf",
    instructions="Extract invoice details",
)
# Returns typed Invoice instance

Multi-Media Support

Extract from documents, images, audio, and video. Smart detection routes to the right handler automatically.

Any LLM Provider

Built on pydantic-ai. Use OpenAI, Anthropic, Google, or any compatible provider with a single model string.

Type Safe

Leverage Pydantic schemas to ensure extracted data is validated, typed, and clean every time.

Built for developers.

Stop writing regex. Just define your schema and let the LLM do the heavy lifting.

  • Works with documents, images, audio, and video
  • Structured error handling with typed exceptions
  • Optional logfire instrumentation for tracing
receipt_parser.py
from open_xtract import extract, UrlFetchError, ModelError

class Receipt(BaseModel):
    vendor: str
    items: list[LineItem]
    total: float

try:
    result = extract(
        schema=Receipt,
        model="anthropic:claude-sonnet-4-5",
        url="https://example.com/receipt.jpg",
        instructions="Extract receipt details",
    )
    print(f"Vendor: {result.vendor}, Total: ${result.total}")
except UrlFetchError as e:
    print(f"Failed to fetch: {e}")
durable_extract.py
from open_xtract import extract, stop_temporal

# Just add durable=True for automatic
# Temporal workflow execution
result = extract(
    schema=Article,
    model="openai:gpt-5.2",
    url="https://example.com/report.pdf",
    instructions="Extract article info",
    durable=True,
    temporal_ui=True,  # Optional, default True
)
# Temporal UI: http://localhost:8080

# Optional: clean up when done
stop_temporal()
Powered by Temporal

Durable Execution.

Long-running extractions that survive failures. Just add durable=True and let Temporal handle the rest.

  • Auto-starts Temporal via Docker
  • PostgreSQL for persistent state
  • Optional Temporal UI for monitoring
  • Automatic retries and recovery
$ uv add open-xtract[temporal]