Parse documents, extract structured data with LLMs, and ingest into vector databases. Composable pipelines with a plugin architecture.
Use each independently or chain them together. Five ready-made flows out of the box.
Convert any document to clean text or markdown.
import docpipe
doc = docpipe.parse("report.pdf")
print(doc.markdown)
Extract structured entities from any text with LLMs.
schema = docpipe.ExtractionSchema(
description="Extract people and ages",
model_id="gemini-2.5-flash",
)
results = docpipe.extract(text, schema)
Full pipeline: document to structured data in one call.
result = docpipe.run(
"invoice.pdf", schema
)
print(result.extractions)
Parse a document and ingest vectors into your DB.
config = docpipe.IngestionConfig(
connection_string="postgresql://...",
table_name="docs",
embedding_provider="openai",
embedding_model="text-embedding-3-small",
)
docpipe.ingest("report.pdf", config=config)
Parse, extract, and ingest - all in one call.
result = docpipe.run(
"contract.pdf", schema,
ingestion_config=config,
)
Everything you need to go from documents to structured data at scale.
Add custom parsers and extractors via Python entry points. Third-party packages auto-discovered on install.
Full CLI for scripting, FastAPI server for microservices, Docker image for deployment.
No magic defaults. Explicit LLM provider, embedding model, and DB connection. YAML + env vars.
Built on LangChain for embeddings, text splitting, and vector stores. Supports OpenAI, Gemini, Ollama, HuggingFace.
PDF, DOCX, XLSX, PPTX, HTML, images, audio, video - powered by IBM Docling's advanced parsing.
Character-level source spans via LangExtract. Every extraction traces back to exact text positions.
CLI for quick tasks, Python API for integration, Docker for deployment.
# Parse a document
$ docpipe parse invoice.pdf --format markdown
# Extract structured data
$ docpipe extract data.txt \
--schema schema.yaml \
--model gemini-2.5-flash
# Ingest into your vector DB
$ docpipe ingest report.pdf \
--db "postgresql://..." \
--table documents \
--embedding-provider openai \
--embedding-model text-embedding-3-small
# Start API server
$ docpipe serve --port 8000
# Run API server
$ docker run -p 8000:8000 \
--env-file .env \
docpipe
# Parse in container
$ docker run -v ./data:/data \
docpipe parse /data/invoice.pdf
# Ingest from container
$ docker run --env-file .env \
docpipe ingest /data/report.pdf \
--db "postgresql://..." \
--table docs \
--embedding-provider openai \
--embedding-model text-embedding-3-small
pip install docpipe # Core only
pip install docpipe[docling] # + Document parsing
pip install docpipe[langextract] # + Google LangExtract
pip install docpipe[openai] # + OpenAI embeddings & LLM
pip install docpipe[google] # + Google Gemini
pip install docpipe[pgvector] # + PostgreSQL vector store
pip install docpipe[server] # + FastAPI server
pip install docpipe[all] # Everything