Powered by Docling + LangExtract + LangChain

Unstructured docs to
structured data
in one pipeline

Parse documents, extract structured data with LLMs, and ingest into vector databases. Composable pipelines with a plugin architecture.

$ pip install docpipe[all] click to copy

Three Pipelines, Fully Composable

Use each independently or chain them together. Five ready-made flows out of the box.

📄
Documents
PDF, DOCX, images...
🔍
Parse
Docling
Extract
LangExtract / LangChain
🗃
Ingest
pgvector + your DB

1. Parse Only (Docling)

Convert any document to clean text or markdown.

import docpipe

doc = docpipe.parse("report.pdf")
print(doc.markdown)

2. Extract Only (LangExtract)

Extract structured entities from any text with LLMs.

schema = docpipe.ExtractionSchema(
    description="Extract people and ages",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(text, schema)

3. Parse + Extract

Full pipeline: document to structured data in one call.

result = docpipe.run(
    "invoice.pdf", schema
)
print(result.extractions)

4. Parse + Ingest

Parse a document and ingest vectors into your DB.

config = docpipe.IngestionConfig(
    connection_string="postgresql://...",
    table_name="docs",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("report.pdf", config=config)

5. Full Pipeline

Parse, extract, and ingest - all in one call.

result = docpipe.run(
    "contract.pdf", schema,
    ingestion_config=config,
)

Built for Production

Everything you need to go from documents to structured data at scale.

🔌

Plugin Architecture

Add custom parsers and extractors via Python entry points. Third-party packages auto-discovered on install.

🚀

CLI + API Server

Full CLI for scripting, FastAPI server for microservices, Docker image for deployment.

Fully Configurable

No magic defaults. Explicit LLM provider, embedding model, and DB connection. YAML + env vars.

🔗

LangChain Backbone

Built on LangChain for embeddings, text splitting, and vector stores. Supports OpenAI, Gemini, Ollama, HuggingFace.

📄

20+ Document Formats

PDF, DOCX, XLSX, PPTX, HTML, images, audio, video - powered by IBM Docling's advanced parsing.

🔎

Source Grounding

Character-level source spans via LangExtract. Every extraction traces back to exact text positions.

Use It Your Way

CLI for quick tasks, Python API for integration, Docker for deployment.

▶ CLI
# Parse a document
$ docpipe parse invoice.pdf --format markdown

# Extract structured data
$ docpipe extract data.txt \
    --schema schema.yaml \
    --model gemini-2.5-flash

# Ingest into your vector DB
$ docpipe ingest report.pdf \
    --db "postgresql://..." \
    --table documents \
    --embedding-provider openai \
    --embedding-model text-embedding-3-small

# Start API server
$ docpipe serve --port 8000
🐋 Docker
# Run API server
$ docker run -p 8000:8000 \
    --env-file .env \
    docpipe

# Parse in container
$ docker run -v ./data:/data \
    docpipe parse /data/invoice.pdf

# Ingest from container
$ docker run --env-file .env \
    docpipe ingest /data/report.pdf \
    --db "postgresql://..." \
    --table docs \
    --embedding-provider openai \
    --embedding-model text-embedding-3-small
📦 Install Options
pip install docpipe                    # Core only
pip install docpipe[docling]           # + Document parsing
pip install docpipe[langextract]       # + Google LangExtract
pip install docpipe[openai]            # + OpenAI embeddings & LLM
pip install docpipe[google]            # + Google Gemini
pip install docpipe[pgvector]          # + PostgreSQL vector store
pip install docpipe[server]            # + FastAPI server
pip install docpipe[all]               # Everything