Metadata-Version: 2.4
Name: docpipe-sdk
Version: 0.1.0
Summary: Unified document parsing, structured extraction, and vector ingestion pipeline
Project-URL: Homepage, https://docpipe.vercel.app
Project-URL: Repository, https://github.com/thesunnysinha/docpipe
Project-URL: Bug Tracker, https://github.com/thesunnysinha/docpipe/issues
Project-URL: Changelog, https://github.com/thesunnysinha/docpipe/blob/main/CHANGELOG.md
Author-email: Sunny Sinha <thesunnysinha@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: docling,document,extraction,ingestion,langchain,langextract,llm,parsing,pipeline,rag,vector
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: langchain-core>=0.3
Requires-Dist: langchain-text-splitters>=0.3
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: docling>=2.0; extra == 'all'
Requires-Dist: fastapi>=0.100; extra == 'all'
Requires-Dist: langchain-google-genai>=2.0; extra == 'all'
Requires-Dist: langchain-ollama>=0.3; extra == 'all'
Requires-Dist: langchain-openai>=0.3; extra == 'all'
Requires-Dist: langchain-postgres>=0.0.12; extra == 'all'
Requires-Dist: langextract>=0.1; extra == 'all'
Requires-Dist: python-multipart>=0.0.6; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.20; extra == 'all'
Provides-Extra: dev
Requires-Dist: httpx; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: google
Requires-Dist: langchain-google-genai>=2.0; extra == 'google'
Provides-Extra: huggingface
Requires-Dist: langchain-huggingface>=0.1; extra == 'huggingface'
Provides-Extra: langextract
Requires-Dist: langextract>=0.1; extra == 'langextract'
Provides-Extra: ollama
Requires-Dist: langchain-ollama>=0.3; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.3; extra == 'openai'
Provides-Extra: pgvector
Requires-Dist: langchain-postgres>=0.0.12; extra == 'pgvector'
Provides-Extra: server
Requires-Dist: fastapi>=0.100; extra == 'server'
Requires-Dist: python-multipart>=0.0.6; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.20; extra == 'server'
Description-Content-Type: text/markdown

# docpipe

Unified document parsing, structured extraction, and vector ingestion pipeline.

## Overview

docpipe connects document parsing (Docling), LLM-based structured extraction (LangExtract + LangChain), and vector ingestion (pgvector via LangChain) into a single composable pipeline.

**Three independent pipelines, composable together:**

1. **Parse**: Unstructured docs → parsed text/markdown (Docling)
2. **Extract**: Text → structured entities via LLM (LangExtract or LangChain)
3. **Ingest**: Parsed chunks → embeddings → your vector DB (LangChain + pgvector)

## Install

```bash
# Core only
pip install docpipe

# With all backends
pip install "docpipe[all]"

# Pick what you need
pip install "docpipe[docling]"              # Document parsing
pip install "docpipe[langextract]"          # Google LangExtract
pip install "docpipe[openai]"              # OpenAI embeddings + LLM
pip install "docpipe[pgvector]"            # PostgreSQL vector store
pip install "docpipe[server]"              # FastAPI server
```

## Quick Start

### Python API

```python
import docpipe

# Parse a document
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract structured data
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Full pipeline
result = docpipe.run("invoice.pdf", schema)

# Ingest into your vector DB
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)
```

### CLI

```bash
docpipe parse invoice.pdf --format markdown
docpipe extract "John Doe, age 30" --schema schema.yaml --model gemini-2.5-flash
docpipe run invoice.pdf --schema schema.yaml --model gemini-2.5-flash
docpipe ingest invoice.pdf --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small
docpipe search "total amount" --db "postgresql://..." --table invoices \
    --embedding-provider openai --embedding-model text-embedding-3-small
docpipe serve
docpipe plugins list
```

### Docker

```bash
# API server
docker run -p 8000:8000 --env-file .env docpipe

# CLI
docker run -v ./data:/data docpipe parse /data/invoice.pdf
```

## Plugin System

Third-party packages can register as plugins via entry points:

```toml
# In your package's pyproject.toml
[project.entry-points."docpipe.parsers"]
my_parser = "my_package:MyParser"

[project.entry-points."docpipe.extractors"]
my_extractor = "my_package:MyExtractor"
```

## License

MIT
