Metadata-Version: 2.4
Name: docpipe-sdk
Version: 0.5.3
Summary: Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK
Project-URL: Homepage, https://docpipe.sunnysinha.online
Project-URL: Documentation, https://docpipe.sunnysinha.online
Project-URL: Repository, https://github.com/thesunnysinha/docpipe
Project-URL: Bug Tracker, https://github.com/thesunnysinha/docpipe/issues
Project-URL: Changelog, https://github.com/thesunnysinha/docpipe/blob/main/CHANGELOG.md
Author-email: Sunny Sinha <thesunnysinha@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: docling,document,embeddings,extraction,ingestion,langchain,langextract,llm,observability,opentelemetry,parsing,pgvector,pipeline,rag,retrieval,turbovec,vector
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: langchain-core>=0.3
Requires-Dist: langchain-text-splitters>=0.3
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: docling>=2.0; extra == 'all'
Requires-Dist: fastapi>=0.100; extra == 'all'
Requires-Dist: flashrank>=0.2; extra == 'all'
Requires-Dist: glmocr>=0.1; extra == 'all'
Requires-Dist: httpx>=0.27; extra == 'all'
Requires-Dist: langchain-anthropic>=0.3; extra == 'all'
Requires-Dist: langchain-classic>=0.1; extra == 'all'
Requires-Dist: langchain-community>=0.3; extra == 'all'
Requires-Dist: langchain-google-genai>=2.0; extra == 'all'
Requires-Dist: langchain-ollama>=0.3; extra == 'all'
Requires-Dist: langchain-openai>=0.3; extra == 'all'
Requires-Dist: langchain-postgres>=0.0.12; extra == 'all'
Requires-Dist: langextract>=0.1; extra == 'all'
Requires-Dist: opentelemetry-api>=1.27; extra == 'all'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.27; extra == 'all'
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.48b0; extra == 'all'
Requires-Dist: opentelemetry-sdk>=1.27; extra == 'all'
Requires-Dist: prometheus-client>=0.20; extra == 'all'
Requires-Dist: prometheus-fastapi-instrumentator>=7.1.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9; extra == 'all'
Requires-Dist: python-multipart>=0.0.6; extra == 'all'
Requires-Dist: rank-bm25>=0.2; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.20; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=0.3; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: fastapi>=0.100; extra == 'dev'
Requires-Dist: httpx; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: prometheus-client>=0.20; extra == 'dev'
Requires-Dist: prometheus-fastapi-instrumentator>=7.1.0; extra == 'dev'
Requires-Dist: psycopg2-binary>=2.9; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: glm-ocr
Requires-Dist: glmocr>=0.1; extra == 'glm-ocr'
Provides-Extra: google
Requires-Dist: langchain-google-genai>=2.0; extra == 'google'
Provides-Extra: http
Requires-Dist: httpx>=0.27; extra == 'http'
Provides-Extra: huggingface
Requires-Dist: langchain-huggingface>=0.1; extra == 'huggingface'
Provides-Extra: langextract
Requires-Dist: langextract>=0.1; extra == 'langextract'
Provides-Extra: observability
Requires-Dist: opentelemetry-api>=1.27; extra == 'observability'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.27; extra == 'observability'
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.48b0; extra == 'observability'
Requires-Dist: opentelemetry-sdk>=1.27; extra == 'observability'
Provides-Extra: ollama
Requires-Dist: langchain-ollama>=0.3; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.3; extra == 'openai'
Provides-Extra: pgvector
Requires-Dist: langchain-postgres>=0.0.12; extra == 'pgvector'
Requires-Dist: psycopg2-binary>=2.9; extra == 'pgvector'
Provides-Extra: rag
Requires-Dist: langchain-classic>=0.1; extra == 'rag'
Requires-Dist: langchain-community>=0.3; extra == 'rag'
Requires-Dist: rank-bm25>=0.2; extra == 'rag'
Provides-Extra: rerank
Requires-Dist: flashrank>=0.2; extra == 'rerank'
Provides-Extra: server
Requires-Dist: fastapi>=0.100; extra == 'server'
Requires-Dist: prometheus-client>=0.20; extra == 'server'
Requires-Dist: prometheus-fastapi-instrumentator>=7.1.0; extra == 'server'
Requires-Dist: python-multipart>=0.0.6; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.20; extra == 'server'
Provides-Extra: turbovec
Requires-Dist: turbovec[langchain]>=0.1; extra == 'turbovec'
Description-Content-Type: text/markdown

# docpipe

Unified document parsing, structured extraction, vector ingestion, and RAG pipeline SDK.

[![PyPI](https://img.shields.io/pypi/v/docpipe-sdk)](https://pypi.org/project/docpipe-sdk/)
[![Python](https://img.shields.io/pypi/pyversions/docpipe-sdk)](https://pypi.org/project/docpipe-sdk/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Docker](https://img.shields.io/badge/ghcr.io-docpipe-6366f1?logo=docker&logoColor=white)](https://ghcr.io/thesunnysinha/docpipe)
[![Website](https://img.shields.io/badge/docs-docpipe-6366f1)](https://docpipe.sunnysinha.online/docs)

## Overview

docpipe connects document parsing (Docling / GLM-OCR), LLM-based structured extraction (LangExtract + LangChain), vector ingestion (pgvector or optional turbovec), and RAG querying into a single composable pipeline.

**Four pipelines, composable together:**

1. **Parse** — Unstructured docs → parsed text/markdown
2. **Extract** — Text → structured entities via LLM
3. **Ingest** — Chunks → embeddings → your vector store
4. **RAG** — Questions → grounded answers with citations (six retrieval strategies)

> docpipe never stores your data. It connects to your infrastructure and gets out of the way.

**Full documentation** (install extras, Docker, API reference, RAG strategies, observability, turbovec, plugins): **[docpipe docs](https://docpipe.sunnysinha.online/docs)** · [Marketing site](https://docpipe.sunnysinha.online)

---

## Install

```bash
pip install docpipe-sdk
# API server + OpenTelemetry (optional)
pip install "docpipe-sdk[server,observability]"
```

Optional extras (`docling`, `openai`, `google`, `pgvector`, `turbovec`, `rag`, `rerank`, `http`, `all`, …) are listed on the **[Install guide](https://docpipe.sunnysinha.online/docs)**.

For unreleased commits: `pip install git+https://github.com/thesunnysinha/docpipe.git`

---

## Quick start

```python
import docpipe

# Parse
doc = docpipe.parse("invoice.pdf")
print(doc.markdown)

# Extract
schema = docpipe.ExtractionSchema(
    description="Extract invoice line items with amounts",
    model_id="gemini-2.5-flash",
)
results = docpipe.extract(doc.text, schema)

# Ingest + RAG (configure your DB + providers)
config = docpipe.IngestionConfig(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    table_name="invoices",
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
)
docpipe.ingest("invoice.pdf", config=config)

rag_config = docpipe.RAGConfig(
    connection_string=config.connection_string,
    table_name=config.table_name,
    embedding_provider="openai",
    embedding_model="text-embedding-3-small",
    llm_provider="openai",
    llm_model="gpt-4o",
    strategy="hyde",
)
result = docpipe.query("What is the total on the invoice?", config=rag_config)
print(result.answer)
```

**CLI:** `docpipe parse`, `docpipe ingest`, `docpipe rag query`, `docpipe serve` — see **[CLI & API server](https://docpipe.sunnysinha.online/docs)**.

**Docker:** `docker pull ghcr.io/thesunnysinha/docpipe:latest` — compose examples and env vars are in the **[Docker guide](https://docpipe.sunnysinha.online/docs)** and [`.env.example`](.env.example).

---

## Learn more

| Topic | Where |
|--------|--------|
| Install extras & providers | [docs](https://docpipe.sunnysinha.online/docs) |
| REST API (`/ingest`, `/rag/query`, `/rag/stream`, …) | [docs](https://docpipe.sunnysinha.online/docs) |
| RAG strategies (`naive`, `hyde`, `hybrid`, `auto`, …) | [docs](https://docpipe.sunnysinha.online/docs) |
| Observability (OTEL, Prometheus, JSON logs) | [docs](https://docpipe.sunnysinha.online/docs) · `.env.example` |
| turbovec (local file indices) | [docs](https://docpipe.sunnysinha.online/docs) |
| Custom parsers / extractors | [CONTRIBUTING.md](CONTRIBUTING.md) |
| Environment variables | [`.env.example`](.env.example) · [config reference](https://docpipe.sunnysinha.online/docs) |

---

## License

MIT — see [LICENSE](LICENSE).
