Metadata-Version: 2.4
Name: pgrag
Version: 1.2.0
Summary: Self-hosted document ingestion and hybrid (vector + keyword) retrieval server backed by Postgres/pgvector and Google embeddings.
Author: Sagar Hedaoo
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/sagarhedaoo/vectorrag
Project-URL: Repository, https://github.com/sagarhedaoo/vectorrag
Project-URL: Issues, https://github.com/sagarhedaoo/vectorrag/issues
Keywords: rag,embeddings,pgvector,retrieval,semantic-search,fastapi
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.12
Classifier: Framework :: FastAPI
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115
Requires-Dist: uvicorn[standard]>=0.30
Requires-Dist: pydantic>=2.7
Requires-Dist: pydantic-settings>=2.3
Requires-Dist: psycopg[binary,pool]>=3.2
Requires-Dist: pgvector>=0.3.6
Requires-Dist: arq>=0.26
Requires-Dist: redis>=5.0
Requires-Dist: google-genai>=1.0
Requires-Dist: alembic>=1.13
Requires-Dist: prometheus-client>=0.20
Provides-Extra: dev
Requires-Dist: pytest>=8.2; extra == "dev"
Requires-Dist: pytest-asyncio>=1.0; extra == "dev"
Requires-Dist: testcontainers[postgres]>=4.5; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: pyright>=1.1.380; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: opentelemetry-api>=1.27; extra == "dev"
Requires-Dist: opentelemetry-sdk>=1.27; extra == "dev"
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.27; extra == "otel"
Requires-Dist: opentelemetry-sdk>=1.27; extra == "otel"
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.27; extra == "otel"
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.48; extra == "otel"
Dynamic: license-file

# VectorRAG (pgrag)

> A self-hosted document ingestion and **hybrid retrieval** server for Retrieval-Augmented Generation pipelines. Bring your own embedding key and database — no SaaS dependency.

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/sagarhedaoo/VectorRAG/blob/main/LICENSE)
[![Python: 3.12+](https://img.shields.io/badge/Python-3.12%2B-3776AB.svg?logo=python&logoColor=white)](https://www.python.org/downloads/)
[![CI](https://github.com/sagarhedaoo/VectorRAG/actions/workflows/ci.yml/badge.svg)](https://github.com/sagarhedaoo/VectorRAG/actions/workflows/ci.yml)
![Coverage](https://img.shields.io/badge/coverage-99%25-brightgreen.svg)
[![Type checked: pyright (strict)](https://img.shields.io/badge/types-pyright%20strict-2D6FBC.svg)](https://github.com/microsoft/pyright)

VectorRAG ingests text documents, embeds them with **Google `gemini-embedding-001`**, stores the vectors in **Postgres + pgvector**, and serves **hybrid (vector + full-text)** retrieval over HTTP — with every retrieval knob (top-k, top-p, similarity threshold, MMR, hybrid fusion, metadata filters) adjustable per request. Async ingestion via an Arq worker, API-key auth, per-key rate limiting, Prometheus metrics, structured logs, and health probes are wired in by default.

> The PyPI distribution name is **`pgrag`** (the bare `vectorrag` name was claimed by an unrelated project). The Python package still imports as `vectorrag` and the CLI is `vectorrag`. Sources, issue tracker, architecture diagram, and contributor guide live on **[GitHub](https://github.com/sagarhedaoo/VectorRAG)**.

---

## Why VectorRAG

There are excellent hosted vector databases. There are also good ingestion frameworks. VectorRAG sits in the gap between them:

- **It's a service, not a library.** A real FastAPI server with auth, rate limiting, observability, and health probes — not a notebook helper.
- **You self-host it.** Your data stays in your Postgres. Your embeddings cost what _you_ pay Google, not a markup. No outbound calls beyond the embedding API.
- **The retrieval layer is configurable per request.** Want pure semantic for one query and hybrid + MMR for another? Just change the request body — no redeploy, no separate index.
- **It's built to be operated.** API-key auth, per-key rate limits, Prometheus metrics with request id correlation, `/healthz` + `/readyz` probes, graceful worker drain, content-hash dedup, idempotent migrations.

---

## Features

- **Hybrid retrieval** — HNSW vector ANN (pgvector `halfvec` + `halfvec_cosine_ops`) fused with Postgres full-text search (`ts_rank_cd` over `tsvector`). RRF or weighted-α fusion.
- **Every knob adjustable per request** — `top_k`, `candidate_k`, `ef_search`, `similarity_threshold`, `top_p` (softmax nucleus), `mmr` + `mmr_lambda`, `hybrid` toggle, `fusion`/`alpha`/`rrf_k`, JSONB metadata `filter`, `include` field selection, and a **reranker seam** ready for a cross-encoder drop-in.
- **Embeddings via Google Gemini** — `gemini-embedding-001` at 1536 dims through `google-genai`, with **task-aware** embeddings (`RETRIEVAL_DOCUMENT` for chunks, `RETRIEVAL_QUERY` for queries) and a **content-hash cache** that never re-embeds the same chunk twice.
- **Resilient composition** — `CachingEmbedder(RetryingEmbedder(GeminiEmbedder(...)))` with exponential backoff on transient failures.
- **Async ingestion** — Submit a document over HTTP, get a `job_id` back, watch the Arq worker chunk → embed → bulk-insert with status visible via the API. Failed jobs mark the document `failed`, increment attempts, and re-raise for Arq retries.
- **Production hardening built in** — API-key auth (SHA-256 hashed at rest, Bearer or `X-API-Key`), per-key fixed-window rate limiting, JSON logs with `X-Request-ID` correlation, Prometheus metrics (`http_requests_total`, `http_request_duration_seconds`), `/healthz` and `/readyz`.
- **Operator tooling** — `pg_dump` / `pg_restore` scripts + runbook, recall@k evaluation harness, `BatchingEmbedder` for cheap bulk loads.
- **Strict quality bar** — 126 tests, **99% coverage**, ruff clean, **pyright strict 0 errors**, CI runs on every push.

---

## Requirements

- Python **3.12+**
- **Postgres 16+** with the `pgvector` extension (easiest: `pgvector/pgvector:pg17` Docker image)
- **Redis 7+** for the Arq queue
- A **Google Gemini API key** for embeddings

---

## Installation

```bash
pip install pgrag
```

This installs the `vectorrag` console script (`vectorrag migrate / serve / worker / create-key`) and the importable package (`import vectorrag`).

Set the required environment variables, then bring up the service:

```bash
export GCP_API_KEY=...
export DATABASE_URL=postgresql://user:pass@localhost:5432/rag
export REDIS_URL=redis://localhost:6379/0

vectorrag --version                  # print the installed version
vectorrag migrate                    # apply schema
vectorrag create-key my-app          # mint a server API key (prints the raw key once — capture it)
vectorrag serve &                    # HTTP API on :8000
vectorrag worker &                   # ingestion worker
```

For a complete docker-compose stack (Postgres + Redis + API + worker, all wired together), clone the repo:

```bash
git clone https://github.com/sagarhedaoo/VectorRAG.git
cd VectorRAG && cp .env.example .env  # fill in GCP_API_KEY
docker compose up -d --build
docker compose exec api vectorrag migrate
docker compose exec api vectorrag create-key my-app
```

---

## Usage

All examples assume `BASE=http://localhost:8000` and `KEY=<your-api-key>`.

### 1. Create a collection

```bash
curl -s -X POST $BASE/collections \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"name":"my-docs"}'
```

Returns:

```json
{
  "id": "8c2a...",
  "name": "my-docs",
  "embedding_model": "gemini-embedding-001",
  "embedding_dim": 1536,
  "distance": "cosine"
}
```

### 2. Submit a document for ingestion

```bash
curl -s -X POST $BASE/collections/<collection_id>/documents \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{
    "text": "VectorRAG stores embeddings in Postgres and serves hybrid retrieval...",
    "title": "intro",
    "metadata": {"source": "docs", "lang": "en"},
    "chunk_size": 512,
    "chunk_overlap": 64
  }'
```

Returns `202 Accepted` with `{ "document_id": ..., "job_id": ..., "deduplicated": false }`. The worker picks the job up, chunks the text, embeds each chunk (using cached results when available), and bulk-inserts. Poll `GET /documents/{id}` until `"status": "done"`.

Dedup is per-collection on `sha256(text)`. Submitting the same text twice in the same collection returns `deduplicated: true` and no new job.

### 3. Query the collection

```bash
curl -s -X POST $BASE/collections/<collection_id>/search \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{
    "query": "hybrid retrieval",
    "top_k": 5,
    "hybrid": true,
    "fusion": "rrf",
    "filter": {"lang": "en"},
    "include": ["content", "metadata", "score"]
  }'
```

Returns:

```json
{
  "results": [
    {
      "chunk_id": "...",
      "score": 0.84,
      "content": "VectorRAG stores embeddings in Postgres...",
      "metadata": { "source": "docs", "lang": "en" }
    }
  ]
}
```

---

## Configuration

Set via environment variables (or a local `.env`):

| Var                      | Required | Default                | Purpose                                                                                              |
| ------------------------ | -------- | ---------------------- | ---------------------------------------------------------------------------------------------------- |
| `GCP_API_KEY`            | ✓        | —                      | Google Gemini embedding API key                                                                      |
| `DATABASE_URL`           | ✓        | —                      | Postgres connection string (must have pgvector)                                                      |
| `REDIS_URL`              | ✓        | —                      | Redis URL for the Arq queue                                                                          |
| `EMBEDDING_MODEL`        |          | `gemini-embedding-001` | Embedding model name                                                                                 |
| `EMBEDDING_DIM`          |          | `1536`                 | Embedding dimensionality (Matryoshka-truncated)                                                      |
| `EMBED_BATCH_SIZE`       |          | `100`                  | Documents per embedding call                                                                         |
| `HNSW_EF_SEARCH_DEFAULT` |          | `80`                   | Default HNSW recall/speed knob                                                                       |
| `API_KEYS_BOOTSTRAP`     |          | `""`                   | If set, this raw key is inserted as an API key on first boot (idempotent). Convenient for local dev. |

---

## Search parameters

Every request to `POST /collections/{id}/search` accepts the following:

| Parameter              | Type                  | Default                          | Notes                                                                                 |
| ---------------------- | --------------------- | -------------------------------- | ------------------------------------------------------------------------------------- |
| `query`                | string\|null          | —                                | Required unless `query_vector` is provided.                                           |
| `query_vector`         | float[]\|null         | —                                | Bypass embedding by sending a vector. Length should match `embedding_dim`.            |
| `top_k`                | int (1..1000)         | `10`                             | Final results returned. Must be ≤ `candidate_k`.                                      |
| `candidate_k`          | int (1..10000)        | `100`                            | Candidates pulled from each index before fusion.                                      |
| `ef_search`            | int\|null (1..1000)   | `HNSW_EF_SEARCH_DEFAULT`         | HNSW recall/speed dial. Set higher for better recall.                                 |
| `similarity_threshold` | float\|null (-1..1)   | —                                | Drop candidates with `vector_similarity` below this. Keyword-only hits are preserved. |
| `top_p`                | float\|null (0..1]    | —                                | Nucleus cutoff over softmaxed scores.                                                 |
| `mmr`                  | bool                  | `false`                          | Apply Maximal Marginal Relevance after threshold/top_p.                               |
| `mmr_lambda`           | float (0..1)          | `0.5`                            | 0 = max diversity, 1 = max relevance.                                                 |
| `hybrid`               | bool                  | `true`                           | Combine vector ANN with full-text search.                                             |
| `fusion`               | `"rrf"`\|`"weighted"` | `"rrf"`                          | Reciprocal Rank Fusion or min-max-normalized weighted blend.                          |
| `alpha`                | float (0..1)          | `0.5`                            | Weighted fusion only — vector contribution weight.                                    |
| `rrf_k`                | int ≥1                | `60`                             | RRF constant.                                                                         |
| `filter`               | object                | `{}`                             | JSONB containment filter on `chunks.metadata` (e.g. `{"lang": "en"}`).                |
| `distance`             | `"cosine"`            | `"cosine"`                       | Only cosine supported in v1; others return 400.                                       |
| `rerank`               | bool                  | `false`                          | Enable the reranker seam (no-op until a cross-encoder is wired).                      |
| `include`              | string[]              | `["content","metadata","score"]` | Subset of `content`, `metadata`, `score`, `embedding`.                                |

Two common recipes:

- **Pure semantic:** `{"query": "...", "top_k": 5, "hybrid": false}`
- **Hybrid + diverse + filtered:** `{"query": "...", "top_k": 5, "fusion": "rrf", "mmr": true, "filter": {"lang": "en"}}`

---

## API reference

All data endpoints require `Authorization: Bearer <key>` (or `X-API-Key: <key>`). Health and `/metrics` are open.

| Method   | Path                          | Notes                                                                                       |
| -------- | ----------------------------- | ------------------------------------------------------------------------------------------- |
| `POST`   | `/collections`                | Create a collection. **409** on duplicate name.                                             |
| `GET`    | `/collections`                | List (`limit`/`offset` query params).                                                       |
| `GET`    | `/collections/{id}`           | Get one. **404** if missing.                                                                |
| `DELETE` | `/collections/{id}`           | Cascades to documents and chunks. **204** on success.                                       |
| `POST`   | `/collections/{id}/documents` | Submit a document. **202** on accept; **404** on missing collection; **422** on empty text. |
| `GET`    | `/documents/{id}`             | Document status + metadata.                                                                 |
| `GET`    | `/jobs/{id}`                  | Job status, progress, attempts, error.                                                      |
| `POST`   | `/collections/{id}/search`    | Hybrid search. **400** on non-cosine `distance`; **404** on missing collection.             |
| `POST`   | `/embeddings`                 | Debug: returns `{dim, embedding}` for arbitrary text.                                       |
| `GET`    | `/healthz`                    | Liveness — always `200 {"status":"ok"}`.                                                    |
| `GET`    | `/readyz`                     | Readiness — **200** if DB reachable, **503** otherwise.                                     |
| `GET`    | `/metrics`                    | Prometheus text format.                                                                     |

---

## Observability

VectorRAG ships structured JSON logs (with `X-Request-ID` correlation), Prometheus metrics on `GET /metrics`, and **optional** OpenTelemetry tracing for the retrieval pipeline and ingestion worker.

### Tracing (optional)

Install the extra and point at any OTel collector:

```bash
pip install "pgrag[otel]"

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
vectorrag serve
```

All standard OTel environment variables are honored — `OTEL_SERVICE_NAME`, `OTEL_EXPORTER_OTLP_PROTOCOL` (`grpc` or `http/protobuf`), `OTEL_SDK_DISABLED`, `OTEL_TRACES_SAMPLER`, etc. Use any compatible backend: Jaeger, Grafana Tempo, Honeycomb, Datadog APM, Lightstep.

Without `OTEL_EXPORTER_OTLP_ENDPOINT` set, tracing is a complete no-op even with the `[otel]` extra installed — there's nothing to send spans to, so the SDK isn't initialized. Without the `[otel]` extra installed, all `span(...)` calls are pass-through context managers — zero runtime overhead.

**Spans emitted:**

| Span | Where | Key attributes |
| --- | --- | --- |
| `vectorrag.search` | per search request | `collection_id`, `top_k`, `hybrid`, `fusion`, `mmr`, `rerank` |
| `vectorrag.embed_query` | inside search, when `query` is provided | `model`, `embedding_dim` |
| `vectorrag.vector_search` | inside search | `candidate_k`, `ef_search` |
| `vectorrag.keyword_search` | inside search, when `hybrid=true` | `candidate_k` |
| `vectorrag.fuse` | inside search, when `hybrid=true` | `fusion_strategy`, `alpha` (weighted only) |
| `vectorrag.postprocess` | inside search | `similarity_threshold`, `top_p` |
| `vectorrag.rerank` | inside search, when `rerank=true` | `reranker` |
| `vectorrag.mmr` | inside search, when `mmr=true` | `mmr_lambda` |
| `vectorrag.ingest_document` | per ingestion job (worker) | `document_id`, `collection_id` |
| `vectorrag.chunk_text` | inside ingestion | `chunk_size`, `chunk_overlap`, `strategy`, `chunk_count` |
| `vectorrag.embed_chunks` | inside ingestion | `model`, `embedding_dim`, `chunk_count` |
| `vectorrag.insert_chunks` | inside ingestion | `row_count` |

The API trace tree is rooted at the FastAPI auto-instrumented `POST /collections/{id}/search` span. The worker trace tree is rooted at `vectorrag.ingest_document` — there's no cross-process trace propagation in v1.1 (API → Redis → worker is a v1.2+ item).

---

## Roadmap

- **v1.1 — Production hardening for scale.** Optional binary-quantization + rerank path for ≥10M-chunk deployments, partitioned `chunks` table, OpenTelemetry tracing.
- **v1.2 — Beyond the basics.** Cross-encoder reranker implementation behind the existing seam, batch embedding API path (50% off via Gemini batch endpoint), HyDE / query expansion experiments.

For the full architecture diagram, project structure, contributor guide, and operator runbooks, see the **[GitHub repository](https://github.com/sagarhedaoo/VectorRAG)**.

---

## License

Released under the [Apache License 2.0](https://github.com/sagarhedaoo/VectorRAG/blob/main/LICENSE).

---

## Acknowledgements

VectorRAG stands on the shoulders of:

- **[pgvector](https://github.com/pgvector/pgvector)** — Postgres vector search.
- **[FastAPI](https://fastapi.tiangolo.com/)** + **[Pydantic](https://docs.pydantic.dev/)** — the HTTP and validation layer.
- **[psycopg3](https://www.psycopg.org/)** — async Postgres driver.
- **[Arq](https://arq-docs.helpmanual.io/)** — Redis-backed async task queue.
- **[Google AI Studio](https://ai.google.dev/) / `google-genai`** — embeddings.
- **[Alembic](https://alembic.sqlalchemy.org/)** — schema migrations.
- **[testcontainers-python](https://testcontainers-python.readthedocs.io/)** — test infrastructure.
