Metadata-Version: 2.4
Name: omnidoc-rag
Version: 0.1.1
Summary: RAG pipeline for omnidoc-sdk — intent-aware chunking, evaluation, streaming, graph linking, and vector DB integrations
Author-email: Ganesh Kinkar Giri <k.ganeshgiri@example.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/your-org/omnidoc-rag
Project-URL: Documentation, https://github.com/your-org/omnidoc-rag#readme
Project-URL: Issues, https://github.com/your-org/omnidoc-rag/issues
Keywords: rag,semantic-chunking,document-intelligence,vector-database,agentic-ai,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: omnidoc-sdk>=0.4.0
Provides-Extra: chroma
Requires-Dist: chromadb>=0.4.0; extra == "chroma"
Provides-Extra: pinecone
Requires-Dist: pinecone-client>=3.0.0; extra == "pinecone"
Provides-Extra: weaviate
Requires-Dist: weaviate-client>=4.0.0; extra == "weaviate"
Provides-Extra: pgvector
Requires-Dist: psycopg2-binary>=2.9.0; extra == "pgvector"
Requires-Dist: pgvector>=0.2.0; extra == "pgvector"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Provides-Extra: all
Requires-Dist: chromadb>=0.4.0; extra == "all"
Requires-Dist: pinecone-client>=3.0.0; extra == "all"
Requires-Dist: weaviate-client>=4.0.0; extra == "all"
Requires-Dist: psycopg2-binary>=2.9.0; extra == "all"
Requires-Dist: pgvector>=0.2.0; extra == "all"
Dynamic: license-file

<div align="center">

<h1>omnidoc-rag</h1>

<p><strong>Intent-aware RAG pipeline for the OmniDoc document intelligence ecosystem</strong></p>

<p>
  <img src="https://img.shields.io/badge/python-3.9%2B-blue?style=flat-square" alt="Python 3.9+"/>
  <img src="https://img.shields.io/badge/version-0.1.0-green?style=flat-square" alt="v0.1.0"/>
  <img src="https://img.shields.io/badge/license-Apache--2.0-orange?style=flat-square" alt="Apache 2.0"/>
  <img src="https://img.shields.io/badge/intents-6-purple?style=flat-square" alt="6 intent types"/>
  <img src="https://img.shields.io/badge/vector%20DBs-4-teal?style=flat-square" alt="4 vector DB adapters"/>
</p>

</div>

---

## What is omnidoc-rag?

`omnidoc-rag` is the companion RAG SDK for [`omnidoc-sdk`](https://github.com/your-org/omnidoc-python-sdk). It takes the clean `Document` objects produced by the extraction layer and turns them into vector-DB-ready semantic chunks with:

- **Intent classification** — 6 canonical intent types (metric, table, process, value_proposition, heading, narrative)
- **Adaptive chunking** — token budget varies by intent; overlap preserves context across boundaries
- **Deterministic confidence scoring** — per-chunk quality signal for retrieval ranking
- **Streaming** — true lazy generator that emits one chunk at a time
- **Retrieval evaluation** — query-term coverage, source diversity, verdict
- **Graph linking** — NEXT / SAME\_INTENT / METRIC\_OF edges between chunks
- **Cross-document stitching** — merge equivalent sections from multiple documents
- **Vector DB adapters** — ChromaDB, Pinecone, Weaviate, PostgreSQL/pgvector

---

## Table of Contents

1. [Installation](#installation)
2. [Quick Start](#quick-start)
3. [Intent Types](#intent-types)
4. [Chunking](#chunking)
5. [Streaming](#streaming)
6. [Confidence Scoring](#confidence-scoring)
7. [Retrieval Evaluation](#retrieval-evaluation)
8. [Graph Linking](#graph-linking)
9. [Cross-Document Stitching](#cross-document-stitching)
10. [Vector DB Adapters](#vector-db-adapters)
    - [ChromaDB](#chromadb)
    - [Pinecone](#pinecone)
    - [Weaviate](#weaviate)
    - [PostgreSQL / pgvector](#postgresql--pgvector)
11. [Schema Reference](#schema-reference)
12. [Optional Extras](#optional-extras)
13. [Contributing & Development](#contributing--development)
    - [Setup](#setup)
    - [Project Structure](#project-structure)
    - [Running Tests](#running-tests)
    - [Building & Publishing](#building--publishing)
    - [Versioning](#versioning)
    - [Release Checklist](#release-checklist)
    - [Troubleshooting](#troubleshooting)
14. [Changelog](#changelog)

---

## Installation

### Core (no vector DB)

```bash
pip install omnidoc-rag
```

### With ChromaDB

```bash
pip install "omnidoc-rag[chroma]"
```

### With Pinecone

```bash
pip install "omnidoc-rag[pinecone]"
```

### With Weaviate

```bash
pip install "omnidoc-rag[weaviate]"
```

### With PostgreSQL / pgvector

```bash
pip install "omnidoc-rag[pgvector]"
```

### Everything

```bash
pip install "omnidoc-rag[all]"
```

---

## Quick Start

```python
from omnidoc.loader.load import load_document      # omnidoc-sdk
from omnidocrag.chunker import chunk_document
from omnidocrag.evaluation import evaluate_rag_result

# 1. Extract
doc = load_document("investor_deck.pdf")

# 2. Chunk
chunks = chunk_document(doc)

for c in chunks:
    print(f"[{c.intent:<18}] conf={c.confidence:.2f}  p{c.page}  {c.text[:80]}")

# 3. Evaluate a retrieval result
result = evaluate_rag_result(
    query="What was the revenue growth rate?",
    answer="Revenue grew 24% year-over-year to $4.2B.",
    chunks=chunks,
)
print(result["overall"], result["verdict"])   # 0.87  excellent
```

---

## Intent Types

Every chunk is labelled with one of six canonical intents. The intent drives chunk sizing and confidence scoring.

| Intent | Token budget | Typical content |
|--------|-------------|-----------------|
| `heading` | 60 | Section/slide titles |
| `metric` | 150 | KPIs, financial figures, percentages |
| `table` | 200 | One row from an extracted table |
| `value_proposition` | 250 | Benefits, ROI claims, competitive statements |
| `narrative` | 350 | Prose, analysis, background paragraphs |
| `process` | 400 | Numbered steps, workflows, procedures |

Classification uses a deterministic regex classifier — no LLM call required:

```python
from omnidocrag.intent import classify_intent

classify_intent("Revenue grew 24% YoY to $4.2B")      # "metric"
classify_intent("Step 1: Configure the API key")       # "process"
classify_intent("This solution reduces costs by 30%")  # "value_proposition"
classify_intent("EXECUTIVE SUMMARY")                   # "heading"
classify_intent("The company was founded in 2012.")    # "narrative"
```

---

## Chunking

`chunk_document` converts a `Document` object into a list of `SemanticChunk` dataclasses.

```python
from omnidocrag.chunker import chunk_document

chunks = chunk_document(
    doc,
    overlap_chars=100,   # characters carried from previous chunk (default: 100)
    min_chars=20,        # discard chunks shorter than this (default: 20)
)
```

### What it does

- Iterates over `doc.sections` line by line
- Detects headings — flushes the current buffer and starts a new heading chunk
- Uses `classify_intent()` on the first line of each new buffer to set the intent
- Token budget comes from `tokens_for_intent(intent)` (see intent table above)
- When the buffer exceeds the budget it is flushed as a chunk; last `overlap_chars` characters carry over
- Each row of `doc.tables` becomes a separate `metric` chunk with a header prefix
- Chunk IDs are SHA1 hashes of `source + page + text` — deterministic and reproducible

### Chunk fields

```python
chunk.id           # str — SHA1 deterministic ID
chunk.text         # str — chunk content
chunk.intent       # str — one of the 6 intent types
chunk.confidence   # float — 0.1 … 1.0
chunk.page         # int — source page number
chunk.heading      # str | None — nearest heading above this chunk
chunk.keywords     # List[str] — BM25-weighted non-stopword terms
chunk.metadata     # dict — source, chunk_index, char_length, embedding_hint

chunk.to_dict()    # → dict, ready for JSON / vector DB
```

---

## Streaming

`stream_chunks` is a true Python generator — chunks are computed and emitted one at a time without building a full list first. Use this for large documents or memory-constrained environments.

```python
from omnidocrag.stream import stream_chunks

for chunk in stream_chunks(doc, overlap_chars=100):
    # process or upsert each chunk immediately
    print(chunk.text[:80])
```

The generator produces the same chunks as `chunk_document` (same algorithm, same overlap logic) but yields each one lazily.

---

## Confidence Scoring

`score_chunk` returns a float in `[0.1, 1.0]` — a deterministic quality signal based on text density, length, and structure.

```python
from omnidocrag.confidence import score_chunk

score_chunk("Revenue grew 24% YoY to $4.2B in fiscal 2024.")   # ≥ 0.8
score_chunk("ROI: 38%", intent="metric")                        # ≥ 0.7 (not penalised for short length)
score_chunk("See below")                                        # < 0.7
score_chunk("")                                                  # 0.1 (floor)
```

**Scoring factors:**
- Length bonus (scaled up to `≥ 100 chars`)
- Multi-line bonus (`≥ 4 lines`)
- Dense-fact pattern bonus (financial terms, percentages, currency)
- Short-text penalty — **not applied** to `metric`, `table`, or `value_proposition` intents

---

## Retrieval Evaluation

Score a RAG result without an LLM. All logic is deterministic and runs locally.

```python
from omnidocrag.evaluation import evaluate_rag_result

result = evaluate_rag_result(
    query="What was the EBITDA margin in fiscal 2024?",
    answer="EBITDA margin reached 28% driven by cost efficiencies.",
    chunks=chunks,                   # List[SemanticChunk] retrieved
)
```

### Return value

```python
{
    "overall":          0.84,        # composite score 0.0 – 1.0
    "coverage":         0.75,        # fraction of query terms found in chunks
    "confidence":       0.91,        # average chunk confidence
    "source_diversity": 3,           # unique pages used
    "chunks_used":      4,           # number of chunks evaluated
    "verdict":          "good",      # "excellent" | "good" | "weak" | "unsafe"
    "missing_terms":    ["ebitda"],  # query terms absent from chunks
}
```

### Verdict thresholds

| Verdict | Condition |
|---------|-----------|
| `unsafe` | No chunks provided |
| `weak` | overall < 0.5 |
| `good` | overall < 0.75 |
| `excellent` | overall ≥ 0.75 |

---

## Graph Linking

Build a lightweight in-memory knowledge graph from a list of chunks.

```python
from omnidocrag.graph import link_chunks

graph = link_chunks(chunks, source="investor_deck.pdf")

graph["nodes"]   # [{"id": ..., "text": ..., "intent": ..., ...}, ...]
graph["edges"]   # [{"from": ..., "to": ..., "relation": ...}, ...]
```

### Edge types

| Relation | Description |
|----------|-------------|
| `NEXT` | Sequential order — every adjacent pair of chunks |
| `SAME_INTENT` | Consecutive chunks sharing the same intent |
| `METRIC_OF` | Metric chunk → nearest preceding heading chunk |

---

## Cross-Document Stitching

Merge semantically equivalent sections from multiple documents into a single unified chunk set.

```python
from omnidocrag.stitcher import stitch_documents

# Each item: {"metadata": {...}, "chunks": [chunk.to_dict(), ...]}
docs = [
    {"metadata": {"file": "q3_report.pdf"},     "chunks": [c.to_dict() for c in chunks_q3]},
    {"metadata": {"file": "q4_report.pdf"},     "chunks": [c.to_dict() for c in chunks_q4]},
    {"metadata": {"file": "annual_report.pdf"}, "chunks": [c.to_dict() for c in chunks_annual]},
]

merged = stitch_documents(docs, similarity_threshold=0.80)

# Merged chunks have a "sources" list
for chunk in merged:
    if len(chunk["sources"]) > 1:
        print(f"Merged from: {chunk['sources']}  — {chunk['text'][:80]}")
```

Two chunks are merged when their heading **and** intent match with `SequenceMatcher` similarity ≥ `similarity_threshold`. The merged chunk's text is the highest-confidence version; `sources` lists all contributing files.

---

## Vector DB Adapters

All four adapters share the same `upsert(chunks)` / `query(query_text)` interface and accept either `SemanticChunk` objects or plain dicts.

---

### ChromaDB

Local or persistent ChromaDB. No embedding function required — ChromaDB provides a built-in one.

```python
from omnidocrag.vectordb.chroma import ChromaAdapter

# In-memory (default)
adapter = ChromaAdapter(collection_name="omnidoc")

# Persistent on disk
import chromadb
client = chromadb.PersistentClient(path="/data/chroma")
adapter = ChromaAdapter(collection_name="omnidoc", client=client)

# Custom embedding function
adapter = ChromaAdapter(
    collection_name="omnidoc",
    embedding_fn=lambda texts: my_model.encode(texts).tolist(),
)

# Upsert
count = adapter.upsert(chunks)          # accepts SemanticChunk or dict

# Query
results = adapter.query(
    query_text="What was the Q3 revenue growth?",
    n_results=5,
    where={"intent": "metric"},         # optional metadata filter
)
for r in results:
    print(r["score"], r["text"][:80])
```

Requires `pip install "omnidoc-rag[chroma]"`.

---

### Pinecone

Pinecone serverless or pod-based index. Embedding function is required.

```python
from omnidocrag.vectordb.pinecone import PineconeAdapter

adapter = PineconeAdapter(
    index_name="omnidoc-prod",          # must already exist in Pinecone
    embedding_fn=lambda text: model.encode(text).tolist(),
    api_key="pc-xxxxxxxxxxxxx",
    namespace="reports",                # optional
)

# Upsert
count = adapter.upsert(chunks)

# Query
results = adapter.query(query_text="EBITDA margin", top_k=5)
for r in results:
    print(r["score"], r["text"][:80])
```

Requires `pip install "omnidoc-rag[pinecone]"`.

---

### Weaviate

Weaviate v4 — local instance or Weaviate Cloud (WCD). Collection is created automatically.

```python
from omnidocrag.vectordb.weaviate import WeaviateAdapter

# Local Weaviate (localhost:8080)
adapter = WeaviateAdapter(
    collection_name="OmnidocChunks",
    embedding_fn=lambda text: model.encode(text).tolist(),
)

# Weaviate Cloud (WCD)
adapter = WeaviateAdapter(
    collection_name="OmnidocChunks",
    embedding_fn=lambda text: model.encode(text).tolist(),
    wcd_url="https://my-cluster.weaviate.network",
    wcd_api_key="wcd-api-key",
)

# Existing connected client
import weaviate
client = weaviate.connect_to_local()
adapter = WeaviateAdapter(collection_name="OmnidocChunks", client=client)

# Upsert — idempotent (deterministic UUID per chunk ID)
count = adapter.upsert(chunks)

# Query — vector search when embedding_fn provided, BM25 fallback otherwise
from weaviate.classes.query import Filter
results = adapter.query(
    query_text="revenue growth",
    limit=5,
    filters=Filter.by_property("intent").equal("metric"),  # optional
    certainty=0.7,                                         # min cosine similarity
)
for r in results:
    print(r["score"], r["text"][:80])

# Always close the connection when done
adapter.close()
```

Requires `pip install "omnidoc-rag[weaviate]"`.

---

### PostgreSQL / pgvector

Stores embeddings in a PostgreSQL table using the `vector` column type. Table and IVFFlat index are created automatically.

**Prerequisites:**
```sql
-- Run once on your PostgreSQL instance
CREATE EXTENSION IF NOT EXISTS vector;
```

```python
from omnidocrag.vectordb.pgvector import PgVectorAdapter
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

adapter = PgVectorAdapter(
    embedding_fn=lambda text: model.encode(text).tolist(),
    dsn="postgresql://user:password@localhost:5432/ragdb",
    table="omnidoc_chunks",            # created automatically
    dimensions=384,                    # must match embedding_fn output size
    create_index=True,                 # IVFFlat index for fast ANN search
)

# Upsert — ON CONFLICT DO UPDATE, safe to call repeatedly
count = adapter.upsert(chunks)

# Query by cosine similarity (<=> operator)
results = adapter.query(query_text="revenue growth", top_k=5)

# Query with SQL filter
results = adapter.query(
    query_text="revenue growth",
    top_k=5,
    where="intent = %s AND page > %s",
    where_params=("metric", 2),
)
for r in results:
    print(r["score"], r["text"][:80])

# Delete specific chunks
adapter.delete(["chunk-id-1", "chunk-id-2"])

# Context manager — closes connection automatically
with PgVectorAdapter(embedding_fn=..., dsn=..., dimensions=384) as adapter:
    adapter.upsert(chunks)
    results = adapter.query("revenue", top_k=5)
```

Requires `pip install "omnidoc-rag[pgvector]"`.

---

### Adapter comparison

| | ChromaDB | Pinecone | Weaviate | pgvector |
|--|----------|----------|----------|----------|
| Embedding fn required | No (built-in) | Yes | No (built-in or custom) | Yes |
| Self-hosted | Yes | No | Yes / WCD | Yes |
| Persistent by default | No (in-memory) | Yes | Yes | Yes |
| Filter on query | Yes (`where` dict) | No | Yes (Filter API) | Yes (raw SQL) |
| Similarity metric | Cosine (distance) | Cosine | Cosine / certainty | Cosine (`<=>`) |
| Install extra | `chroma` | `pinecone` | `weaviate` | `pgvector` |

---

## Schema Reference

```python
from omnidocrag.schema import SemanticChunk
from dataclasses import fields

for f in fields(SemanticChunk):
    print(f.name, f.type)

# id           str
# text         str
# intent       str    — metric|table|process|value_proposition|heading|narrative
# confidence   float  — 0.1 … 1.0
# page         int    — default 1
# heading      Optional[str]
# keywords     List[str]
# metadata     Dict[str, Any]
```

`SemanticChunk.to_dict()` returns a plain `dict` safe for JSON serialisation and vector DB metadata fields.

---

## Optional Extras

| Extra | Install | Unlocks |
|-------|---------|---------|
| `chroma` | `pip install "omnidoc-rag[chroma]"` | ChromaDB adapter |
| `pinecone` | `pip install "omnidoc-rag[pinecone]"` | Pinecone adapter |
| `weaviate` | `pip install "omnidoc-rag[weaviate]"` | Weaviate v4 adapter |
| `pgvector` | `pip install "omnidoc-rag[pgvector]"` | PostgreSQL + pgvector adapter |
| `all` | `pip install "omnidoc-rag[all]"` | All four adapters |

---

## Contributing & Development

### Setup

```bash
git clone https://github.com/your-org/omnidoc-rag.git
cd omnidoc-rag
python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

# Editable install with all extras
pip install -e ".[all]"

# Dev tools
pip install build twine pytest pytest-cov ruff black mypy
```

Verify:

```bash
python -c "import omnidocrag; print('OK')"
```

`omnidoc-sdk` is a required dependency — install it first if not already present:

```bash
pip install omnidoc-sdk
# or from local source:
pip install -e ../omnidoc-sdk
```

---

### Project Structure

```
omnidoc-rag/
├── omnidocrag/
│   ├── __init__.py          # Public API — lazy wrappers for all modules
│   ├── schema.py            # SemanticChunk dataclass
│   ├── intent.py            # classify_intent() — deterministic regex classifier
│   ├── adaptive.py          # tokens_for_intent() — per-intent token budgets
│   ├── confidence.py        # score_chunk() — deterministic quality scoring
│   ├── hybrid_metadata.py   # hybrid_metadata() — BM25 keywords + SHA1 hint
│   ├── chunker.py           # chunk_document() — full list output
│   ├── stream.py            # stream_chunks() — lazy generator
│   ├── evaluation.py        # evaluate_rag_result() — retrieval scoring
│   ├── graph.py             # link_chunks() — NEXT / SAME_INTENT / METRIC_OF
│   ├── stitcher.py          # stitch_documents() — cross-doc merging
│   └── vectordb/
│       ├── __init__.py      # Adapter index + docstring
│       ├── chroma.py        # ChromaAdapter
│       ├── pinecone.py      # PineconeAdapter
│       ├── weaviate.py      # WeaviateAdapter (v4 Collections API)
│       └── pgvector.py      # PgVectorAdapter (psycopg2 + pgvector)
├── tests/
│   ├── __init__.py
│   ├── test_rag.py          # 62 tests — intent, confidence, chunker, eval, graph, stitcher, stream
│   └── test_vectordb.py     # 49 tests — all 4 vector DB adapters (mock-based, no live DB)
├── pyproject.toml
└── README.md
```

---

### Running Tests

The test suite uses fake `Document` objects and mocked DB clients — no `omnidoc-sdk` or live database required to run tests.

```bash
# All 111 tests
pytest tests/ -v

# Single class
pytest tests/test_rag.py::TestChunkDocument -v
pytest tests/test_vectordb.py::TestChromaAdapter -v

# With coverage
pytest tests/ --cov=omnidocrag --cov-report=term-missing

# Skip if vector DB packages not installed
pytest tests/ -v -m "not integration"
```

**Coverage report (111 tests — 100% coverage):**

| Module | Stmts | Cover | Test file |
|--------|-------|-------|-----------|
| `__init__.py` | 16 | **100%** | `test_rag.py::TestTopLevelAPI` |
| `schema.py` | 10 | **100%** | `test_rag.py` |
| `adaptive.py` | 4 | **100%** | `test_rag.py` |
| `intent.py` | 30 | **100%** | `test_rag.py::TestClassifyIntent` + `TestIntentEdgeCases` |
| `confidence.py` | 23 | **100%** | `test_rag.py::TestScoreChunk` + `TestConfidenceEdgeCases` |
| `hybrid_metadata.py` | 22 | **100%** | `test_rag.py::TestHybridMetadataEdgeCases` |
| `chunker.py` | 89 | **100%** | `test_rag.py::TestChunkDocument` + `TestChunkerEdgeCases` |
| `stream.py` | 77 | **100%** | `test_rag.py::TestStreamChunks` + `TestStreamEdgeCases` |
| `evaluation.py` | 44 | **100%** | `test_rag.py::TestEvaluateRagResult` + `TestEvaluationEdgeCases` |
| `graph.py` | 30 | **100%** | `test_rag.py::TestLinkChunks` + `TestGraphEdgeCases` |
| `stitcher.py` | 29 | **100%** | `test_rag.py::TestStitchDocuments` |
| `vectordb/chroma.py` | 32 | **100%** | `test_vectordb.py::TestChromaAdapter` |
| `vectordb/pinecone.py` | 27 | **100%** | `test_vectordb.py::TestPineconeAdapter` |
| `vectordb/weaviate.py` | 69 | **100%** | `test_vectordb.py::TestWeaviateAdapter` |
| `vectordb/pgvector.py` | 80 | **100%** | `test_vectordb.py::TestPgVectorAdapter` |
| **Total** | **582** | **100%** | — |

> Vector DB adapter tests use `unittest.mock` / `sys.modules` injection — no live ChromaDB, Pinecone, Weaviate, or PostgreSQL connection required.

**Lint and type-check:**

```bash
ruff check omnidocrag/
black --check omnidocrag/
mypy omnidocrag/ --ignore-missing-imports

# Auto-fix
black omnidocrag/
ruff check omnidocrag/ --fix
```

---

### Building & Publishing

**Build:**

```bash
rm -rf dist/ build/ omnidocrag.egg-info/
python -m build
twine check dist/*
```

**Test on TestPyPI first:**

```bash
twine upload --repository testpypi dist/*

# Verify install
pip install \
  --index-url https://test.pypi.org/simple/ \
  --extra-index-url https://pypi.org/simple/ \
  omnidoc-rag
```

**Publish to PyPI:**

```bash
twine upload dist/*
pip install omnidoc-rag==0.1.0
```

**Credentials — `~/.pypirc`:**

```ini
[distutils]
index-servers = pypi testpypi

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-YOUR_TEST_TOKEN

[pypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-YOUR_PROD_TOKEN
```

```bash
chmod 600 ~/.pypirc
```

Store `PYPI_API_TOKEN` and `TEST_PYPI_API_TOKEN` as GitHub repository secrets for CI/CD publishing on version tags.

---

### Versioning

Version is defined once in `pyproject.toml`. Follow [Semantic Versioning](https://semver.org/):

| Change | Example | Bump |
|--------|---------|------|
| Bug fix | Fix coverage calculation | `0.1.0 → 0.1.1` |
| New feature | Add new vector DB adapter | `0.1.0 → 0.2.0` |
| Breaking change | Rename `SemanticChunk` fields | `0.1.0 → 1.0.0` |

> PyPI does not allow re-uploading the same version. Always bump before rebuilding.

---

### Release Checklist

```
[ ] ruff check omnidocrag/        — zero errors
[ ] black --check omnidocrag/     — no formatting changes
[ ] pytest tests/ -v              — all 111 tests pass
[ ] Version bumped in pyproject.toml
[ ] Changelog updated below
[ ] rm -rf dist/ && python -m build
[ ] twine check dist/*            — both artifacts PASSED
[ ] TestPyPI round-trip verified
[ ] twine upload dist/*           — production upload
[ ] git tag vX.Y.Z && git push origin vX.Y.Z
```

---

### Troubleshooting

**`ImportError: ChromaDB support requires the chromadb package`**

```bash
pip install "omnidoc-rag[chroma]"
```

**`ImportError: Pinecone support requires the pinecone package`**

```bash
pip install "omnidoc-rag[pinecone]"
```

**`ImportError: Weaviate support requires the weaviate-client package`**

```bash
pip install "omnidoc-rag[weaviate]"
```

**`ImportError: pgvector support requires psycopg2-binary and pgvector`**

```bash
pip install "omnidoc-rag[pgvector]"
```

**Pinecone / pgvector: embedding function is required** — both adapters need you to supply an `embedding_fn`:

```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embedding_fn = lambda text: model.encode(text).tolist()
```

**pgvector: `could not open extension "vector"`** — the pgvector extension is not installed on the PostgreSQL server. Run as a superuser:

```sql
CREATE EXTENSION IF NOT EXISTS vector;
```

**Weaviate: `WeaviateConnectionError`** — no Weaviate instance running locally. Start one with Docker:

```bash
docker run -d -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:latest
```

Or pass `wcd_url` and `wcd_api_key` to connect to Weaviate Cloud instead.

**`evaluate_rag_result` returns `overall=0.0` / verdict `unsafe`** — chunks list is empty. Ensure `chunk_document(doc)` ran successfully and the document has non-empty sections.

**`twine check` fails with "description failed to render"**

```bash
pip install readme-renderer[md]
python -m readme_renderer README.md -o /tmp/preview.html
```

---

## Changelog

### [0.1.0] — 2026-04-10

Initial release of `omnidoc-rag` as a standalone SDK, split from `omnidoc-sdk`.

#### Added
- `schema.py` — `SemanticChunk` dataclass with `to_dict()` serialisation
- `intent.py` — `classify_intent()` deterministic regex classifier; 6 canonical labels: `metric`, `table`, `process`, `value_proposition`, `heading`, `narrative`
- `adaptive.py` — `tokens_for_intent()` per-intent token budgets (heading=60 → process=400)
- `confidence.py` — `score_chunk(text, intent=)` with intent-aware short-text penalty exemption; floor=0.1
- `hybrid_metadata.py` — `hybrid_metadata()` BM25 keyword extraction with stopword removal; SHA1 embedding hint
- `chunker.py` — `chunk_document(doc, overlap_chars, min_chars)` with heading detection, overlap carry-over, table row expansion, deterministic SHA1 chunk IDs
- `stream.py` — `stream_chunks()` true lazy generator; yields one chunk at a time
- `evaluation.py` — `evaluate_rag_result(query, answer, chunks)` returning `overall`, `coverage`, `confidence`, `source_diversity`, `verdict`, `missing_terms`
- `graph.py` — `link_chunks()` producing NEXT, SAME\_INTENT, and METRIC\_OF edges
- `stitcher.py` — `stitch_documents()` cross-document merging by heading+intent similarity via `SequenceMatcher`
- `vectordb/chroma.py` — `ChromaAdapter`: in-memory or persistent; optional custom embedding function; `where` filter on query
- `vectordb/pinecone.py` — `PineconeAdapter`: serverless and pod indexes; namespace support; `top_k` query
- `vectordb/weaviate.py` — `WeaviateAdapter`: Weaviate v4 Collections API; auto-creates collection and schema; `near_vector` or BM25 fallback; local and WCD support; deterministic UUID upserts
- `vectordb/pgvector.py` — `PgVectorAdapter`: PostgreSQL + pgvector `<=>` cosine operator; auto-creates table and IVFFlat index; `ON CONFLICT DO UPDATE` upserts; raw SQL `WHERE` filter; `delete(ids)`, `drop_table()`, context manager
- `tests/test_rag.py` — 62 tests covering all core modules; fake `Document` objects require no `omnidoc-sdk` install
- `tests/test_vectordb.py` — 49 tests covering all 4 vector DB adapters using mocks (no live DB required)
- `pyproject.toml` — `omnidoc-rag v0.1.0`; extras: `chroma`, `pinecone`, `weaviate`, `pgvector`, `all`

#### Fixed
- Import path corrected from `omnidoc_rag` to `omnidocrag` across all vectordb adapters
- `pyproject.toml` package discovery path corrected to `omnidocrag`

---

<div align="center">

**omnidoc-rag** &nbsp;·&nbsp; v0.1.0 &nbsp;·&nbsp; Apache 2.0 &nbsp;·&nbsp; Extraction layer → [omnidoc-sdk](https://github.com/your-org/omnidoc-python-sdk)

</div>
