Metadata-Version: 2.4
Name: chroma-hybrid-rrf
Version: 0.1.0
Summary: A production-grade custom LangChain retriever combining Chroma vector search and local BM25 search with Reciprocal Rank Fusion (RRF).
Author-email: Raja Rajeswaran A <rajarajeswaran2001@gmail.com>
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain-core>=0.1.0
Requires-Dist: langchain-community>=0.0.10
Requires-Dist: chromadb>=0.4.0
Requires-Dist: rank-bm25>=0.2.2
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.20.0; extra == "dev"
Dynamic: license-file

# 🚀 Chroma-Hybrid-RRF

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![LangChain](https://img.shields.io/badge/LangChain-%E2%9A%A1-orange)](https://github.com/langchain-ai/langchain)
[![Chroma](https://img.shields.io/badge/ChromaDB-VectorSearch-blue)](https://github.com/chroma-core/chroma)

`chroma-hybrid-rrf` is a production-grade, highly performant custom **LangChain-compatible retriever** that merges dense vector semantic search (using **ChromaDB**) and sparse keyword keyword search (using **BM25**) using **Reciprocal Rank Fusion (RRF)**.

By combining keyword matching and vector embeddings, this retriever increases query precision and robustness, mitigating issues like synonym misses and context retrieval gaps.

---

## 📐 How It Works: Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion is a highly reliable algorithm that scores documents solely based on their rank order from different retrievers (rather than comparing raw similarity scores or distances, which vary widely between vector spaces and keyword counters).

The RRF score for a document $d$ across retrieval models $M$ is calculated as:

$$RRF\_Score(d \in D) = \sum_{m \in M} \frac{1}{k + r_m(d)}$$

Where:
* $M$: The set of retrievers (Dense vector search + Sparse BM25 keyword search).
* $r_m(d)$: The 1-based rank position of document $d$ in the result list returned by retriever $m$.
* $k$: A constant smoothing parameter (default `60`) that prevents low ranks (outliers) from dominating the overall scoring.

---

## 🛠️ Key Features

* **Dual-retrieval pipelines**: Performs dense search via ChromaDB and sparse keyword search via BM25.
* **Auto-Sync indexing**: Dynamically pulls and indexes documents from ChromaDB to construct the BM25 search corpus automatically.
* **Metadata preservation**: Retains all original source metadata and appends the calculated `rrf_score` for debugging and evaluation.
* **LangChain BaseRetriever compliance**: Full drop-in integration with LangChain chains (`|`) and LCEL (LangChain Expression Language).
* **Async-ready**: Supports standard async calling conventions (`ainvoke`).

---

## 📦 Installation

To install `chroma-hybrid-rrf` locally in editable mode for development:

```bash
git clone https://github.com/Raj2001A/chroma-hybrid-rrf.git
cd chroma-hybrid-rrf
python -m venv venv
source venv/bin/activate  # On Windows: .\venv\Scripts\activate
pip install -e .[dev]
```

---

## ⚡ Quick Start

```python
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from chroma_hybrid_rrf import ChromaHybridRRFRetriever

# 1. Initialize dense Chroma Vector Store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
    collection_name="my_docs", 
    embedding_function=embeddings, 
    persist_directory="./chroma_db"
)

# 2. Create the Custom Hybrid RRF Retriever
retriever = ChromaHybridRRFRetriever(
    chroma_vectorstore=vectorstore,
    rrf_k=60,       # RRF constant k
    top_n=4         # Return top 4 fused documents
)

# 3. Retrieve fused documents
query = "Explain LangGraph multi-agent orchestration"
fused_docs = retriever.invoke(query)

for rank, doc in enumerate(fused_docs):
    print(f"Rank {rank + 1} | Score: {doc.metadata['rrf_score']:.6f}")
    print(f"Content: {doc.page_content}\n")
```

---

## 🧪 Evaluation via RAGAS

Evaluating retrieval precision is critical for building production-grade RAG systems. Using the **RAGAS** framework, you can evaluate the effectiveness of this retriever across key retrieval and generation metrics:

* **Context Precision**: Measures how well the retriever ranks relevant documents at the top.
* **Context Recall**: Verifies if all relevant ground-truth facts are successfully retrieved.

### Setup RAGAS Evaluation:

```python
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
from datasets import Dataset

# Construct your evaluation dataset
eval_data = {
    "question": ["How do you orchestrate agents?"],
    "contexts": [[doc.page_content for doc in fused_docs]],
    "ground_truth": ["LangGraph is used for building stateful, multi-actor applications with LLMs."]
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[context_precision, context_recall])
print(results)
```

---

## 🧪 Testing

To run the test suite and verify calculation correctness:

```bash
pytest tests/
```

---

## 🤝 Contributing

Contributions are highly welcome! To contribute:
1. Fork the repository.
2. Create a new feature branch: `git checkout -b feat/your-feature`.
3. Write your changes and add tests.
4. Run `pytest` to make sure all tests pass.
5. Push to your branch and open a Pull Request.

---

## 📜 License

Distributed under the MIT License. See `LICENSE` for details.
