Metadata-Version: 2.4
Name: irkit-py
Version: 0.2.0
Summary: Pluggable distributed hybrid information retrieval toolkit
Author-email: Venu Arvind Arangarajan <venuarvinda066@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/VenuArvind/IRKit
Project-URL: Documentation, https://github.com/VenuArvind/IRKit#readme
Project-URL: Repository, https://github.com/VenuArvind/IRKit
Keywords: information retrieval,search,BM25,semantic search,distributed,NLP
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: sentence-transformers>=2.7.0
Requires-Dist: openai>=1.0.0
Requires-Dist: datasets>=2.0.0
Requires-Dist: Wikipedia-API>=0.6.0
Requires-Dist: fastapi>=0.110.0
Requires-Dist: uvicorn[standard]>=0.29.0
Requires-Dist: redis>=5.0.0
Requires-Dist: psycopg2-binary>=2.9.9
Requires-Dist: typer>=0.9.0
Requires-Dist: rich>=13.0.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: httpx>=0.27.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# 🚀 IRKit: Distributed Hybrid Information Retrieval Toolkit

[![Python Version](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-009688.svg)](https://fastapi.tiangolo.com/)

**IRKit** is a modular, high-performance Information Retrieval (IR) library built for the modern era of Semantic Search. It combines **BM25 Keyword Search** with **Vector Embeddings** and **Cross-Encoder Reranking** to provide search precision.

---

## ✨ Key Features

- **🔍 Hybrid Search**: Combines BM25 and Semantic search using **Reciprocal Rank Fusion (RRF)**.
- **🌐 Distributed Architecture**: Implemented **Consistent Hashing** over a sharded Redis cluster, ensuring O(1) metadata lookups and seamless horizontal scaling.
- **⚡ Performance First**: Guaranteed sub-5ms search latency via a custom **Semantic Vector Cache**; integrated P50/P95/P99 latency tracking for real-time observability.
- **🚀 Production-Ready**: Fully containerized with **Docker & Docker Compose**; designed for high-availability deployment on **GCP (Google Cloud Platform)** or Hugging Face.

---

## 🎨 Frontend Demo

Experience the engine visually through a React-powered search interface.

1. **Start the API**: `irkit serve --source arxiv --max-docs 500`
2. **Start the UI**: `cd demo && npm run dev`
3. **Open**: `http://localhost:5173`

---

## 📊 Benchmarks & Quality

*Verified actual numbers (M5 MacBook Pro) - 20,000 Documents.*

### Search Performance (20,000 Docs @ 384-dim)
| Mode | Search Latency | Indexing Time* | Peak RSS Memory |
|------|----------------|---------------|-----------------|
| **Baseline (float32)** | 1.41 ms | 0.149 s | 610 MB |
| **SQ8 (uint8)** | 7.65 ms | 0.122 s | 677 MB |
| **PQ (8-subspaces)** | **1.15 ms** | 1.028 s | 677 MB |

\*Indexing includes K-Means training for PQ.

### Precision (ArXiv - 200 docs)
| Ranking Mode | Mean MRR | Mean nDCG@10 |
|--------------|----------|--------------|
| BM25 Only | 0.8750 | 1.0000 |
| Semantic Only | 1.0000 | 1.0000 |
| **Hybrid + Reranking** | **1.0000** | **1.0000** |

---

## 🏗️ Technical Architecture

IRKit is built on four core pillars:

1.  **Sources**: Native ingestors for ArXiv, Wikipedia, News RSS, and Local Files.
2.  **Embedders**: Support for local HuggingFace models or OpenAI API.
3.  **Rankers**: Advanced relevance algorithms including BM25, Semantic FAISS, Hybrid RRF, and Cross-Encoder Reranking.
4.  **Core Services**: Latency tracking, Semantic Caching, and Scientific Evaluation modules.

---

## 🧪 Testing

```bash
# Run full test suite with coverage
pytest tests/ -v --cov=irkit

# Run performance & quality benchmarks
python3 scripts/evaluate_quality.py
python3 scripts/benchmark_cache.py
```
