Metadata-Version: 2.4
Name: dv-hyperrag
Version: 0.1.0
Summary: Python SDK for RAG serving optimization — RAGO Pareto scheduler + RAGCache KV cache manager
Author-email: Deep Variance Dev Team <founders@deepvariance.com>
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.11
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: torch>=2.1
Requires-Dist: tqdm>=4.65
Provides-Extra: gpu
Requires-Dist: faiss-gpu>=1.7.4; extra == "gpu"
Requires-Dist: vllm>=0.4.0; extra == "gpu"
Requires-Dist: triton>=2.1; extra == "gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.7; extra == "dev"
Provides-Extra: eval
Requires-Dist: datasets>=2.14; extra == "eval"
Requires-Dist: transformers>=4.35; extra == "eval"
Requires-Dist: sentence-transformers>=2.2; extra == "eval"
Requires-Dist: matplotlib>=3.8; extra == "eval"
Provides-Extra: all
Requires-Dist: dv-hyperrag[dev,eval,gpu]; extra == "all"

# HyperRAG

KV cache + Pareto scheduling middleware for RAG pipelines. Plugs in between your vector search and your LLM. Built on two systems papers: **RAGO** (ISCA'25) and **RAGCache** (TOCS'25).

---

## The problem

Every RAG request re-processes the same documents from scratch. At 70B params, that's ~650ms of wasted prefill before you see a single output token — and you just paid for the same compute last request.

The fix: cache the transformer's KV state per document. When a document appears again, load the cache instead of recomputing. TTFT drops proportionally to hit rate.

This SDK manages that cache and finds the optimal GPU/batch/cache configuration for your workload.

---

## Install

```bash
pip install hyperrag           # core (schedule optimizer + cache planner)
pip install "hyperrag[gpu]"    # + vLLM for real inference
pip install "hyperrag[all]"    # + dev + eval tooling
```

---

## Quickstart

→ **[quickstart/README.md](quickstart/README.md)**

Install, configure, optimize, serve. Five steps.

---

## Pipeline Integration

→ **[docs/pipeline-integration.md](docs/pipeline-integration.md)**

Drop this into a pipeline you already have (LangChain, LlamaIndex, custom).

---

## How it works

1. `optimize()` — Pareto search over GPU counts, batch sizes, and cache hit rates. Returns the config that minimises TTFT (or maximises QPS) for your workload.
2. `recommend_cache()` — Sweeps GPU/host DRAM split. Tells you how to allocate cache budget.
3. `build_controller()` — Returns a live serving controller backed by vLLM. Every request goes through cache lookup → speculative pipelining → KV cache admission.

```python
from hyperrag import RAGOptimize, RAGOptimizeConfig, LLMModel, Query

rago = RAGOptimize(RAGOptimizeConfig(
    paradigm="long_context",
    model=LLMModel.LLAMA_3_1_70B,
    gpu_budget_gb=8.0,
    host_budget_gb=32.0,
))

# Pre-flight: find the best schedule before you commit hardware
result = rago.optimize()
print(result.summary())
# TTFT=3.4ms  QPS=12.5  hit_rate=0.82  gpus=4  batch=8

# Production: real vLLM inference with KV caching active
ctrl = rago.build_controller()   # requires NVIDIA GPU + pip install "hyperrag[gpu]"
resp = ctrl.process(Query("q1", "What is transformer attention?", ["d1", "d2"], [512, 256]))
print(f"TTFT={resp.ttft_s*1000:.1f}ms  cache_hit={resp.cache_hit}")
```

---

## Model presets

```python
from hyperrag import LLMModel

# LLMs
LLMModel.LLAMA_3_1_8B    LLMModel.LLAMA_3_1_70B    LLMModel.LLAMA_3_1_405B
LLMModel.MISTRAL_7B      LLMModel.MISTRAL_NEMO_12B
LLMModel.GEMMA_2_9B      LLMModel.GEMMA_2_27B
LLMModel.QWEN_2_5_72B    LLMModel.DEEPSEEK_R1_70B

# SLMs
LLMModel.LLAMA_3_2_1B    LLMModel.LLAMA_3_2_3B
LLMModel.PHI_3_5_MINI    LLMModel.GEMMA_2_2B
LLMModel.QWEN_2_5_7B     LLMModel.DEEPSEEK_R1_7B

# Custom
LLMModel.custom("MyModel-7B", "myorg/mymodel-7b", 7.0,
                num_layers=32, q_heads=32, kv_heads=8, head_dim=128)
```

All presets: `from hyperrag import ALL_MODELS`.

---

## RAG paradigms

| `paradigm=` | Default model | Bottleneck | Use when |
|-------------|--------------|------------|----------|
| `"hyperscale"` | 8B | FAISS scan | Standard single-hop RAG |
| `"long_context"` | 70B | LLM prefill | 1M+ token context, no retrieval |
| `"iterative"` | 70B | FAISS × 4 | Multi-hop / agentic retrieval |
| `"rewriter_reranker"` | 70B | Encoder + rewriter | Query rewrite + cross-encoder rerank |

---

## Config reference

```python
RAGOptimizeConfig(
    paradigm="hyperscale",     # see table above
    model=LLMModel.LLAMA_3_1_8B,
    gpu_budget_gb=4.0,         # GPU HBM for KV cache (GB)
    host_budget_gb=16.0,       # host DRAM for KV cache (GB)
    hardware_profile=None,     # path to JSON from scripts/profile_hardware.py
    num_gpus=None,             # override GPU count (or RAGO_NUM_GPUS env)
    max_ttft_s=None,           # filter: reject schedules with TTFT > this
    min_qps=None,              # filter: reject schedules with QPS < this
)
```

---

## Key classes

| Class | What it does |
|-------|-------------|
| `RAGOptimize` | Facade: `optimize()`, `recommend_cache()`, `build_controller()` |
| `RAGServeController` | Serving: `process()`, `process_batch()`, `warmup()`, `metrics()`, `reset()` |
| `LLMModel` | Model spec with 17 built-in presets + `custom()` |
| `Query` | `(query_id, text, doc_ids, doc_tokens)` |
| `QueryResult` | `(ttft_s, latency_s, cache_hit, cached_tokens, speculative)` |
| `OptimizeResult` | `(ttft_s, qps_per_chip, cache_hit_rate, pareto_size)` |
| `CacheRecommendation` | `(gpu_gb, host_gb, estimated_hit_rate, reasoning)` |
| `ServeMetrics` | `(hit_rate, avg_ttft_s, gpu_used_mib, host_used_mib, ...)` |

Exceptions: `RAGOptimizeError` → `ConfigError`, `ScheduleError`, `HardwareError`, `ServeError`.

---

## Hardware calibration

For tighter schedule predictions, profile once and pass the result:

```bash
python scripts/profile_hardware.py --output profiles/my_gpu.json
```

```python
RAGOptimizeConfig(model=LLMModel.LLAMA_3_1_8B, hardware_profile="profiles/my_gpu.json")
```

---

## Benchmarks

4× A100-SXM4-40GB, 1000 queries, Zipfian workload, calibrated cost model.

| Paradigm | Baseline | +RAGCache | +RAGO | Speedup |
|----------|----------|-----------|-------|---------|
| Hyperscale 8B | 264.8 ms | 251.6 ms | **243.6 ms** | 1.09× |
| Long-context 70B | 30.9 ms | 13.7 ms | **3.4 ms** | **9.02×** |
| Iterative 70B | 264.8 ms | 251.6 ms | **243.6 ms** | 1.09× |
| Rewriter-Reranker 70B | 649.2 ms | 635.9 ms | **339.7 ms** | 1.91× |

---

## Tests

```bash
pytest tests/ -m "not gpu" -v    # 100 tests, no GPU needed
pytest tests/ -m gpu -v          # serving tests (requires NVIDIA GPU + vllm)
```

MS-MARCO v2.1 fixture (50 queries) at `tests/fixtures/ms_marco_sample.json`.

---

## Source layout

```
src/
├── hyperrag/        SDK (public API)
│   ├── client.py            RAGOptimize facade
│   ├── config.py            RAGOptimizeConfig
│   ├── serve.py             RAGServeController
│   ├── models.py            LLMModel + result types
│   └── exceptions.py
└── engine/                  Engine (RAGO + RAGCache algorithms)
    ├── schema/              RAGSchema workload model
    ├── cost_model/          Roofline / calibrated / adaptive
    ├── knowledge_tree/      Prefix trie for KV cache
    ├── cache/               Multi-tier cache + PGDSF
    ├── request_scheduler/   Cache-aware reorder + speculative pipeline
    ├── scheduler/           Pareto scheduler
    ├── inference/           vLLM backend
    └── serving/             RAGController
```

---

## References

1. Nawras Alnaasan et al. "RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving." ISCA 2025. [arXiv:2503.14649](https://arxiv.org/abs/2503.14649)
2. Chao Jin et al. "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation." ACM TOCS 2025. [arXiv:2404.12457](https://arxiv.org/abs/2404.12457)

MIT License
