Metadata-Version: 2.4
Name: rag-llm-infra
Version: 0.1.0
Summary: Vendor-neutral RAG + LLM serving infrastructure: swappable LLM protocol and vector store (FAISS/NumPy/Qdrant), cached embedding index, and observability.
Project-URL: Homepage, https://github.com/MarwaBS/rag-llm-infra
Project-URL: Repository, https://github.com/MarwaBS/rag-llm-infra
Author: Marwa Ben Salem
License-Expression: MIT
License-File: LICENSE
Keywords: embeddings,faiss,llm,openai,qdrant,rag,retrieval,vector-store
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy>=1.26
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: faiss-cpu>=1.9; extra == 'dev'
Requires-Dist: fastapi>=0.115; extra == 'dev'
Requires-Dist: httpx>=0.27; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: qdrant-client>=1.12; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=3.0; extra == 'embeddings'
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.9; extra == 'faiss'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.29; extra == 'otel'
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.29; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.29; extra == 'otel'
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.12; extra == 'qdrant'
Provides-Extra: serve
Requires-Dist: fastapi>=0.115; extra == 'serve'
Requires-Dist: uvicorn>=0.30; extra == 'serve'
Description-Content-Type: text/markdown

# RAG + LLM Serving Infrastructure

[![CI](../../actions/workflows/ci.yml/badge.svg)](../../actions/workflows/ci.yml)

An installable, vendor-neutral foundation for retrieval-augmented LLM applications:
a swappable vector store, a cached embedding index, a provider-agnostic LLM
protocol, the observability around them, a FastAPI serving layer, and a
retrieval-quality eval gate.

> Distilled infrastructure layer — typed, tested, packaged, and runnable on its own.

## Install

```bash
# from Git (works today):
pip install "git+https://github.com/MarwaBS/rag-llm-infra"
pip install "rag-llm-infra[faiss,qdrant,openai,serve] @ git+https://github.com/MarwaBS/rag-llm-infra"
# from a local clone, for development:
pip install -e ".[dev]"
```
> A tagged PyPI release is planned; until then, install from Git as above.

## Quickstart — end-to-end RAG (no API key, no network)

```bash
git clone https://github.com/MarwaBS/rag-llm-infra && cd rag-llm-infra
pip install -e .
python example.py
```

```
embed documents → index in a VectorStore → retrieve top-k for a query
                → build a grounded prompt → answer with an LLMProtocol backend
```

Runs on the NumPy vector store + the deterministic mock LLM, so it needs no key.
In production, swap the demo embedder for `EmbeddingEngine` and `get_llm("mock")`
for `get_llm("openai")`.

## Serve it

```bash
pip install "rag-llm-infra[serve] @ git+https://github.com/MarwaBS/rag-llm-infra"
uvicorn rag_llm_infra.serve:app          # or: docker build -t rag-llm-infra . && docker run -p 8000:8000 rag-llm-infra
```

```bash
curl -XPOST localhost:8000/index -d '{"documents":["FAISS is in-process vector search","Qdrant is a vector database"]}' -H 'content-type: application/json'
curl -XPOST localhost:8000/query -d '{"query":"vector search","k":1}'      -H 'content-type: application/json'
```

## What's inside

| Module | Responsibility |
| --- | --- |
| `rag_llm_infra.llm_protocol` | `LLMProtocol` — `runtime_checkable` Protocol over OpenAI / Anthropic-stub / Mock; factory `get_llm()` |
| `rag_llm_infra.vector_store` | `VectorStoreProtocol` — in-process FAISS `IndexFlatIP`, pure-NumPy fallback, real **Qdrant** (batched search) |
| `rag_llm_infra.evidence_index` | `EmbeddingEngine` — SentenceTransformers embeddings + adaptive, memory-pressure-aware LRU cache; reader/writer lock |
| `rag_llm_infra.tracing` | OpenTelemetry spans with console-exporter + no-op fallbacks |
| `rag_llm_infra.log_config` | structured JSON logging + an `llm_call` latency/token timer |
| `rag_llm_infra.serve` | FastAPI service (`/index`, `/query`, `/health`) wiring the parts together |
| `rag_llm_infra.faithfulness` | `groundedness(answer, contexts)` — lexical faithfulness metric for RAG output |
| `rag_llm_infra.fallback` | `FallbackLLM` — budget-aware multi-provider routing; drop-in `LLMProtocol` |

## Quality gates

```bash
python -m eval.retrieval_eval      # recall@1 / MRR on a labelled paraphrase corpus
python -m eval.generation_eval     # groundedness (faithfulness) of generated answers
```

Both run in CI: a **retrieval** regression (`recall@1 ≥ 0.80`, `MRR ≥ 0.85`) or a
**faithfulness** regression (grounded answer below threshold, or the metric failing
to flag a hallucinated control) fails the build and cannot merge.

## Engineering principles demonstrated

- **Swap by interface** — `LLMProtocol` / `VectorStoreProtocol` make the model and the index runtime-swappable.
- **Degrade, don't crash** — FAISS / Qdrant / OpenTelemetry / SentenceTransformers are lazily imported with working fallbacks; missing infra never hard-fails import.
- **Measured, not asserted** — a retrieval eval gate, not just unit tests; packaged and CI-built end to end.

## Develop / test

```bash
pip install -e ".[dev]"     # installs FAISS + Qdrant + serve extras too
ruff check . && pytest && python -m eval.retrieval_eval
```

CI installs the native backends, so the FAISS and Qdrant tests run there (they
skip only when those libraries are absent).

## License

MIT — see [LICENSE](LICENSE).
