Metadata-Version: 2.4
Name: rag-agent-eval-ci
Version: 0.1.0
Summary: CI evaluation framework for RAG and AI agents: groundedness, retrieval quality, hallucination, citations, latency, cost, and regression gating before production.
Author: rag-agent-eval-ci contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/martian7777/rag-agent-eval-ci
Project-URL: Documentation, https://github.com/martian7777/rag-agent-eval-ci/tree/main/docs
Project-URL: Issues, https://github.com/martian7777/rag-agent-eval-ci/issues
Keywords: rag,llm,evaluation,ci,hallucination,groundedness,agents,mlops,llmops
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13.7
Requires-Dist: pydantic>=2.6
Requires-Dist: pydantic-settings>=2.2
Requires-Dist: pyyaml>=6.0
Requires-Dist: httpx>=0.27
Requires-Dist: sqlalchemy>=2.0
Requires-Dist: jinja2>=3.1
Requires-Dist: python-dotenv>=1.0
Provides-Extra: chroma
Requires-Dist: chromadb>=0.5; extra == "chroma"
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.9; extra == "qdrant"
Requires-Dist: fastembed>=0.3; extra == "qdrant"
Provides-Extra: pdf
Requires-Dist: pypdf>=4.2; extra == "pdf"
Provides-Extra: api
Requires-Dist: fastapi>=0.136; extra == "api"
Requires-Dist: starlette>=1.3; extra == "api"
Requires-Dist: uvicorn[standard]>=0.29; extra == "api"
Provides-Extra: dashboard
Requires-Dist: streamlit>=1.58; extra == "dashboard"
Requires-Dist: pandas>=2.2; extra == "dashboard"
Requires-Dist: altair>=5.3; extra == "dashboard"
Provides-Extra: postgres
Requires-Dist: psycopg[binary]>=3.1; extra == "postgres"
Provides-Extra: openai
Requires-Dist: openai>=1.30; extra == "openai"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.6; extra == "gemini"
Provides-Extra: all
Requires-Dist: chromadb>=0.5; extra == "all"
Requires-Dist: qdrant-client>=1.9; extra == "all"
Requires-Dist: fastembed>=0.3; extra == "all"
Requires-Dist: pypdf>=4.2; extra == "all"
Requires-Dist: fastapi>=0.136; extra == "all"
Requires-Dist: starlette>=1.3; extra == "all"
Requires-Dist: uvicorn[standard]>=0.29; extra == "all"
Requires-Dist: streamlit>=1.58; extra == "all"
Requires-Dist: pandas>=2.2; extra == "all"
Requires-Dist: altair>=5.3; extra == "all"
Requires-Dist: openai>=1.30; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.1; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: pypdf>=4.2; extra == "dev"
Dynamic: license-file

<div align="center">

# 🛠️ rag-agent-eval-ci

### Stop shipping RAG systems you can't test.

**A CI evaluation framework for RAG and AI agents — gate every deploy on groundedness, retrieval quality, hallucination, citations, latency, cost, and regression.**

[![CI Status](https://github.com/martian7777/rag-agent-eval-ci/actions/workflows/ci.yml/badge.svg)](https://github.com/martian7777/rag-agent-eval-ci/actions)
[![Python](https://img.shields.io/badge/python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache--2.0-22314F?style=flat-square)](LICENSE)

</div>

---

## ⚠️ The Problem

Teams are shipping RAG assistants and AI agents into production, but most have **no automated way to answer the one question that matters before a deploy:**

> [!IMPORTANT]
> **"Is this version safe to ship, or did we just make it worse?"**
>
> A prompt tweak silently drops retrieval recall. A model swap doubles cost. A new chunking strategy starts hallucinating. Today these regressions are caught by users, not by CI.

`rag-agent-eval-ci` turns RAG quality into a **pull-request gate**. Developers write test questions in YAML; the tool measures groundedness, retrieval accuracy, hallucination, citations, latency, and cost, compares against a baseline, and **fails the build** when quality drops.

---

## 📺 Demo

![Demo Screenshot](https://raw.githubusercontent.com/martian7777/rag-agent-eval-ci/main/image.png)

---

## ⚡ Quick Start (under 5 minutes, no API keys)

> [!TIP]
> The default config uses a built-in **mock provider** + in-memory retriever, so the first run works with **zero keys and zero network dependency**.

### Installation
```bash
pip install rag-agent-eval-ci          # or: git clone https://github.com/martian7777/rag-agent-eval-ci && pip install -e .
```

### Run the bundled visa-enrollment example end-to-end:
```bash
rag-eval gate \
  --tests examples/visa/tests.yaml \
  -c examples/visa/rag_eval.yaml \
  -d examples/visa/docs
```

The exit code is `0` if the gate passes, and `1` if it fails — making it extremely straightforward to plug into your CI pipelines.

> [!NOTE]
> Prefer Docker? `docker compose up` brings the API, dashboard, Postgres, and a local Ollama up together.

---

## 📄 Example Input

A developer just writes questions and expectations in `tests.yaml`:

```yaml
suite: visa-docs
questions:
  - question: "What documents are required for visa enrollment?"
    expected_sources: ["visa_checklist.pdf"]   # retrieval must surface this
    must_include: ["passport", "admission letter", "insurance"]  # answer completeness
    must_not_include: ["bank statement is required"]             # hallucination trap
```

---

## 📊 Example Output

Every run produces clean console output **and** CI-friendly artifacts in `.rag_eval/reports/`:

| File | Use case / Description |
| :--- | :--- |
| `report.json` | Machine-readable full evaluation results. |
| `junit.xml` | Standard format that renders as native test results in GitHub/GitLab CI. |
| `summary.md` | Clean markdown summary intended to be auto-posted to PR comments or job summaries. |
| `report.html` | Self-contained, premium visual report (perfect as a CI build artifact). |

---

## 🎯 What it Measures

| Metric | Question it answers | Core Evaluation Method |
| :--- | :--- | :--- |
| **Retrieval accuracy** | Did we retrieve the documents we expected? | Substring checking on returned sources vs expected. |
| **Groundedness** | Is the answer actually supported by the retrieved context? | LLM-judge verification or fallback token overlap. |
| **Hallucination** | Did the model invent unsupported or forbidden claims? | LLM-judge analysis plus `must_not_include` penalty checks. |
| **Citation** | Are the cited sources real and the right ones? | Compares cited sources vs retrieved context. |
| **Answer completeness** | Does the answer contain the key facts? | Validates presence of `must_include` phrases. |
| **Latency** | How fast is the system? | Tracks wall-clock time (gates on suite-level **p95**). |
| **Cost** | What is the financial footprint? | Live calculations (gates on suite-level **total USD**). |
| **Regression** | Did quality drop since the last release? | Auto-compares metrics against a tagged baseline run. |

---

## 🏗️ Architecture

```mermaid
flowchart TD
    classDef main fill:#3b82f6,stroke:#1d4ed8,stroke-width:2px,color:#fff,font-weight:bold;
    classDef input fill:#f8fafc,stroke:#64748b,stroke-width:1.5px;
    classDef step fill:#fff,stroke:#cbd5e1,stroke-width:1px;
    classDef eval fill:#fef2f2,stroke:#f87171,stroke-width:1.5px;
    classDef storage fill:#ecfdf5,stroke:#34d399,stroke-width:1.5px;
    classDef report fill:#fff7ed,stroke:#fb923c,stroke-width:1.5px;
    classDef gate fill:#fef08a,stroke:#eab308,stroke-width:1.5px;

    tests["tests.yaml<br><i>(Questions & Expectations)</i>"]:::input --> runner["Runner Engine"]:::main

    runner --> target["Target System"]:::step
    runner --> evals["Evaluator Suite"]:::step
    runner --> reporting["Reporting System"]:::step

    target --> target_desc["• Local RAG Pipeline<br>• HTTP Endpoint"]:::step
    target_desc --> providers["Providers"]:::step
    providers --> providers_list["• Mock (Offline)<br>• Ollama / OpenAI<br>• Gemini / OpenRouter"]:::step
    providers_list --> vector["Vector Stores"]:::step
    vector --> vector_list["• Memory / Chroma / Qdrant"]:::step

    evals --> eval_list["• Retrieval Accuracy<br>• Groundedness <i>(LLM)</i><br>• Hallucination Penalty<br>• Citation & Source Precision<br>• Answer Completeness<br>• Latency & Cost"]:::eval
    eval_list --> storage["Storage DB<br><i>(SQLite / Postgres)</i>"]:::storage
    storage --> dashboard["Streamlit Dashboard<br>& FastAPI Backend"]:::storage

    reporting --> report_formats["• Rich Console Output<br>• report.json<br>• junit.xml<br>• summary.md<br>• report.html"]:::report
    report_formats --> gate_check["Gate Check"]:::gate
    gate_check --> threshold_desc["Thresholds & Regression Checks"]:::gate
    threshold_desc --> exit_code["CI Exit Code<br><i>(0 = Pass, 1 = Fail)</i>"]:::gate

    storage -.-> gate_check
```

- **Providers**: `mock`, `ollama`, `openai`, `gemini`, `openrouter` (any model via OpenRouter with live cost tracking). See [docs/providers.md](docs/providers.md).
- **Vector stores**: `memory` (zero dependencies), `chroma`, `qdrant`.
- **Targets**: Evaluate the built-in pipeline **or point it at your own RAG HTTP endpoint** — making it a plug-and-play evaluation utility for existing architectures.

---

## 💼 Use Cases

- **Pull-Request Gate** — Automatically block merges that regress retrieval/groundedness metrics or exceed cost limits.
- **Model & Prompt Optimization** — Run side-by-side comparison matrices (e.g. `gpt-4o-mini` vs local LLM) on quality and cost.
- **Nightly Regression Gates** — Schedule automated cron runs against a tagged production baseline to capture silent drift.
- **Provider Migration Assurance** — Benchmark a new model/provider to prove compatibility before going live.

---

## 🚀 How to Use It in Your Company

1. **Seed a Test Suite**: Write 10–20 high-value user questions in a `tests.yaml`.
2. **Configure Your Target**: Point the runner to your pipeline. If using your own server, set `target.type: http`:
   ```yaml
   target:
     type: http
     url: https://your-rag.internal/answer
     question_field: question
     answer_field: answer
     sources_field: sources
   ```
3. **Select Your Judge**: Configure a judging provider. Use `ollama` for a free local judge, or keys for `openrouter`/`openai`/`gemini`.
4. **Define Quality Thresholds**: Set performance/cost bounds in `rag_eval.yaml`.
5. **Drop into CI**: Copy the prebuilt [.github/workflows/rag-eval.yml](.github/workflows/rag-eval.yml) to auto-run on every PR.
6. **Freeze a Baseline**: Run `rag-eval baseline save <run_id>` on your main branch to establish a comparison reference.

> Run history is persisted to Postgres/SQLite and can be monitored visually using the Streamlit dashboard (`docker compose up`).

---

## 🐍 Python SDK

```python
from rag_eval import Evaluator

# Run evaluations directly from your custom pipeline or test script
report = Evaluator.from_config("rag_eval.yaml").run(
    "tests.yaml", ingest_dir="examples/visa/docs"
)

print(report.summary.quality)     # e.g., {'groundedness': 0.98, ...}
assert report.passed              # raise exceptions in standard test runners
```

---

## 🗺️ Roadmap

- [ ] **Additional Evaluators**: Context precision/recall, answer relevancy, and toxicity filters.
- [ ] **Native Integrations**: Direct connectors for LangChain / LlamaIndex pipelines.
- [ ] **Agentic Evaluation**: Support multi-turn agent conversations and tool usage tracking.
- [ ] **Notification Exporters**: Built-in Slack, Discord, and email alerts on gate failures.
- [ ] **Visual Diffing**: Comprehensive run-to-run comparisons on the dashboard.
- [ ] **PyPI & Docker Images**: Hosted pre-builds for zero setup.

---

## 🤝 Contributing

Contributions are very welcome! Please check out [CONTRIBUTING.md](CONTRIBUTING.md) to get started.

> [!TIP]
> Looking for entry points? Check out our [**Good First Issues**](.github/GOOD_FIRST_ISSUES.md) list.

---

## 📄 License

This project is licensed under the terms of the [Apache-2.0 License](LICENSE).
