Metadata-Version: 2.4
Name: rag-benchmarking
Version: 1.0.0
Summary: Framework-agnostic evaluation harness for RAG and agentic AI systems
Author: Ajay Pundhir
License: Apache-2.0
Project-URL: Homepage, https://aiexponent.com
Project-URL: Repository, https://github.com/aiexponenthq/rag-benchmarking
Project-URL: Documentation, https://github.com/aiexponenthq/rag-benchmarking#readme
Project-URL: Bug Tracker, https://github.com/aiexponenthq/rag-benchmarking/issues
Keywords: rag,evaluation,benchmarking,llm,ai-governance,eu-ai-act,agentic-ai
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn>=0.30
Requires-Dist: pydantic>=2.6
Requires-Dist: pydantic-settings>=2.2
Requires-Dist: python-json-logger>=2.0.7
Requires-Dist: qdrant-client>=1.9
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: numpy>=1.26
Requires-Dist: tqdm>=4.66
Requires-Dist: requests>=2.31
Requires-Dist: ragas>=0.1.9
Requires-Dist: langchain-google-genai>=2.0.9
Requires-Dist: FlagEmbedding>=1.2.11
Provides-Extra: test
Requires-Dist: pytest>=8.2; extra == "test"
Requires-Dist: pytest-asyncio>=0.23; extra == "test"
Requires-Dist: pytest-cov>=5.0; extra == "test"
Requires-Dist: httpx>=0.27; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff==0.15.12; extra == "lint"
Requires-Dist: mypy>=1.10; extra == "lint"
Dynamic: license-file

<p align="center">
  <a href="https://aiexponent.com"><img src=".github/brand/logo-full-light.png" alt="AiExponent — Building AI that deserves to be trusted" width="560"></a>
</p>

<h1 align="center">RAG Benchmarking</h1>
<p align="center"><em>Prove your RAG system works — before you ship.</em></p>

<p align="center">
  <a href="https://pypi.org/project/rag-benchmarking/"><img src="https://img.shields.io/pypi/v/rag-benchmarking.svg" alt="PyPI"></a>
  <a href="https://github.com/aiexponenthq/rag-benchmarking/actions"><img src="https://github.com/aiexponenthq/rag-benchmarking/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-0D5463.svg" alt="License: Apache 2.0"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11%2B-0D5463.svg" alt="Python 3.11+"></a>
  <a href="https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689"><img src="https://img.shields.io/badge/EU%20AI%20Act-Article%2015-0D5463.svg" alt="EU AI Act Article 15"></a>
  <a href="#privacy"><img src="https://img.shields.io/badge/telemetry-zero-0B7A4B.svg" alt="Zero telemetry"></a>
</p>

---

**A framework-agnostic evaluation harness for RAG and agentic AI systems.**

Bring your own RAG pipeline — LangChain, LlamaIndex, or custom — and benchmark it against classic and agentic-era metrics. Built for teams who need to prove their AI systems work before they ship.

Built by [AI Exponent LLC](https://aiexponent.com). Provides **partial Art. 15(1) accuracy input** for high-risk AI systems — not Art. 15 robustness, not cybersecurity, not conformity evidence (see [scope panel](#eu-ai-act-article-15--partial-input-not-conformity-evidence)).

---

## Quick Start

```bash
pip install rag-benchmarking
```

```python
from app.sdk.client import RagEval

client = RagEval(api_url="http://localhost:5001", api_key="your-key")

# Works with LangChain
result = my_chain.invoke({"query": "What is RAG?"})
sample = RagEval.from_langchain(result)

# Or any dict with question / contexts / answer
sample = {
    "question": "What is RAG?",
    "contexts": ["RAG stands for Retrieval-Augmented Generation."],
    "answer": "RAG combines retrieval with LLM generation.",
}

report = client.evaluate([sample], metrics=["faithfulness", "answer_relevancy"])
print(report["metrics"])
# {"faithfulness": 0.958, "answer_relevancy": 0.810}
```

```bash
# Start the evaluation server
docker compose up
# API docs: http://localhost:5001/docs
```

---

## Architecture

```mermaid
graph TD
    RAG["Your RAG System\nLangChain · LlamaIndex · Custom"]
    SDK["SDK Adapters\nRagEval.from_langchain()\nRagEval.from_llamaindex()"]
    SCHEMA["EvalSample / AgentTrace\nharness/schemas.py"]
    RUNNER["EvaluationRunner\nharness/runner.py"]

    CLASSIC["Classic Metrics\nfaithfulness · answer_relevancy\ncontext_precision · context_recall"]
    RETRIEVAL["Retrieval Metrics\nPrecision@K · Recall@K\nMRR · NDCG"]
    AGENTIC["Agentic Metrics\nagent_faithfulness · tool_call_accuracy\nretrieval_necessity · source_attribution"]

    REPORT["BenchmarkReport"]
    STORE["SQLite ResultStore\nRun history + comparison"]
    API["REST API\n/v1/evaluate · /v1/evaluate/agent\n/v1/runs · /v1/runs/compare"]

    RAG --> SDK --> SCHEMA --> RUNNER
    RUNNER --> CLASSIC
    RUNNER --> RETRIEVAL
    RUNNER --> AGENTIC
    CLASSIC --> REPORT
    RETRIEVAL --> REPORT
    AGENTIC --> REPORT
    REPORT --> STORE --> API

    style RAG fill:#2d5a2d,color:#fff
    style SDK fill:#1e3a5f,color:#fff
    style SCHEMA fill:#1e3a5f,color:#fff
    style RUNNER fill:#1e3a5f,color:#fff
    style CLASSIC fill:#c9a84c,color:#000
    style RETRIEVAL fill:#c9a84c,color:#000
    style AGENTIC fill:#c9a84c,color:#000
    style REPORT fill:#1e3a5f,color:#fff
    style STORE fill:#1e3a5f,color:#fff
    style API fill:#2d5a2d,color:#fff
```

---

## Metrics

### Classic RAG Metrics

```mermaid
graph LR
    Q["question\ncontexts\nanswer"]

    FAITH["faithfulness\nAre all claims in the\nanswer supported by context?"]
    RELEV["answer_relevancy\nDoes the answer\naddress the question?"]
    CPREC["context_precision\nAre retrieved chunks\nrelevant to the query?"]
    CREC["context_recall\nDoes context contain\nenough to answer?"]

    Q --> FAITH
    Q --> RELEV
    Q --> CPREC
    Q --> CREC

    style Q fill:#1e3a5f,color:#fff
    style FAITH fill:#c9a84c,color:#000
    style RELEV fill:#c9a84c,color:#000
    style CPREC fill:#c9a84c,color:#000
    style CREC fill:#c9a84c,color:#000
```

| Metric | What it measures | LLM judge |
|---|---|---|
| `faithfulness` | Are all claims in the answer supported by context? | Yes |
| `answer_relevancy` | Does the answer address the question? | Yes |
| `context_precision` | Are retrieved chunks relevant to the query? | Yes |
| `context_recall` | Does context contain enough to answer correctly? | Yes |
| `precision_at_k` | Fraction of top-K retrieved docs that are relevant | No |
| `recall_at_k` | Fraction of relevant docs found in top-K | No |
| `mrr` | Reciprocal rank of first relevant doc | No |
| `ndcg_at_k` | Rank-weighted retrieval quality | No |

### Agentic-Era Metrics

For multi-step agents, tool-using systems, and autonomous RAG pipelines:

| Metric | What it measures | LLM judge |
|---|---|---|
| `source_attribution_accuracy` | Did the agent cite sources it actually retrieved? | No — deterministic |
| `agent_faithfulness` | Is every reasoning step faithful to retrieved sources? | Yes |
| `tool_call_accuracy` | Did the agent choose the right tool at the right time? | Yes |
| `retrieval_necessity` | Was retrieval actually needed for this query? | Yes |

### Metric Groups

```python
# Use pre-defined groups
report = client.evaluate(samples, metric_group="classic")
report = client.evaluate(samples, metric_group="retrieval")
report = client.evaluate(samples, metric_group="agentic_v1")
report = client.evaluate(samples, metric_group="full")  # all metrics
```

---

## Benchmarks

Measured on the built-in 50-sample golden dataset (10 domains):

| Metric | Score | Label |
|---|---|---|
| faithfulness | **0.958** | Excellent |
| answer_relevancy | **0.810** | Good |

---

## LLM Backend

Several metrics use an LLM as a judge. Supported providers:

```bash
# .env
LLM_PROVIDER=gemini       # recommended
GEMINI_API_KEY=your-key

# Or OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=your-key
```

**Cost guidance:** A full classic-metrics pass on 50 samples costs ~$0.05–$0.15 with Gemini Flash or GPT-4o-mini. Source attribution accuracy is deterministic and costs nothing.

**Determinism:** Judge calls run at `temperature=0.0`. For CI/CD, flag changes beyond ±0.05 rather than asserting exact scores.

---

## API Reference

```bash
# Evaluate a RAG sample
curl -X POST http://localhost:5001/v1/evaluate \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [{"question": "What is RAG?",
      "contexts": ["RAG is Retrieval-Augmented Generation."],
      "answer": "RAG combines retrieval with generation."}],
    "metrics": ["faithfulness", "answer_relevancy"]
  }'

# Evaluate an agentic trace
curl -X POST http://localhost:5001/v1/evaluate/agent \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "trace": {
      "question": "What is the GPAI deadline?",
      "final_answer": "GPAI obligations apply from August 2025.",
      "tool_calls": [{"tool_name": "retrieve",
        "tool_input": {"query": "GPAI deadline"},
        "tool_output": "Article 53 obligations apply from August 2025.",
        "step_index": 0}]
    },
    "metrics": ["source_attribution_accuracy", "tool_call_accuracy"]
  }'

# Compare runs
curl -X POST http://localhost:5001/v1/runs/compare \
  -H "X-API-Key: your-key" \
  -d '["run-id-a", "run-id-b"]'
```

---

## EU AI Act Article 15 — partial input, not conformity evidence

`rag-benchmarking` is an **evaluation harness** that measures **accuracy and faithfulness** for RAG and agentic systems. Those two metrics are *one* input among many that an Article 15 conformity assessment will draw on. They are **not** the conformity assessment itself, and the harness does **not** discharge an Article 15 obligation.

```mermaid
graph LR
    RAG["rag-benchmarking\nevaluation harness"]
    FAITH2["Faithfulness measurement\n(LLM-judge)"]
    ANS["Answer-relevancy + retrieval metrics\n(deterministic)"]
    AGENT2["Agentic-trace metrics\n(tool_call_accuracy, source_attribution)"]
    REPORT2["BenchmarkReport\n→ telemetry input for\nArticle 15(1) accuracy claims"]

    RAG --> FAITH2 --> REPORT2
    RAG --> ANS --> REPORT2
    RAG --> AGENT2 --> REPORT2

    style RAG fill:#c9a84c,color:#000
    style REPORT2 fill:#2d5a2d,color:#fff
```

### What this tool covers, honestly

- **Article 15(1) — accuracy declared in instructions for use.** The harness produces faithfulness, answer-relevancy and retrieval-quality metrics that a provider can cite as the empirical basis for the accuracy figures they declare on the system label. The tool does not declare for you, and it does not certify the figures.

### What this tool does NOT cover

- **Article 15 robustness in the regulatory sense.** Robustness under Art. 15 means resilience to errors, faults and inconsistencies — including adversarial-input resilience. This harness has no perturbation generator, no out-of-distribution detector, no adversarial-passage suite. **If a tool tells you it does Art. 15 robustness with a faithfulness scorer, it is overclaiming.**
- **Article 15 cybersecurity.** Adversarial prompt injection, jailbreak resistance, model-integrity controls. Out of scope. Pair with a runtime AI security control (e.g. AgentShield) for that leg.
- **Conformity assessment.** Article 15 requires a notified-body conformity assessment for high-risk systems. A benchmark report is not a substitute for that process.
- **Real-world testing under Art. 60.** The Art. 60 sandboxed-testing regime is a separate procedure with its own supervisory notifications. Out of scope.

> **Penalty band, contextually.** Art. 15 obligations route through the Art. 16 provider-obligation chain to **Art. 99(4)** — up to **€15M or 3% of total worldwide annual turnover, whichever is higher** ([EUR-Lex](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689)). This number is here for context, not as a sales hook. The compliance pathway is broader than this tool.

---

## AiExponent Toolchain

rag-benchmarking feeds accuracy evidence into RiskForge for Article 9 risk management:

```mermaid
graph LR
    LCC["LCC\n(Art. 53 licenses)"]
    RAG["rag-benchmarking\n(Art. 15 accuracy)"]
    RF["RiskForge\n(Art. 9 risk management)"]
    TD["TransparencyDeck\n(Art. 13 docs)"]

    LCC -->|"license evidence"| RF
    RAG -->|"benchmark_report.json\naccuracy evidence"| RF
    RF -->|"rmf.json"| TD

    style RAG fill:#c9a84c,color:#000
    style LCC fill:#1e3a5f,color:#fff
    style RF fill:#1e3a5f,color:#fff
    style TD fill:#1e3a5f,color:#fff
```

---

## Configuration

```bash
# .env
LLM_PROVIDER=gemini
GEMINI_API_KEY=...
OPENAI_API_KEY=...

# Vector store (built-in RAG pipeline only)
QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=...

# API authentication
API_KEY=your-secret-key
ENFORCE_API_KEY=true
```

---

## Project Structure

```
src/
  harness/            # Framework-agnostic evaluation harness
    schemas.py        # EvalSample, AgentTrace, BenchmarkReport
    protocol.py       # RAGEvaluable Protocol — the plug-in contract
    runner.py         # EvaluationRunner — orchestrates metrics
    result_store.py   # SQLite persistence
  app/
    api/              # FastAPI endpoints
    eval/             # Metric implementations
    sdk/              # Python SDK (RagEval client)
data/
  golden/qa.jsonl     # 50-sample golden dataset (10 domains)
```

---

## Known Limitations

- English-only benchmark datasets; no multilingual evaluation.
- Custom dataset integration requires manual formatting to the JSONL schema.
- Accuracy metrics only — latency and throughput are not measured.
- LLM-as-judge quality depends on the configured judge model.
- Rate limiting is in-memory and resets on server restart.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Issues and PRs welcome.

```bash
git clone https://github.com/aiexponenthq/rag-benchmarking
cd rag-benchmarking
pip install -e ".[test]"
pytest
```

---

## License

[Apache 2.0](LICENSE) — free to use, modify, and distribute.

Built by [AI Exponent LLC](https://aiexponent.com) — `hello@aiexponent.com`

---

*Part of the AiExponent open-source AI governance toolchain:
[license-compliance-checker](https://github.com/aiexponenthq/license-compliance-checker) ·
**rag-benchmarking** ·
[RiskForge](https://github.com/aiexponenthq/riskforge)*
