Metadata-Version: 2.4
Name: rag-eval-gate
Version: 0.1.0
Summary: CI/CD-integrated RAG evaluation pipeline — quality gate for AI chatbots using Ragas + Groq LLM judge
Author: Manik Bodamwad
License: MIT
Project-URL: Homepage, https://github.com/manikbodamwad/rag-eval
Project-URL: Repository, https://github.com/manikbodamwad/rag-eval
Project-URL: Issues, https://github.com/manikbodamwad/rag-eval/issues
Keywords: rag,evaluation,llm,ragas,ci-cd,quality-gate,groq
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: ragas>=0.2.0
Requires-Dist: langchain>=0.3.0
Requires-Dist: langchain-community>=0.3.0
Requires-Dist: langchain-huggingface>=0.1.0
Requires-Dist: faiss-cpu>=1.8.0
Requires-Dist: litellm>=1.40.0
Requires-Dist: datasets>=2.20.0
Requires-Dist: click>=8.1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.32.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers>=3.0.0
Requires-Dist: groq>=0.9.0
Provides-Extra: dev
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"

# rag-eval

A CI/CD-integrated evaluation pipeline for RAG systems. 

[![PyPI version](https://badge.fury.io/py/rag-eval-gate.svg)](https://badge.fury.io/py/rag-eval-gate)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![RAG Eval CI](https://github.com/manikbodamwad/rag-eval/actions/workflows/rag_eval.yml/badge.svg)](https://github.com/manikbodamwad/rag-eval/actions/workflows/rag_eval.yml)

`rag-eval` acts as a quality gate for your RAG applications. It evaluates Pull Requests and can block merges if the output quality drops below defined thresholds.

## How it works

When a pull request is opened, the Github Action:
1. Installs the `rag-eval` package.
2. Loads a golden evaluation dataset (from Hugging Face or a local file).
3. Runs the dataset through your Mock RAG pipeline.
4. Evaluates the outputs using Ragas metrics.
5. Checks scores against your defined thresholds in `eval_config.yaml`.
6. Pushes metrics to Grafana for trend tracking.
7. Posts a summary comment on the Pull Request.
8. Fails the CI job if any metric drops below the threshold.

## Evaluation Metrics

| Metric | What It Measures | Default Threshold |
|--------|------------------|-------------------|
| **Faithfulness** | Answers are grounded in retrieved context | ≥ 0.75 |
| **Context Relevance** | Retrieved context quality | ≥ 0.70 |
| **Answer Correctness** | Accuracy vs ground truth | ≥ 0.65 |
| **Token Efficiency** | `correctness / log(1 + tokens)` | ≥ 0.50 |

The default LLM Judge is `groq/llama-3.3-70b-versatile` via LiteLLM.

## Quick Start

```bash
# Install
pip install rag-eval-gate

# Set API key
export GROQ_API_KEY="your_api_key"

# Run evaluation
rag-eval run

# View report
rag-eval report
```

### Try the Hallucination Demo 🚨
Want to see `rag-eval` catch a hallucinating AI in real-time? We built a cinematic terminal demo that intentionally forces our mock RAG pipeline to hallucinate an answer about "RLHF", proving that the quality gate works:

```bash
# Make sure GROQ_API_KEY is exported, then run:
python examples/demo.py
```

## GitHub Actions Setup

Add this workflow to `.github/workflows/rag_eval.yml`:

```yaml
name: RAG Evaluation
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install rag-eval-gate
      - run: rag-eval run --config eval_config.yaml
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
```

Ensure you set `GROQ_API_KEY` in your GitHub repository secrets.

## Configuration

You can customize the passing thresholds and dataset endpoints in `eval_config.yaml`:

```yaml
thresholds:
  faithfulness_min: 0.75
  context_relevance_min: 0.70
  answer_correctness_min: 0.65
  token_efficiency_min: 0.50

dataset:
  hf_repo: "manikbodamwad/rag-eval-golden" 
```

## Local Development

```bash
git clone https://github.com/manikbodamwad/rag-eval
cd rag-eval
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env

# Run local evaluation
rag-eval run

# View formatted report
rag-eval report

# Run unit tests
python -m pytest tests/
```

## Golden Dataset

The default test set is pushed to `manikbodamwad/rag-eval-golden` on Hugging Face. To use your own dataset, create a JSONL file with the following schema:

```jsonl
{"question": "What is X?", "ground_truth": "X is ...", "reference_context": "The passage that answers this..."}
```

Then specify the local path or your own HF repo in `eval_config.yaml`.

## License

MIT License.
