Metadata-Version: 2.4
Name: rageval-ai
Version: 0.1.2
Summary: Standalone RAG evaluation library — run LLM-as-judge evaluation locally with your own API key
Project-URL: Homepage, https://github.com/SKYMOD-Team/llm-evaluation
Project-URL: Documentation, https://github.com/SKYMOD-Team/llm-evaluation/tree/main/sdk#readme
Project-URL: Repository, https://github.com/SKYMOD-Team/llm-evaluation
Project-URL: Issues, https://github.com/SKYMOD-Team/llm-evaluation/issues
Author-email: syorgun891@gmail.com
License: MIT License
        
        Copyright (c) 2026
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: ai,evaluation,langchain,llm,nlp,rag,retrieval-augmented-generation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx[http2]>=0.24.0
Provides-Extra: dev
Requires-Dist: pytest-httpx>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1.0; extra == 'langchain'
Description-Content-Type: text/markdown

# rageval-ai

**Standalone RAG evaluation library — run LLM-as-judge evaluation locally with your own API key.**

Evaluate your RAG pipelines with 9+ metrics including hallucination detection, answer relevancy, faithfulness, and more. No server needed — everything runs locally.

[![PyPI version](https://img.shields.io/pypi/v/rageval-ai)](https://pypi.org/project/rageval-ai/)
[![Python](https://img.shields.io/pypi/pyversions/rageval-ai)](https://pypi.org/project/rageval-ai/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Installation

```bash
pip install rageval-ai
```

## Quick Start

```python
import os
from rageval_sdk import evaluate

result = evaluate(
    question="What is the capital of France?",
    answer="The capital of France is Paris.",
    contexts=["Paris is the capital and largest city of France."],
    ground_truth="Paris",
    api_key=os.environ["OPENAI_API_KEY"],
)

print(f"Overall Score: {result['overall_score']}")
print(f"Hallucination: {result['hallucination_score']}")
print(f"Faithfulness:  {result['faithfulness']}")
print(f"Relevancy:     {result['answer_relevancy']}")
print(f"Cost:          ${result['cost_usd']}")
```

### Environment Variables

```bash
export OPENAI_API_KEY="sk-your-openai-key"
```

### Custom Configuration

```python
from rageval_sdk import evaluate, EvalConfig

config = EvalConfig(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://api.openai.com/v1",     # or OpenRouter, Azure, etc.
    stage_1_model="gpt-4o",                   # reasoning model
    stage_2_model="gpt-4o-mini",              # JSON conversion model
    rag_metrics_model="gpt-4o-mini",          # RAG metrics model
)

result = evaluate(
    question="What is RAG?",
    answer="RAG is Retrieval-Augmented Generation.",
    config=config,
)
```

### Async Usage

```python
import asyncio
from rageval_sdk import evaluate_trace, EvalConfig

async def main():
    config = EvalConfig(api_key="sk-...")
    result = await evaluate_trace(
        question="What is RAG?",
        answer="RAG is Retrieval-Augmented Generation.",
        contexts=["RAG combines retrieval with generation."],
        config=config,
    )
    print(result["overall_score"])

asyncio.run(main())
```

### Background Evaluation (Non-blocking)

The `RagEvaluator` runs evaluations in background threads so your RAG pipeline is never blocked:

```python
from rageval_sdk import RagEvaluator

evaluator = RagEvaluator(api_key=os.environ["OPENAI_API_KEY"], max_workers=4)

# Your RAG pipeline runs normally — evaluation happens in background
for query in user_queries:
    answer, contexts = my_rag_pipeline(query)  # your existing code

    # Non-blocking: submits and returns immediately
    evaluator.submit(
        question=query,
        answer=answer,
        contexts=contexts,
    )

# Check how many are done
print(f"Completed: {evaluator.completed_count}, Pending: {evaluator.pending_count}")

# When ready, collect all results
results = evaluator.wait()
for r in results:
    print(f"Score: {r['overall_score']}, Hallucination: {r['hallucination_score']}")

evaluator.shutdown()
```

### Batch Evaluation

Evaluate multiple traces at once:

```python
from rageval_sdk import RagEvaluator

with RagEvaluator(api_key=os.environ["OPENAI_API_KEY"]) as evaluator:
    results = evaluator.evaluate_batch([
        {
            "question": "What is RAG?",
            "answer": "RAG is Retrieval-Augmented Generation.",
            "contexts": ["RAG combines retrieval with generation."],
        },
        {
            "question": "What is Python?",
            "answer": "Python is a programming language.",
            "contexts": ["Python was created by Guido van Rossum."],
        },
    ])

    for r in results:
        print(f"Score: {r['overall_score']}")
```

## Features

- **Standalone** — No server needed, runs entirely locally
- **Background Evaluation** — Non-blocking evaluation with `RagEvaluator`
- **Batch Support** — Evaluate multiple traces concurrently
- **9+ Metrics** — Hallucination, relevancy, faithfulness, completeness, and more
- **Parallel Pipeline** — Stage 1 + Stage 2 + RAG metrics run concurrently
- **OpenAI Compatible** — Works with OpenAI, OpenRouter, Azure, or any compatible API
- **Retry & Circuit Breaker** — Production-grade reliability
- **Typed** — Full type hints with `py.typed` marker
- **Lightweight** — Only `httpx` as required dependency

## Evaluation Metrics

| Metric | Description |
|--------|-------------|
| `overall_score` | Weighted combination of all metrics |
| `hallucination_score` | Detects fabricated information (claim-level) |
| `faithfulness` | Ensures answer is grounded in context |
| `answer_relevancy` | Measures answer relevance to the question |
| `completeness` | Key-point coverage verification |
| `context_precision` | Evaluates quality of retrieved contexts |
| `context_recall` | Checks if all needed facts are retrieved |
| `citation_check` | Validates source citations against contexts |
| `clarity` | Answer clarity and readability |
| `coherence` | Logical flow and consistency |
| `helpfulness` | How actionable/useful the answer is |
| `is_off_topic` | Off-topic detection |
| `is_deflection` | Deflection detection ("I don't know") |

## API Reference

### `evaluate()`

```python
result = evaluate(
    question,                    # str — the user question
    answer,                      # str — the LLM answer
    contexts=None,               # list[str] — retrieved context passages
    ground_truth=None,           # str — expected correct answer
    api_key=None,                # str — your OpenAI API key
    config=None,                 # EvalConfig — full configuration
    **config_overrides,          # additional EvalConfig fields
)
```

### `EvalConfig`

```python
config = EvalConfig(
    api_key="sk-...",                           # Required
    base_url="https://api.openai.com/v1",       # LLM endpoint
    stage_1_model="gpt-4o",                     # Reasoning model
    stage_2_model="gpt-4o-mini",                # JSON model
    rag_metrics_model="gpt-4o-mini",            # RAG metrics model
    timeout_seconds=120.0,                      # Request timeout
)
```

## License

MIT — see [LICENSE](LICENSE) for details.
