Metadata-Version: 2.3
Name: openevalkit
Version: 0.1.6
Summary: Open evaluation kit for LLM systems
Keywords: llm,agent,evaluation,judge,nlp,ml,evals,generative
Author: Yonah Byarugaba
Author-email: Yonah Byarugaba <yonahgraphics@gmail.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: litellm>=1.81.9
Requires-Dist: numpy>=2.0.2
Requires-Dist: pandas>=2.3.3
Requires-Python: >=3.9
Project-URL: Homepage, https://github.com/yonahgraphics/openevalkit
Project-URL: Repository, https://github.com/yonahgraphics/openevalkit
Project-URL: Bug Tracker, https://github.com/yonahgraphics/openevalkit/issues
Project-URL: Changelog, https://github.com/yonahgraphics/openevalkit/releases
Project-URL: Documentation, https://github.com/yonahgraphics/openevalkit#readme
Description-Content-Type: text/markdown

# OpenEvalKit

[![PyPI version](https://img.shields.io/pypi/v/openevalkit.svg)](https://pypi.org/project/openevalkit/)
[![Python](https://img.shields.io/pypi/pyversions/openevalkit.svg)](https://pypi.org/project/openevalkit/)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

Production-grade Python framework for evaluating LLM systems with traditional scorers, LLM judges (OpenAI, Anthropic, Ollama, 100+ models via LiteLLM), ensemble aggregation, and smart caching for cost-effective testing.

## Table of Contents

- [Why OpenEvalKit?](#why-openevalkit)
- [Quick Start](#quick-start)
- [Installation](#installation)
  - [From PyPI](#from-pypi)
  - [From Source](#from-source)
- [Features](#features)
- [Examples](#examples)
  - [Traditional Scorers](#traditional-scorers)
  - [LLM Judges](#llm-judges)
  - [Ensemble Evaluation](#ensemble-evaluation)
  - [Using Local Models (Ollama)](#using-local-models-ollama)
- [Configuration](#configuration)
- [Built-in Scorers](#built-in-scorers)
- [Supported Models](#supported-models)
- [Custom Scorers](#custom-scorers)
- [Contributing](#contributing)
- [License](#license)

## Why OpenEvalKit?

- **Production Ready**: Smart caching with LRU eviction, parallel execution, comprehensive error handling
- **Cost Effective**: Intelligent caching avoids redundant LLM API calls, saving you money
- **Flexible Model Support**: Works with 100+ models via [LiteLLM](https://github.com/BerriAI/litellm) - OpenAI, Anthropic, Google, local models via Ollama
- **Reliable Evaluation**: Ensemble judges with configurable aggregation methods (average, median, majority vote, unanimous)
- **Developer Friendly**: Clean API, extensive documentation, comprehensive type hints
- **Battle Tested**: Comprehensive test suite, proven in production environments

## Quick Start

Evaluate your LLM outputs in just a few lines:
```python
from openevalkit import Dataset, Run, evaluate
from openevalkit.scorers import ExactMatch, BLEU, TokenF1

# Create your dataset
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="Capital of France?", output="Paris", reference="Paris"),
    Run(id="3", input="Translate hello", output="hola amigo", reference="hola"),
])

# Evaluate with multiple scorers
results = evaluate(
    dataset,
    scorers=[ExactMatch(), BLEU(max_n=2), TokenF1()]
)
print(results.aggregates)
# {'exact_match': 0.6667, 'bleu_mean': 0.8165, 'token_f1_mean': 0.8889, ...}
```

### With LLM Judges
```python
from openevalkit import Dataset, Run, evaluate
from openevalkit.judges import LLMJudge, LLMConfig, Rubric

dataset = Dataset([
    Run(id="1", input="Explain Python", output="Python is a high-level programming language..."),
    Run(id="2", input="What is AI?", output="AI stands for Artificial Intelligence..."),
])

rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_overall_mean': 0.85, ...}
```

## Installation

### From PyPI
```bash
pip install openevalkit
```

**Recommended**: Use a virtual environment to avoid dependency conflicts:
```bash
python -m venv openevalkit_env
source openevalkit_env/bin/activate  # On Windows: openevalkit_env\Scripts\activate
pip install openevalkit
```

### From Source

OpenEvalKit uses [uv](https://github.com/astral-sh/uv) for fast dependency management:
```bash
# Clone the repository
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies (includes dev dependencies)
uv sync --dev
```

**Traditional pip installation:**
```bash
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
pip install -e .
```

## Features

- **17 Built-in Scorers** - Text similarity, token-level, structural, semantic, and performance metrics
- **LLM Judges** - Evaluate quality with any LLM (100+ models supported)
- **Ensemble Judges** - Combine multiple judges for reliable evaluation
- **Smart Caching** - Automatic result caching with LRU eviction
- **Parallel Execution** - Fast evaluation with configurable concurrency
- **Flexible Data Loading** - JSONL, CSV, or in-memory datasets
- **Comprehensive Configuration** - Timeouts, retries, seed control, progress bars

## Examples

### Traditional Scorers
```python
from openevalkit import Dataset, evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, ContainsKeywords, BLEU, TokenF1

# Create dataset
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# Evaluate with multiple scorers
results = evaluate(
    dataset,
    scorers=[
        ExactMatch(),
        RegexMatch(pattern=r'\d+'),  # Contains numbers
        ContainsKeywords(keywords=["python", "code"], ignore_case=True),
        BLEU(),                      # N-gram precision
        TokenF1(),                   # Token-level F1
    ]
)

print(results.aggregates)
# {'exact_match': 0.85, 'bleu_mean': 0.72, 'token_f1_mean': 0.91, ...}
```

### LLM Judges
```python
from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Define what to evaluate
rubric = Rubric(
    criteria=["correctness", "clarity", "completeness"],
    scale="0-1",
    criteria_descriptions={
        "correctness": "Factually accurate with no errors",
        "clarity": "Easy to understand and well-structured",
        "completeness": "Addresses all aspects of the question"
    },
    weights={"correctness": 3.0, "clarity": 1.5, "completeness": 1.5}
)

# Use OpenAI
judge_gpt = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o", temperature=0.0),
    rubric=rubric
)

# Use Anthropic
judge_claude = LLMJudge(
    llm_config=LLMConfig(model="claude-3-5-sonnet-20241022"),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge_gpt, judge_claude])
```

### Ensemble Evaluation

Combine multiple judges for more reliable scores:
```python
from openevalkit.judges import EnsembleJudge

ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # Options: "average", "median", "majority_vote", "unanimous"
    min_agreement=0.7,  # Warn if judges disagree
    n_jobs=3  # Parallel evaluation
)

results = evaluate(dataset, judges=[ensemble])
```

### Using Local Models (Ollama)

Run evaluations completely offline with local models:
```python
# First, install and start Ollama:
# curl -fsSL https://ollama.com/install.sh | sh
# ollama pull llama3
# ollama serve

judge = LLMJudge(
    llm_config=LLMConfig(
        model="ollama/llama3",
        api_base="http://localhost:11434"
    ),
    rubric=rubric
)

results = evaluate(dataset, judges=[judge])
```

### Loading Data

**From JSONL:**
```python
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected",
    metadata_fields=["user_id"],
    metrics_fields=["latency"]
)
```

**From CSV:**
```python
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected"
)
```

**From code:**
```python
from openevalkit import Run

dataset = Dataset([
    Run(id="1", input="Q1", output="A1", reference="A1"),
    Run(id="2", input="Q2", output="A2", reference="A2"),
])
```

## Configuration
```python
from openevalkit import EvalConfig

config = EvalConfig(
    # Execution
    concurrency=10,           # Parallel runs
    timeout=30.0,             # Timeout per run (seconds)
    
    # Reproducibility
    seed=42,                  # For deterministic results
    
    # Caching
    cache_enabled=True,       # Enable smart caching
    cache_max_size_mb=500,    # Cache size limit
    cache_max_age_days=30,    # Auto-cleanup old entries
    
    # Error handling
    fail_fast=False,          # Continue on errors
    
    # Output
    verbose=True,             # Show detailed progress
    progress_bar=True,        # Show progress bars
)

results = evaluate(dataset, judges=[judge], config=config)
```

## Built-in Scorers

### Reference-Based
- **ExactMatch** - Exact string comparison with reference

### Text Similarity
- **LevenshteinDistance** - Normalized edit distance (0-1)
- **FuzzyMatch** - Fuzzy string similarity via `difflib`
- **BLEU** - N-gram precision with brevity penalty
- **ROUGE** - ROUGE-L (longest common subsequence F1)

### Token-Level
- **TokenF1** - Token overlap F1 score (precision/recall)
- **LengthRatio** - Output-to-reference length ratio

### Rule-Based
- **RegexMatch** - Pattern matching with regex
- **ContainsKeywords** - Check for required keywords
- **JSONValid** - Validate JSON output
- **StartsWith** - Check output prefix
- **EndsWith** - Check output suffix
- **LengthCheck** - Validate output length bounds

### Performance Metrics
- **Latency** - Response time from `run.metrics`
- **Cost** - API cost from `run.metrics`
- **TokenCount** - Token usage (exact or estimated)

### Semantic
- **CosineSimilarity** - Embedding-based semantic similarity (via litellm)

## Supported Models

Via [LiteLLM](https://github.com/BerriAI/litellm), OpenEvalKit supports 100+ models:

- **OpenAI**: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
- **Anthropic**: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
- **Google**: gemini-pro, gemini-1.5-pro, gemini-1.5-flash
- **Local (Ollama)**: llama3, mistral, phi, qwen
- **Cohere, Replicate, HuggingFace, and more**

## Custom Scorers

Create your own scorers:
```python
from openevalkit.scorers.base import Scorer
from openevalkit import Score

class SentimentScorer(Scorer):
    name = "sentiment"
    requires_reference = False
    cacheable = True  # Cache expensive computations
    
    def score(self, run):
        # Your scoring logic here
        sentiment = analyze_sentiment(run.output)  # Your function
        return Score(
            value=sentiment,
            reason=f"Detected sentiment: {sentiment}",
            metadata={"analyzer": "custom"}
        )

# Use it
results = evaluate(dataset, scorers=[SentimentScorer()])
```

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

**Quick start for contributors:**
```bash
# Clone and setup
git clone https://github.com/yonahgraphics/openevalkit.git
cd openevalkit
uv sync --dev

# Run tests
uv run pytest tests/

# Format code
ruff check .
```

## License

MIT License - see [LICENSE](LICENSE.md) for details.

---

**Made with love for the LLM and Agent evaluation community**

**Star us on GitHub** if you find OpenEvalKit useful!
