Metadata-Version: 2.3
Name: openevalkit
Version: 0.1.0
Summary: Open evaluation kit for LLM systems
Keywords: llm,agent,evaluation,judge,nlp,ml,evals,generative
License: MIT
Requires-Dist: litellm>=1.81.9
Requires-Dist: numpy>=2.0.2
Requires-Dist: pandas>=2.3.3
Requires-Python: >=3.9
Description-Content-Type: text/markdown

# OpenEvalKit

**Universal evaluation framework for LLM systems**

[![PyPI](https://img.shields.io/pypi/v/openevalkit.svg)](https://pypi.org/project/openevalkit/)
[![Python](https://img.shields.io/pypi/pyversions/openevalkit.svg)](https://pypi.org/project/openevalkit/)

OpenEvalKit is a production-grade framework for evaluating LLM systems with traditional metrics, LLM-as-a-judge, and ensemble evaluation.

## Features

- 📊 **Traditional Scorers** - ExactMatch, Latency, Cost, TokenCount, RegexMatch, JSONValid, ContainsKeywords
- 🤖 **LLM Judges** - Use any LLM (OpenAI, Anthropic, Ollama, 100+ models) to evaluate quality
- 🎯 **Ensemble Judges** - Combine multiple judges for more reliable evaluation
- 💾 **Smart Caching** - Automatic caching with LRU eviction (saves API costs)
- ⚡ **Parallel Execution** - Fast evaluation with configurable concurrency
- 🔧 **Flexible** - Custom scorers, judges, and rubrics

## Installation
```bash
pip install openevalkit
```

## Quick Start

### Loading Datasets
```python
from openevalkit import Dataset

# From JSONL
dataset = Dataset.from_jsonl(
    "data.jsonl",
    input_field="question",
    output_field="answer",
    reference_field="expected"
)

# From CSV
dataset = Dataset.from_csv(
    "data.csv",
    input_col="question",
    output_col="answer",
    reference_col="expected",
    metadata_cols=["user_id"],
    metrics_cols=["latency"]
)

# From list
from openevalkit import Run
dataset = Dataset([
    Run(id="1", input="What is 2+2?", output="4", reference="4"),
    Run(id="2", input="What is 3+3?", output="6", reference="6"),
])
```

### Evaluate with Traditional Scorers
```python
from openevalkit import evaluate
from openevalkit.scorers import ExactMatch, RegexMatch, JSONValid, ContainsKeywords

# Exact match
results = evaluate(dataset, scorers=[ExactMatch()])
print(results.aggregates)
# {'exact_match': 1.0}

# Regex pattern matching
scorer = RegexMatch(pattern=r'\d+')  # Check if output contains numbers
results = evaluate(dataset, scorers=[scorer])

# JSON validation
json_scorer = JSONValid()
results = evaluate(dataset, scorers=[json_scorer])

# Keyword detection
keyword_scorer = ContainsKeywords(keywords=["python", "code"], ignore_case=True)
results = evaluate(dataset, scorers=[keyword_scorer])
```

### Evaluate with LLM Judge
```python
from openevalkit.judges import LLMJudge, LLMConfig, Rubric

# Create dataset
dataset = Dataset([
    {"input": "Explain Python", "output": "Python is a programming language..."},
])

# Create rubric
rubric = Rubric(
    criteria=["helpfulness", "accuracy", "clarity"],
    scale="0-1",
    weights={"helpfulness": 2.0, "accuracy": 3.0, "clarity": 1.0}
)

# Create judge
judge = LLMJudge(
    llm_config=LLMConfig(model="gpt-4o"),
    rubric=rubric
)

# Evaluate
results = evaluate(dataset, judges=[judge])
print(results.aggregates)
# {'llm_judge_gpt-4o_score': 0.85, 'llm_judge_gpt-4o_helpfulness': 0.9, ...}
```

### Ensemble Evaluation (Multiple Judges)
```python
from openevalkit.judges import EnsembleJudge

# Combine multiple judges for more reliable evaluation
ensemble = EnsembleJudge(
    judges=[
        LLMJudge(LLMConfig(model="gpt-4o"), rubric),
        LLMJudge(LLMConfig(model="claude-3-5-sonnet-20241022"), rubric),
        LLMJudge(LLMConfig(model="gpt-4o-mini"), rubric),
    ],
    method="average",  # or "median", "majority_vote", "unanimous"
    n_jobs=3  # Parallel execution
)

results = evaluate(dataset, judges=[ensemble])
```

## Configuration
```python
from openevalkit import EvalConfig

config = EvalConfig(
    concurrency=10,           # Parallel runs
    cache_enabled=True,       # Cache results (saves API costs)
    cache_max_size_mb=500,    # Cache size limit
    timeout=30.0,             # Timeout per run
    seed=42,                  # Reproducible results
    verbose=True,             # Show progress
)

results = evaluate(dataset, judges=[judge], config=config)
```

## Built-in Scorers

### String Matching
- **ExactMatch** - Exact string comparison with reference
- **RegexMatch** - Pattern matching with regex
- **ContainsKeywords** - Check for required keywords

### Structure Validation
- **JSONValid** - Validate JSON output

### Performance Metrics
- **Latency** - Response time from run.metrics
- **Cost** - API cost from run.metrics
- **TokenCount** - Token usage (exact or estimated)

## Supported Models

Via [LiteLLM](https://github.com/BerriAI/litellm), supports 100+ models:

- **OpenAI**: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
- **Anthropic**: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
- **Google**: gemini-pro, gemini-1.5-pro
- **Ollama**: llama3, mistral, phi (local models)
- **Cohere, Replicate, HuggingFace, and more**

## Custom Scorers
```python
from openevalkit.scorers.base import Scorer
from openevalkit import Score

class ContainsWord(Scorer):
    name = "contains_word"
    requires_reference = False
    
    def __init__(self, word: str):
        self.word = word
    
    def score(self, run):
        has_word = self.word.lower() in run.output.lower()
        return Score(
            value=1.0 if has_word else 0.0,
            reason=f"Word '{self.word}' {'found' if has_word else 'not found'}"
        )

results = evaluate(dataset, scorers=[ContainsWord("Python")])
```

## Why OpenEvalKit?

- **Production Ready**: Smart caching, parallel execution, error handling
- **Cost Effective**: Cache LLM judgments to avoid redundant API calls
- **Flexible**: Works with any LLM provider via LiteLLM
- **Reliable**: Ensemble judges with configurable aggregation
- **Simple**: Clean API, comprehensive documentation

## Documentation

Coming soon! For now, see examples above and docstrings.

## License

MIT

## Contributing

Contributions welcome! Please open an issue or PR.