Metadata-Version: 2.4
Name: pytest-eval
Version: 0.1.0
Summary: LLM testing for humans.
Project-URL: Homepage, https://github.com/doganarif/pytest-eval
Project-URL: Documentation, https://github.com/doganarif/pytest-eval
Project-URL: Repository, https://github.com/doganarif/pytest-eval
Project-URL: Issues, https://github.com/doganarif/pytest-eval/issues
Author-email: Arif Dogan <arif@dogan.dev>
License-Expression: MIT
License-File: LICENSE
Keywords: ai,evaluation,llm,pytest,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: httpx>=0.24
Requires-Dist: pydantic>=2.0
Requires-Dist: pytest>=7.0
Requires-Dist: sentence-transformers>=2.0
Provides-Extra: all
Requires-Dist: anthropic>=0.20; extra == 'all'
Requires-Dist: detoxify>=0.5; extra == 'all'
Requires-Dist: litellm>=1.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: anthropic>=0.20; extra == 'dev'
Requires-Dist: detoxify>=0.5; extra == 'dev'
Requires-Dist: litellm>=1.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: openai>=1.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: litellm
Requires-Dist: litellm>=1.0; extra == 'litellm'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: safety
Requires-Dist: detoxify>=0.5; extra == 'safety'
Description-Content-Type: text/markdown

<div align="center">

# pytest-eval

**LLM testing for humans.**

[![PyPI version](https://img.shields.io/pypi/v/pytest-eval.svg)](https://pypi.org/project/pytest-eval/)
[![Python](https://img.shields.io/pypi/pyversions/pytest-eval.svg)](https://pypi.org/project/pytest-eval/)
[![Tests](https://github.com/doganarif/pytest-eval/actions/workflows/ci.yml/badge.svg)](https://github.com/doganarif/pytest-eval/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

Bring LLM evaluation into your existing pytest workflow.<br>
No custom test runners. No new concepts. Just `pytest`.

</div>

---

## Install

```bash
pip install pytest-eval
```

## Quick Start

```python
# No imports needed. The ai fixture IS the API.

def test_chatbot(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")
```

```bash
pytest -v
```

```
tests/test_chatbot.py::test_chatbot PASSED
    ✓ similar       ███████████████  0.94  ≥0.80

  ──────────────────────────────────────────────────────
  pytest-eval                                     v0.1.0
  ──────────────────────────────────────────────────────
    Test           Result Score                    Cost
  ──────────────────────────────────────────────────────
    test_chatbot     ✓   ██████████████░  0.94      $0
  ──────────────────────────────────────────────────────
    1 tests  │  1 passed  │  $0.0000 total
  ──────────────────────────────────────────────────────
```

That's it. No `LLMTestCase` objects, no custom runner, no cloud dashboard.

## Why pytest-eval?

|  | DeepEval | pytest-eval |
|---|---|---|
| Basic test | ~15 lines, 4 imports | ~3 lines, 0 imports |
| Test runner | `deepeval test run` | `pytest` |
| Metrics | 50+ to learn | ~10 methods on one fixture |
| Dependencies | 30+ (OpenTelemetry, gRPC, Sentry...) | 4 core |
| Telemetry | Cloud dashboard by default | None. Fully local. |

## Methods

| Method | What it does | Cost |
|--------|--------------|------|
| `ai.similar(a, b, threshold=0.8)` | Semantic similarity check | Free (local) |
| `ai.similarity_score(a, b)` | Returns similarity float 0–1 | Free (local) |
| `ai.judge(text, criteria)` | LLM evaluates against criteria | $ |
| `ai.grounded(response, context)` | RAG faithfulness check | $ |
| `ai.relevant(response, query)` | Answer relevancy | $ |
| `ai.hallucinated(response, context)` | Detect unsupported claims | $ |
| `ai.toxic(text)` | Toxicity detection | Free |
| `ai.biased(text)` | Bias detection | Free |
| `ai.valid_json(text, schema=None)` | JSON validation + Pydantic parsing | Free |
| `ai.assert_snapshot(value, name)` | Regression testing vs saved baseline | Free (local) |
| `ai.metric(name, text, **kw)` | Run a custom registered metric | Varies |
| `ai.cost` | Cumulative $ for this test | — |
| `ai.latency` | Cumulative seconds for this test | — |

**Free** methods use local models (sentence-transformers). No API key needed.
**$** methods call an LLM API (OpenAI by default). Requires `OPENAI_API_KEY`.

## Examples

### Semantic Similarity (free, local)

```python
def test_capital(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")
```

### LLM-as-Judge

```python
def test_tone(ai):
    response = my_chatbot("I want to cancel my subscription")
    assert ai.judge(response, "Response is polite and offers help")
```

### Structured Output

```python
from pydantic import BaseModel

class City(BaseModel):
    name: str
    country: str

def test_structured(ai):
    response = my_llm("Give me Paris info as JSON")
    city = ai.valid_json(response, City)
    assert city.country == "France"
```

### RAG Pipeline

```python
def test_rag(ai):
    query = "What is our refund policy?"
    docs = retriever.get_relevant_docs(query)
    response = generator.generate(query, docs)

    assert ai.grounded(response, docs)
    assert ai.relevant(response, query)
    assert not ai.hallucinated(response, docs)
```

### Snapshot Regression

```python
def test_regression(ai):
    response = my_chatbot("What are your business hours?")
    ai.assert_snapshot(response, name="business_hours", threshold=0.85)
```

```bash
# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update
```

### Multi-Model Comparison

```python
import pytest

@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
    response = call_llm(model=model, prompt="What is 2+2?")
    assert ai.similar(response, "4")
```

### Custom Metrics

```python
from pytest_eval import Metric, MetricResult

@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
    formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
    score = min(formal / 2, 1.0)
    return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))

def test_brand(ai):
    assert ai.metric("brand_voice", response, threshold=0.7)
```

## Configuration

### pyproject.toml

```toml
[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"
```

### Environment Variables

```bash
OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00
```

### CLI Options

```bash
pytest --ai-provider=openai    # Provider
pytest --ai-model=gpt-4o       # Model
pytest --ai-threshold=0.9      # Similarity threshold
pytest --ai-budget=2.00        # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose            # Show scores for passing tests
pytest --snapshot-update       # Update snapshot baselines
pytest -m ai                   # Run only @pytest.mark.ai tests
pytest -m "not cost_high"      # Skip expensive tests
```

Precedence: CLI > env vars > pyproject.toml > defaults

## Providers

pytest-eval supports multiple LLM providers:

```bash
pip install 'pytest-eval[openai]'     # OpenAI (default)
pip install 'pytest-eval[anthropic]'  # Anthropic
pip install 'pytest-eval[litellm]'    # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]'     # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]'        # Everything
```

Local embeddings (sentence-transformers) are always included — no API key needed for `similar()`, `similarity_score()`, and `assert_snapshot()`.

## Rich Failure Messages

Every assertion failure explains what happened:

```
AssertionError: Semantic similarity below threshold
  actual:     "The capital of France is Lyon"
  expected:   "The capital of France is Paris"
  similarity: 0.72
  threshold:  0.85
  reason:     Texts differ on the key fact (Lyon vs Paris)
```

## TUI Output

pytest-eval renders score bars and a summary table directly in your terminal:

- Per-test metric detail lines (with `-v` or `--ai-verbose`)
- Session summary table with visual score bars
- Cost tracking per test and per session

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

MIT
