Metadata-Version: 2.4
Name: cane-eval
Version: 0.4.0
Summary: Agent Reliability Layer. LLM-as-Judge eval, schema validation, latency tracking, and reliability scoring for AI agents.
Project-URL: Homepage, https://github.com/colingfly/cane-eval
Project-URL: Documentation, https://github.com/colingfly/cane-eval#readme
Project-URL: Repository, https://github.com/colingfly/cane-eval
Project-URL: Issues, https://github.com/colingfly/cane-eval/issues
Author: Cane
License-Expression: MIT
License-File: LICENSE
Keywords: ai-agents,dpo,eval,evaluation,judge,latency,llm,reliability,schema-validation,training-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: anthropic>=0.39.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all-providers
Requires-Dist: google-genai>=1.0.0; extra == 'all-providers'
Requires-Dist: openai>=1.0.0; extra == 'all-providers'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: fastapi
Requires-Dist: fastapi>=0.100.0; extra == 'fastapi'
Requires-Dist: httpx>=0.24.0; extra == 'fastapi'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0.0; extra == 'gemini'
Provides-Extra: integrations
Requires-Dist: fastapi>=0.100.0; extra == 'integrations'
Requires-Dist: httpx>=0.24.0; extra == 'integrations'
Requires-Dist: langchain-core>=0.1.0; extra == 'integrations'
Requires-Dist: llama-index-core>=0.10.0; extra == 'integrations'
Requires-Dist: openai>=1.0.0; extra == 'integrations'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1.0; extra == 'langchain'
Provides-Extra: llamaindex
Requires-Dist: llama-index-core>=0.10.0; extra == 'llamaindex'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Description-Content-Type: text/markdown

# cane-eval

The agent reliability layer. Catch what breaks in production before it ships.

[![PyPI](https://img.shields.io/pypi/v/cane-eval)](https://pypi.org/project/cane-eval/)

```
pip install cane-eval
```

## What it does

LLM-as-Judge eval + schema validation + latency tracking + reliability scoring. One tool, one score, one answer: **would this break in production?**

```
  Support Agent                              28.4s

  Overall: [=========----------] 47

  1 passed  1 warned  3 failed  (5 total)
  Pass rate: 20%

  Latency:  p50: 1.2s  p95: 8.4s  max: 12.1s
  Schema:   3/5 valid (60%)

  Reliability: [=======-----------] 52 (D)
```

## 30-Second Demo

```bash
export ANTHROPIC_API_KEY=sk-ant-...
cane-eval demo
```

## Quick Start

**1. Define tests** (`tests.yaml`):

```yaml
name: Support Agent

criteria:
  - key: accuracy
    weight: 40
  - key: completeness
    weight: 30
  - key: hallucination
    weight: 30

# Optional: validate response structure
schema:
  type: object
  required: [answer, sources]
  properties:
    answer: { type: string }
    sources: { type: array }

# Optional: latency target for reliability scoring
latency_target_ms: 5000

tests:
  - question: What is the return policy?
    expected_answer: 30-day return policy for unused items with receipt
  - question: How do I reset my password?
    expected_answer: Go to Settings > Security > Reset Password
```

**2. Run**:

```bash
cane-eval run tests.yaml
```

**3. Production checks**:

```bash
# Validate responses against JSON schema
cane-eval run tests.yaml --schema schema.json --fail-on-schema

# Fail if p95 latency exceeds 10 seconds
cane-eval run tests.yaml --latency-p95 10000

# Both + mine failures into training data
cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --mine --export dpo
```

## Reliability Score

Every eval run produces an Agent Reliability Score (0-100) across three pillars:

| Pillar | What it measures | How |
|--------|-----------------|-----|
| **Correctness** | Does the answer look good? | LLM judge (accuracy, completeness, hallucination) |
| **Structural** | Does the response match expected format? | JSON schema validation |
| **Performance** | Is it fast enough for production? | p95 latency vs target |

Grades: **A** (90+) production-ready, **B** (75+) mostly reliable, **C** (60+) needs work, **D** (40+) significant gaps, **F** (<40) not ready.

## Multi-Model Judging

Any LLM as judge. Auto-detects provider from model name.

```bash
cane-eval run tests.yaml                                                       # Claude (default)
cane-eval run tests.yaml --provider openai --model gpt-4o                      # OpenAI
cane-eval run tests.yaml --provider gemini --model gemini-2.0-flash            # Gemini
cane-eval run tests.yaml --provider ollama --model llama3 --base-url http://localhost:11434/v1  # Local
```

```bash
pip install cane-eval[openai]          # OpenAI
pip install cane-eval[gemini]          # Google Gemini
pip install cane-eval[all-providers]   # everything
```

## CLI

```bash
cane-eval run tests.yaml                          # run eval
cane-eval run tests.yaml --schema schema.json     # + schema validation
cane-eval run tests.yaml --latency-p95 10000      # + latency threshold
cane-eval run tests.yaml --mine --export dpo      # + failure mining
cane-eval rca tests.yaml --targeted               # root cause analysis
cane-eval diff old.json new.json                  # regression diff
cane-eval demo                                    # try it in 30 seconds
```

## Python API

```python
from cane_eval import TestSuite, EvalRunner

suite = TestSuite.from_yaml("tests.yaml")
runner = EvalRunner(
    schema={"type": "object", "required": ["answer"]},
    latency_p95=10000,
)
summary = runner.run(suite, agent=lambda q: my_agent.ask(q))

print(f"Score: {summary.overall_score}")
print(f"Reliability: {summary.reliability_score} ({summary.reliability_grade})")
print(f"Latency p95: {summary.latency.p95_ms}ms")
print(f"Schema: {summary.schema_pass}/{summary.schema_pass + summary.schema_fail} valid")
```

## Framework Integrations

```python
from cane_eval import evaluate_langchain, evaluate_llamaindex, evaluate_openai, evaluate_fastapi

results = evaluate_langchain(chain, suite="qa.yaml")
results = evaluate_llamaindex(query_engine, suite="qa.yaml")
results = evaluate_openai("http://localhost:11434/v1/chat/completions", suite="qa.yaml")
results = evaluate_fastapi("http://localhost:8000/ask", suite="qa.yaml")
```

## Eval Targets

```yaml
# HTTP endpoint
target:
  type: http
  url: https://my-agent.com/api/ask
  payload_template: '{"query": "{{question}}"}'
  response_path: data.answer

# CLI tool
target:
  type: command
  command: python my_agent.py --query "{{question}}"
```

## CI

```yaml
# .github/workflows/eval.yml
- run: pip install cane-eval
- run: cane-eval run tests.yaml --schema schema.json --latency-p95 10000 --quiet
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
```

Exit code 1 on failures. Add `--fail-on-warn` or `--fail-on-schema` for stricter checks.

## How It Works

```
YAML Suite --> Agent --> LLM Judge -----> Reliability Score (A-F)
                  |          |                    |
                  |          v                    |
                  |   Schema Check                |
                  |   Latency Stats               |
                  |          |                    v
                  v          v              Training Data
            Root Cause    Failure           (DPO/SFT/OpenAI)
            Analysis      Mining
```

## License

MIT
