Metadata-Version: 2.4
Name: skillevaluation
Version: 0.1.1
Summary: An open spec for A/B benchmarking skills via declarative test suites.
Project-URL: Homepage, https://github.com/decimal-labs/skillevaluation
Project-URL: Documentation, https://github.com/decimal-labs/skillevaluation/tree/main/spec
Project-URL: Repository, https://github.com/decimal-labs/skillevaluation
Project-URL: Issues, https://github.com/decimal-labs/skillevaluation/issues
Author: Decimal AI
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ab-test,agent,ai,benchmark,claude,eval,llm,skill
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: jsonschema>=4.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# skillevaluation

**Does your skill actually make the agent better? Prove it — with measured before/after numbers.**

[![CI](https://github.com/decimal-labs/skillevaluation/actions/workflows/ci.yml/badge.svg)](https://github.com/decimal-labs/skillevaluation/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/skillevaluation)](https://pypi.org/project/skillevaluation/)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://pypi.org/project/skillevaluation/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://github.com/decimal-labs/skillevaluation/blob/main/LICENSE)

A skill is just a folder — a `SKILL.md` plus some attachments. It's easy to write one and *assume* it helps. `skillevaluation` lets you **measure** the help: write a small `eval.yaml` next to your skill, and a runner executes each test case twice — once with the skill loaded, once without — then hands you a clear A/B delta on pass rate, speed, tokens, turns, and tool calls.

No more "I think this skill is good." Now you can say *"this skill lifts pass rate 40 points and cuts tokens 43%"* — and back it with reproducible cases.

---

## The payoff

Here's the bundled [`gdpr-pii-classifier`](https://github.com/decimal-labs/skillevaluation/tree/main/examples/gdpr-pii-classifier) example — the same five cases, run with the skill and without it:

| Dimension | Without skill | With skill | Delta |
|---|---:|---:|:---:|
| **Pass rate** | 40% | 80% | **+40 pts** |
| Avg tokens | 3,210 | 1,840 | **−43%** |
| Avg turns | 8.2 | 4.6 | **−44%** |
| Avg duration | 22.8s | 14.2s | **−38%** |
| Avg tool calls | 5.4 | 3.0 | **−44%** |

The skill more than doubled the pass rate *and* made the agent faster and cheaper. That's exactly the kind of claim `skillevaluation` is built to produce.

> Numbers above are illustrative of the example's shape — your real deltas depend on your agent runtime and model.

---

## How it works

1. **Write `eval.yaml`** next to your `SKILL.md` — a handful of declarative cases (a prompt, plain-English expectations, and optional shell validators).
2. **A runner executes each case twice** — once with the skill loaded (the *with* arm), once without (the *without* arm).
3. **You get measured deltas** — each case is classified (`flip_to_pass`, `pass_kept`, …) and aggregated into per-dimension lift.

```
                    ┌─ with skill ────▶ pass? + metrics ─┐
   each case ──────▶┤                                    ├──▶ outcome ──▶ aggregate deltas
                    └─ without skill ─▶ pass? + metrics ─┘
```

---

## Quickstart

```bash
pip install skillevaluation
```

**1. Describe what "better" means.** Drop an `eval.yaml` beside your `SKILL.md`:

```yaml
# eval.yaml
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields and write JSON to /workspace/output.json: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous (not PII)"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' /workspace/output.json"
        label: "email categorized as PII"
```

See the full five-case suite in [`examples/gdpr-pii-classifier/eval.yaml`](https://github.com/decimal-labs/skillevaluation/blob/main/examples/gdpr-pii-classifier/eval.yaml).

**2. Score your A/B results.** Once you've run each case with and without the skill, feed the per-arm results to the library and get the deltas back:

```python
from skillevaluation.outcomes import classify_outcome
from skillevaluation.aggregation import CaseResult, CaseMetrics, compute_run_aggregates

results = [
    CaseResult(
        case_name="tracks_with_id",
        outcome=classify_outcome(with_passed=True, without_passed=False),
        with_skill=CaseMetrics(passed=True,  duration_ms=14200, turns=4, total_tokens=1840, tool_call_count=3),
        without_skill=CaseMetrics(passed=False, duration_ms=22800, turns=8, total_tokens=3210, tool_call_count=5),
    ),
    # ... one CaseResult per case
]

agg = compute_run_aggregates(results)
print(agg.pass_rate)   # {'with_skill': 1.0, 'without_skill': 0.0, 'delta_pts': 100.0}
print(agg.to_dict())   # full per-dimension JSON, matching the wire schema
```

**What actually runs the agent?** That part is yours to bring. This repo defines the *format*, the *scoring*, and the *spec* — it does **not** ship the harness that drives the agent through each case. Wire your own agent loop to the [runner contract](https://github.com/decimal-labs/skillevaluation/blob/main/spec/runner-contract.md), or use a conforming runner like [DecimalAI](https://github.com/decimal-labs) that does the A/B execution for you.

> **Status:** v0.1.1, pre-1.0. The format is stable enough to build on, but APIs may shift before v1 — changes are logged in [`CHANGELOG.md`](https://github.com/decimal-labs/skillevaluation/blob/main/CHANGELOG.md).

---

## What's in the box

A typed, dependency-light Python reference implementation (only needs PyYAML):

| Module | What it does |
|---|---|
| `skillevaluation.parser` | Parse + strictly validate `eval.yaml` |
| `skillevaluation.outcomes` | Classify each case: `flip_to_pass` / `pass_kept` / `fail_kept` / `flip_to_fail` / `error` |
| `skillevaluation.aggregation` | Per-dimension delta math, with an honest apples-to-oranges skip rule |
| `skillevaluation.baseline` | Baseline-cache key derivation (skip re-running an unchanged *without* arm) |
| `skillevaluation.trajectory.format_v1` | Canonical agent-session rendering, so different runners' LLM judges agree |

---

## Use it as a spec, not just a library

`skillevaluation` is an **open spec**, so any tool — in any language — can produce interoperable results. If you're building your own runner, start here:

- [`spec/eval-yaml.md`](https://github.com/decimal-labs/skillevaluation/blob/main/spec/eval-yaml.md) — the file format
- [`spec/runner-contract.md`](https://github.com/decimal-labs/skillevaluation/blob/main/spec/runner-contract.md) — how to execute cases A/B and aggregate
- [`spec/llm-judge.md`](https://github.com/decimal-labs/skillevaluation/blob/main/spec/llm-judge.md) — the judge input/output contract
- [`spec/trajectory-format.md`](https://github.com/decimal-labs/skillevaluation/blob/main/spec/trajectory-format.md) — canonical session rendering
- [`schemas/`](https://github.com/decimal-labs/skillevaluation/tree/main/schemas) — JSON Schemas for every input and output
- [`CONFORMANCE.md`](https://github.com/decimal-labs/skillevaluation/blob/main/CONFORMANCE.md) + [`compatibility-tests/`](https://github.com/decimal-labs/skillevaluation/tree/main/compatibility-tests) — golden in/out pairs your implementation must reproduce

Deliberately **out of scope:** live traffic-split experiments, external eval-score webhooks (DeepEval/LangSmith), catalog ranking or publish-gate policy, and the exact LLM-judge prompt wording (the contract is specified; the prompt is your choice).

---

## Contributing

Contributions are genuinely welcome — especially new conformance cases that catch an edge the golden suite misses. See [`CONTRIBUTING.md`](https://github.com/decimal-labs/skillevaluation/blob/main/CONTRIBUTING.md). Dev setup is the usual:

```bash
git clone https://github.com/decimal-labs/skillevaluation
cd skillevaluation
pip install -e ".[dev]"
pytest
```

## License

[Apache 2.0](https://github.com/decimal-labs/skillevaluation/blob/main/LICENSE).
