Metadata-Version: 2.4
Name: agi-eval
Version: 0.1.1
Summary: Plug any model into any major AGI eval and actually run it.
Project-URL: Homepage, https://agi-eval.studio
Project-URL: Documentation, https://agi-eval.studio/evals
Project-URL: Leaderboard, https://agi-eval.studio/leaderboard
Author: iso-ai
License: Apache-2.0
License-File: LICENSE
Keywords: agent,agi,benchmark,evals,evaluation,llm
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: httpx>=0.27
Requires-Dist: posthog>=3.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: alfworld
Requires-Dist: alfworld>=0.4; extra == 'alfworld'
Provides-Extra: all
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: hf
Requires-Dist: torch>=2.0; extra == 'hf'
Requires-Dist: transformers>=4.40; extra == 'hf'
Provides-Extra: mlx
Requires-Dist: mlx-lm>=0.10; extra == 'mlx'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: scienceworld
Requires-Dist: scienceworld>=1.2; extra == 'scienceworld'
Description-Content-Type: text/markdown

# agi-evals

**Plug any model into any major AGI eval and actually run it.**

Open source, six categories, 45 evals catalogued and deeply implemented over
time. This is not a directory pretending to be a platform. The runner code is
Apache-2.0, and each eval's dataset keeps its original upstream license,
documented per entry.

Everything runs on your machine. No account or API key is required: bring your
own model credentials, or none at all for local models served through Ollama,
vLLM, or MLX. A free account at [agi-eval.studio](https://agi-eval.studio)
adds the hosted layer: score-over-time dashboards, variant-vs-base comparison
cards, a public leaderboard, and challenges.

Catalog, per-eval docs, and the scoreboard live at
**[agi-eval.studio](https://agi-eval.studio)**.

---

## The idea

Two protocols decouple what we run from what we run it on:

- **`PatientAdapter`** is a model endpoint. It takes a prompt plus any
  eval-specific scenario and returns a response. Adapters ship for OpenAI,
  Anthropic, Grok, Ollama, vLLM, Hugging Face Transformers, MLX (Apple
  Silicon), and a custom-callable shim.
- **`EvalRunner`** is an eval. It takes a patient and a case and returns a
  scored result with a typed failure tag.

Get these two right and adding eval #2 through #50 is incremental. Every eval
and every model hangs off them. An eval never imports an adapter, and an
adapter never imports an eval.

```
catalog/evals.yaml ──┬──► website (agi-eval.studio)
                     └──► registry ──► EvalRunner ──┐
                                                    ├──► harness ──► EvalReport ──► push to scoreboard
                              PatientAdapter ───────┘
```

## Install

```bash
pip install agi-eval                 # core + custom/ollama/vllm/grok/openai-compat
pip install 'agi-eval[openai]'       # + OpenAI SDK
pip install 'agi-eval[anthropic]'    # + Anthropic SDK
pip install 'agi-eval[hf]'           # + Transformers/torch
pip install 'agi-eval[mlx]'          # + MLX (Apple Silicon)
```

## Quickstart: CLI

```bash
agi-evals list --status live                 # browse the catalog
agi-evals info gpqa-diamond                  # inspect one eval
agi-evals run gpqa-diamond --model echo      # offline smoke test, no keys
agi-evals download --all                     # fetch + cache full datasets
agi-evals run gpqa-diamond --model openai:gpt-4o-mini --limit 50
agi-evals run humaneval-plus --model ollama:llama3.1:8b --concurrency 4
agi-evals run math --model anthropic:claude-opus-4-8 --push   # submit to scoreboard
```

Every shipped eval bundles a small real-schema sample, so it runs offline out
of the box. `agi-evals download <eval>` fetches the full upstream dataset
(from the HF datasets-server or GitHub, with no heavy dependencies) into
`~/.cache/agi-evals/`, and runs pick it up automatically. GPQA is gated
upstream: set `HF_TOKEN` after accepting its terms, or the runner falls back
to the GPQA repo's published-password zip. An explicit `data_path=` always
wins.

## Quickstart: SDK

```python
from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter, CustomAdapter

# Any of the built-in adapters...
patient = OpenAIAdapter("gpt-4o-mini")

# ...or wrap your own endpoint as a callable:
patient = CustomAdapter(lambda req: my_model(req.prompt), name="my-model")

report = run_eval(load_runner("gpqa-diamond"), patient, limit=100, concurrency=8)
print(report.score, report.pass_rate, report.failure_counts)

# Save it to your scoreboard at agi-eval.studio
from agi_evals.client import push_report
push_report(report, model="my-model")        # needs AGI_EVALS_API_KEY
```

## Track your scores at agi-eval.studio

Local runs print a report and exit. Nothing leaves your machine. To keep a
history, add `--push`:

1. Sign in at [agi-eval.studio](https://agi-eval.studio) (GitHub OAuth).
2. Mint a key under **Settings → API keys** (shown once, stored hashed).
3. `export AGI_EVALS_API_KEY=ae_...`
4. Add `--push` to any `run` or `compare`.

Your [dashboard](https://agi-eval.studio/dashboard) charts every eval over
time and groups variant-vs-base comparisons into vs-cards. From there you can
submit a run to a [challenge](https://agi-eval.studio/challenges) or the
public [leaderboard](https://agi-eval.studio/leaderboard), and attach your
GitHub repo or an endpoint so others can see what the score belongs to.

## Live evals (runnable today)

| Eval | Category | Grading | Full dataset |
|------|----------|---------|--------------|
| GPQA Diamond | reasoning | single-letter MCQ | 198 |
| MMLU-Pro | reasoning | 10-choice MCQ | ~12k |
| MATH | reasoning | `\boxed{}` answer, math-aware match | 500 (MATH-500) |
| AIME 2024 | reasoning | integer exact-match | 30 |
| HumanEval+ | code | sandboxed test execution | 164 |
| BIG-Bench Hard | reasoning | normalized exact-match, 27 tasks | ~6.5k |
| MuSR | reasoning | narrative MCQ | 756 |
| BFCL (simple) | agent | function-call AST match | 400 |
| ZebraLogic | reasoning | full-grid JSON, puzzle-level | gated (HF_TOKEN) |
| JailbreakBench | safety | refusal rate, LLM-judged | 100 |
| LiveCodeBench | code | contest tests, pass@k, contamination-free | recent releases (~340) |
| HarmBench | safety | behavior classifier, score = 1 − ASR | 300 |
| τ-bench | agent | episode reward: DB-state × outputs | 165 (retail+airline) |
| ALFWorld | embodied | task success in the real TextWorld engine* | 134 unseen games |
| ScienceWorld | embodied | engine score 0–100, partial credit* | 30 tasks, test variations |
| AILuminate | safety | judged safe-response rate (practice set) | 1,200 |
| GAIA | agent | official exact-match scorer, FINAL ANSWER template | 165 (validation, gated) |
| WebShop | agent | engine's attribute/option/price reward, partial credit* | 500 test goals |
| LIBERO | robotics | success rate via PolicyAdapter (MuJoCo)* | 4 suites × 10 tasks |

τ-bench is a faithful port of the Sierra Research benchmark: the original
tools, databases, policy wikis, tasks, and reward function, vendored 1:1
(MIT). The simulated user is any `PatientAdapter`
(`TauBenchRunner(user=OpenAIAdapter("gpt-4o"))`). The port is verified by a
gold-replay oracle scoring 165/165 on the real test sets.

\* Engine-backed evals drive the original benchmark environments and need
their engines: ALFWorld and ScienceWorld install as extras
(`pip install 'agi-eval[alfworld]'` / `'agi-eval[scienceworld]'`, Java for
the latter), while WebShop and LIBERO install from their upstream repos
(each eval's docs page has the recipe). LIBERO evaluates robot policies, not
text models: serve one over HTTP and pass `--model policy:http://host:port`.
Every other live eval runs with zero optional dependencies.

The other 26 catalogued evals across agent/tool-use, code, robotics, and
safety carry status `building` or `roadmap`. Browse them all, with per-eval
docs covering how each works, how it scores, and how to troubleshoot it, at
[agi-eval.studio/evals](https://agi-eval.studio/evals).

## pass@k for code evals

```python
from agi_evals.evals import HumanEvalPlusRunner, LiveCodeBenchRunner

runner = LiveCodeBenchRunner(n_samples=10, k=5)   # 10 samples, report pass@5
```

Sampling uses the unbiased Chen et al. (2021) estimator. The default
`n_samples=1, k=1` is plain greedy pass@1.

## Compare a variant against its base

```bash
agi-evals compare gpqa-diamond --model openai:my-finetune \
    --baseline openai:gpt-4o-mini --push
```

This runs a paired per-case comparison on identical cases: improvements (cases
the variant newly solves), regressions (cases it newly fails, listed by id),
the score delta, and McNemar's exact test on the discordant pairs. Infra
errors on either side are excluded from pairing, so endpoint flakes never read
as regressions. `--push` lands both runs on your dashboard as a vs-card.

## Typed failure taxonomy

Every result carries at most one `FailureTag`: `WRONG_ANSWER`, `NO_ANSWER`,
`REFUSED`, `MALFORMED_OUTPUT`, `TOOL_ERROR`, `TIMEOUT`, `CONTEXT_OVERFLOW`,
`ADAPTER_ERROR`, `HARNESS_ERROR`. Infrastructure errors (adapter or harness)
are excluded from the aggregate score, so a flaky endpoint never silently
penalizes a model. They stay visible in `failure_counts`.

## Safety note

`HumanEval+` executes model-generated code locally in a subprocess with a
timeout. Run only models and datasets you trust, or wrap it in an OS-level
sandbox.

## Contributing

Bug reports, eval requests, and questions:
[agi-eval.studio](https://agi-eval.studio). Adding an eval is deliberately
small. Implement an `EvalRunner`, bundle a sample, and add a catalog entry.
The installed package is the reference: every live eval ships its source in
`agi_evals/evals/`.

## License

Runner code: **Apache-2.0** (see [LICENSE](LICENSE)). Eval datasets retain
their upstream licenses, documented per entry in the catalog.
