Metadata-Version: 2.4
Name: hypnex-bench
Version: 0.1.0
Summary: Public eval + leaderboard for the Morpheus AI inference network. Drives MRC 76 (agent benchmarking) with reproducible probe sets.
Project-URL: Homepage, https://hypnex.xyz
Project-URL: Documentation, https://docs.hypnex.xyz/bench
Project-URL: Repository, https://github.com/hypnex-labs/hypnex
Project-URL: Issues, https://github.com/hypnex-labs/hypnex/issues
Author: Hypnex Labs
License: MIT
Keywords: benchmark,eval,hypnex,leaderboard,llm,mor,morpheus,mrc-76
Requires-Python: >=3.9
Requires-Dist: httpx>=0.27.0
Requires-Dist: openai<2.0.0,>=1.50.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# hypnex-bench

Public eval + leaderboard for the Morpheus AI inference network. The off-chain implementation of **MRC 76 (Agent Performance Benchmarking)**.

```bash
pip install hypnex-bench
```

## What it does

Runs a small, reproducible probe set against every LLM on the Morpheus network, collects per-model pass-rates, latencies (p50/p95), and token counts, and renders a markdown leaderboard. Designed to run nightly so the data flywheel compounds.

Default suites (~19 probes total — a full run across all live LLMs is typically **<$0.20 of MOR**):

| Suite | Probes | What it tests |
|---|---|---|
| `coding` | 6 | HumanEval-style — model writes a Python function, we exec it + assert |
| `math` | 8 | GSM8K-style word problems with deterministic numeric answers |
| `json` | 5 | Strict JSON adherence — does the model produce parseable, schema-matching JSON? |

## Quickstart

```bash
# 1. List available LLMs (no key needed, public registry)
hypnex-bench models

# 2. Run all suites against the default LLM set (key required, costs MOR)
HYPNEX_API_KEY=mor_xxx  hypnex-bench run

# 3. Render the leaderboard from data/latest.json
hypnex-bench leaderboard
```

## Programmatic

```python
from hypnex_bench import BenchRunner, all_suites, to_markdown

runner = BenchRunner(api_key="mor_...")
results = runner.run(["mistral-31-24b", "glm-5"], all_suites())
print(to_markdown(results))
```

## CLI reference

```
hypnex-bench models                              # list active LLMs

hypnex-bench run [options]
    --models a,b,c           # comma-list (default: all live LLMs)
    --limit N                # only first N models (when --models omitted)
    --suite SUITE            # all | coding | math | json | a,b
    --output DIR             # output dir (default: ./data)
    --api-key KEY            # override HYPNEX_API_KEY
    --base-url URL           # override https://api.mor.org/api/v1

hypnex-bench leaderboard [options]
    --input DIR              # dir containing latest.json (default: ./data)
    --output FILE            # write to file (default: stdout)
```

## Output

```
data/
  run-20260507T031502Z.jsonl      # one full run, append-only
  run-20260508T031455Z.jsonl
  ...
  latest.json                     # snapshot of the most recent run
```

`latest.json` is what the leaderboard renderer (and any future static-site generator) consumes.

## Why not just use HumanEval / GSM8K / MMLU directly?

Those benchmarks have leaked into model training data. The probes here are small-set, slightly-rephrased variations chosen to be cheap (so a full run costs cents, not dollars), language-canonical (Python only for coding; ASCII + ASCII numbers for math), and verifiable without an LLM grader (deterministic evaluators that exec or regex). For canonical leaderboard claims, swap these probe sets for the official suites — the runner architecture stays the same.

## Tests

```bash
pip install -e ".[dev]"
pytest                  # 17 pure-Python evaluator tests, no API key needed
```

## Status & affiliation

Hypnex Labs draft of MRC 76. Not affiliated with the Morpheus AI Foundation. Suite definitions are MIT-licensed; submit PRs to add probes.

## License

MIT
