Metadata-Version: 2.4
Name: murthy-bench
Version: 0.2.0
Summary: Longevity LLM benchmark CLI — Estimathon-style evaluation for aging-biology tasks
License: MIT
Project-URL: Repository, https://github.com/OhhMoo/Murphy-Health
Keywords: longevity,benchmark,llm,aging,estimathon
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: openai>=1.0.0
Requires-Dist: anthropic>=0.25.0
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.0.0
Requires-Dist: prompt_toolkit>=3.0.0
Requires-Dist: requests>=2.31.0
Provides-Extra: longebench
Requires-Dist: datasets>=2.0.0; extra == "longebench"
Provides-Extra: all
Requires-Dist: datasets>=2.0.0; extra == "all"

# Murphy — Longevity Benchmark CLI

Evaluate any LLM on aging-biology tasks using an **Estimathon-style** benchmark.
Models submit intervals `[min, max]` for numerical questions, receive only **binary feedback**
(GOOD / BAD), and manage a shared submission budget across all problems.
Non-numerical tasks (binary, multiclass, ternary, generation) are scored with standard accuracy / F1.

---

## Install

```bash
cd longivity_hack
pip install -r requirements.txt
```

---

## First-time setup

Run the setup wizard inside the chat to configure keys and verify dataset access:

```bash
python cli.py
```

Then type:

```
/setup
```

The wizard walks through:
1. **Anthropic API key** — required for the chat interface and `--provider anthropic` runs
2. **HuggingFace token** — required for LongeBench dataset; wizard verifies live access
3. **OpenAI API key** — optional

Keys are saved to `~/.longevity/config.json` and masked on input.

> **LongeBench is a gated dataset.** Before your token will work, visit
> `huggingface.co/datasets/insilicomedicine/longebench` and click **Request access**.
> Approval is usually instant. Then re-run `/setup` to verify.

---

## Interactive chat (recommended)

```bash
python cli.py          # opens chat directly
python cli.py chat     # same thing
```

Type naturally — Claude calls the right tools. Type `/` to see all commands with Tab autocomplete.

| Command | Args | Description |
|---|---|---|
| `/setup` | | Configure API keys + verify HuggingFace access |
| `/help` | | Show all commands |
| `/benchmark` | `[model] [provider] [tasks]` | Quick-run with current defaults |
| `/question_set` | `[source] [limit]` | Preview tasks |
| `/status` | `[model] [provider]` | Check model connectivity |
| `/model` | `[id]` | Show or set benchmark model |
| `/provider` | `[name]` | Show or set provider |
| `/tasks` | `[source]` | Show or set default task source |
| `/think` | | Toggle chain-of-thought traces |
| `/config` | `[key] [value]` | View or set a config value |
| `/clear` | | Clear conversation history |
| `/exit` | | Exit |

---

## Running benchmarks

### Full LongeBench — mixed mode (recommended)

Runs Estimathon on numerical tasks and one-shot accuracy on categorical tasks:

```bash
python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode mixed \
  --limit 50
```

### Estimathon only (numerical tasks)

```bash
python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks sample \
  --mode estimathon \
  --think
```

### One-shot baseline

```bash
python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode one-shot \
  --limit 100
```

### Against the L-LLM endpoint

```bash
python cli.py run \
  --model longevity-llm \
  --provider endpoint \
  --endpoint https://saujlffcxf20v74m.us-east-2.aws.endpoints.huggingface.cloud \
  --api-key <hf-token> \
  --tasks longebench \
  --mode mixed \
  --limit 50
```

---

## Estimathon rules

```
score = (10 + Σ floor(max/min) for GOOD final answers) × 2^(N − # good final answers)
```

- Only the **last** submission per problem counts
- Refining a GOOD interval is a voluntary bet — if the new interval misses, you lose that problem
- Feedback is **binary only**: GOOD or BAD — no "too high / too low"
- Default budget: `floor(1.38 × N)` slips across all N problems
- **Lower score is better**

**Refinement accuracy** — the key signal: of all voluntary bets on GOOD intervals, what fraction
paid off? Random guessing wins ~50%. Significantly above 50% means genuine biological reasoning.

### Two-track scoring in mixed mode

| Track | Task formats | Scoring |
|---|---|---|
| Estimathon | regression, pairwise | Interval score + refinement accuracy |
| One-shot | binary, multiclass, ternary | Exact-match accuracy |
| One-shot | generation (gene lists) | Token F1 ≥ 0.5 = correct |

---

## Providers

| `--provider` | Connects to | Credential |
|---|---|---|
| `anthropic` | Anthropic API | `anthropic.api_key` / `ANTHROPIC_API_KEY` |
| `endpoint` | Any OpenAI-compatible URL | `--api-key` + `--endpoint` |
| `hf` | HuggingFace Inference API | `hf.token` / `HF_TOKEN` |
| `openai` | OpenAI API | `openai.api_key` / `OPENAI_API_KEY` |

## Task sources

| `--tasks` | Loads |
|---|---|
| `sample` | 7 built-in tasks — no network required |
| `longebench` | Full LongeBench benchmark (HuggingFace, gated) |
| `longebench:extra` | LongeBench extra split |
| `path/to/file.jsonl` | Local JSONL file |

---

## Output

Results written to `results.jsonl`. Fields include:

**Estimathon track**
- `final_score` — Estimathon score (lower is better)
- `n_good_final` / `n_problems` — problems solved
- `slips_used` / `total_budget`
- `refinement_accuracy` — fraction of refinement bets that succeeded
- `slip_log` — every submission with GOOD/BAD, width factor, score delta
- `think` — per-slip chain-of-thought trace (with `--think`)

**One-shot track**
- `correct` — boolean per task
- `f1` — for generation tasks
- `by_format` — accuracy breakdown per format

---

## Project structure

```
longivity_hack/
├── cli.py                  Typer entry point
├── requirements.txt
├── idea.md                 Benchmark design document
├── devlog.md               Development log
├── CLAUDE.md               Developer guide for teammates
└── benchmark/
    ├── chat.py             Interactive chat UI (Claude tool-use, /setup wizard, slash autocomplete)
    ├── runner.py           Estimathon session, one-shot eval, run_mixed()
    ├── loader.py           Task loading — sample / LongeBench / local JSONL
    ├── client.py           Unified model client (all providers)
    ├── config.py           ~/.longevity/config.json
    └── results.py          JSONL writer / reader
```
