Metadata-Version: 2.4
Name: hypara
Version: 0.1.0
Summary: Benchmark harness for black-box optimizers that speak an ask/tell JSON Lines protocol
Author-email: jun76 <jun76.main@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jun76/hypara
Project-URL: Repository, https://github.com/jun76/hypara
Project-URL: Issues, https://github.com/jun76/hypara/issues
Keywords: benchmark,optimization,black-box,hyperparameter-tuning,ask-tell,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: build>=1; extra == "dev"
Requires-Dist: twine>=5; extra == "dev"
Dynamic: license-file

# hypara

A benchmark harness for measuring how well an optimizer searches an **unknown
black-box evaluation function**.

hypara is deliberately not about solving famous problems (TSP, knapsack, bin
packing) where a strong off-the-shelf solver wins. Each problem ships a
natural-language description, a mixed search space, and a *hidden* evaluator
whose shape changes with the instance seed. To score well an optimizer has to
read the description, reason about the space, and adapt its strategy from the
evaluation history within a limited budget.

Optimizers are **language-agnostic external processes**: they talk to the
runner over a stdin/stdout JSON Lines protocol, so an optimizer can be written
in Python, Rust, Go, TypeScript, or any executable.

## Install

```bash
pip install hypara
```

For development (tests + build tooling):

```bash
pip install -e .[dev]
python -m pytest
```

## Quickstart

List the built-in problems:

```bash
hypara list
```

Write a minimal optimizer. Create `my_opt/manifest.json`:

```json
{"name": "my_opt", "command": ["python", "main.py"]}
```

and `my_opt/main.py`:

```python
import json, random, sys

space = []
rng = random.Random()

def send(msg):
    sys.stdout.write(json.dumps(msg) + "\n")
    sys.stdout.flush()

for line in sys.stdin:
    msg = json.loads(line)
    t = msg.get("type")
    if t == "init":
        space = msg["problem"]["space"]
        rng = random.Random(msg.get("optimizer_seed"))
        send({"type": "ready"})
    elif t == "ask":
        # propose a candidate; here, a trivial random pick over numeric params
        cand = {}
        for p in space:
            if p.get("condition") is not None:
                continue
            if p["type"] == "categorical":
                cand[p["name"]] = rng.choice(p["choices"])
            elif p["type"] == "bool":
                cand[p["name"]] = rng.random() < 0.5
            else:
                lo, hi = p["low"], p["high"]
                v = rng.uniform(lo, hi)
                cand[p["name"]] = int(round(v)) if p["type"] == "int" else v
        send({"type": "propose", "candidate": cand})
    elif t == "tell":
        pass  # inspect msg["score"], msg["valid"], msg["remaining"] to adapt
    elif t == "finish":
        break
```

Run it against one problem, then aggregate:

```bash
hypara run --problem smooth_hill --optimizer ./my_opt --seed 1
```

The source repository also includes two reference optimizers
(`optimizers/random_search`, `optimizers/hill_climb`) and ready-made suite
configs (`configs/smoke.json`, `configs/full.json`):

```bash
hypara suite --config configs/smoke.json
hypara report --dir results/smoke-YYYYmmdd-HHMMSS
```

## Built-in problems

All problems are single-objective, maximize, with an achievable maximum near
1.0. The hidden landscape is reseeded per run, so memorizing an instance does
not help.

| Problem | What it tests |
|---|---|
| `smooth_hill` | Smooth unimodal surface; local search should win. |
| `rugged_trap` | Multimodal with a decoy hill; needs restarts / exploration. |
| `conditional_knobs` | A categorical choice switches which knobs exist. |
| `noisy_lab` | Additive gaussian noise; beware chasing lucky readings. |
| `multi_fidelity` | Cheap biased low-fidelity vs. expensive true high-fidelity. |
| `sparse_needle` | One hidden combination scores high; weak partial-match signal. |
| `cost_aware` | The candidate's own `samples` knob drives its evaluation cost. |
| `rag_pipeline` | Surrogate RAG tuning (chunking, top_k, reranker interactions). |
| `image_pipeline` | Surrogate diffusion tuning; steps drive quality and cost. |
| `dispatch_policy` | Surrogate delivery policy; balance, batching, mild noise. |

## Protocol

The runner launches the optimizer as a child process (working directory = the
optimizer's directory; if `command[0]` is `"python"` it is replaced with the
runner's own interpreter). Messages are one JSON object per line: runner →
optimizer on stdin, optimizer → runner on stdout. **Optimizer stdout is
protocol-only; write debug output to stderr** (the runner saves it to
`optimizer.stderr.log`). Receivers ignore unknown keys. `NaN`/`Infinity` must
not be sent. Current `protocol_version` is `1`.

### Messages and turn-taking

| Direction | `type` | Reply |
|---|---|---|
| runner → optimizer | `init` | `ready` (once) |
| runner → optimizer | `ask` | `propose` (once) |
| runner → optimizer | `tell` | none |
| runner → optimizer | `finish` | none; exit promptly |

Only one `ask` is outstanding at a time. The `init` reply may take up to 30s,
each `ask` reply up to 60s by default; overruns end the run as
`optimizer_timeout`. A crash, an unparseable line, or an out-of-order message
ends the run as `failed`. The best-so-far is recorded in every case.

**init** (runner → optimizer):

```json
{"type": "init", "protocol_version": 1, "run_id": "smooth_hill--my_opt--s1",
 "problem": {
   "description": "natural-language prompt",
   "space": [ ...param specs (below)... ],
   "objective": "maximize",
   "budget": {"evaluations": 100, "cost_limit": null, "time_limit_sec": 300.0},
   "fidelities": null
 },
 "optimizer_seed": 12345}
```

`budget` always has at least one of `evaluations` or `cost_limit` non-null.
`fidelities`, when non-null, is ordered low→high (last entry = top fidelity).

**ready / propose** (optimizer → runner):

```json
{"type": "ready"}
{"type": "propose", "candidate": {"x0": 0.5, "algo": "alpha"}, "fidelity": "low"}
```

`fidelity` is optional; omitted/null means top fidelity. Sending a non-null
`fidelity` to a problem with no fidelities is invalid.

**tell** (runner → optimizer):

```json
{"type": "tell", "candidate_id": "c-0007", "candidate": {"x0": 0.5},
 "valid": true, "score": 0.73, "cost": 1.0, "fidelity": null, "error": null,
 "remaining": {"evaluations": 92, "cost": null, "time_sec": 291.3}}
```

When invalid: `valid: false`, `score: null`, and `error` gives the reason.

**finish** (runner → optimizer): `{"type": "finish", "reason": "budget_exhausted"}`
(`reason` is `budget_exhausted` or `time_limit`).

### Search space

```json
[
  {"name": "lr", "type": "float", "low": 1e-4, "high": 1.0, "log": true},
  {"name": "layers", "type": "int", "low": 1, "high": 12},
  {"name": "opt", "type": "categorical", "choices": ["sgd", "adam"]},
  {"name": "warmup", "type": "bool"},
  {"name": "warmup_steps", "type": "int", "low": 10, "high": 1000,
   "condition": {"param": "warmup", "equals": [true]}}
]
```

- Types: `float`, `int`, `categorical`, `bool`. Bounds `low`/`high` are
  inclusive; `log: true` hints a log scale.
- A param with `condition` is **active** only when
  `candidate[condition.param]` is in `equals`. Conditioning is one level deep
  (the parent must be unconditional).

A candidate is validated by the runner: it must be a JSON object containing
**exactly** the active params (no unknown keys, no inactive params, none
missing), each of the right type and within range.

### Budget rules

- A valid evaluation consumes the evaluator's `cost` (may depend on the
  candidate/fidelity); the `evaluations` axis always consumes 1.
- **An invalid proposal still consumes budget** (1 evaluation, cost 1.0), so
  spamming invalid candidates cannot mine the space for free.
- The stop check runs before each `ask`, so the final evaluation may slightly
  overshoot `cost_limit`.
- For problems with `fidelities`, **only top-fidelity evaluations count toward
  `best_score`**; lower fidelities are available as history but not scored.

## Metrics

`hypara report` recomputes everything from the saved logs. Per run: best
score, best candidate, best-so-far curve (over evaluations or cumulative
cost), valid rate, status, wall time. Aggregated per (problem, optimizer):
mean best, a baseline-relative normalized best and normalized anytime AUC
(0 = baseline median, 1 = best observed for that problem), and an overall
mean across problems.

## Adding a problem

Implement `Problem` under `src/hypara/problems/` and register it in
`src/hypara/registry.py`. Keep the description and the evaluator's actual
behavior in sync — the point of the benchmark is that reading the description
helps. The shared invariants in `tests/test_problems.py` (finite scores,
determinism given a seed, instance-seed sensitivity) apply automatically.

## License

MIT. See [LICENSE](LICENSE).
