Metadata-Version: 2.4
Name: grasp-skills
Version: 0.1.0
Summary: GRASP — self-improvement via a regression-gated skill library learned from an agent's own failure traces
Author: Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem
License: MIT
Project-URL: Homepage, https://github.com/jomoll/GRASP
Project-URL: Repository, https://github.com/jomoll/GRASP
Keywords: llm-agents,self-improvement,skill-learning,benchmark
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# GRASP

**GRASP** is a self-improvement method that learns a small, **regression-gated
skill library** from an agent's own failure traces. A proposed skill is kept
only when it improves performance on a held-out probe set — so the library grows
by keeping what demonstrably helps and discarding what doesn't.

This repository is **two things**:

1. **A reusable method + framework** (`grasp/`) — apply GRASP to *your own* agent
   and tasks, and benchmark *your own* self-improvement method against GRASP and
   five baselines through a small plug-in interface.
2. **The full paper artifact** — four benchmark families (`benchmarks/`) and all
   released results behind the paper (`results/`).

## Install

```bash
pip install -e .          # core depends only on PyYAML
```

## Quickstart (no Docker, no server)

Watch GRASP learn skills on a laptop in minutes, on a self-contained slice of
MedAgentBench's read-only FHIR lookup tasks served by an in-process mock:

```bash
# point the 'local' backend at any OpenAI-compatible endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"
export GRASP_MODEL="your-model-name"

python -m examples.quickstart.run --agent local
```

It writes a val-accuracy learning curve and the learned skill library under
`examples/quickstart/runs/`. See [`examples/quickstart/`](examples/quickstart).

## Use GRASP on your own agent

Implement a `Task` (how to sample, run, and score your environment) and run GRASP
on it:

```python
from grasp import run_grasp
run_grasp(MyTask(), "config.yaml", agent="local")
```

- **`Task`** — `samples()`, `rollout(sample, agent)`, `evaluate(sample, output)`,
  plus optional `failure_tags` / `protocol_hook` / `updater_*` hooks.
- **`Method`** — GRASP is the reference `Method`; subclass it to benchmark your
  own self-improvement method on the same tasks.

| Read this | For |
|---|---|
| [`docs/method.md`](docs/method.md) | how GRASP works — the loop and the regression gate |
| [`docs/add_a_task.md`](docs/add_a_task.md) | plug in your own environment |
| [`docs/add_a_method.md`](docs/add_a_method.md) | benchmark your own method vs. GRASP + 5 baselines |

---

## Benchmarks (the paper artifact)

Each benchmark is self-contained under `benchmarks/`, with its own README for
environment setup (conda, Docker, data) and a `run_all.sh <backend> [run_name]`
helper.

| Directory | Benchmark | Role in paper | Setup |
|---|---|---|---|
| `benchmarks/MedAgentBench/` | FHIR reads/writes against a live FHIR server | primary (clinical) | Docker |
| `benchmarks/MedAgentBench-v2/` | Harder FHIR tasks: multi-step decisions, coordinated writes | primary (clinical) | Docker |
| `benchmarks/FHIR-AgentBench/` | Structured clinical QA / tool use on an independent FHIR store | supporting (clinical) | GCP Healthcare API |
| `benchmarks/AgentBench/` | Four non-clinical environments: OS, DBBench, WebShop, ALFWorld | supporting (generality) | Docker |

The paper compares GRASP against a no-skills baseline and five self-improvement
methods, all implemented in each benchmark directory: `grasp` (**GRASP**, ours),
`memory_cycle` (Sequential memory), `batch_memory_cycle` (Batch memory),
`expel_cycle` (ExpeL), `evo_memory_cycle` (Evo-MedAgent), `skillx_cycle` (SkillX).

The executing agent and skill-writer use the same model; five backends are
selectable at run time (`gptoss`, `deepseek`, `gemini`, `gpt5`, `gpt4`, or a
generic `local` OpenAI-compatible endpoint). No secrets are stored in the
repository — presets read endpoints and keys from environment variables. See
each benchmark's `configs/agents/README.md`.

## Released results

All numbers behind the paper live under **[`results/`](results/)** — per-seed
validation, test, and OOD accuracies for every cell of Tables 1–5, the learned
skill libraries, the frozen transfer libraries, and the run configurations.
Reproduce the headline tables directly:

```bash
python results/reproduce_tables.py                 # Table 1 (all models) + Table 5
python results/reproduce_tables.py gpt-oss-120b     # one model
```

See [`results/README.md`](results/README.md) for the full directory↔cell map.

---

## License

MIT (see [`LICENSE`](LICENSE)) for the GRASP core, examples, and docs. Vendored
benchmark code under `benchmarks/AgentBench/` and `benchmarks/FHIR-AgentBench/`
retains its own upstream license.

## Citation

See [`CITATION.cff`](CITATION.cff).
