Metadata-Version: 2.4
Name: benchrail
Version: 0.2.2
Summary: CLI for benchmarking agent setups
Author: tripcher
License-Expression: MIT
Keywords: agents,benchmark,claude,cli,codex,evaluation
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: docker>=7.0
Requires-Dist: pydantic<3,>=1.10
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Description-Content-Type: text/markdown

# benchrail

CLI for benchmarking agent setups.

`benchrail` is a simple CLI for running the same tasks across different agent
setups and comparing the results.

An agent setup can include:

- a different agent (Codex, Claude code)
- a different model
- a different skill
- a different tool
- a different `AGENTS.md`
- a different prompt or context-engineering strategy
- a different execution environment

Use `benchrail` to measure whether a change actually makes the agent better on
the tasks you care about.

## Quick Start

### Install

Install from PyPI with `uv`:

```bash
uv tool install benchrail
```

Install from PyPI with `pip`:

```bash
pip install benchrail
```

### Run
Run in Docker mode:

```bash
benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode docker
```

Run the same dataset locally:

```bash
benchrail run \
  --dataset examples/multi-swe-bench-universal-smoke \
  --mode local 
```

If you use Docker mode and want to reuse your local AI agent login instead of passing
an API key into the container, add:

```bash
--auth-session
```

## First-Look Mental Model

The workflow is intentionally simple:

1. Create or choose a dataset directory
2. Validate it with `benchrail validate`
3. Run it against one or more agents with `benchrail run`
4. Inspect per-task JSON results, logs, and the aggregated run summary

At runtime, the tool builds the cartesian product of:

- dataset instances
- manifest agents

That becomes the task queue for the run.

## Core Concepts

### Dataset

A dataset is a directory containing:

- a `manifest.json` file describing which agents to run
- an optional `config.json` and `environment/` directory containing instance defaults
- one subdirectory per benchmark instance

### Instance

Each instance contains a `config.json` plus optional environment scripts and patches.

### Agent

An agent entry in `manifest.json` maps an agent id to an adapter and optional CLI
arguments such as model selection.

## Dataset Layout

Expected dataset shape:

```text
<dataset>/
  manifest.json
  config.json
  environment/
    Dockerfile
    setup.sh
  <instance_id>/
    config.json
    environment/
      Dockerfile
      setup.sh
      run-gold-tests.sh
      any_check.sh
    patches/
      test.patch
```

Dataset `config.json` fields are inherited by each instance. Nested Docker environment
values, hooks, and named check commands are merged, while explicit instance values override
dataset defaults.

Dataset `environment/` files are copied first, then instance `environment/` files are copied
on top. For an inherited Dockerfile such as `"dockerfile": "environment/Dockerfile"`, the
instance path is used when it exists; otherwise Benchrail falls back to the dataset path.
An explicit instance `docker.image` overrides an inherited Dockerfile, and an explicit
instance `docker.dockerfile` overrides an inherited image.

Example included in this repository:

- `examples/multi-swe-bench-universal-smoke`

Validate it before your first run:

```bash
benchrail validate \
  --dataset examples/multi-swe-bench-universal-smoke
```

## Example `manifest.json`

```json
{
  "agents": [
    {
      "id": "codex-gpt-5.4-mini-medium",
      "agent": "codex",
      "version": "latest",
      "command": "--model gpt-5.4-mini --config model_reasoning_effort=\"medium\" --disable fast_mode"
    }
  ]
}
```

Current built-in agent types:

- `codex`
- `claude-code`

## Execution Modes

### `local`

Use local mode when the host machine already has the right toolchains and agent CLI
access.

Pros:

- Faster iteration
- No container setup
- Easier local debugging

Tradeoffs:

- Depends on host environment consistency
- Harder to make fully reproducible across machines

### `docker`

Use Docker mode when you want a more reproducible execution environment or need the
provided universal image flow.

Pros:

- Better environment isolation
- Better fit for multi-language benchmark runs
- Easier to standardize across machines and CI

Tradeoffs:

- Requires Docker
- Adds image and container overhead

## Output Artifacts

By default, result artifacts are written under the run workspace. If `--output` is
provided, result JSON and CSV summaries are written there instead.

Aggregated run artifacts:

```text
<output-or-workspace>/<run_id>/
  result.json
  result.csv
```

Per-task artifacts:

```text
<output-or-workspace>/<run_id>/<agent_id>/<instance_id>/
  result.json
  agent.patch
```

Per-task logs:

```text
<logs-root>/<run_id>/<agent_id>/<instance_id>/
  runner.log
  logs/
    agent.stdout
    agent.stderr
    check_<name>.stdout
    check_<name>.stderr
    ...
```

The aggregated run summary includes:

- passed / failed / total tasks
- total duration
- token counts, when available
- cost in USD and credits, when available
- per-check pass/fail counts

## Development

Run unit tests:

```bash
make unit
```

Manual release prep:

```bash
make bump BUMP=patch
git commit -am "Release $(make print-release-tag)"
git push origin main
make tag-release
git push origin "$(make print-release-tag)"
```

After the tag is pushed, run the GitHub release workflow with that tag.

Run lint and type checks:

```bash
make lint
```

Format the codebase:

```bash
make format
```

Equivalent direct commands:

```bash
uv run pytest tests/unit/ -v
uv run ruff check benchrail/ tests/
uv run mypy benchrail tests
uv run ruff format benchrail/ tests/
uv run ruff check --fix benchrail/ tests/
```

## License

The source code in this repository is licensed under the MIT License. See
[LICENSES/LICENSE](./LICENSES/LICENSE).

This repository also contains third-party derived materials:

- `docker/universal/` was adapted in part from
  `https://github.com/openai/codex-universal` (MIT)
- `examples/multi-swe-bench-universal-smoke/` is derived from `SWE-bench_Lite`
  and `SWE-bench_Multilingual`

See [LICENSES/THIRD_PARTY.md](./LICENSES/THIRD_PARTY.md) for attribution and
redistribution caveats for dataset-derived content.
