Metadata-Version: 2.4
Name: wargames
Version: 0.1.0
Summary: War Games measures whether frontier models can sustain long-horizon planning and adaptation in complex, non-stationary real-time games, with human performance as the benchmark.
Project-URL: Homepage, https://github.com/layerbrain/wargames
Project-URL: Repository, https://github.com/layerbrain/wargames
Project-URL: Issues, https://github.com/layerbrain/wargames/issues
Author: Layerbrain
License-Expression: MIT
License-File: LICENSE
Keywords: benchmark,cua,games,openra,red-alert,reinforcement-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Games/Entertainment
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: msgpack>=1.0.0
Requires-Dist: openai>=2.32.0
Requires-Dist: python-xlib>=0.33
Requires-Dist: pyyaml>=6.0.0
Provides-Extra: dev
Requires-Dist: mypy>=1.13.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Provides-Extra: server
Requires-Dist: fastapi>=0.115.0; extra == 'server'
Requires-Dist: httpx>=0.28.0; extra == 'server'
Requires-Dist: uvicorn>=0.30.0; extra == 'server'
Requires-Dist: websockets>=14.0; extra == 'server'
Description-Content-Type: text/markdown

# WarGames

WarGames turns OpenRA Red Alert into a computer-use environment for agentic AI.
An agent receives pixels and a small CUA tool set, then sends mouse/keyboard/wait
actions back to the simulator.

The runtime never calls an LLM and never trains a model. It does three things:
capture frames, apply tool calls, and compute rewards from private simulator
state. Your agent or external harness owns model calls. Prime/prime-rl owns
gradient updates.

## Example Output

This is a short Kimi K2.5 smoke run. The agent receives screenshots, chooses
CUA actions, and WarGames applies them to the live OpenRA window.

![Kimi K2.5 Red Alert smoke run](docs/assets/kimi-k2-redalert-smoke.gif)

## Install

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

Red Alert needs a working OpenRA checkout:

```bash
export LAYERBRAIN_WARGAMES_REDALERT_OPENRA_ROOT=/path/to/openra-source
export LAYERBRAIN_WARGAMES_REDALERT_OPENRA_BINARY=/path/to/openra-source/launch-game.sh
```

## Local Secrets

Create `local.env` from the template. `local.env` is gitignored.

```bash
cp local.env.example local.env
```

Use provider-standard names for model keys:

```bash
OPENAI_API_KEY=
OPENAI_BASE_URL=
OPENAI_MODEL=
ANTHROPIC_API_KEY=
ANTHROPIC_MODEL=
GOOGLE_API_KEY=
GOOGLE_MODEL=
```

`LAYERBRAIN_PRIME` is a publish/admin key only. WarGames does not use it for
model inference.

## Tasks

Tasks are mission + seed + split + reward profile.

```bash
wargames tasks --game redalert --split debug
```

Splits:

- `debug`: tiny smoke tasks
- `train`: tasks agents may learn from
- `validation`: tune prompts/profile weights/max steps
- `test`: held-out reported benchmark tasks
- `curriculum`: ordered train tasks

The catalog rejects the same `(mission_id, seed)` appearing in multiple splits.
It also rejects `train_only` reward profiles on `test`.

## Agents

Agents are named YAML configs under `agents/` or your own `--agent-dir`.

```bash
wargames agents list
wargames agents validate agents/scripted-wait.yaml
```

Example:

```yaml
id: my-agent
driver: python
factory: my_project.agent:create_agent
provider: openai
model: ${OPENAI_MODEL}
api_key_env: OPENAI_API_KEY
base_url: ${OPENAI_BASE_URL}
config:
  temperature: 0.2
  top_p: 0.9
  max_tokens: 256
  timeout_seconds: 20
  disable_reasoning: false
  reject_reasoning_models: false
  reasoning_effort: medium
  extra_body:
    enable_thinking: true
    chat_template_kwargs:
      enable_thinking: true
```

The Python factory receives the `AgentSpec` and returns an object implementing:

```python
async def start(task): ...
async def decide(obs): ...
async def close(): ...
```

For OpenAI-compatible providers, `config` is passed through to the local
agent wrapper. Use it to choose model behavior per run. For fast non-thinking
smoke runs, set `disable_reasoning: true` and keep `max_tokens` small. For
models that need internal thinking, set `disable_reasoning: false` and pass the
provider-specific `extra_body` they require. WarGames does not own those keys or
settings; the agent config does.

## Run Locally

```bash
wargames run \
  --task redalert.debug.smoke.seed-000000 \
  --agent scripted-wait \
  --watch none \
  --record summary_only
```

For demo/debug runs, record frames and export video later:

```bash
wargames run \
  --task redalert.debug.smoke.seed-000000 \
  --agent scripted-wait \
  --watch window \
  --record full \
  --video frames

wargames export <run_id> --out exports --video mp4
```

MP4 is export-only. Runs write frames; export turns frames into a shareable
video.

## Reward Profiles

List profiles:

```bash
wargames profile list --game redalert
```

Built-ins:

- `terminal`: win/loss only
- `standard`: terminal + mild dense shaping
- `dense`: training-only dense profile
- `protective`: defense-aligned profile that rewards friendly-force preservation
- `aggressive_stress_test`: training-only contrast profile, blocked from test

Validate a profile YAML:

```bash
wargames profile validate scenarios/redalert/profiles/protective.yaml
```

Profiles are the behavior dial. The same model can be evaluated under different
profiles to measure whether reward design changes behavior.

The full profile schema, every Red Alert reward field, built-in primitives, and
Prime RL examples are documented in [`docs/reward_profiles.md`](docs/reward_profiles.md).

## Watching

Local:

```bash
wargames run --task ... --agent ... --watch window
```

Replay public events from disk:

```bash
wargames watch <run_id>
```

Public event files never include hidden state. Private traces are only written
when explicitly requested.

## Prime Intellect

The Prime implementation lives in `wargames.environments.prime`.
The public Prime environment is `layerbrain/wargames`.
`environments/prime` is only the thin publish wrapper.

```bash
uv pip install -e ./environments/prime
prime eval run wargames --config environments/prime/configs/eval-debug.toml -n 1 -r 1
```

Prime RL uses the shipped TOML configs. WarGames supplies the environment and
reward signal; Prime/prime-rl owns rollouts, batching, GPUs, and gradient
updates.

RL training changes behavior by changing `reward_profile` in the Prime config:

```toml
split = "train"
reward_profile = "protective"
recorder_mode = "none"
max_steps = 500
rollouts_per_example = 8
```

Use `dense` or `protective` on `train`/`curriculum`, then report against
`terminal` or `standard` on `test`.

## Tests

```bash
source venv/bin/activate
python -m unittest tests.evaluation tests.harness
python -m unittest discover -s environments/prime/tests/conformance
```
