跳转至

IMPALA V1 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but honest IMPALA baseline for vector-observation, discrete-action control using synchronous rollout collection and V-trace actor-critic updates.

Architecture: Keep the release deliberately narrow and aligned with the current on-policy package shape. Reuse the existing discrete actor-critic model, vector-environment rollout loop, checkpointing, evaluation path, managed API, and package exports, but add a dedicated IMPALA algorithm that computes V-trace policy/value targets from behavior-policy log-probabilities gathered during rollout collection. Explicitly do not implement distributed actors, learner queues, recurrent state support, image observations, multi-GPU learners, or asynchronous RPC-style IMPALA infrastructure in this batch.

Tech Stack: Python 3.10, PyTorch, Gymnasium, pytest, existing rl_training actor-critic and experiment infrastructure.


Task 1: Add failing IMPALA coverage

Files: - Create: tests/test_impala_update.py - Create: tests/test_impala_trainer_smoke.py - Create: tests/test_impala_reference_script.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_package_smoke.py

Step 1: Write the failing test - impala_loss(...) returns named V-trace actor-critic metrics. - IMPALA.update(...) accepts sequence rollout batches with behavior log-probabilities and bootstrap values. - train_impala() writes a checkpoint and evaluation metrics on CartPole-v1. - root/api/algorithms package exports include IMPALA. - checkpoint workflows can evaluate and resume an impala checkpoint. - packaged config resolves outside repo root and reference script runs as a smoke command.

Step 2: Run test to verify it fails Run: pytest -q tests/test_impala_update.py tests/test_impala_trainer_smoke.py tests/test_impala_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py Expected: FAIL with missing impala modules / exports.

Task 2: Implement the IMPALA learner

Files: - Create: src/rl_training/algorithms/impala.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Write minimal implementation - implement V-trace return / advantage computation from rewards, dones, bootstrap value, behavior log-probabilities, and current-policy log-probabilities. - implement IMPALA as a discrete actor-critic learner over flattened rollout sequences. - keep scope explicit: vector observations only, discrete actions only, synchronous rollout batches only.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_impala_update.py Expected: PASS.

Task 3: Implement trainer and workflow integration

Files: - Create: src/rl_training/runtime/impala_trainer.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py

Step 1: Write minimal implementation - reuse the vector-env rollout collection path from the current actor-critic trainers. - store behavior-policy log-probabilities and bootstrap values during collection. - update the policy with V-trace targets, then support train / eval / resume / predict through registry wiring. - keep evaluation and prediction on the existing discrete actor-critic deterministic action path.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_impala_trainer_smoke.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py Expected: PASS.

Task 4: Add config, example, and docs

Files: - Create: configs/impala/cartpole.yaml - Create: src/rl_training/assets/configs/impala/cartpole.yaml - Create: examples/impala_cartpole_reference.py - Modify: README.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Write minimal implementation - add a runnable impala CartPole preset. - add a reference script for a tiny synchronous run. - update README and yearly sourcebook to mark IMPALA as implemented in a narrow synchronous v1 form.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_impala_reference_script.py tests/test_cli.py tests/test_package_smoke.py Expected: PASS.

Task 5: Regression verification

Files: - Modify only if verification reveals regressions.

Step 1: Run focused regression coverage Run: pytest -q tests/test_impala_update.py tests/test_impala_trainer_smoke.py tests/test_impala_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py tests/test_a2c_trainer.py tests/test_ppo_trainer.py tests/test_trpo_update.py Expected: PASS.

Step 2: Run full suite Run: pytest -q Expected: PASS.