跳转至

APPO V1 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but honest APPO baseline for vector-observation, discrete-action control by combining PPO-style clipped policy updates with V-trace off-policy correction over synchronous rollout batches.

Architecture: Keep the release deliberately narrow and aligned with the current discrete actor-critic package shape. Reuse the existing actor-critic model, vector-environment rollout loop, checkpointing, evaluation path, managed API, and the new IMPALA V-trace utilities, but add a dedicated APPO algorithm that applies a PPO-style clipped objective against behavior-policy log-probabilities while training values against V-trace targets. Explicitly do not implement distributed actors, asynchronous learner queues, recurrent state handling, image observations, or multi-GPU / RPC infrastructure in this batch.

Tech Stack: Python 3.10, PyTorch, Gymnasium, pytest, existing rl_training actor-critic and experiment infrastructure.


Task 1: Add failing APPO coverage

Files: - Create: tests/test_appo_update.py - Create: tests/test_appo_trainer_smoke.py - Create: tests/test_appo_reference_script.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_package_smoke.py

Step 1: Write the failing test - appo_loss(...) returns named clipped-V-trace actor-critic metrics. - APPO.update(...) accepts rollout batches with behavior log-probabilities and bootstrap value. - train_appo() writes a checkpoint and evaluation metrics on CartPole-v1. - root/api/algorithms package exports include APPO. - checkpoint workflows can evaluate and resume an appo checkpoint. - packaged config resolves outside repo root and reference script runs as a smoke command.

Step 2: Run test to verify it fails Run: pytest -q tests/test_appo_update.py tests/test_appo_trainer_smoke.py tests/test_appo_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py Expected: FAIL with missing appo modules / exports.

Task 2: Implement the APPO learner

Files: - Create: src/rl_training/algorithms/appo.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Write minimal implementation - reuse the current V-trace computation from IMPALA. - implement PPO-style clipped policy loss against behavior-policy log-probabilities. - train values against V-trace targets and keep entropy regularization. - keep scope explicit: vector observations only, discrete actions only, synchronous rollout batches only.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_appo_update.py Expected: PASS.

Task 3: Implement trainer and workflow integration

Files: - Create: src/rl_training/runtime/appo_trainer.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py

Step 1: Write minimal implementation - reuse the synchronous vector rollout loop from the current on-policy actor-critic trainers. - collect behavior log-probabilities and bootstrap value for each rollout batch. - update the policy with clipped V-trace objectives, then support train / eval / resume / predict through registry wiring. - keep evaluation and prediction on the existing discrete actor-critic deterministic action path.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_appo_trainer_smoke.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py Expected: PASS.

Task 4: Add config, example, and docs

Files: - Create: configs/appo/cartpole.yaml - Create: src/rl_training/assets/configs/appo/cartpole.yaml - Create: examples/appo_cartpole_reference.py - Modify: README.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Write minimal implementation - add a runnable appo CartPole preset. - add a reference script for a tiny synchronous run. - update README and yearly sourcebook to mark APPO as implemented in a narrow synchronous v1 form.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_appo_reference_script.py tests/test_cli.py tests/test_package_smoke.py Expected: PASS.

Task 5: Regression verification

Files: - Modify only if verification reveals regressions.

Step 1: Run focused regression coverage Run: pytest -q tests/test_appo_update.py tests/test_appo_trainer_smoke.py tests/test_appo_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py tests/test_impala_update.py tests/test_impala_trainer_smoke.py tests/test_ppo_update.py tests/test_a2c_update.py tests/test_trpo_update.py Expected: PASS.

Step 2: Run full suite Run: pytest -q Expected: PASS.