跳转至

R2D2 V1 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but honest R2D2 baseline for discrete-action, vector-observation environments by extending the new recurrent value-learning lane with prioritized recurrent replay and n-step returns.

Architecture: Reuse the existing LSTMQNetwork and DRQN package shape instead of pretending to ship the full distributed R2D2 paper stack. Add chunk-level prioritized recurrent replay, compute sequence priorities from masked TD errors, and train with n-step bootstrapped targets through a dedicated r2d2 trainer. Explicitly do not implement distributed actors, burn-in, or value rescaling in this batch.

Tech Stack: Python 3.10, PyTorch, Gymnasium, pytest, existing rl_training trainer/registry/checkpoint stack.


Task 1: Add failing R2D2 coverage

Files: - Create: tests/test_prioritized_recurrent_replay_buffer.py - Create: tests/test_r2d2_update.py - Create: tests/test_r2d2_trainer_smoke.py - Create: tests/test_r2d2_reference_script.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_checkpoint_workflows.py

Step 1: Write the failing test - PrioritizedRecurrentReplayBuffer stores chunk priorities, samples with beta, and updates priorities. - R2D2.update() returns named metrics and exposes sequence priorities. - train_r2d2() writes checkpoint and evaluation metrics on CartPole-v1. - package exports R2D2 across root/api/algorithms/data surfaces. - reference script runs as smoke command. - checkpoint workflows can evaluate a saved r2d2 checkpoint.

Step 2: Run test to verify it fails Run: pytest -q tests/test_prioritized_recurrent_replay_buffer.py tests/test_r2d2_update.py tests/test_r2d2_trainer_smoke.py tests/test_r2d2_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py Expected: FAIL with missing r2d2 modules / exports.

Task 2: Implement prioritized recurrent replay

Files: - Create: src/rl_training/data/prioritized_recurrent_replay_buffer.py - Modify: src/rl_training/data/__init__.py

Step 1: Write minimal implementation - store recurrent chunks exactly like RecurrentReplayBuffer - add chunk-level priorities, sample(batch_size, beta=...), and update_priorities(indices, priorities) - include weights and indices in sampled batches - persist priorities and active chunks in state_dict()

Step 2: Run tests to verify it passes Run: pytest -q tests/test_prioritized_recurrent_replay_buffer.py Expected: PASS.

Task 3: Implement R2D2 algorithm and trainer

Files: - Create: src/rl_training/algorithms/r2d2.py - Create: src/rl_training/runtime/r2d2_trainer.py

Step 1: Write minimal implementation - R2D2 reuses LSTMQNetwork and masked recurrent TD loss - support optional importance-sampling weights - expose last_sequence_priorities for replay updates - train_r2d2() uses NStepAccumulator plus prioritized recurrent replay - support only vector observations and discrete actions - keep evaluation / checkpoint flow parallel to train_drqn()

Step 2: Run tests to verify it passes Run: pytest -q tests/test_r2d2_update.py tests/test_r2d2_trainer_smoke.py Expected: PASS.

Task 4: Wire package surfaces and docs

Files: - Create: examples/r2d2_cartpole_reference.py - Create: configs/r2d2/cartpole.yaml - Create: src/rl_training/assets/configs/r2d2/cartpole.yaml - Modify: src/rl_training/algorithms/__init__.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Modify: src/rl_training/experiment/registry.py - Modify: README.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Write minimal implementation - add r2d2 load/eval/predict functions and registry spec - add managed API class R2D2 - add config + packaged asset config - add reference script - update docs to mark R2D2 implemented as narrow non-distributed v1

Step 2: Run tests to verify it passes Run: pytest -q tests/test_r2d2_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py Expected: PASS.

Task 5: Regression verification

Files: - Modify only if verification reveals regressions.

Step 1: Run focused regression coverage Run: pytest -q tests/test_prioritized_recurrent_replay_buffer.py tests/test_r2d2_update.py tests/test_r2d2_trainer_smoke.py tests/test_r2d2_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_drqn_update.py tests/test_drqn_trainer_smoke.py tests/test_dqn_update.py tests/test_dqn_trainer_smoke.py Expected: PASS.

Step 2: Run full suite Run: pytest -q Expected: PASS.