RLPD Phase 13 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable RLPD path for continuous-control offline-to-online training by reusing the current SAC actor-critic stack and adding trainer-level prior-data pretraining plus mixed offline/online updates.
Architecture: Keep the algorithm v1 intentionally small. Reuse MLPSACModel and the existing SAC-style update as the learner core, then make RLPD distinct at the trainer level through offline dataset loading, offline pretrain updates, online replay collection, and configurable mixed batches drawn from prior data plus current online replay. Reuse the current checkpoint, eval, predict, schedule, and early-stopping surfaces instead of introducing a new runtime family.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset, replay buffer, callback, and experiment infrastructure
Task 1: Freeze The Narrow RLPD Scope¶
Files: - Create: docs/plans/2026-03-12-rlpd-phase13.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md
Step 1: Freeze v1 boundaries
Document the first packaged RLPD release as:
- continuous
Boxactions only - flat vector observations only
- single-process online trainer only
- prior data loaded through the existing offline dataset path
- online experience stored in the existing replay buffer path
- trainer-level offline pretraining plus mixed offline/online update batches
- fixed entropy coefficient
alphain this phase - no recurrent path
- no image observations
- no distributed actor-learner runtime in this phase
Step 2: Record the package rationale
Explain that RLPD is the next low-friction 2022 offline-to-online wave because:
- it is a recognizable offline-to-online baseline that lets the package use prior data for online improvement
- it reuses the current
SACmental model instead of forcing a new world-model or sequence-model stack - it productizes offline dataset loading and replay collection together on one trainer path
Step 3: Keep test execution deferred
Record that tests are added but intentionally not executed until the user explicitly requests it.
Task 2: Add The RLPD Learner¶
Files: - Create: src/rl_training/algorithms/rlpd.py - Modify: src/rl_training/algorithms/__init__.py
Step 1: Reuse the current SAC model family
Build RLPD on MLPSACModel and keep the learner update deliberately narrow by reusing the current SAC loss structure in v1.
Step 2: Expose an explicit package learner
Add a readable RLPD algorithm class plus an rlpd_loss(...) helper so the package exports RLPD / RLPDAlgorithm as a first-class algorithm even though trainer behavior is where most of the package distinction lives.
Step 3: Keep runtime assumptions explicit
Document in code and plan comments that v1 defers architecture-specific paper details such as broader normalization or distributed data collection unless they are needed for the package path.
Task 3: Add The Offline-To-Online RLPD Trainer¶
Files: - Create: src/rl_training/runtime/rlpd_trainer.py - Modify: src/rl_training/experiment/registry.py
Step 1: Reuse offline data and online replay
Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path plus the standard ReplayBuffer from the online path.
Step 2: Add narrow prior-data training controls
Support the following algo_kwargs in v1:
offline_pretrain_updatesoffline_batch_ratiobuffer_capacitylearning_startstrain_frequencygradient_updates_per_step
Allow mixed updates that combine offline prior-data batches with online replay batches after collection starts.
Step 3: Preserve shared controls
Keep support for:
eval_interval- early stopping callbacks
- learning-rate schedules
- checkpoint save / resume
- update-budget tracking
Task 4: Wire RLPD Into The Package Surface¶
Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/rlpd/pendulum.yaml - Create: src/rl_training/assets/configs/rlpd/pendulum.yaml - Modify: README.md
Step 1: Add the managed API entrypoint
Expose RLPD through the root package and API namespaces.
Step 2: Ship a starter config
Add a packaged Pendulum-v1 config for narrow offline-to-online RLPD training.
Step 3: Update package docs
Document RLPD as the first packaged prior-data offline-to-online trainer on top of the SAC lane.
Task 5: Add Unexecuted Coverage¶
Files: - Create: tests/test_rlpd_update.py - Create: tests/test_rlpd_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py
Step 1: Add learner-level coverage
Add a small unit test for rlpd_loss(...) and one RLPD.update(...) call.
Step 2: Add trainer smoke coverage
Add a smoke test that checks offline pretraining, online replay collection, mixed-batch metrics, and checkpoint creation.
Step 3: Extend package-surface expectations
Update public exports, managed API, checkpoint workflow, and packaged-config tests so RLPD is treated as a shipped algorithm.
Step 4: Keep execution deferred
Do not run the tests until the user explicitly asks for test execution.