跳转至

RLPD Phase 13 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but package-usable RLPD path for continuous-control offline-to-online training by reusing the current SAC actor-critic stack and adding trainer-level prior-data pretraining plus mixed offline/online updates.

Architecture: Keep the algorithm v1 intentionally small. Reuse MLPSACModel and the existing SAC-style update as the learner core, then make RLPD distinct at the trainer level through offline dataset loading, offline pretrain updates, online replay collection, and configurable mixed batches drawn from prior data plus current online replay. Reuse the current checkpoint, eval, predict, schedule, and early-stopping surfaces instead of introducing a new runtime family.

Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset, replay buffer, callback, and experiment infrastructure


Task 1: Freeze The Narrow RLPD Scope

Files: - Create: docs/plans/2026-03-12-rlpd-phase13.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Freeze v1 boundaries

Document the first packaged RLPD release as:

  • continuous Box actions only
  • flat vector observations only
  • single-process online trainer only
  • prior data loaded through the existing offline dataset path
  • online experience stored in the existing replay buffer path
  • trainer-level offline pretraining plus mixed offline/online update batches
  • fixed entropy coefficient alpha in this phase
  • no recurrent path
  • no image observations
  • no distributed actor-learner runtime in this phase

Step 2: Record the package rationale

Explain that RLPD is the next low-friction 2022 offline-to-online wave because:

  • it is a recognizable offline-to-online baseline that lets the package use prior data for online improvement
  • it reuses the current SAC mental model instead of forcing a new world-model or sequence-model stack
  • it productizes offline dataset loading and replay collection together on one trainer path

Step 3: Keep test execution deferred

Record that tests are added but intentionally not executed until the user explicitly requests it.

Task 2: Add The RLPD Learner

Files: - Create: src/rl_training/algorithms/rlpd.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Reuse the current SAC model family

Build RLPD on MLPSACModel and keep the learner update deliberately narrow by reusing the current SAC loss structure in v1.

Step 2: Expose an explicit package learner

Add a readable RLPD algorithm class plus an rlpd_loss(...) helper so the package exports RLPD / RLPDAlgorithm as a first-class algorithm even though trainer behavior is where most of the package distinction lives.

Step 3: Keep runtime assumptions explicit

Document in code and plan comments that v1 defers architecture-specific paper details such as broader normalization or distributed data collection unless they are needed for the package path.

Task 3: Add The Offline-To-Online RLPD Trainer

Files: - Create: src/rl_training/runtime/rlpd_trainer.py - Modify: src/rl_training/experiment/registry.py

Step 1: Reuse offline data and online replay

Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path plus the standard ReplayBuffer from the online path.

Step 2: Add narrow prior-data training controls

Support the following algo_kwargs in v1:

  • offline_pretrain_updates
  • offline_batch_ratio
  • buffer_capacity
  • learning_starts
  • train_frequency
  • gradient_updates_per_step

Allow mixed updates that combine offline prior-data batches with online replay batches after collection starts.

Step 3: Preserve shared controls

Keep support for:

  • eval_interval
  • early stopping callbacks
  • learning-rate schedules
  • checkpoint save / resume
  • update-budget tracking

Task 4: Wire RLPD Into The Package Surface

Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/rlpd/pendulum.yaml - Create: src/rl_training/assets/configs/rlpd/pendulum.yaml - Modify: README.md

Step 1: Add the managed API entrypoint

Expose RLPD through the root package and API namespaces.

Step 2: Ship a starter config

Add a packaged Pendulum-v1 config for narrow offline-to-online RLPD training.

Step 3: Update package docs

Document RLPD as the first packaged prior-data offline-to-online trainer on top of the SAC lane.

Task 5: Add Unexecuted Coverage

Files: - Create: tests/test_rlpd_update.py - Create: tests/test_rlpd_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py

Step 1: Add learner-level coverage

Add a small unit test for rlpd_loss(...) and one RLPD.update(...) call.

Step 2: Add trainer smoke coverage

Add a smoke test that checks offline pretraining, online replay collection, mixed-batch metrics, and checkpoint creation.

Step 3: Extend package-surface expectations

Update public exports, managed API, checkpoint workflow, and packaged-config tests so RLPD is treated as a shipped algorithm.

Step 4: Keep execution deferred

Do not run the tests until the user explicitly asks for test execution.