CRR Phase 8 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable CRR implementation for offline continuous-control training on top of the existing dataset, checkpoint, and managed API surfaces.
Architecture: Reuse the current offline training shape already used by AWAC, CQL, and IQL instead of introducing another runtime family. Implement CRR with the shared MLPSACModel, target critics, offline dataset sampling, and standard package evaluation / prediction wiring.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure
Task 1: Lock The Narrow CRR Scope¶
Files: - Create: docs/plans/2026-03-12-crr-phase8.md - Modify: README.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md
Step 1: Freeze the first-release boundaries
Document the v1 constraints:
- continuous
Boxactions only - flat vector observations only
- offline dataset training only
- no sequence model path
- no distributed runtime
- no image observations in v1
Step 2: Explain why CRR is the next low-friction wave
Record that CRR is a practical follow-on because it still appears in current offline RL library surfaces and reuses the current AWAC/CQL/IQL infrastructure instead of demanding a new runtime.
Step 3: Keep verification deferred
Document that test execution remains intentionally deferred until the user explicitly requests it.
Task 2: Add The CRR Learner¶
Files: - Create: src/rl_training/algorithms/crr.py - Modify: src/rl_training/algorithms/__init__.py
Step 1: Implement the critic update
Add a twin-critic update using the existing MLPSACModel and target network path for offline continuous control.
Step 2: Implement the conservative actor regression update
Add policy weighting based on critic-computed advantages over sampled policy actions. Support the minimal package-relevant knobs:
advantage_type:meanormaxweight_type:binaryorexpbetan_action_samplesmax_weight
Step 3: Expose public loss helpers
Add a readable crr_loss(...) function and export CRR / CRRAlgorithm through the shared algorithms package.
Task 3: Add The Offline CRR Trainer¶
Files: - Create: src/rl_training/runtime/crr_trainer.py - Modify: src/rl_training/experiment/registry.py
Step 1: Reuse the existing offline dataset path
Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline stack.
Step 2: Preserve the shared control surface
Keep support for:
eval_interval- early stopping callbacks
- offline epoch / update budgets
- learning-rate schedules
- checkpoint save / resume
Step 3: Reuse standard evaluation / prediction
Evaluate with the current continuous-action stochastic-policy evaluation helper and expose prediction through the checkpoint workflow.
Task 4: Wire CRR Into The Package Surface¶
Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/crr/pendulum.yaml - Create: src/rl_training/assets/configs/crr/pendulum.yaml
Step 1: Add the managed API entrypoint
Expose CRR through the root package and API namespaces.
Step 2: Add starter configs
Ship a packaged offline config using Pendulum-v1 and the current random offline dataset path.
Step 3: Update package docs
Add CRR to the README and roadmap docs as part of the current offline package wave.
Task 5: Add Unexecuted Test Coverage¶
Files: - Create: tests/test_crr_update.py - Create: tests/test_crr_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py
Step 1: Add unit coverage
Add a learner test for CRR metric keys and one update call.
Step 2: Add trainer smoke coverage
Add a small offline smoke test that checks checkpoint creation and eval wiring.
Step 3: Extend public-surface expectations
Update package export and registry tests so CRR is treated as a first-class shipped algorithm.
Step 4: Keep test execution deferred
Add tests but do not execute them until the user explicitly asks.