跳转至

CRR Phase 8 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but package-usable CRR implementation for offline continuous-control training on top of the existing dataset, checkpoint, and managed API surfaces.

Architecture: Reuse the current offline training shape already used by AWAC, CQL, and IQL instead of introducing another runtime family. Implement CRR with the shared MLPSACModel, target critics, offline dataset sampling, and standard package evaluation / prediction wiring.

Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure


Task 1: Lock The Narrow CRR Scope

Files: - Create: docs/plans/2026-03-12-crr-phase8.md - Modify: README.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md

Step 1: Freeze the first-release boundaries

Document the v1 constraints:

  • continuous Box actions only
  • flat vector observations only
  • offline dataset training only
  • no sequence model path
  • no distributed runtime
  • no image observations in v1

Step 2: Explain why CRR is the next low-friction wave

Record that CRR is a practical follow-on because it still appears in current offline RL library surfaces and reuses the current AWAC/CQL/IQL infrastructure instead of demanding a new runtime.

Step 3: Keep verification deferred

Document that test execution remains intentionally deferred until the user explicitly requests it.

Task 2: Add The CRR Learner

Files: - Create: src/rl_training/algorithms/crr.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Implement the critic update

Add a twin-critic update using the existing MLPSACModel and target network path for offline continuous control.

Step 2: Implement the conservative actor regression update

Add policy weighting based on critic-computed advantages over sampled policy actions. Support the minimal package-relevant knobs:

  • advantage_type: mean or max
  • weight_type: binary or exp
  • beta
  • n_action_samples
  • max_weight

Step 3: Expose public loss helpers

Add a readable crr_loss(...) function and export CRR / CRRAlgorithm through the shared algorithms package.

Task 3: Add The Offline CRR Trainer

Files: - Create: src/rl_training/runtime/crr_trainer.py - Modify: src/rl_training/experiment/registry.py

Step 1: Reuse the existing offline dataset path

Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline stack.

Step 2: Preserve the shared control surface

Keep support for:

  • eval_interval
  • early stopping callbacks
  • offline epoch / update budgets
  • learning-rate schedules
  • checkpoint save / resume

Step 3: Reuse standard evaluation / prediction

Evaluate with the current continuous-action stochastic-policy evaluation helper and expose prediction through the checkpoint workflow.

Task 4: Wire CRR Into The Package Surface

Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/crr/pendulum.yaml - Create: src/rl_training/assets/configs/crr/pendulum.yaml

Step 1: Add the managed API entrypoint

Expose CRR through the root package and API namespaces.

Step 2: Add starter configs

Ship a packaged offline config using Pendulum-v1 and the current random offline dataset path.

Step 3: Update package docs

Add CRR to the README and roadmap docs as part of the current offline package wave.

Task 5: Add Unexecuted Test Coverage

Files: - Create: tests/test_crr_update.py - Create: tests/test_crr_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py

Step 1: Add unit coverage

Add a learner test for CRR metric keys and one update call.

Step 2: Add trainer smoke coverage

Add a small offline smoke test that checks checkpoint creation and eval wiring.

Step 3: Extend public-surface expectations

Update package export and registry tests so CRR is treated as a first-class shipped algorithm.

Step 4: Keep test execution deferred

Add tests but do not execute them until the user explicitly asks.