AWR Phase 14 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable AWR implementation for offline continuous-control training by reusing the current actor/value model lane and the package's discounted returns-to-go data processing.
Architecture: Reuse MLPIQLModel for tanh-Gaussian policy plus value regression, and make the packaged AWR path intentionally narrow: value learning from discounted returns-to-go, then behavior cloning weighted by exponentiated return advantages. Reuse the current offline dataset builder, reward processing, checkpoint, eval, predict, schedule, and early-stopping surfaces instead of creating a new trainer family.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset, returns-to-go, and experiment infrastructure
Task 1: Freeze The Narrow AWR Scope¶
Files: - Create: docs/plans/2026-03-12-awr-phase14.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md
Step 1: Freeze v1 boundaries
Document the first packaged AWR release as:
- continuous
Boxactions only - flat vector observations only
- offline dataset training only
- single-process trainer only
- discounted returns-to-go computed from the processed reward stream in v1
- actor/value updates only, with no separate Q-critic path in this phase
- no recurrent path
- no image observations
- no distributed runtime
Step 2: Record the package rationale
Explain that AWR is the next low-friction 2019 / 2020 batch-RL wave because:
- it remains a recognizable actor-regression baseline in current offline / imitation libraries
- it reuses the current actor/value mental model instead of requiring a new sequence or world-model stack
- it productizes the package's returns-to-go processing for another shipped algorithm
Step 3: Keep test execution deferred
Record that tests are added but intentionally not executed until the user explicitly requests it.
Task 2: Add The AWR Learner¶
Files: - Create: src/rl_training/algorithms/awr.py - Modify: src/rl_training/algorithms/__init__.py
Step 1: Reuse the actor/value model family
Build AWR on MLPIQLModel and keep the learner narrow by using the policy head plus value head only in v1.
Step 2: Implement package-narrow AWR losses
Keep the package v1 behavior small and explicit:
- regress the value head to discounted returns-to-go
- estimate per-sample advantages as
returns_to_go - value(obs) - optionally normalize those advantages before weighting
- train the actor with exponentiated advantage-weighted behavior log-probabilities
- expose knobs for
betaandmax_weight
Step 3: Expose public loss helpers
Add a readable awr_loss(...) helper and export AWR / AWRAlgorithm through the shared algorithms package.
Task 3: Add The Offline AWR Trainer¶
Files: - Create: src/rl_training/runtime/awr_trainer.py - Modify: src/rl_training/experiment/registry.py
Step 1: Reuse offline dataset and returns processing
Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path, then derive discounted returns-to-go from the processed reward stream.
Step 2: Preserve shared controls
Keep support for:
eval_interval- early stopping callbacks
- offline epoch / update budgets
- learning-rate schedules
- checkpoint save / resume
Step 3: Reuse standard evaluation and prediction
Evaluate and predict through the current deterministic actor helper path already used by IQL-family algorithms.
Task 4: Wire AWR Into The Package Surface¶
Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/awr/pendulum.yaml - Create: src/rl_training/assets/configs/awr/pendulum.yaml - Modify: README.md
Step 1: Add the managed API entrypoint
Expose AWR through the root package and API namespaces.
Step 2: Ship a starter config
Add a packaged offline Pendulum-v1 config with narrow AWR defaults.
Step 3: Update package docs
Document AWR as a return-weighted offline actor/value baseline on the same dataset path.
Task 5: Add Unexecuted Coverage¶
Files: - Create: tests/test_awr_update.py - Create: tests/test_awr_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py
Step 1: Add learner-level coverage
Add a unit test for awr_loss(...), invalid beta / max_weight, and one update call.
Step 2: Add trainer smoke coverage
Add a small offline smoke test that checks checkpoint creation, returns-to-go metrics, and eval wiring.
Step 3: Extend package-surface expectations
Update public exports, managed API, checkpoint workflow, and packaged-config tests so AWR is treated as a shipped algorithm.
Step 4: Keep execution deferred
Do not run the tests until the user explicitly asks for test execution.