跳转至

MOPO V1 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but honest MOPO baseline for offline vector-observation, continuous-action control by combining a learned dynamics ensemble, uncertainty-penalized synthetic rollouts, and an SAC-style policy learner.

Architecture: Keep the release deliberately small and aligned with the current offline RL package shape. Reuse the existing offline dataset builder, MLPSACModel, evaluation path, checkpointing, and managed API, but add a dedicated ensemble dynamics model plus a trainer that first fits the dynamics model on the offline dataset and then alternates between regenerating a synthetic replay buffer from real states and training the actor-critic on mixed real/synthetic transitions. Explicitly do not implement image observations, discrete actions, learned terminal prediction, online environment collection, or the full paper-scale MOPO training stack in this batch.

Tech Stack: Python 3.10, PyTorch, Gymnasium, pytest, existing rl_training offline dataset and continuous-control infrastructure.


Task 1: Add failing MOPO coverage

Files: - Create: tests/test_mopo_dynamics_model.py - Create: tests/test_mopo_update.py - Create: tests/test_mopo_trainer_smoke.py - Create: tests/test_mopo_reference_script.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_package_smoke.py

Step 1: Write the failing test - ensemble dynamics model predicts means/logvars for reward plus next-state delta and can sample synthetic transitions. - mopo_model_loss(...) returns named dynamics metrics and MOPO.update_model(...) / MOPO.update(...) return expected metrics. - train_mopo() writes a checkpoint and evaluation metrics on Pendulum-v1 with random offline data. - root/api/algorithms package exports include MOPO. - checkpoint workflows can evaluate and resume a saved mopo checkpoint. - packaged config resolves outside repo root and reference script runs as a smoke command.

Step 2: Run test to verify it fails Run: pytest -q tests/test_mopo_dynamics_model.py tests/test_mopo_update.py tests/test_mopo_trainer_smoke.py tests/test_mopo_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py Expected: FAIL with missing mopo modules / exports.

Task 2: Implement ensemble dynamics model

Files: - Create: src/rl_training/models/mlp_mopo.py - Modify: src/rl_training/models/__init__.py

Step 1: Write minimal implementation - add a compact MLP ensemble that takes (obs, action) and predicts Gaussian parameters for (delta_obs, reward). - expose a method to sample synthetic transitions and compute ensemble disagreement. - keep the scope explicit: vector observations only, continuous actions only, no learned terminal model.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_mopo_dynamics_model.py Expected: PASS.

Task 3: Implement learner and trainer

Files: - Create: src/rl_training/algorithms/mopo.py - Create: src/rl_training/runtime/mopo_trainer.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Write minimal implementation - implement MOPO as a composite algorithm that owns a policy learner and a dynamics ensemble. - dynamics update uses Gaussian NLL against reward and next-state deltas. - policy update reuses the current SAC learner on mixed real/synthetic transition batches. - trainer first pretrains the dynamics model, then periodically refreshes a synthetic replay buffer from real states using the current actor and uncertainty-penalized model rollouts. - support train / eval / resume / predict through registry wiring.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_mopo_update.py tests/test_mopo_trainer_smoke.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py Expected: PASS.

Task 4: Add config, example, and docs

Files: - Create: configs/mopo/pendulum.yaml - Create: src/rl_training/assets/configs/mopo/pendulum.yaml - Create: examples/mopo_pendulum_reference.py - Modify: README.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Write minimal implementation - add a runnable mopo Pendulum preset using random offline data. - add a reference script for a tiny offline run. - update README and yearly sourcebook to mark MOPO as implemented narrow v1 and keep scope explicit.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_mopo_reference_script.py tests/test_cli.py tests/test_package_smoke.py Expected: PASS.

Task 5: Regression verification

Files: - Modify only if verification reveals regressions.

Step 1: Run focused regression coverage Run: pytest -q tests/test_mopo_dynamics_model.py tests/test_mopo_update.py tests/test_mopo_trainer_smoke.py tests/test_mopo_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py tests/test_sac_update.py tests/test_cql_trainer_smoke.py tests/test_rlpd_trainer_smoke.py Expected: PASS.

Step 2: Run full suite Run: pytest -q Expected: PASS.