DrQ-v2 Phase 7 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable DrQ-v2 implementation for pixel observations and continuous-action control.
Architecture: Reuse the existing single-process off-policy runtime shape instead of introducing a new actor-learner system. Implement DrQ-v2 as a CNN-backed actor-critic learner with replay-buffer training and random-crop image augmentation, then wire it through the shared registry, CLI, checkpoint, and packaged-config surfaces.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training runtime / registry / checkpoint infrastructure
Task 1: Phase Scope And Documentation Lock-In¶
Files: - Create: docs/plans/2026-03-12-drqv2-phase7.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md - Modify: README.md
Step 1: Document the first-release scope
Write down the non-negotiable v1 boundaries:
- image observations only
- continuous
Boxactions only - single-process replay-buffer training only
- random-crop augmentation only
- no distributed collectors
- no world-model features
- no claim of paper-perfect reproduction
Step 2: Record why this wave is next
Explain that DrQ-v2 is the next practical follow-on after TRPO, Discrete SAC, and CrossQ because it expands the package's image-based continuous-control lane without requiring a distributed runtime redesign.
Step 3: Update user-facing docs
Add DrQ-v2 to the README algorithm surface, starter commands, and roadmap ordering. Keep the wording explicit that tests are intentionally not executed until the user requests verification.
Task 2: CNN-Backed DrQ-v2 Model¶
Files: - Create: src/rl_training/models/cnn/drqv2.py - Modify: src/rl_training/models/cnn/__init__.py - Modify: src/rl_training/models/__init__.py
Step 1: Add the shared model types
Implement a CNN-backed model that supports:
NatureCNNfeature extraction for channel-first images- tanh-bounded continuous actor outputs
- twin Q critics over encoded observations and continuous actions
- separate actor / critic parameter iterators
- deterministic and stochastic action sampling helpers
Step 2: Keep the interface package-friendly
Mirror the helper shape already used by MLPSACModel, MLPCrossQModel, and MLPDDPGModel so the trainer and registry layers can stay simple.
Step 3: Export the new symbols
Expose the new sample/model names through rl_training.models.cnn, rl_training.models, and any module-contract expectations.
Task 3: DrQ-v2 Learner And Trainer¶
Files: - Create: src/rl_training/algorithms/drqv2.py - Create: src/rl_training/runtime/drqv2_trainer.py - Modify: src/rl_training/data/replay_buffer.py - Modify: src/rl_training/runtime/__init__.py
Step 1: Extend replay-buffer support for image pipelines if needed
Ensure the replay buffer can store channel-first image observations in uint8 while still returning tensors that the learner can consume cleanly on-device.
Step 2: Implement the learner
Add a readable DrQ-v2 learner that includes:
- twin critic TD targets
- delayed actor updates
- target-network soft updates
- random-crop augmentation on current and next observations
- metrics for critic loss, actor loss, target-Q mean, and update counts
Step 3: Implement the trainer
Build a trainer that reuses the current replay-buffer control flow:
- infer image observation shape and continuous action shape
- scale normalized actor outputs into env action bounds
- support checkpoint load/save
- evaluate deterministically through the standard workflow
Step 4: Keep verification deferred
Add code and metrics paths only. Do not execute tests or smoke runs unless the user explicitly asks.
Task 4: Registry, API, And Packaged Config Surface¶
Files: - Modify: src/rl_training/algorithms/__init__.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/runtime/workflows.py - Create: configs/drqv2/pendulum_pixels.yaml - Create: src/rl_training/assets/configs/drqv2/pendulum_pixels.yaml
Step 1: Add managed package entrypoints
Expose DrQv2 on the same public surface as the other algorithms so users can call it through the root package, the managed API, the registry, CLI config loading, checkpoint evaluation, and prediction.
Step 2: Add starter configs
Ship a starter pixel-control config using a continuous-control environment that can render RGB frames, with wrappers or env kwargs documented explicitly.
Step 3: Preserve the current package split
Keep DrQ-v2 in the core surface for now, while continuing to defer distributed actor-learner families and world models to later roadmap waves.
Task 5: Unexecuted Test Coverage¶
Files: - Create: tests/test_drqv2_update.py - Create: tests/test_drqv2_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_module_contracts.py
Step 1: Add unit coverage
Add a narrow learner test that validates tensor shapes, metric keys, and one update call for the new algorithm.
Step 2: Add a smoke trainer test
Register a tiny image continuous-control test environment and add a smoke test that verifies checkpoint creation and evaluation wiring for train_drqv2(...).
Step 3: Update public-surface tests
Extend the package export and experiment-manager expectations so DrQ-v2 is treated like the rest of the shipped algorithm surface.
Step 4: Leave execution to a later verification pass
Keep all test execution commands deferred until the user explicitly asks for verification.