跳转至

DrQ-v2 Phase 7 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but package-usable DrQ-v2 implementation for pixel observations and continuous-action control.

Architecture: Reuse the existing single-process off-policy runtime shape instead of introducing a new actor-learner system. Implement DrQ-v2 as a CNN-backed actor-critic learner with replay-buffer training and random-crop image augmentation, then wire it through the shared registry, CLI, checkpoint, and packaged-config surfaces.

Tech Stack: Python, PyTorch, Gymnasium, existing rl_training runtime / registry / checkpoint infrastructure


Task 1: Phase Scope And Documentation Lock-In

Files: - Create: docs/plans/2026-03-12-drqv2-phase7.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md - Modify: README.md

Step 1: Document the first-release scope

Write down the non-negotiable v1 boundaries:

  • image observations only
  • continuous Box actions only
  • single-process replay-buffer training only
  • random-crop augmentation only
  • no distributed collectors
  • no world-model features
  • no claim of paper-perfect reproduction

Step 2: Record why this wave is next

Explain that DrQ-v2 is the next practical follow-on after TRPO, Discrete SAC, and CrossQ because it expands the package's image-based continuous-control lane without requiring a distributed runtime redesign.

Step 3: Update user-facing docs

Add DrQ-v2 to the README algorithm surface, starter commands, and roadmap ordering. Keep the wording explicit that tests are intentionally not executed until the user requests verification.

Task 2: CNN-Backed DrQ-v2 Model

Files: - Create: src/rl_training/models/cnn/drqv2.py - Modify: src/rl_training/models/cnn/__init__.py - Modify: src/rl_training/models/__init__.py

Step 1: Add the shared model types

Implement a CNN-backed model that supports:

  • NatureCNN feature extraction for channel-first images
  • tanh-bounded continuous actor outputs
  • twin Q critics over encoded observations and continuous actions
  • separate actor / critic parameter iterators
  • deterministic and stochastic action sampling helpers

Step 2: Keep the interface package-friendly

Mirror the helper shape already used by MLPSACModel, MLPCrossQModel, and MLPDDPGModel so the trainer and registry layers can stay simple.

Step 3: Export the new symbols

Expose the new sample/model names through rl_training.models.cnn, rl_training.models, and any module-contract expectations.

Task 3: DrQ-v2 Learner And Trainer

Files: - Create: src/rl_training/algorithms/drqv2.py - Create: src/rl_training/runtime/drqv2_trainer.py - Modify: src/rl_training/data/replay_buffer.py - Modify: src/rl_training/runtime/__init__.py

Step 1: Extend replay-buffer support for image pipelines if needed

Ensure the replay buffer can store channel-first image observations in uint8 while still returning tensors that the learner can consume cleanly on-device.

Step 2: Implement the learner

Add a readable DrQ-v2 learner that includes:

  • twin critic TD targets
  • delayed actor updates
  • target-network soft updates
  • random-crop augmentation on current and next observations
  • metrics for critic loss, actor loss, target-Q mean, and update counts

Step 3: Implement the trainer

Build a trainer that reuses the current replay-buffer control flow:

  • infer image observation shape and continuous action shape
  • scale normalized actor outputs into env action bounds
  • support checkpoint load/save
  • evaluate deterministically through the standard workflow

Step 4: Keep verification deferred

Add code and metrics paths only. Do not execute tests or smoke runs unless the user explicitly asks.

Task 4: Registry, API, And Packaged Config Surface

Files: - Modify: src/rl_training/algorithms/__init__.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/runtime/workflows.py - Create: configs/drqv2/pendulum_pixels.yaml - Create: src/rl_training/assets/configs/drqv2/pendulum_pixels.yaml

Step 1: Add managed package entrypoints

Expose DrQv2 on the same public surface as the other algorithms so users can call it through the root package, the managed API, the registry, CLI config loading, checkpoint evaluation, and prediction.

Step 2: Add starter configs

Ship a starter pixel-control config using a continuous-control environment that can render RGB frames, with wrappers or env kwargs documented explicitly.

Step 3: Preserve the current package split

Keep DrQ-v2 in the core surface for now, while continuing to defer distributed actor-learner families and world models to later roadmap waves.

Task 5: Unexecuted Test Coverage

Files: - Create: tests/test_drqv2_update.py - Create: tests/test_drqv2_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_module_contracts.py

Step 1: Add unit coverage

Add a narrow learner test that validates tensor shapes, metric keys, and one update call for the new algorithm.

Step 2: Add a smoke trainer test

Register a tiny image continuous-control test environment and add a smoke test that verifies checkpoint creation and evaluation wiring for train_drqv2(...).

Step 3: Update public-surface tests

Extend the package export and experiment-manager expectations so DrQ-v2 is treated like the rest of the shipped algorithm surface.

Step 4: Leave execution to a later verification pass

Keep all test execution commands deferred until the user explicitly asks for verification.