跳转至

EDAC Phase 12 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but package-usable EDAC implementation for offline continuous-control training by reusing the current ensemble SAC-family path and adding the critic-diversity regularizer that makes EDAC distinct.

Architecture: Reuse the current offline dataset path, checkpoint / eval / predict plumbing, and the existing multi-critic MLPREDQModel instead of introducing a new model family. Implement EDAC as a package-narrow offline ensemble actor-critic that keeps the current fixed-alpha package convention, adds the gradient-diversity regularizer on data actions, and evaluates through the same deterministic continuous-control helpers already used by REDQ.

Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure


Task 1: Freeze The Narrow EDAC Scope

Files: - Create: docs/plans/2026-03-12-edac-phase12.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Freeze v1 boundaries

Document the first packaged EDAC release as:

  • continuous Box actions only
  • flat vector observations only
  • offline dataset training only
  • single-process trainer only
  • ensemble SAC-style actor / critic updates with critic-diversity regularization
  • fixed entropy coefficient alpha in this phase
  • no recurrent path
  • no image observations
  • no online fine-tuning runtime in this phase

Step 2: Record the package rationale

Explain that EDAC is the next low-friction 2022 offline wave because:

  • it is a recognizable NeurIPS 2021 / 2022-era offline baseline still used as a comparison point
  • it reuses the current continuous actor-critic mental model instead of forcing a sequence or world-model stack
  • it can be built on top of the existing multi-critic runtime pieces already present in the package

Step 3: Keep test execution deferred

Record that tests are added but intentionally not executed until the user explicitly requests it.

Task 2: Add The EDAC Learner

Files: - Create: src/rl_training/algorithms/edac.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Reuse the existing ensemble model family

Build EDAC on MLPREDQModel and keep the same narrow continuous offline assumptions as the current SAC-family offline algorithms.

Step 2: Implement the package-narrow EDAC losses

Keep the package v1 behavior small and explicit:

  • reuse the current tanh-Gaussian actor sampling path
  • compute target values from the minimum over target-critic ensemble members
  • add the critic-diversity penalty on action gradients across ensemble members
  • expose knobs for num_critics, eta, and fixed alpha
  • keep the current package convention of fixed alpha instead of introducing automatic entropy tuning in this phase

Step 3: Expose public loss helpers

Add a readable edac_loss(...) helper plus a separate critic_diversity_loss(...) helper and export EDAC / EDACAlgorithm through the shared algorithms package.

Task 3: Add The Offline EDAC Trainer

Files: - Create: src/rl_training/runtime/edac_trainer.py - Modify: src/rl_training/experiment/registry.py

Step 1: Reuse the offline dataset stack

Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path, and evaluate through the current deterministic continuous-control ensemble helper.

Step 2: Preserve shared controls

Keep support for:

  • eval_interval
  • early stopping callbacks
  • offline epoch / update budgets
  • learning-rate schedules
  • checkpoint save / resume

Step 3: Reuse standard evaluation and prediction

Evaluate with the current deterministic ensemble-action helper and expose package prediction through checkpoint workflows.

Task 4: Wire EDAC Into The Package Surface

Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/edac/pendulum.yaml - Create: src/rl_training/assets/configs/edac/pendulum.yaml - Modify: README.md

Step 1: Add the managed API entrypoint

Expose EDAC through the root package and API namespaces.

Step 2: Ship a starter config

Add a packaged offline Pendulum-v1 config with narrow EDAC defaults.

Step 3: Update package docs

Document EDAC as an ensemble-diversified offline actor-critic on the same continuous offline data path.

Task 5: Add Unexecuted Coverage

Files: - Create: tests/test_edac_update.py - Create: tests/test_edac_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py

Step 1: Add learner-level coverage

Add a unit test for critic_diversity_loss(...), metric keys from edac_loss(...), and one update call.

Step 2: Add trainer smoke coverage

Add a small offline smoke test that checks checkpoint creation and eval wiring.

Step 3: Extend package-surface expectations

Update public exports, managed API, checkpoint workflow, and packaged-config tests so EDAC is treated as a shipped algorithm.

Step 4: Keep execution deferred

Do not run the tests until the user explicitly asks for test execution.