跳转至

Cal-QL Phase 10 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but package-usable Cal-QL implementation for offline continuous-control training, plus the minimum offline dataset support needed to carry or derive discounted returns-to-go for calibrated conservative value learning.

Architecture: Reuse the current offline SAC-family runtime instead of inventing a new trainer lane. Implement Cal-QL as a small extension of the existing CQL mental model on top of MLPSACModel, packaged configs, managed API wiring, and checkpoint/eval/predict flows, while extending TransitionDataset with optional returns_to_go support and a deterministic recomputation path so reward transforms and dataset mixing stay compatible.

Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure


Task 1: Freeze The Narrow Cal-QL Scope

Files: - Create: docs/plans/2026-03-12-calql-phase10.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Freeze v1 boundaries

Document the first packaged Cal-QL release as:

  • continuous Box actions only
  • flat vector observations only
  • offline dataset training only
  • single-process trainer only
  • CQL-style conservative SAC baseline with calibrated returns-to-go support
  • no online fine-tuning runtime in this phase
  • no recurrent path
  • no image observations

Step 2: Record the package rationale

Explain that Cal-QL is the next low-friction 2022 offline wave because:

  • current public offline libraries still surface it
  • it extends the existing CQL baseline instead of creating a new runtime family
  • it is explicitly positioned for stronger offline initialization before later online fine-tuning

Step 3: Keep test execution deferred

Record that tests are added but intentionally not executed until the user explicitly requests it.

Task 2: Extend Offline Dataset Payloads For Returns-To-Go

Files: - Modify: src/rl_training/data/offline_dataset.py - Modify: src/rl_training/data/dataset_loaders.py - Modify: src/rl_training/data/offline_mixers.py - Modify: src/rl_training/data/__init__.py - Modify: src/rl_training/runtime/iql_trainer.py - Modify: tests/test_offline_dataset.py - Modify: tests/test_dataset_loaders.py

Step 1: Add optional returns_to_go storage

Allow TransitionDataset to carry optional discounted returns-to-go tensors without breaking existing callers that only expect the standard transition fields.

Step 2: Add a deterministic recomputation path

Provide a helper that recomputes discounted returns-to-go from the current reward and done arrays so reward scaling / shifting / clipping can happen before calibration targets are derived.

Step 3: Preserve rich fields through processing

Keep optional offline payload fields through:

  • TransitionDataset.from_dict(...)
  • file-backed dataset loading
  • mixed dataset assembly
  • action normalization in _process_offline_dataset(...)
  • sampling to trainer batches

Also fix the current action-normalization path so it no longer drops next_actions.

Task 3: Add The Cal-QL Learner

Files: - Create: src/rl_training/algorithms/cal_ql.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Reuse the existing SAC / CQL model family

Build Cal-QL on MLPSACModel and keep the same narrow continuous offline assumptions as CQL.

Step 2: Implement calibrated conservative loss

Keep the package v1 behavior small and explicit:

  • reuse the current SAC target and actor update shape
  • reuse the current random-action conservative penalty structure from the local CQL
  • calibrate policy-sampled conservative values with max(sampled_q, returns_to_go)
  • require returns-to-go in training batches, but derive them automatically in the trainer from processed datasets

Step 3: Expose public loss helpers

Add a readable cal_ql_loss(...) function and export CalQL / CalQLAlgorithm through the shared algorithms package.

Task 4: Add The Offline Cal-QL Trainer

Files: - Create: src/rl_training/runtime/cal_ql_trainer.py - Modify: src/rl_training/experiment/registry.py

Step 1: Reuse the offline dataset stack

Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path, then derive discounted returns-to-go from the processed dataset using the configured gamma.

Step 2: Preserve shared controls

Keep support for:

  • eval_interval
  • early stopping callbacks
  • offline epoch / update budgets
  • learning-rate schedules
  • checkpoint save / resume

Step 3: Reuse standard evaluation and prediction

Evaluate with the current deterministic SAC helper and expose package prediction through checkpoint workflows.

Task 5: Wire Cal-QL Into The Package Surface

Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/cal_ql/pendulum.yaml - Create: src/rl_training/assets/configs/cal_ql/pendulum.yaml - Modify: README.md

Step 1: Add the managed API entrypoint

Expose CalQL through the root package and API namespaces.

Step 2: Ship a starter config

Add a packaged offline Pendulum-v1 config that uses the random dataset path and narrow Cal-QL defaults.

Step 3: Update package docs

Document Cal-QL as a 2022 calibrated CQL follow-on and explain the optional returns_to_go payload / automatic derivation path.

Task 6: Add Unexecuted Coverage

Files: - Create: tests/test_cal_ql_update.py - Create: tests/test_cal_ql_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py

Step 1: Add learner-level coverage

Add a unit test for cal_ql_loss(...) metric keys and one update call.

Step 2: Add trainer smoke coverage

Add a small offline smoke test that checks checkpoint creation and eval wiring.

Step 3: Extend package-surface expectations

Update public exports, managed API, checkpoint workflow, and packaged-config tests so Cal-QL is treated as a shipped algorithm.

Step 4: Keep execution deferred

Do not run the tests until the user explicitly asks for test execution.