Cal-QL Phase 10 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable Cal-QL implementation for offline continuous-control training, plus the minimum offline dataset support needed to carry or derive discounted returns-to-go for calibrated conservative value learning.
Architecture: Reuse the current offline SAC-family runtime instead of inventing a new trainer lane. Implement Cal-QL as a small extension of the existing CQL mental model on top of MLPSACModel, packaged configs, managed API wiring, and checkpoint/eval/predict flows, while extending TransitionDataset with optional returns_to_go support and a deterministic recomputation path so reward transforms and dataset mixing stay compatible.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure
Task 1: Freeze The Narrow Cal-QL Scope¶
Files: - Create: docs/plans/2026-03-12-calql-phase10.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md
Step 1: Freeze v1 boundaries
Document the first packaged Cal-QL release as:
- continuous
Boxactions only - flat vector observations only
- offline dataset training only
- single-process trainer only
CQL-style conservative SAC baseline with calibrated returns-to-go support- no online fine-tuning runtime in this phase
- no recurrent path
- no image observations
Step 2: Record the package rationale
Explain that Cal-QL is the next low-friction 2022 offline wave because:
- current public offline libraries still surface it
- it extends the existing
CQLbaseline instead of creating a new runtime family - it is explicitly positioned for stronger offline initialization before later online fine-tuning
Step 3: Keep test execution deferred
Record that tests are added but intentionally not executed until the user explicitly requests it.
Task 2: Extend Offline Dataset Payloads For Returns-To-Go¶
Files: - Modify: src/rl_training/data/offline_dataset.py - Modify: src/rl_training/data/dataset_loaders.py - Modify: src/rl_training/data/offline_mixers.py - Modify: src/rl_training/data/__init__.py - Modify: src/rl_training/runtime/iql_trainer.py - Modify: tests/test_offline_dataset.py - Modify: tests/test_dataset_loaders.py
Step 1: Add optional returns_to_go storage
Allow TransitionDataset to carry optional discounted returns-to-go tensors without breaking existing callers that only expect the standard transition fields.
Step 2: Add a deterministic recomputation path
Provide a helper that recomputes discounted returns-to-go from the current reward and done arrays so reward scaling / shifting / clipping can happen before calibration targets are derived.
Step 3: Preserve rich fields through processing
Keep optional offline payload fields through:
TransitionDataset.from_dict(...)- file-backed dataset loading
- mixed dataset assembly
- action normalization in
_process_offline_dataset(...) - sampling to trainer batches
Also fix the current action-normalization path so it no longer drops next_actions.
Task 3: Add The Cal-QL Learner¶
Files: - Create: src/rl_training/algorithms/cal_ql.py - Modify: src/rl_training/algorithms/__init__.py
Step 1: Reuse the existing SAC / CQL model family
Build Cal-QL on MLPSACModel and keep the same narrow continuous offline assumptions as CQL.
Step 2: Implement calibrated conservative loss
Keep the package v1 behavior small and explicit:
- reuse the current SAC target and actor update shape
- reuse the current random-action conservative penalty structure from the local
CQL - calibrate policy-sampled conservative values with
max(sampled_q, returns_to_go) - require returns-to-go in training batches, but derive them automatically in the trainer from processed datasets
Step 3: Expose public loss helpers
Add a readable cal_ql_loss(...) function and export CalQL / CalQLAlgorithm through the shared algorithms package.
Task 4: Add The Offline Cal-QL Trainer¶
Files: - Create: src/rl_training/runtime/cal_ql_trainer.py - Modify: src/rl_training/experiment/registry.py
Step 1: Reuse the offline dataset stack
Build the trainer on _infer_env_spaces(...) and _build_offline_dataset(...) from the current offline path, then derive discounted returns-to-go from the processed dataset using the configured gamma.
Step 2: Preserve shared controls
Keep support for:
eval_interval- early stopping callbacks
- offline epoch / update budgets
- learning-rate schedules
- checkpoint save / resume
Step 3: Reuse standard evaluation and prediction
Evaluate with the current deterministic SAC helper and expose package prediction through checkpoint workflows.
Task 5: Wire Cal-QL Into The Package Surface¶
Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/cal_ql/pendulum.yaml - Create: src/rl_training/assets/configs/cal_ql/pendulum.yaml - Modify: README.md
Step 1: Add the managed API entrypoint
Expose CalQL through the root package and API namespaces.
Step 2: Ship a starter config
Add a packaged offline Pendulum-v1 config that uses the random dataset path and narrow Cal-QL defaults.
Step 3: Update package docs
Document Cal-QL as a 2022 calibrated CQL follow-on and explain the optional returns_to_go payload / automatic derivation path.
Task 6: Add Unexecuted Coverage¶
Files: - Create: tests/test_cal_ql_update.py - Create: tests/test_cal_ql_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py
Step 1: Add learner-level coverage
Add a unit test for cal_ql_loss(...) metric keys and one update call.
Step 2: Add trainer smoke coverage
Add a small offline smoke test that checks checkpoint creation and eval wiring.
Step 3: Extend package-surface expectations
Update public exports, managed API, checkpoint workflow, and packaged-config tests so Cal-QL is treated as a shipped algorithm.
Step 4: Keep execution deferred
Do not run the tests until the user explicitly asks for test execution.