XQL Phase 11 Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add a narrow but package-usable XQL implementation for offline continuous-control training by reusing the current IQL runtime lane and replacing the value update with the packaged extreme value loss.
Architecture: Reuse MLPIQLModel, the existing offline dataset path, and the current checkpoint / eval / predict plumbing instead of inventing a new runtime family. Implement XQL as an IQL-adjacent learner with the same actor / critic update shape, plus a small value-loss helper that supports a stable Gumbel-rescale objective and a package-default fallback to the familiar expectile path when needed.
Tech Stack: Python, PyTorch, Gymnasium, existing rl_training offline dataset and experiment infrastructure
Task 1: Freeze The Narrow XQL Scope¶
Files: - Create: docs/plans/2026-03-12-xql-phase11.md - Modify: README.md - Modify: docs/plans/2026-03-12-mainstream-rl-package-design.md - Modify: docs/plans/2026-03-12-rl-expansion-roadmap-design.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md
Step 1: Freeze v1 boundaries
Document the first packaged XQL release as:
- continuous
Boxactions only - flat vector observations only
- offline dataset training only
- single-process trainer only
IQL-style actor / critic updates with anXQLextreme value update- no recurrent path
- no image observations
- no online fine-tuning runtime in this phase
Step 2: Record the package rationale
Explain that XQL is the next low-friction offline wave because:
- the official implementation is built on top of the
IQLcodebase - it extends the current actor / critic / value decomposition instead of creating a new model family
- it adds a recognizable 2023 offline baseline without forcing a world-model or sequence stack
Step 3: Keep test execution deferred
Record that tests are added but intentionally not executed until the user explicitly requests it.
Task 2: Add The XQL Learner¶
Files: - Create: src/rl_training/algorithms/xql.py - Modify: src/rl_training/algorithms/__init__.py
Step 1: Reuse the existing IQL model family
Build XQL on MLPIQLModel and keep the same narrow continuous offline assumptions as IQL.
Step 2: Implement the packaged extreme value loss
Keep the package v1 behavior small and explicit:
- reuse the current
IQLTD critic update - reuse the current advantage-weighted actor regression update
- replace the value update with a stable
XQL-style Gumbel-rescale loss overmin(Q1, Q2) - V - keep a
vanilla_value_loss/ expectile fallback for a narrow compatibility path - expose knobs for
loss_temperature,max_value_diff_exp, andmax_advantage_weight
Step 3: Expose public loss helpers
Add readable xql_loss(...), xql_value_loss(...), and gumbel_rescale_loss(...) helpers and export XQL / XQLAlgorithm through the shared algorithms package.
Task 3: Add The Offline XQL Trainer¶
Files: - Create: src/rl_training/runtime/xql_trainer.py - Modify: src/rl_training/experiment/registry.py
Step 1: Reuse the offline dataset stack
Build the trainer on _infer_env_spaces(...), _build_offline_dataset(...), and _evaluate_iql_policy(...) from the current offline path.
Step 2: Preserve shared controls
Keep support for:
eval_interval- early stopping callbacks
- offline epoch / update budgets
- learning-rate schedules
- checkpoint save / resume
Step 3: Reuse standard evaluation and prediction
Evaluate with the current deterministic IQL helper and expose package prediction through checkpoint workflows.
Task 4: Wire XQL Into The Package Surface¶
Files: - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py - Create: configs/xql/pendulum.yaml - Create: src/rl_training/assets/configs/xql/pendulum.yaml - Modify: README.md
Step 1: Add the managed API entrypoint
Expose XQL through the root package and API namespaces.
Step 2: Ship a starter config
Add a packaged offline Pendulum-v1 config with narrow XQL defaults.
Step 3: Update package docs
Document XQL as an IQL-adjacent offline follow-on with an extreme value loss on the same offline data path.
Task 5: Add Unexecuted Coverage¶
Files: - Create: tests/test_xql_update.py - Create: tests/test_xql_trainer_smoke.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_package_smoke.py - Modify: tests/test_cli.py
Step 1: Add learner-level coverage
Add a unit test for gumbel_rescale_loss(...), metric keys from xql_loss(...), and one update call.
Step 2: Add trainer smoke coverage
Add a small offline smoke test that checks checkpoint creation and eval wiring.
Step 3: Extend package-surface expectations
Update public exports, managed API, checkpoint workflow, and packaged-config tests so XQL is treated as a shipped algorithm.
Step 4: Keep execution deferred
Do not run the tests until the user explicitly asks for test execution.