跳转至

OpenAI ES V1 Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a narrow but honest OpenAI ES baseline for vector-observation, continuous-action control by training a deterministic MLP policy with mirrored parameter perturbations and rank-based evolution updates.

Architecture: Keep the release deliberately small and aligned with the current search-based continuous-control package surface. Reuse the new deterministic search-policy model family, existing run directories, checkpointing, evaluation/prediction workflow, action scaling, and managed API, but add a dedicated OpenAI ES learner and trainer built around synchronous positive/negative perturbation rollouts with centered-rank utilities. Explicitly do not implement parallel workers, distributed gradient aggregation, novelty search, observation normalization across processes, or discrete-action variants in this batch.

Tech Stack: Python 3.10, PyTorch, Gymnasium, pytest, existing rl_training runtime and experiment infrastructure.


Task 1: Add failing OpenAI ES coverage

Files: - Create: tests/test_openai_es_update.py - Create: tests/test_openai_es_trainer_smoke.py - Create: tests/test_openai_es_reference_script.py - Modify: tests/test_package_api_exports.py - Modify: tests/test_public_api.py - Modify: tests/test_checkpoint_workflows.py - Modify: tests/test_experiment_manager.py - Modify: tests/test_cli.py - Modify: tests/test_package_smoke.py

Step 1: Write the failing test - openai_es_loss(...) returns named search metrics for mirrored returns, utilities, and parameter updates. - OpenAIES.update(...) consumes perturbations and mirrored rollout returns. - train_openai_es() writes a checkpoint and evaluation metrics on Pendulum-v1. - root/api/algorithms package exports include OpenAIES. - checkpoint workflows can evaluate and resume an openai_es checkpoint. - packaged config resolves outside repo root and the reference script runs as a smoke command.

Step 2: Run test to verify it fails Run: pytest -q tests/test_openai_es_update.py tests/test_openai_es_trainer_smoke.py tests/test_openai_es_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py Expected: FAIL with missing openai_es modules / exports.

Task 2: Implement the OpenAI ES learner

Files: - Create: src/rl_training/algorithms/openai_es.py - Modify: src/rl_training/algorithms/__init__.py

Step 1: Write minimal implementation - reuse MLPARSModel as the deterministic search policy. - implement centered-rank utilities over mirrored returns and evolution-style parameter updates. - keep scope explicit: vector observations only, continuous Box actions only, synchronous rollout evaluation only.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_openai_es_update.py Expected: PASS.

Task 3: Implement trainer and workflow integration

Files: - Create: src/rl_training/runtime/openai_es_trainer.py - Modify: src/rl_training/experiment/registry.py - Modify: src/rl_training/api/algorithms.py - Modify: src/rl_training/api/__init__.py - Modify: src/rl_training/__init__.py

Step 1: Write minimal implementation - reuse the current continuous-control action scaling and evaluation path. - collect mirrored perturbation rollouts synchronously for each evolution update. - support train / eval / resume / predict through registry wiring.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_openai_es_trainer_smoke.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py Expected: PASS.

Task 4: Add config, example, and docs

Files: - Create: configs/openai_es/pendulum.yaml - Create: src/rl_training/assets/configs/openai_es/pendulum.yaml - Create: examples/openai_es_pendulum_reference.py - Modify: README.md - Modify: docs/plans/2026-03-12-rl-yearly-sourcebook-design.md

Step 1: Write minimal implementation - add a runnable openai_es Pendulum preset. - add a reference script for a tiny synchronous run. - update README and yearly sourcebook to mark OpenAI ES as implemented in a narrow v1 form.

Step 2: Run tests to verify it passes Run: pytest -q tests/test_openai_es_reference_script.py tests/test_cli.py tests/test_package_smoke.py Expected: PASS.

Task 5: Regression verification

Files: - Modify only if verification reveals regressions.

Step 1: Run focused regression coverage Run: pytest -q tests/test_openai_es_update.py tests/test_openai_es_trainer_smoke.py tests/test_openai_es_reference_script.py tests/test_package_api_exports.py tests/test_public_api.py tests/test_checkpoint_workflows.py tests/test_experiment_manager.py tests/test_cli.py tests/test_package_smoke.py tests/test_ars_update.py tests/test_ars_trainer_smoke.py tests/test_ddpg_update.py tests/test_ddpg_trainer_smoke.py tests/test_td3_update.py tests/test_td3_trainer_smoke.py Expected: PASS.

Step 2: Run full suite Run: pytest -q Expected: PASS.