Benchmark Normalization and Best Checkpoint Tracking Implementation Plan¶
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Make training runs benchmark-ready by adding optional score normalization metadata and automatic best-checkpoint tracking.
Architecture: Extend TrainConfig with an optional top-level benchmark mapping. Add a small benchmarking utility module that resolves score-normalization settings and computes human-normalized scores from evaluation returns. Update save_training_checkpoint(...) so every trainer gets checkpoints/best.pt, run metadata for the best checkpoint, and benchmark-augmented metrics without changing each trainer implementation. Update evaluate_checkpoint(...) to return normalized metrics when the saved config includes benchmark settings.
Tech Stack: Python 3.10+, existing checkpointing/run metadata flow, pytest.
Task 1: Add failing benchmark tests¶
Files: - Create: tests/test_benchmarking.py - Modify: tests/test_checkpoint_workflows.py
Step 1: Write the failing tests - Add a unit test that saves multiple checkpoints and verifies: - checkpoints/best.pt is created - the best checkpoint tracks the best eval_return_mean - normalized score metrics are added when benchmark score references are configured - Add a workflow test proving evaluate_checkpoint(...) returns the normalized metric from the saved config.
Step 2: Run test to verify it fails - Run: pytest -q tests/test_benchmarking.py tests/test_checkpoint_workflows.py - Expected: failures because benchmark config, normalization, and best-checkpoint tracking do not exist yet.
Task 2: Implement benchmark config and normalization utilities¶
Files: - Modify: src/rl_training/experiment/config.py - Create: src/rl_training/experiment/benchmarking.py - Modify: src/rl_training/cli.py - Modify: src/rl_training/runtime/workflows.py
Step 1: Write minimal implementation - Add top-level benchmark config support to TrainConfig, config loading, serialization, and checkpoint restore. - Implement score normalization helpers for human-random scaling. - Augment evaluate_checkpoint(...) with normalized metrics when benchmark config exists.
Step 2: Run focused tests - Run: pytest -q tests/test_benchmarking.py tests/test_checkpoint_workflows.py
Task 3: Implement best checkpoint tracking in run utilities¶
Files: - Modify: src/rl_training/experiment/runs.py - Modify: src/rl_training/runtime/run_utils.py
Step 1: Write minimal implementation - Track best checkpoint according to benchmark.best_metric / benchmark.best_metric_mode, defaulting to eval_return_mean / max. - Save/update checkpoints/best.pt. - Persist best-checkpoint metadata in metadata.json. - Add best-checkpoint fields to the metrics dict returned by trainers.
Step 2: Run focused tests - Run: pytest -q tests/test_benchmarking.py tests/test_checkpoint_workflows.py
Task 4: Document benchmark config usage¶
Files: - Modify: README.md
Step 1: Add docs - Show benchmark config with random_score, human_score, and best_metric. - Document checkpoints/best.pt and normalized eval metrics.
Task 5: Verification¶
Run: - Focused: pytest -q tests/test_benchmarking.py tests/test_checkpoint_workflows.py - Broader: pytest -q
Notes: - This plan intentionally omits commits because the session instructions forbid committing unless explicitly requested.