跳转至

Zoo Leaderboard Compare-To Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a higher-level leaderboard switch that lets users compare peak performance versus final performance without needing to understand the full leaderboard metric alias set.

Architecture: Extend src/rl_training/zoo_cli.py with a leaderboard-only --compare-to latest|best option that resolves into the existing leaderboard metric alias layer. When no explicit --leaderboard-metric or raw --sort-by is provided, compare_to=latest should select latest-normalized on normalized benchmarks and latest-return on unnormalized ones; compare_to=best should analogously resolve to best-*. Surface the resolved compare_to and leaderboard_metric in leaderboard payload metadata and renderers.

Tech Stack: Python 3.10+, argparse, existing zoo CLI/report serializers, pytest.


Task 1: Add failing regression tests

Files: - Modify: tests/test_cli.py

Step 1: Write the failing tests - Add a leaderboard JSON test verifying --compare-to latest on a normalized benchmark resolves to latest-normalized and sorts by latest normalized score. - Add a leaderboard JSON test verifying --compare-to latest on an unnormalized manifest resolves to latest-return and sorts by latest return.

Step 2: Run test to verify it fails

Run: pytest -q tests/test_cli.py

Expected: FAIL because --compare-to is not supported yet.

Task 2: Implement compare-to resolution

Files: - Modify: src/rl_training/zoo_cli.py - Modify: src/rl_training/cli.py

Step 1: Write minimal implementation - Add --compare-to parser support in both CLI entry points. - Resolve compare_to through the existing leaderboard metric alias layer only when no explicit --leaderboard-metric or --sort-by is provided. - Carry compare_to into leaderboard payload metadata and text/CSV outputs when it is the active resolution path.

Step 2: Run focused tests

Run: pytest -q tests/test_cli.py

Expected: PASS.

Task 3: Document compare-to usage

Files: - Modify: README.md - Modify: zoo/README.md - Modify: src/rl_training/assets/zoo/README.md

Step 1: Add docs - Show --compare-to latest. - Explain that it is the higher-level “best vs latest” switch while --leaderboard-metric remains the lower-level metric selector.

Task 4: Verification

Run: - Focused: pytest -q tests/test_zoo_presets.py tests/test_cli.py - Broader: pytest -q

Notes: - This plan intentionally omits commits because the session instructions forbid committing unless explicitly requested.