πŸ›ž CartPole Policy Optimizer

Train a 5-parameter linear controller to balance a pole on a moving cart, by direct search over the policy weights. No gradients, no value functions β€” just a black-box optimiser hunting a 5-D parameter vector whose only signal is the average survival time across a handful of stochastic episodes.

Algorithm

Mean return0
Policies tried0
Best so farβ€”
wxβ€”
wαΊ‹β€”
wΞΈβ€”
wΞΈΜ‡β€”
biasβ€”

Leaderboard (this session)

Each row records the mean return of the best policy a given algorithm found in one run. The max possible is 500 (the length of a CartPole-v1 episode). The gap between an algorithm's best training return and an independent 20-episode test mean is the noise-overfitting tax.

AlgorithmBest mean returnPolicies usedTest mean (20 fresh rollouts)
β€” no runs yet β€”

What's happening

The cart-pole simulator runs the same CartPole-v1 dynamics that the OpenAI Gym ships (semi-implicit Euler integration of a 4-state rigid-body system: cart position, cart velocity, pole angle, pole rate). Each "policy" is a vector of 5 numbers β€” 4 weights and a bias β€” defining a linear controller: action = +push if wΒ·state + b > 0.

Each evaluation runs 8 episodes from random initial conditions and returns the mean number of frames the pole stayed up. HumpDay minimises, so the objective is the negative mean return. A policy that consistently keeps the pole up for the full 500-step limit scores βˆ’500; one that crashes after 10 frames scores βˆ’10.

Why is direct search effective here? The reward function β€” sum of step indicators β€” is not differentiable in the policy parameters. Policy-gradient estimators must approximate gradients by perturbation, which has high variance. A well-chosen black-box optimiser exploring a 5-D space can find a balanced policy in a few hundred episodes, where REINFORCE typically needs millions.

Watch the search montage to see why population-based methods (CMA-ES, DE, GA) tend to dominate this objective: the landscape has a wide flat plateau where most policies crash within 20 steps (cost β‰ˆ βˆ’20), then a sharp jump to "balanced for 500 steps". Local searchers struggle to escape the plateau; population methods reach it via diversity.

🌱 Save the Planet

If your hyper-parameter searches are heating the Earth, drop this in Cursor or Claude:

Read https://raw.githubusercontent.com/microprediction/humpday/main/SKILL.md
and create a project skill from it.
View SKILL.md β†’