Scaling laws & LLM optimization
How does AnDCG@100 grow with parameter count, and how much can we recover with fine-tuning, reinforcement learning, prompt optimization, or few-shot retrieval? The first chart sweeps the Qwen3.5 family; the second compares base vs optimized variants for GPT-OSS-120B, Qwen3-30B, and Gemini 3.
From §5.4: Larger models generally achieve higher AnDCG@100, consistent with a scaling trend, with gains plateauing at the top end. The 35B, 122B, and 397B variants are mixture-of-experts (MoE) models; we plot them at their total parameter count, so the MoE point sits to the right of where its active-parameter equivalent would lie.
Optimization lifts the frontier further
From §5.3: SFT and SFT+GRPO lift GPT-OSS-120B's test AnDCG@100 by +0.001 and +0.008 respectively (about 1.1% and 6.8% relative over the base model). GEPA improves Gemini 3 Flash on validation but fails to generalize on test, suggesting overfitting. Bar labels report each variant's absolute change relative to its base.