Test your judge on sample data to see exactly what inputs/outputs it receives
openai/gpt-5
anthropic/claude-3.5-sonnet
This will automatically: Export selected traces ā Run weak models ā Evaluate with judge