Test your judge on sample data to see exactly what inputs/outputs it receives
This will automatically: Export selected traces → Run weak models → Evaluate with judge