Drop benchmark files here

Or browse to select files

Supports: judge_*.json · *_full_results.json · *_summary.json · *_qa_results.jsonl
Loaded:
Filter:
# ID Type Question Hypothesis Expected Verdict Failure Latency Source

Result Detail