tokenjam 0.5.1 n=5 tasks · k=1 sample(s) · deepseek:deepseek-reasoner → deepseek:deepseek-chat
| task | orig | cand | candidate detail | |
|---|---|---|---|---|
| judged/refund-policy | 0/1 | 0/1 | deepeval:correctness@deepseek:deepseek-chat score=0.00 (>= 0.5) — The actual output does n | |
| judged/capital | 1/1 | 1/1 | deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output exactl | |
| judged/retry-summary | 1/1 | 1/1 | deepeval:correctness@deepseek:deepseek-chat score=1.00 (>= 0.5) — The actual output states | |
| judged/shipping | 0/1 | 0/1 | deepeval:correctness@deepseek:deepseek-chat score=0.20 (>= 0.5) — The actual output adds c | |
| judged/define-llm | 0/1 | 0/1 | deepeval:correctness@deepseek:deepseek-chat score=0.30 (>= 0.5) — The actual output descri |