CheckLLM Competitor Benchmark
| framework | dataset | metric_family | auc | best_f1 | spearman | n | mean_latency_ms | total_cost_usd | rank |
|---|
| checkllm | halubench | hallucination | 0.7834641706924316 | 0.7962085308056872 | 0.5439989768236825 | 200.0 | 2415.385 | 0.03429645 | 1 |
| deepeval | halubench | hallucination | 0.553341384863124 | 0.7012987012987013 | 0.15094728675256455 | 200.0 | 4456.805 | 0.0 | 3 |
| promptfoo | halubench | hallucination | 0.7528683574879227 | 0.7913043478260869 | 0.5103141714087509 | 200.0 | 1801.825 | 0.029248049999999998 | 2 |
| deepeval | ragtruth | context_relevance | 0.4348598499802606 | 0.8538681948424068 | -0.09953764028954455 | 200.0 | 20571.63 | 0.0 | 3 |
| promptfoo | ragtruth | context_relevance | 0.5 | 0.8538681948424068 | nan | 200.0 | 1363.54 | 0.0422958 | 2 |
| checkllm | ragtruth | context_relevance | 0.5645479668377419 | 0.8563218390804598 | 0.12519910668622805 | 200.0 | 2350.945 | 0.06226005 | 1 |
| checkllm | ragtruth | faithfulness | 0.7541781813396499 | 0.8606811145510835 | 0.4235879167613457 | 200.0 | 11877.5 | 0.06127485 | 1 |
| deepeval | ragtruth | faithfulness | 0.6308724832214765 | 0.8538681948424068 | 0.2046464551960725 | 200.0 | 17191.23 | 0.0 | 2 |
| promptfoo | ragtruth | faithfulness | 0.5340834320305304 | 0.8563218390804598 | 0.08964416068501872 | 200.0 | 1692.765 | 0.0440592 | 3 |
| checkllm | ragtruth | hallucination | 0.663179365706014 | 0.8714733542319749 | 0.39820181540014465 | 200.0 | 2728.255 | 0.044208449999999996 | 1 |
| deepeval | ragtruth | hallucination | 0.5879721015923148 | 0.8690476190476191 | 0.31107184538246846 | 200.0 | 3669.495 | 0.0 | 2 |
| promptfoo | ragtruth | hallucination | 0.5130280300039479 | 0.8554913294797688 | 0.08111708188068653 | 200.0 | 1602.015 | 0.0441192 | 3 |
| checkllm | truthfulqa | answer_relevancy | 0.54555 | 0.6666666666666666 | 0.08511435692920465 | 400.0 | 6643.0575 | 0.021320099999999998 | 1 |
| deepeval | truthfulqa | answer_relevancy | 0.4375625 | 0.6666666666666666 | -0.12200569080495786 | 400.0 | 30595.83 | 0.0 | 2 |
| promptfoo | truthfulqa | answer_relevancy | 0.392225 | 0.6666666666666666 | -0.23251317852311312 | 400.0 | 1175.63 | 0.024685199999999997 | 3 |