Evaluation report · regenerated with stronger charts · 13 May 2026

File tools make the eval solvable

Analysis of /home/ssmith/source/hf-mcp-server/eval/file-tools/out/final-runs/20260513T203044Z-blended-compact. Across all scored runs, blended accuracy rises from 28.0% without file tools to 82.1% with default file tools. On the same 165-row blended set, every model improves; the average lift is 54.1 percentage points.

3 scenarios21 scored model-runs3,213 scored row evaluations6 Matplotlib SVG chartsall Grok data now scored
28.0%01-no-file-tools-blended
323/1155 · 7 scored models
39.8%01-no-file-tools-hard
359/903 · 7 scored models
82.1%02-default-file-tools-blended
948/1155 · 7 scored models
01

What the data says

Main conclusion

File tools do not merely improve performance; they change the problem from mostly unsolved to mostly solved. The default file-tools condition scores 82.1% overall.

Model story

responses-gpt-5-4-mini leads at 91.5%. kimi reaches 86.1%; grok is now scored and lands at 83.0%.

Category story

The biggest lift is exactly where expected: repo_files, file_search_or_filter, dataset_preview, and exact_file_read.

Remaining work

The remaining low-scoring areas are not generic tool access; they are routing/intent distinctions such as exact vs discovered file reads and ambiguous source-code repo-detail requests.

02

Decision charts

These are Matplotlib-rendered SVG charts embedded directly in the report. They emphasize the decision questions: how much file tools help, where they help, which models benefit, and what remains to tune.

2026-05-13T23:48:46.415654 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
2026-05-13T23:48:46.649757 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
2026-05-13T23:48:47.855006 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
2026-05-13T23:48:46.990609 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
2026-05-13T23:48:47.360630 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
2026-05-13T23:48:47.637288 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/
03

Completed file-tools ranking

1. responses-gpt-5-4-mini

91.5% ok · 151/165 rows · mean 2.91s

2. kimi

86.1% ok · 142/165 rows · mean 4.57s

3. grok

83.0% ok · 137/165 rows · mean 5.41s

4. minimax

80.6% ok · 133/165 rows · mean 2.37s

5. deepseek

79.4% ok · 131/165 rows · mean 3.57s

6. haiku

77.6% ok · 128/165 rows · mean 2.94s

7. sonnet

76.4% ok · 126/165 rows · mean 4.06s

04

Model-by-model lift

ModelNo file toolsDefault file toolsLiftFile-tools mean latency
deepseek15.8%79.4%+63.6 pp
3.57s
kimi27.3%86.1%+58.8 pp
4.57s
responses-gpt-5-4-mini33.9%91.5%+57.6 pp
2.91s
grok26.1%83.0%+57.0 pp
5.41s
minimax31.5%80.6%+49.1 pp
2.37s
haiku30.9%77.6%+46.7 pp
2.94s
sonnet30.3%76.4%+46.1 pp
4.06s
05

All scored runs

ScenarioModelOKOK rateTool-call rateMean latencyMedian latency
01-no-file-tools-blendedresponses-gpt-5-4-mini56/16533.9%
89.1%3.07s2.85s
01-no-file-tools-blendedminimax52/16531.5%
92.1%2.27s1.69s
01-no-file-tools-blendedhaiku51/16530.9%
86.1%3.63s3.12s
01-no-file-tools-blendedsonnet50/16530.3%
84.2%4.24s3.48s
01-no-file-tools-blendedkimi45/16527.3%
89.1%5.09s3.06s
01-no-file-tools-blendedgrok43/16526.1%
94.5%4.48s4.03s
01-no-file-tools-blendeddeepseek26/16515.8%
87.9%3.99s3.16s
01-no-file-tools-hardhaiku52/12940.3%
90.7%4.20s3.24s
01-no-file-tools-hardkimi52/12940.3%
89.9%5.37s3.44s
01-no-file-tools-hardminimax52/12940.3%
96.1%2.63s1.92s
01-no-file-tools-harddeepseek51/12939.5%
90.7%5.46s3.82s
01-no-file-tools-hardgrok51/12939.5%
95.3%5.10s4.44s
01-no-file-tools-hardresponses-gpt-5-4-mini51/12939.5%
94.6%3.12s2.99s
01-no-file-tools-hardsonnet50/12938.8%
90.7%4.35s3.66s
02-default-file-tools-blendedresponses-gpt-5-4-mini151/16591.5%
93.9%2.91s2.72s
02-default-file-tools-blendedkimi142/16586.1%
93.9%4.57s3.41s
02-default-file-tools-blendedgrok137/16583.0%
92.7%5.41s4.68s
02-default-file-tools-blendedminimax133/16580.6%
95.2%2.37s1.97s
02-default-file-tools-blendeddeepseek131/16579.4%
94.5%3.57s2.67s
02-default-file-tools-blendedhaiku128/16577.6%
93.9%2.94s2.63s
02-default-file-tools-blendedsonnet126/16576.4%
94.5%4.06s3.57s
06

Category movement table

CategoryNo-tools NNo-tools rateFile-tools NFile-tools rateDelta
repo_files4480.0%44885.7%+85.7 pp
dataset_preview283.6%2885.7%+82.1 pp
file_search_or_filter280.0%2882.1%+82.1 pp
exact_file_read1331.5%13375.9%+74.4 pp
mixed12625.4%12674.6%+49.2 pp
exact_or_discovered_file_read70.0%728.6%+28.6 pp
repo_details7057.1%7081.4%+24.3 pp
dataset_structure5675.0%5698.2%+23.2 pp
dataset_viewer2867.9%2885.7%+17.9 pp
general_answer4254.8%4254.8%+0.0 pp
search12688.1%12688.1%+0.0 pp
write_or_upload2875.0%2875.0%+0.0 pp
needs_context28100.0%2896.4%-3.6 pp
source_code_ambiguous_repo_details757.1%728.6%-28.6 pp
07

Failure signatures

01-no-file-tools-blended

  1. no_acceptable_call_set_matched — 832
  2. forbidden_tool:hub_repo_details — 76
  3. forbidden_tool:hf_doc_search — 8
  4. forbidden_tool:hub_repo_search — 8
  5. forbidden_tool:space_search — 2

01-no-file-tools-hard

  1. missing_files_operation — 406
  2. expected_hub_file_read — 112
  3. no_tool_call — 30
  4. expected_hub_repo_details — 30
  5. write_request_wrong_tool — 14
  6. expected_search_or_filter_tool — 8
  7. expected_search_or_repo_details — 3
  8. expected_repo_or_file_details — 1

02-default-file-tools-blended

  1. no_acceptable_call_set_matched — 207
  2. forbidden_tool:hub_repo_details — 17
  3. forbidden_tool:hub_repo_search — 9
  4. forbidden_tool:hf_doc_search — 7
  5. forbidden_tool:hub_file_read — 7
  6. forbidden_tool:paper_search — 1
  7. forbidden_tool:space_search — 1
08

Tool usage profile

01-no-file-tools-blended

  1. hub_repo_details — 871
  2. hub_repo_search — 104
  3. hf_doc_search — 37
  4. space_search — 9
  5. paper_search — 7

01-no-file-tools-hard

  1. hub_repo_details — 735
  2. hub_repo_search — 80
  3. hf_doc_search — 16
  4. space_search — 3
  5. paper_search — 2

02-default-file-tools-blended

  1. hub_repo_details — 773
  2. hub_file_read — 149
  3. hub_repo_search — 113
  4. hf_doc_search — 34
  5. paper_search — 10
  6. space_search — 8
09

Recommendations

Keep file tools on by default

All comparable models improve, with an average lift of 54.1 pp.

Focus tuning on intent routing

Capability gaps are much smaller after file tools. Next gains should come from better discrimination among repo listing, exact file read, search/filter, and ambiguous source-code requests.

Use category heatmaps as regression gates

Track repo_files, exact_file_read, file_search_or_filter, and exact_or_discovered_file_read separately; aggregate accuracy hides different failure modes.

Choose models by frontier

responses-gpt-5-4-mini is the accuracy leader; minimax is the fastest strong performer in this completed file-tools set.