Evaluation report · regenerated with stronger charts · 13 May 2026
Analysis of /home/ssmith/source/hf-mcp-server/eval/file-tools/out/final-runs/20260513T203044Z-blended-compact. Across all scored runs, blended accuracy rises from 28.0% without file tools to 82.1% with default file tools. On the same 165-row blended set, every model improves; the average lift is 54.1 percentage points.
File tools do not merely improve performance; they change the problem from mostly unsolved to mostly solved. The default file-tools condition scores 82.1% overall.
responses-gpt-5-4-mini leads at 91.5%. kimi reaches 86.1%; grok is now scored and lands at 83.0%.
The biggest lift is exactly where expected: repo_files, file_search_or_filter, dataset_preview, and exact_file_read.
The remaining low-scoring areas are not generic tool access; they are routing/intent distinctions such as exact vs discovered file reads and ambiguous source-code repo-detail requests.
These are Matplotlib-rendered SVG charts embedded directly in the report. They emphasize the decision questions: how much file tools help, where they help, which models benefit, and what remains to tune.
91.5% ok · 151/165 rows · mean 2.91s
86.1% ok · 142/165 rows · mean 4.57s
83.0% ok · 137/165 rows · mean 5.41s
80.6% ok · 133/165 rows · mean 2.37s
79.4% ok · 131/165 rows · mean 3.57s
77.6% ok · 128/165 rows · mean 2.94s
76.4% ok · 126/165 rows · mean 4.06s
| Model | No file tools | Default file tools | Lift | File-tools mean latency |
|---|---|---|---|---|
| deepseek | 15.8% | 79.4% | +63.6 pp | 3.57s |
| kimi | 27.3% | 86.1% | +58.8 pp | 4.57s |
| responses-gpt-5-4-mini | 33.9% | 91.5% | +57.6 pp | 2.91s |
| grok | 26.1% | 83.0% | +57.0 pp | 5.41s |
| minimax | 31.5% | 80.6% | +49.1 pp | 2.37s |
| haiku | 30.9% | 77.6% | +46.7 pp | 2.94s |
| sonnet | 30.3% | 76.4% | +46.1 pp | 4.06s |
| Scenario | Model | OK | OK rate | Tool-call rate | Mean latency | Median latency |
|---|---|---|---|---|---|---|
01-no-file-tools-blended | responses-gpt-5-4-mini | 56/165 | 33.9% | 89.1% | 3.07s | 2.85s |
01-no-file-tools-blended | minimax | 52/165 | 31.5% | 92.1% | 2.27s | 1.69s |
01-no-file-tools-blended | haiku | 51/165 | 30.9% | 86.1% | 3.63s | 3.12s |
01-no-file-tools-blended | sonnet | 50/165 | 30.3% | 84.2% | 4.24s | 3.48s |
01-no-file-tools-blended | kimi | 45/165 | 27.3% | 89.1% | 5.09s | 3.06s |
01-no-file-tools-blended | grok | 43/165 | 26.1% | 94.5% | 4.48s | 4.03s |
01-no-file-tools-blended | deepseek | 26/165 | 15.8% | 87.9% | 3.99s | 3.16s |
01-no-file-tools-hard | haiku | 52/129 | 40.3% | 90.7% | 4.20s | 3.24s |
01-no-file-tools-hard | kimi | 52/129 | 40.3% | 89.9% | 5.37s | 3.44s |
01-no-file-tools-hard | minimax | 52/129 | 40.3% | 96.1% | 2.63s | 1.92s |
01-no-file-tools-hard | deepseek | 51/129 | 39.5% | 90.7% | 5.46s | 3.82s |
01-no-file-tools-hard | grok | 51/129 | 39.5% | 95.3% | 5.10s | 4.44s |
01-no-file-tools-hard | responses-gpt-5-4-mini | 51/129 | 39.5% | 94.6% | 3.12s | 2.99s |
01-no-file-tools-hard | sonnet | 50/129 | 38.8% | 90.7% | 4.35s | 3.66s |
02-default-file-tools-blended | responses-gpt-5-4-mini | 151/165 | 91.5% | 93.9% | 2.91s | 2.72s |
02-default-file-tools-blended | kimi | 142/165 | 86.1% | 93.9% | 4.57s | 3.41s |
02-default-file-tools-blended | grok | 137/165 | 83.0% | 92.7% | 5.41s | 4.68s |
02-default-file-tools-blended | minimax | 133/165 | 80.6% | 95.2% | 2.37s | 1.97s |
02-default-file-tools-blended | deepseek | 131/165 | 79.4% | 94.5% | 3.57s | 2.67s |
02-default-file-tools-blended | haiku | 128/165 | 77.6% | 93.9% | 2.94s | 2.63s |
02-default-file-tools-blended | sonnet | 126/165 | 76.4% | 94.5% | 4.06s | 3.57s |
| Category | No-tools N | No-tools rate | File-tools N | File-tools rate | Delta |
|---|---|---|---|---|---|
| repo_files | 448 | 0.0% | 448 | 85.7% | +85.7 pp |
| dataset_preview | 28 | 3.6% | 28 | 85.7% | +82.1 pp |
| file_search_or_filter | 28 | 0.0% | 28 | 82.1% | +82.1 pp |
| exact_file_read | 133 | 1.5% | 133 | 75.9% | +74.4 pp |
| mixed | 126 | 25.4% | 126 | 74.6% | +49.2 pp |
| exact_or_discovered_file_read | 7 | 0.0% | 7 | 28.6% | +28.6 pp |
| repo_details | 70 | 57.1% | 70 | 81.4% | +24.3 pp |
| dataset_structure | 56 | 75.0% | 56 | 98.2% | +23.2 pp |
| dataset_viewer | 28 | 67.9% | 28 | 85.7% | +17.9 pp |
| general_answer | 42 | 54.8% | 42 | 54.8% | +0.0 pp |
| search | 126 | 88.1% | 126 | 88.1% | +0.0 pp |
| write_or_upload | 28 | 75.0% | 28 | 75.0% | +0.0 pp |
| needs_context | 28 | 100.0% | 28 | 96.4% | -3.6 pp |
| source_code_ambiguous_repo_details | 7 | 57.1% | 7 | 28.6% | -28.6 pp |
no_acceptable_call_set_matched — 832forbidden_tool:hub_repo_details — 76forbidden_tool:hf_doc_search — 8forbidden_tool:hub_repo_search — 8forbidden_tool:space_search — 2missing_files_operation — 406expected_hub_file_read — 112no_tool_call — 30expected_hub_repo_details — 30write_request_wrong_tool — 14expected_search_or_filter_tool — 8expected_search_or_repo_details — 3expected_repo_or_file_details — 1no_acceptable_call_set_matched — 207forbidden_tool:hub_repo_details — 17forbidden_tool:hub_repo_search — 9forbidden_tool:hf_doc_search — 7forbidden_tool:hub_file_read — 7forbidden_tool:paper_search — 1forbidden_tool:space_search — 1hub_repo_details — 871hub_repo_search — 104hf_doc_search — 37space_search — 9paper_search — 7hub_repo_details — 735hub_repo_search — 80hf_doc_search — 16space_search — 3paper_search — 2hub_repo_details — 773hub_file_read — 149hub_repo_search — 113hf_doc_search — 34paper_search — 10space_search — 8All comparable models improve, with an average lift of 54.1 pp.
Capability gaps are much smaller after file tools. Next gains should come from better discrimination among repo listing, exact file read, search/filter, and ambiguous source-code requests.
Track repo_files, exact_file_read, file_search_or_filter, and exact_or_discovered_file_read separately; aggregate accuracy hides different failure modes.
responses-gpt-5-4-mini is the accuracy leader; minimax is the fastest strong performer in this completed file-tools set.