Metadata-Version: 2.4
Name: mcp-tool-selection-bench
Version: 0.2.0
Summary: Benchmark LLM tool selection accuracy for MCP tools using the GitHub Copilot SDK
Author-email: skmanoj <skmanoj@github.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/skmanoj/mcp-tool-selection-bench
Project-URL: Repository, https://github.com/skmanoj/mcp-tool-selection-bench
Project-URL: Issues, https://github.com/skmanoj/mcp-tool-selection-bench/issues
Keywords: mcp,benchmark,llm,tool-selection,copilot
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: github-copilot-sdk>=0.1.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"

# MCP Tool Selection Benchmark

Benchmark how accurately different LLM models select the correct MCP tool given natural language instructions, powered by the [GitHub Copilot SDK](https://github.com/github/copilot-sdk).

## How It Works

1. **Load** a tool registry (`tools.json`) and ground-truth test suite (`test_suite.json`).
2. **Generate** 5 query variations per instruction at different ambiguity levels (explicit → misleading) using a Copilot SDK model call.
3. **Evaluate** each variation against every selected model — the model is presented with the tools and must pick one or more.
4. **Score** selections via exact-match and partial-credit against ground truth.
5. **Report** per-model accuracy, confusion matrices, and optional description suggestions in JSON and HTML.

## Prerequisites

- Python ≥ 3.10
- [GitHub Copilot CLI](https://docs.github.com/copilot/how-tos/set-up/install-copilot-cli) installed and in PATH
- An active Copilot subscription (each query counts as a premium request)
- Authenticated via `copilot` login, or `GH_TOKEN` / `GITHUB_TOKEN` env var

## Installation

From PyPI:

```bash
pip install mcp-tool-selection-bench
```

Or for development:

```bash
cd mcp-tool-selection-bench
pip install -e ".[dev]"
```

## Usage

### Auto-generate a test suite (optional)

Don't have a `test_suite.json`? Generate one from your tool definitions:

```bash
mcp-bench generate --tools tools.json --output test_suite.json
```

This uses an LLM to create 3–5 test cases per tool (mix of single-tool and multi-tool, varying ambiguity). **Review and edit the output** before benchmarking.

| Argument | Required | Default | Description |
|---|---|---|---|
| `--tools` | ✅ | — | Path to the tool registry JSON |
| `--output` | — | `test_suite.json` | Output path for the generated test suite |
| `--per-tool` | — | `4` | Number of test cases to generate per tool |
| `--model` | — | — | Model to use for generation |

### Run a benchmark

```bash
mcp-bench run \
  --tools samples/tools.json \
  --test-suite samples/test_suite.json \
  --models gpt-5 claude-sonnet-4 \
  --output results.json
```

The `run` subcommand is the default — you can omit it for backward compatibility.

### CLI Arguments (`run`)

| Argument | Required | Default | Description |
|---|---|---|---|
| `--tools` | ✅ | — | Path to the tool registry JSON |
| `--test-suite` | ✅ | — | Path to the ground-truth test suite JSON |
| `--models` | ✅ | — | Space-separated model names to benchmark |
| `--output` | — | `results.json` | Output path for the report |
| `--variations` | — | `5` | Number of query variations per instruction |
| `--generator-model` | — | first `--models` entry | Model used to generate query variations |
| `-v` / `--verbose` | — | off | Enable DEBUG logging |
| `--html` | — | — | Output path for an HTML report with confusion-matrix heatmaps |
| `--suggest` | — | off | Generate tool-description improvement suggestions (extra API calls) |
| `--suggest-threshold` | — | `0.7` | Accuracy threshold below which to suggest description improvements |
| `--fail-under` | — | — | Exit with code 1 if any model's exact-match accuracy is below this value |

## Input Schemas

### `tools.json`

```json
[
  {
    "name": "search_issues",
    "description": "Search for issues in GitHub repositories",
    "parameters": {
      "type": "object",
      "properties": {
        "query": { "type": "string", "description": "Search query" }
      },
      "required": ["query"]
    },
    "metadata": { "category": "github", "tags": ["issues", "search"] }
  }
]
```

### `test_suite.json`

Supports both single-tool and multi-tool test cases:

```json
[
  {
    "instruction": "Find all open bugs in the react repo",
    "expected_tools": ["search_issues"]
  },
  {
    "instruction": "Find open bugs in react and check the CI logs",
    "expected_tools": ["search_issues", "get_job_logs"]
  }
]
```

For multi-tool cases, use `expected_tools` (ordered list). Single-tool cases using `expected_tool` (string) remain supported for backward compatibility.

## Output

The report (`results.json`) contains:

- **`run_metadata`** — timestamp, models tested, query counts
- **`per_model_results`** — per-model accuracy (exact-match and partial-credit), per-instruction breakdown with each variation's selected tools and `exact_match` result
- **`summary`** — best model, accuracy ranking, best exact-match and partial-credit scores
- **`confusion_matrices`** — per-model confusion matrix (expected vs. selected tool counts)
- **`suggestions`** — tool-description improvement suggestions (when `--suggest` is used)

### Scoring Metrics

- **Exact-match accuracy**: Full ordered sequence of selected tools must match expected tools exactly
- **Partial-credit accuracy**: Ordered prefix matching — counts how many tools in sequence match from the start (e.g., expected `[A, B, C]`, selected `[A, B, X]` → 2/3 credit)

### HTML Report

Use `--html report.html` to generate a visual report with:
- Model ranking table
- Confusion-matrix heatmaps (green = correct, red = misselected)
- Description improvement suggestions (if `--suggest` was used)

## Regression Tracking (Diff)

Compare two benchmark runs to see what improved or regressed:

```bash
mcp-bench diff baseline.json current.json
mcp-bench diff baseline.json current.json --html diff.html
mcp-bench diff baseline.json current.json --fail-under 0.05  # fail if any model regressed >5%
```

The diff output shows per-model and per-instruction accuracy changes with ↑/↓/= indicators. Changes beyond ±5% are flagged as improved (green) or regressed (red).

## For MCP Server Maintainers

Want to benchmark tool selection accuracy in your MCP server's CI? Add a workflow that runs on every tool change:

1. **Add `tools.json` and `test_suite.json`** to your repo (see [Input Schemas](#input-schemas))
2. **Copy the template workflow** from [`examples/mcp-server-workflow.yml`](examples/mcp-server-workflow.yml) into `.github/workflows/`
3. **Create a `GH_TOKEN` repository secret** with a Copilot-licensed PAT
4. **Customize** the model list, paths, and thresholds in the workflow

The workflow will:
- Run the benchmark whenever tool files change
- Automatically diff against the previous run's results
- **Fail the build** if accuracy drops below the configured threshold
- Upload JSON + HTML reports as GitHub Actions artifacts

## Running Tests

```bash
pytest tests/
```

## Project Structure

```
mcp-tool-selection-bench/
├── pyproject.toml
├── README.md
├── .github/workflows/
│   ├── ci.yml              # CI pipeline: test & build on every push/PR
│   └── publish.yml         # Publish to PyPI on GitHub Release
├── examples/
│   └── mcp-server-workflow.yml  # Template workflow for MCP server repos
├── src/mcp_bench/
│   ├── cli.py              # CLI entry point (run + diff + generate subcommands)
│   ├── models.py           # Pydantic data models
│   ├── prompts.py          # All prompt templates (centralised)
│   ├── generator.py        # Auto-generate test suites from tool definitions
│   ├── query_generator.py  # Generate query variations via Copilot SDK
│   ├── evaluator.py        # Run queries against models
│   ├── scorer.py           # Score, rank, and build confusion matrices
│   ├── advisor.py          # Description improvement suggestions
│   ├── diff.py             # Regression tracking (diff two runs)
│   ├── visualize.py        # HTML report with heatmaps & diff views
│   └── report.py           # Write JSON report
├── samples/
│   ├── tools.json          # Example tool registry (8 tools)
│   └── test_suite.json     # Example test suite (19 instructions)
└── tests/
```

## CI

Every push to `master` and every pull request triggers the CI pipeline (`.github/workflows/ci.yml`):

1. **Test** — runs `pytest` across Python 3.10, 3.12, and 3.13
2. **Build** — builds sdist + wheel and uploads as a GitHub Actions artifact

## Publishing

Creating a [GitHub Release](https://docs.github.com/en/repositories/releasing-projects-on-github) triggers `.github/workflows/publish.yml`, which builds and publishes to PyPI via [Trusted Publishers](https://docs.pypi.org/trusted-publishers/) (OIDC — no API tokens needed).

```bash
# Create a release via CLI
gh release create v0.2.0 --generate-notes
```

## License

MIT
