Metadata-Version: 2.4
Name: agent-vendor-verifier
Version: 0.1.0
Summary: Benchmark framework for comparing LLM agent tool-call quality across vendors.
Keywords: llm,agent,benchmark,tool-calling,vendor-evaluation
License-Expression: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Dist: jsonschema>=4.26.0
Requires-Dist: loguru>=0.7.3
Requires-Dist: megfile>=5.0.9
Requires-Dist: openai>=2.7.0
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: requests>=2.32.5
Requires-Dist: tqdm>=4.67.1
Requires-Dist: transformers>=5.3.0
Requires-Dist: typing-extensions>=4.15.0
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/Prism-Shadow/Agent-Vendor-Verifier
Project-URL: Issues, https://github.com/Prism-Shadow/Agent-Vendor-Verifier/issues
Project-URL: Repository, https://github.com/Prism-Shadow/Agent-Vendor-Verifier
Description-Content-Type: text/markdown

# Agent Vendor Verifier

**Agent Vendor Verifier** is a benchmarking framework for evaluating **Agent tool-call effectiveness**.

The benchmark aggregates these dimensions into a single, comparable **fusion score (IRF)**, enabling fair **cross-vendor comparison** for agent-style tool usage.

Agent Vendor Verifier is built upon [K2-Vendor-Verifier](https://github.com/MoonshotAI/K2-Vendor-Verifier).


## Why Agent Vendor Verifier?

* **Multi-dimensional metrics**: Go beyond “was a tool called” by measuring:
  - correctness,
  - schema compliance,
  - request success and stability,
  - latency and throughput.
* **Comparable fusion score**: combine heterogeneous metrics into a single score for ranking and model/vendor selection.

## Metrics

For each sample, the benchmark records the `finish_reason` (e.g. `tool_calls`, `stop`, `others`) and optional tool-call validation results.  

| Metric | What it Evaluates | Direction |
|------|------------------|------------------------|
| **F1 Score** | Whether a model **triggers tool calls on the right samples**, compared against a designated baseline vendor | Higher is better |
| **Success Rate** | Whether requests **successfully complete without API or runtime errors** | Higher is better |
| **Schema Accuracy** | Whether **generated tool-call arguments conform to the declared JSON Schema** | Higher is better |
| **Avg Token** | **Token usage efficiency** per request (prompt + completion) | Lower is better |
| **Avg TTFT** | **Responsiveness**: time from request to first token (ms) | Lower is better |
| **TPS** | **Generation performance** during decoding (e.g. tokens/s) | Higher is better |


### F1 Score

F1 score measures whether a model **triggers tool calls on the correct samples**, compared against a designated **baseline vendor** for the same model. 
- A higher F1 indicates closer alignment with the baseline on **when to issue tool calls**.

| Model | Baseline Vendor |
| ----- | --------------- |
| Claude | Anthropic |
| Gemini | Google |
| Deepseek | Deepseek |
| Minimax | Minimax |
| GLM | Bigmodel |
| Kimi | Moonshot |



Treat “ended with `tool_calls`” as **positive**, and “ended with `stop` or others” as **negative**.

| Case | Baseline result | Current vendor result | Meaning |
|------|-----------------|------------------------|---------|
| **TP** | `tool_calls` | `tool_calls` | Both agree to trigger a tool call |
| **FP** | `stop` / others | `tool_calls` | False trigger |
| **FN** | `tool_calls` | `stop` / others | Missed trigger |
| **TN** | `stop` / others | `stop` / others | Both agree not to trigger |

The baseline is treated as **ground truth** for whether a tool call should occur.

F1 score computation:

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

$$
\text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Note: The baseline vendor’s F1 is fixed at **1.0**.

### Success Rate

Success rate measures the proportion of requests that complete successfully, without API errors, timeouts, or runtime failures.

Success rate computation:

$$
\text{Success Rate} = \frac{\text{successful requests}}{\text{successful requests} + \text{failed requests}}
$$

### Schema Accuracy

Schema accuracy measures whether the arguments generated in tool calls conform to the declared JSON Schema provided in the request.
- Schema accuracy directly reflects tool-call output quality.

Schema accuracy computation:

$$
\text{Schema Accuracy} = \frac{\text{valid tool calls}}{\text{total tool calls}}
$$

### Average Tokens

Average tokens measures the average **total token usage per request**, including both prompt tokens and completion tokens.

- Lower values indicate better token efficiency.

### Average TTFT

Average TTFT (Time to First Token) measures the average latency, in milliseconds, from request submission to the arrival of the first generated token.
- Lower values indicate faster initial response.

### TPS

TPS measures the average token generation speed during the decoding phase, typically in tokens per second.
- Higher values indicate faster decoding performance.


Together, these six metrics form the basis for the IRF fusion score used for final ranking.


## Fusion Score: IRF (Inverse Rank Fusion)

To combine heterogeneous metrics into a single score, Agent Vendor Verifier uses **Inverse Rank Fusion (IRF)** across six metrics:
- F1 Score  
- Success Rate  
- Schema Accuracy  
- Avg Token
- Avg TTFT  
- TPS

They cover **correctness, stability, correctness of arguments, cost, and performance**, making IRF a balanced indicator of Agent tool-call effectiveness.

### Step 1: Rank per metric

For each metric, rank all participating entities:

- **Higher is better**: F1, Success Rate, Schema Accuracy, TPS → descending order
- **Lower is better**: Avg Token, Avg TTFT → ascending order

### Step 2: Assign ranks

Each entity receives a rank \( r \) starting from 1 (ties handled by standard ranking rules).

### Step 3: IRF contribution

For each metric where an entity has a value, compute:

$$
\text{contribution} = \frac{1}{r + k}
$$

where **k = 5** (In order to map IRF score to $(0, 1]$).

The final **IRF score** is the sum over all participating metrics:

$$
\text{IRF} = \sum_{\text{metrics}} \frac{1}{r_{\text{metric}} + 5}
$$

- Better ranks across more metrics yield higher IRF scores.
- IRF score is comparable within the same model and different vendors.

## Performance

We evaluated the tool-call effectiveness of the latest models across multiple vendors and ranked them using the IRF score. The results are shown below:


### claude-haiku-4.5

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| anthropic (openrouter) | 0.8635 | 1 | 1 | 99.36 | 0.9346 | 3897 | 2571 |
| foxcode | 0.8357 | 1 | 0.7513 | 94.21 | 0.9891 | 4161 | 1945 |
| packyapi | 0.8040 | 1 | 0.7590 | 103.7 | 0.8989 | 4500 | 2504 |
| yunwu | 0.7583 | 1 | 0.7356 | 22.3 | 0.9041 | 4969 | 1412 |


### claude-opus-4.5

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| anthropic (openrouter) | 0.9217 | 1 | 1 | 50.29 | 0.9714 | 4696 | 2585 |
| packyapi | 0.8741 | 1 | 0.7959 | 53.28 | 0.9082 | 5062 | 2510 |
| yunwu | 0.8095 | 0.9967 | 0.8384 | 18.71 | 0.9024 | 6105 | 1435 |


### claude-sonnet-4.5

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| foxcode | 0.8774 | 1 | 0.8454 | 51.94 | 0.8000 | 4985 | 1861 |
| anthropic (openrouter) | 0.8357 | 1 | 1 | 33.95 | 0.8121 | 5175 | 2398 |
| yunwu | 0.8278 | 1 | 0.8304 | 18.94 | 0.8593 | 5649 | 1399 |
| packyapi | 0.7206 | 1 | 0.8293 | 48.36 | 0.7692 | 5676 | 2458 |


### deepseek-v3.2

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| deepseek (openrouter) | 0.8234 | 0.9870 | 1 | 62.77 | 0.9854 | 1268 | 1859 |
| google-vertex (openrouter) | 0.7885 | 0.9995 | 0.6216 | 101.9 | 0.8480 | 777.3 | 1968 |
| siliconflow (openrouter) | 0.7706 | 0.9980 | 0.7282 | 151.5 | 0.8808 | 2465 | 2927 |
| atlascloud (openrouter) | 0.7567 | 1 | 0.7257 | 81.03 | 0.8438 | 1211 | 2368 |
| siliconflow | 0.7345 | 0.9815 | 0.7988 | 15.15 | 0.8633 | 5724 | 1719 |


### gemini-2.5-flash

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| google-vertex (openrouter) | 0.9524 | 1 | 1 | 249.3 | 0.6875 | 1651 | 1731 |
| gemini (openrouter) | 0.9048 | 0.9933 | 0.8810 | 216.1 | 0.6897 | 2626 | 1658 |


### glm-4.7

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| bigmodel | 0.9107 | 0.9725 | 1 | 123.4 | 0.8409 | 2243 | 1789 |
| z.ai (openrouter) | 0.8512 | 0.9980 | 0.8575 | 246 | 0.8291 | 4986 | 2303 |
| atlascloud (openrouter) | 0.8452 | 0.9975 | 0.8624 | 103.8 | 0.8327 | 1914 | 2311 |


### kimi-k2

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| siliconflow | 0.9158 | 1 | 0.8427 | 58.23 | 1 | 1313 | 1581 |
| siliconflow_OR | 0.8690 | 0.9975 | 0.8355 | 92.81 | 0.9985 | 1115 | 1816 |
| moonshot ai_OR | 0.8205 | 0.9905 | 1 | 46.39 | 1 | 2662 | 1857 |


### minimax-m2

| Vendor | IRF Score | Success Rate | F1 Score | TPS | Schema Accuracy | TTFT (ms) | Avg Token |
| --- | --- | --- | --- | --- | --- | --- | --- |
| google-vertex (openrouter) | 0.9217 | 0.9990 | 0.7662 | 347.6 | 1 | 491.7 | 2556 |
| minimax (openrouter) | 0.8512 | 0.9980 | 1 | 147.2 | 0.8205 | 1483 | 2278 |
| atlascloud (openrouter) | 0.8324 | 0.9960 | 0.7622 | 152.6 | 1 | 745.9 | 2443 |

## Dataset

We made modifications to the open-sourced dataset from K2-Vendor-Verifier to fix errors that occurred when running it on other models and vendors. The dataset distribution details will be published separately.

## Supported Vendors

Agent Vendor Verifier currently supports the following vendors:
* Openrouter
* Gemini
* Siliconflow
* 429.icu
* packyapi
* yunwu
* foxcode

Agent Vendor Verifier can easily add new vendors by inheriting from the Vendor base class, see [Vendor-base](./agent_vendor_verifier/vendor/vendor.py).

## Supported Models

To address that different vendors may use different names for the same model, Agent Vendor Verifier provides a unified model naming scheme. The models currently supported are listed below.

<table>
  <thead>
    <tr>
      <th>Model Series</th>
      <th>Unified Models Name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gemini</td>
      <td>gemini-2.5-flash</td>
    </tr>
    <tr>
      <td>Claude</td>
      <td>
        claude-sonnet-4.5<br>
        claude-opus-4.5<br>
        claude-haiku-4.5
      </td>
    </tr>
    <tr>
      <td>DeepSeek</td>
      <td>deepseek-v3.2</td>
    </tr>
    <tr>
      <td>GLM</td>
      <td>glm-4.7</td>
    </tr>
    <tr>
      <td>MiniMax</td>
      <td>minimax-m2.5</td>
    </tr>
    <tr>
      <td>Kimi</td>
      <td>kimi-k2.5</td>
    </tr>
  </tbody>
</table>

When using Agent Vendor Verifier for evaluation, you only need to consider the model names listed in the table above.

## Installation

```bash
pip install agent-vendor-verifier
```

The package requires Python 3.10 or newer and exposes the `agent-vendor-verifier` command-line entry point.

## Agent Vendor Verifier Usage


Prepare:

- Dataset: See [Dataset](#dataset).
- `config.yaml`: Define model and vendor in benchmark.
- `.env`: Define API keys.

Then run:

```bash
agent-vendor-verifier \
  --test-file-path "tool-calls/samples.jsonl" \
  --config-file-path "config.yaml" \
  --vendor-concurrency 5 \
  --request-concurrency 30 \
  --retries 10 \
  --timeout 30
```

## Configuration

### Define model and vendor in benchmark

`config.yaml` maps each unified model name to the vendor configurations evaluated for that model.

Top-level keys are model names. Each model contains a `vendors` list with the request settings for every vendor endpoint.

We provide an example, see [config_example](./config_example.yaml):

```yaml
minimax-m2.5:
  vendors:
    - name: openrouter
      model_id: minimax/minimax-m2.5
      api_key: xxxxxx
      provider: minimax/
      url: https://openrouter.ai/api/v1
      validator: openai
      is_baseline: true
```

Model names must follow this project’s unified naming convention (e.g. gemini-2.5-flash, claude-sonnet-4.5, deepseek-v3.2).

See [supported models](#supported-models) for details.

### Define API keys

API keys are provided via `.env`.
- Variable names must match vendor names in config.yaml

Example:

```bash
openrouter=sk-or-v1-xxxx
siliconflow=sk-xxxx
bigmodel=xxxx
gemini=xxxx
```