Metadata-Version: 2.4
Name: ollama-benchmarker
Version: 0.1.3
Summary: Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Project-URL: Repository, https://github.com/user/ollama-benchmarker
Author: BiWu Contributors
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Keywords: benchmark,embedding,evaluation,llm,mmlu,mteb,ollama,reranking
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: ollama>=0.4.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.11.0
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# ollama-benchmarker

Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

## Features

- 🧠 **LLM Benchmarks**: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 **Embedding Benchmarks**: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 **Reranking Benchmarks**: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🤖 **Ollama Integration**: Direct integration with local Ollama API
- ⚡ **Interactive Selection**: Choose which models and benchmarks to run
- 📋 **Structured Output**: JSON output, rich tables, and ToolResult API pattern
- 🔧 **Agent Integration**: OpenAI function-calling tools for AI agent use

功能特性

- 🧠 大语言模型评估：MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
- 📊 嵌入模型评估：MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
- 🔄 重排模型评估：LLM 逐点重排、嵌入重排、LLM 列表重排
- 🤖 Ollama 集成：直接连接本地 Ollama API
- ⚡ 交互选择：自由选择模型和基准测试
- 📋 结构化输出：JSON 输出、富文本表格、ToolResult API 模式
- 🔧 智能体集成：OpenAI 函数调用工具

## Requirements

- Python 3.10+
- Ollama running locally (default: http://localhost:11434)
- Internet connection (for HuggingFace datasets download, first run only)

## Installation

```bash
pip install -e .
```

For development:
```bash
pip install -e ".[dev]"
```

## Quick Start

```bash
# One-command auto benchmark: detect GPU, pick models, run all applicable tests
ollama-bench auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
ollama-bench auto --no-confirm

# Override VRAM for auto benchmark
ollama-bench auto --vram 24 --no-confirm

# List available benchmarks
ollama-bench list benchmarks

# List available models on Ollama
ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload needed)
ollama-bench list models --gpu-only

# Show GPU info and model VRAM analysis
ollama-bench gpu

# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm

# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
ollama-bench run --gpu-only --category llm

# Run full suite (all categories)
ollama-bench suite -m llama3

# Run suite only on models that fit in VRAM
ollama-bench suite --gpu-only

# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100

# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json
```

## Usage

### CLI Commands

#### List Benchmarks

```bash
# List all benchmarks
ollama-bench list benchmarks

# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking
```

#### List Models

```bash
ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload)
ollama-bench list models --gpu-only
```

#### GPU & VRAM

```bash
# Show GPU info and which models can run fully on GPU
ollama-bench gpu

# Override VRAM size (useful when nvidia-smi not available)
ollama-bench gpu --vram 24

# Output GPU info as JSON
ollama-bench gpu --json
```

The `gpu` command:
- Detects NVIDIA GPUs via `nvidia-smi`
- Estimates VRAM requirements based on model parameter count and quantization
- Shows which models fit in available VRAM (no CPU offload needed)
- Lists models that require CPU offload

#### Auto Benchmark (Hardware-Aware)

```bash
# Interactive mode: detect GPU, show models, let you pick, run all applicable benchmarks
ollama-bench auto

# Non-interactive: auto-select all GPU-fitable models and run
ollama-bench auto --no-confirm

# Override VRAM and run non-interactive
ollama-bench auto --vram 16 --no-confirm

# Limit samples for quick test
ollama-bench auto --no-confirm -n 50

# Output results as JSON
ollama-bench auto --no-confirm --json -o auto_results.json
```

The `auto` command:
1. Detects your GPU VRAM via `nvidia-smi`
2. Lists all Ollama models with VRAM estimates
3. Presents an interactive multi-select menu (GPU-fitable models pre-selected)
4. You pick models by number (e.g. `1,3,5` or `1-3` or `all` or `gpu`)
5. Automatically matches each model to its applicable benchmarks (LLM/embedding/reranking)
6. Runs all benchmarks and outputs results

Interactive selection options:
- Enter numbers: `1,3,5` or `1-3` (range)
- `all` - select all models
- `gpu` or Enter - select only GPU-fitable models (default)
- `q` - quit

#### Run Benchmarks

```bash
# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag

# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm

# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm

# Auto-select models that fit in GPU VRAM
ollama-bench run --gpu-only --category llm

# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50

# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434
```

#### Run Full Suite

```bash
# Full suite on a model
ollama-bench suite -m llama3

# Auto-select GPU-fitable models and run full suite
ollama-bench suite --gpu-only

# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding

# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking
```

### CLI Flags

| Flag | Description |
|------|-------------|
| `-V`, `--version` | Show version |
| `-v`, `--verbose` | Verbose output (debug logging) |
| `-o`, `--output` | Output to file path (JSON) |
| `--json` | JSON output format |
| `-q`, `--quiet` | Suppress non-essential output |

## Python API

```python
from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models
models = discover_models()
print(models.data)       # List of model info dicts

# Only models that fit in GPU VRAM
models = discover_models(gpu_only=True)
print(models.data)       # Only models that don't need CPU offload

# GPU information and VRAM analysis
info = gpu_info()
print(info.data["gpus"])                     # GPU details
print(info.data["gpu_fitable_models"])        # Models that fit in VRAM
print(info.data["offload_required_models"])   # Models needing CPU offload

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run()
print(result.data["selected_models"])         # Number of models selected
print(result.data["results"])                 # Full benchmark results

# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"])  # Accuracy score

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data)       # Full results for all models
```

## Agent Integration (OpenAI Function Calling)

```python
from ollama_benchmarker.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# One-command auto benchmark: detect hardware, select models, run all
result = dispatch("ollama_bench_auto_run", {"gpu_only": True})

# Check GPU VRAM and filter models
result = dispatch("ollama_bench_gpu_info", {})

# Discover only GPU-fitable models
result = dispatch("ollama_bench_discover_models", {"gpu_only": True})
```

## Benchmark Categories

### LLM Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |

### Embedding Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |

### Reranking Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |

## References

All benchmark papers are available in the `pdf/` directory:

| Paper | arXiv ID |
|-------|----------|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff format . && ruff check .

# Type check
mypy ollama_benchmarker/
```

## License

GPL-3.0-or-later