Metadata-Version: 2.4
Name: ollama-benchmarker
Version: 0.1.0
Summary: Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Project-URL: Repository, https://github.com/user/ollama-benchmarker
Author: BiWu Contributors
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Keywords: benchmark,embedding,evaluation,llm,mmlu,mteb,ollama,reranking
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: ollama>=0.4.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.11.0
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# ollama-benchmarker

Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

## Features

- 🧠 **LLM Benchmarks**: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 **Embedding Benchmarks**: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 **Reranking Benchmarks**: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🤖 **Ollama Integration**: Direct integration with local Ollama API
- ⚡ **Interactive Selection**: Choose which models and benchmarks to run
- 📋 **Structured Output**: JSON output, rich tables, and ToolResult API pattern
- 🔧 **Agent Integration**: OpenAI function-calling tools for AI agent use

功能特性

- 🧠 大语言模型评估：MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
- 📊 嵌入模型评估：MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
- 🔄 重排模型评估：LLM 逐点重排、嵌入重排、LLM 列表重排
- 🤖 Ollama 集成：直接连接本地 Ollama API
- ⚡ 交互选择：自由选择模型和基准测试
- 📋 结构化输出：JSON 输出、富文本表格、ToolResult API 模式
- 🔧 智能体集成：OpenAI 函数调用工具

## Requirements

- Python 3.10+
- Ollama running locally (default: http://localhost:11434)
- Internet connection (for HuggingFace datasets download, first run only)

## Installation

```bash
pip install -e .
```

For development:
```bash
pip install -e ".[dev]"
```

## Quick Start

```bash
# List available benchmarks
ollama-bench list benchmarks

# List available models on Ollama
ollama-bench list models

# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm

# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k

# Run full suite (all categories)
ollama-bench suite -m llama3

# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100

# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json
```

## Usage

### CLI Commands

#### List Benchmarks

```bash
# List all benchmarks
ollama-bench list benchmarks

# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking
```

#### List Models

```bash
ollama-bench list models
```

#### Run Benchmarks

```bash
# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag

# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm

# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm

# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50

# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434
```

#### Run Full Suite

```bash
# Full suite on a model
ollama-bench suite -m llama3

# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding

# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking
```

### CLI Flags

| Flag | Description |
|------|-------------|
| `-V`, `--version` | Show version |
| `-v`, `--verbose` | Verbose output (debug logging) |
| `-o`, `--output` | Output to file path (JSON) |
| `--json` | JSON output format |
| `-q`, `--quiet` | Suppress non-essential output |

## Python API

```python
from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models
models = discover_models()
print(models.data)       # List of model info dicts

# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"])  # Accuracy score

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data)       # Full results for all models
```

## Agent Integration (OpenAI Function Calling)

```python
from ollama_benchmarker.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})
```

## Benchmark Categories

### LLM Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |

### Embedding Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |

### Reranking Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |

## References

All benchmark papers are available in the `pdf/` directory:

| Paper | arXiv ID |
|-------|----------|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff format . && ruff check .

# Type check
mypy ollama_benchmarker/
```

## License

GPL-3.0-or-later