Metadata-Version: 2.4
Name: biwu
Version: 0.2.1
Summary: Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends
Project-URL: Repository, https://github.com/user/biwu
Author: BiWu Contributors
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Keywords: benchmark,embedding,evaluation,gguf,huggingface,llm,mmlu,modelscope,mteb,ollama,reranking
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: datasets>=2.14.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.11.0
Provides-Extra: all
Requires-Dist: huggingface-hub>=0.20.0; extra == 'all'
Requires-Dist: llama-cpp-python>=0.2.0; extra == 'all'
Requires-Dist: modelscope>=1.9.0; extra == 'all'
Requires-Dist: ollama>=0.4.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: gguf
Requires-Dist: llama-cpp-python>=0.2.0; extra == 'gguf'
Provides-Extra: huggingface
Requires-Dist: huggingface-hub>=0.20.0; extra == 'huggingface'
Requires-Dist: llama-cpp-python>=0.2.0; extra == 'huggingface'
Provides-Extra: modelscope
Requires-Dist: llama-cpp-python>=0.2.0; extra == 'modelscope'
Requires-Dist: modelscope>=1.9.0; extra == 'modelscope'
Provides-Extra: ollama
Requires-Dist: ollama>=0.4.0; extra == 'ollama'
Description-Content-Type: text/markdown

# BiWu (比武)

Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends

## Features

- 🧠 **LLM Benchmarks**: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 **Embedding Benchmarks**: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 **Reranking Benchmarks**: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🔌 **Multi-Backend**: Ollama, GGUF (llama-cpp-python), ModelScope, HuggingFace
- ⚡ **Interactive Selection**: Choose which models and benchmarks to run
- 📋 **Structured Output**: JSON output, rich tables, and ToolResult API pattern
- 🔧 **Agent Integration**: OpenAI function-calling tools for AI agent use
- 🎮 **GPU VRAM-Aware**: Auto-detect GPU, filter models that fit in VRAM

## Requirements

- Python 3.10+
- For Ollama backend: Ollama running locally (default: http://localhost:11434)
- For GGUF backend: `llama-cpp-python` with CUDA support
- For ModelScope backend: `modelscope` + `llama-cpp-python`
- For HuggingFace backend: `huggingface_hub` + `llama-cpp-python`
- Internet connection (for dataset download, first run only)

## Installation

```bash
pip install -e .
```

With specific backends:
```bash
pip install -e ".[ollama]"      # Ollama backend
pip install -e ".[gguf]"        # GGUF backend
pip install -e ".[modelscope]"  # ModelScope backend
pip install -e ".[huggingface]" # HuggingFace backend
pip install -e ".[all]"         # All backends
pip install -e ".[dev]"          # Dev dependencies
```

## Quick Start

```bash
# One-command auto benchmark: detect GPU, pick models, run all applicable tests
biwu auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
biwu auto --no-confirm

# Override VRAM for auto benchmark
biwu auto --vram 24 --no-confirm

# List available benchmarks
biwu list benchmarks

# List available models on Ollama
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# Show GPU info and model VRAM analysis
biwu gpu

# Run all LLM benchmarks on a model (Ollama)
biwu run -m llama3 --category llm

# Run benchmarks on a GGUF model
biwu run -m /path/to/model.gguf --backend gguf --category llm

# Run benchmarks on a ModelScope model
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# Run benchmarks on a HuggingFace model
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Run specific benchmarks
biwu run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
biwu run --gpu-only --category llm

# Run full suite (all categories)
biwu suite -m llama3

# Run suite only on models that fit in VRAM
biwu suite --gpu-only

# Run with limited samples for quick testing
biwu run -m llama3 -b mmlu -n 100

# Output results to JSON file
biwu run -m llama3 --category llm --json -o results.json
```

## Usage

### CLI Commands

#### List Benchmarks

```bash
biwu list benchmarks
biwu list benchmarks --category llm
biwu list benchmarks --category embedding
biwu list benchmarks --category reranking
```

#### List Models

```bash
# List models on Ollama (default backend)
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# List models for a specific backend
biwu list models --backend gguf
biwu list models --backend modelscope
```

#### GPU & VRAM

```bash
biwu gpu
biwu gpu --vram 24
biwu gpu --json
```

#### Auto Benchmark (Hardware-Aware)

```bash
# Interactive mode
biwu auto

# Non-interactive: auto-select all GPU-fitable models
biwu auto --no-confirm

# Override VRAM
biwu auto --vram 16 --no-confirm

# Use a specific backend
biwu auto --backend gguf --no-confirm
biwu auto --backend modelscope --no-confirm
```

The `auto` command:
1. Detects your GPU VRAM via `nvidia-smi`
2. Lists models with VRAM estimates
3. Presents an interactive multi-select menu (GPU-fitable models pre-selected)
4. Matches each model to its applicable benchmarks (LLM/embedding/reranking)
5. Runs all benchmarks and outputs results

Interactive selection: `1,3,5` or `1-3`, `all`, `gpu` (default), `q`

#### Run Benchmarks

```bash
# Ollama (default)
biwu run -m llama3 -b mmlu hellaswag

# GGUF
biwu run -m /path/to/model.gguf --backend gguf --category llm

# ModelScope
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# HuggingFace
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Multiple models
biwu run -m llama3 mistral qwen2 --category llm

# GPU-only models
biwu run --gpu-only --category llm

# Custom Ollama host
biwu run -m llama3 -b mmlu --host http://192.168.1.100:11434
```

#### Run Full Suite

```bash
biwu suite -m llama3
biwu suite --gpu-only
biwu suite -m nomic-embed-text --category embedding
biwu suite -m jina-reranker-v2-small --category reranking
biwu suite -m model.gguf --backend gguf --category full
```

### CLI Flags

| Flag | Description |
|------|-------------|
| `-V`, `--version` | Show version |
| `-v`, `--verbose` | Verbose output (debug logging) |
| `-o`, `--output` | Output to file path (JSON) |
| `--json` | JSON output format |
| `-q`, `--quiet` | Suppress non-essential output |

## Python API

```python
from biwu import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models (Ollama by default)
models = discover_models()
print(models.data)       # List of model info dicts

# Discover models on specific backend
models = discover_models(backend="gguf")
models = discover_models(backend="modelscope")
models = discover_models(backend="huggingface")

# GPU-only models
models = discover_models(gpu_only=True)

# GPU information
info = gpu_info()
print(info.data["gpus"])
print(info.data["gpu_fitable_models"])
print(info.data["offload_required_models"])

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run(gpu_only=True)

# Run a benchmark (Ollama)
result = run_benchmark(benchmark_name="mmlu", model="llama3")

# Run a benchmark (GGUF)
result = run_benchmark(benchmark_name="mmlu", model="/path/to/model.gguf", backend="gguf")

# Run a benchmark (ModelScope)
result = run_benchmark(benchmark_name="mmlu", model="Qwen/Qwen2-7B-Instruct-GGUF", backend="modelscope")

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
```

## Agent Integration (OpenAI Function Calling)

```python
from biwu.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
result = dispatch("biwu_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# Auto benchmark
result = dispatch("biwu_auto_run", {"gpu_only": True})

# GPU info
result = dispatch("biwu_gpu_info", {})

# Discover models (with backend)
result = dispatch("biwu_discover_models", {"backend": "ollama", "gpu_only": True})

# Discover GGUF models
result = dispatch("biwu_discover_models", {"backend": "gguf"})
```

## Benchmark Categories

### LLM Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |

### Embedding Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |

### Reranking Benchmarks

| Benchmark | Reference | Metric | Description |
|-----------|-----------|--------|-------------|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |

## References

All benchmark papers are available in the `pdf/` directory:

| Paper | arXiv ID |
|-------|----------|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |

## Development

```bash
pip install -e ".[dev]"
pytest tests/ -v
ruff format . && ruff check .
mypy biwu/
```

## License

GPL-3.0-or-later