Metadata-Version: 2.4
Name: ollama-arena
Version: 2.1.2
Summary: Pair-wise ELO evaluation arena for local LLMs.
Author: nazkari86-lab
License: MIT
Project-URL: Homepage, https://github.com/nazkari86-lab/ollama-arena
Project-URL: Repository, https://github.com/nazkari86-lab/ollama-arena
Project-URL: Issues, https://github.com/nazkari86-lab/ollama-arena/issues
Project-URL: Documentation, https://github.com/nazkari86-lab/ollama-arena#readme
Project-URL: Changelog, https://github.com/nazkari86-lab/ollama-arena/releases
Keywords: llm,ollama,benchmark,elo,evaluation
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Requires-Dist: rich>=13.0
Provides-Extra: web
Requires-Dist: fastapi>=0.100; extra == "web"
Requires-Dist: uvicorn[standard]>=0.22; extra == "web"
Provides-Extra: viz
Requires-Dist: plotly>=5.18; extra == "viz"
Provides-Extra: datasets
Requires-Dist: datasets>=2.14; extra == "datasets"
Requires-Dist: huggingface_hub>=0.20; extra == "datasets"
Provides-Extra: hf
Requires-Dist: transformers>=4.40; extra == "hf"
Requires-Dist: torch>=2.2; extra == "hf"
Requires-Dist: accelerate>=0.30; extra == "hf"
Provides-Extra: finetune
Requires-Dist: unsloth>=2024.8; extra == "finetune"
Requires-Dist: transformers>=4.40; extra == "finetune"
Requires-Dist: peft>=0.10; extra == "finetune"
Requires-Dist: trl>=0.9; extra == "finetune"
Requires-Dist: accelerate>=0.30; extra == "finetune"
Requires-Dist: datasets>=2.14; extra == "finetune"
Requires-Dist: torch>=2.2; extra == "finetune"
Requires-Dist: bitsandbytes>=0.43; sys_platform != "darwin" and extra == "finetune"
Provides-Extra: all
Requires-Dist: fastapi>=0.100; extra == "all"
Requires-Dist: uvicorn[standard]>=0.22; extra == "all"
Requires-Dist: plotly>=5.18; extra == "all"
Requires-Dist: datasets>=2.14; extra == "all"
Requires-Dist: huggingface_hub>=0.20; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Dynamic: license-file

# ollama-arena

A pair-wise evaluation harness for locally hosted language models. Runs
matches between two models on a shared task set, scores each response
deterministically (or with an LLM judge), and maintains an ELO rating
across runs.

```
pip install git+https://github.com/nazkari86-lab/ollama-arena.git
ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b -n 20
```

```
match 1/1   llama3.2:3b  vs  qwen2.5-coder:7b
  code_001   1.00  vs  1.00   draw
  code_002   0.00  vs  1.00   B
  humaneval_3 1.00 vs 1.00    draw
  ...

rank  model                elo    W   L   D   matches  win%
1     qwen2.5-coder:7b    1271    7   1   2     10     70%
2     llama3.2:3b         1129    1   7   2     10     10%
```

## Why

When you have several local models, you want a quick answer to "which one
is better at *X*?" — without renting GPUs or signing up for a judging API.
Existing harnesses (lm-evaluation-harness, lighteval, simple-evals) are
absolute-score frameworks designed for paper-grade reporting; they are
overkill for the day-to-day "should I switch from llama3.2 to qwen2.5?"
question. ollama-arena answers that question with pair-wise battles, a
local SQLite ELO table, and built-in or HuggingFace task pools.

ELO rather than Glicko-2 because (a) the implementation is two lines, and
(b) for a moderate number of models the difference is negligible.

## Install

```
pip install git+https://github.com/nazkari86-lab/ollama-arena.git
```

Optional extras (append to the URL, or clone and `pip install '.[extra]'`):

| Extra | Adds |
|---|---|
| `[all]` | web dashboard, Plotly charts, HuggingFace datasets |
| `[hf]` | in-process TransformersBackend (torch, transformers) |
| `[finetune]` | Unsloth fine-tune pipeline — CUDA recommended |

```
# clone for extras
git clone https://github.com/nazkari86-lab/ollama-arena.git
cd ollama-arena
pip install '.[all]'
```

The HuggingFace and fine-tune extras pull large dependencies and are off by default.

## Quick start

```
ollama serve
ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b

ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b --category coding -n 10
ollama-arena leaderboard
```

ELO state lives in `arena.db` in the working directory. Pass `--db` to
share a leaderboard between runs in different folders.

## Backends

Anything that exposes Ollama's native API or the OpenAI
`/v1/chat/completions` shape works without code changes:

```
ollama-arena --backend ollama   match ...        # default, :11434
ollama-arena --backend vllm     match ...        # :8000
ollama-arena --backend lmstudio match ...        # :1234
ollama-arena --backend llamacpp match ...        # :8080
ollama-arena --backend openai     --api-key sk-... match ...
ollama-arena --backend groq       --api-key gsk-... match ...
ollama-arena --backend together   --api-key tg-... match ...
ollama-arena --backend openrouter --api-key sk-or-... match ...
```

Or pass a full URL:

```
ollama-arena --backend http://192.168.1.50:8000/v1 match ...
```

A `TransformersBackend` is also available for in-process generation via
PyTorch; it is lazily imported so the dependency is optional.

## Tasks

The package ships with about 100 hand-written tasks across five
categories: coding (Python plus JS/TS/Rust/Go/C++), reasoning, security,
inspection, and planning. They are intended as a smoke-test starter pack,
not a definitive benchmark.

For serious work, load a HuggingFace dataset:

```
ollama-arena datasets                       # registered datasets
ollama-arena datasets --pull humaneval,gsm8k
ollama-arena match --dataset humaneval --models A,B -n 50
```

Registered loaders (more in `ollama_arena/datasets/loader.py`):

| name        | source            | reference                                     |
| ----------- | ----------------- | --------------------------------------------- |
| humaneval   | openai_humaneval  | Chen et al., 2021                             |
| mbpp        | mbpp              | Austin et al., 2021                           |
| mbpp_plus   | evalplus/mbppplus | Liu et al., 2023                              |
| gsm8k       | gsm8k             | Cobbe et al., 2021                            |
| mmlu        | cais/mmlu         | Hendrycks et al., 2021                        |
| bbh         | lukaemon/bbh      | Suzgun et al., 2022                           |
| multipl_e   | nuprl/MultiPL-E   | Cassano et al., 2022                          |
| hellaswag   | hellaswag         | Zellers et al., 2019                          |
| truthfulqa  | truthful_qa       | Lin et al., 2022                              |
| arc         | ai2_arc           | Clark et al., 2018                            |

Downloads are cached in `~/.cache/ollama_arena/datasets/`. Override with
`OLLAMA_ARENA_CACHE`.

## Scoring

Each task carries its own scorer:

- **coding** — extract the code block, append the task's test cases, and
  execute in the matching language sandbox. Score is 1.0 on a clean exit,
  0.0 otherwise.
- **math, knowledge** — numeric tolerance / multiple-choice letter match.
- **reasoning** — prefix or substring match against `expected_answer`.
- **security, inspection, planning** — keyword presence over an expected
  set of issues / key components.
- **open-ended** — when `task["use_judge"]` is set and the arena is
  constructed with `judge_model=...`, the LLMJudge grades each pair in
  both orderings (A then B, B then A) and averages, to suppress position
  bias. This is meaningfully more expensive — the judge is invoked twice
  per task, on top of the two model generations.

Code is executed in a subprocess with a hardened pattern filter
(`rm -rf`, `shell=True`, raw sockets, …) and a strict timeout. For
untrusted code, pass `use_docker=True` to `run_in_language()`; containers
run with `--network=none --read-only --memory=512m --cpus=1`.

## Languages

The sandbox dispatches by the `language` field on each task. Detected at
runtime from `$PATH`:

| language    | runtime needed                       |
| ----------- | ------------------------------------ |
| python      | python3                              |
| javascript  | node                                 |
| typescript  | tsx, ts-node, or deno                |
| rust        | rustc (edition 2021)                 |
| go          | go ≥ 1.20                            |
| cpp         | g++ or clang++ (-std=c++17)          |
| bash        | bash                                 |

`ollama-arena tasks` shows which languages are currently runnable.

## CLI

```
ollama-arena match        --models A,B [--category C] [--dataset NAME] [--difficulty L]
ollama-arena tournament   --models A,B,C,...
ollama-arena leaderboard
ollama-arena perf
ollama-arena list
ollama-arena tasks
ollama-arena datasets     [--pull NAMES] [--refresh NAMES]
ollama-arena finetune     --analyze | --generate | --train PATH
ollama-arena export       --out report.html
ollama-arena web          [--port 7860]
```

Global flags: `--backend`, `--api-key`, `--db`, `--ollama`.

## Python

```python
from ollama_arena import Arena

arena = Arena()                                      # Ollama on :11434
# arena = Arena(backend="vllm")
# arena = Arena(backend="groq", api_key="gsk_...")

arena.load_hf_dataset("humaneval", limit=50)

result = arena.run_match(
    "llama3.2:3b", "qwen2.5-coder:7b",
    category="coding", n=20,
)
print(result.elo_a_after, result.elo_b_after)
```

Round-robin between several models:

```python
arena.run_tournament(
    ["llama3.2:3b", "qwen2.5-coder:7b", "gemma2:9b"],
    category="reasoning", n_per_match=10,
)
```

LLM judge for open-ended responses:

```python
arena = Arena(judge_model="qwen2.5:32b-instruct")
# tasks marked {"use_judge": True} are graded by the judge in both orderings
```

Export a standalone HTML dashboard (Plotly):

```python
from ollama_arena.visualize import export_dashboard

export_dashboard(
    "report.html",
    leaderboard=arena.leaderboard(),
    matches=arena.match_history(limit=500),
    categories=["coding", "reasoning", "security", "planning", "inspection"],
    performance=arena.performance_stats(),
)
```

## Performance metrics

Every generation logs prompt tokens, output tokens, latency, tokens/sec,
and time-to-first-token. `ollama-arena perf` prints per-model
aggregates:

```
model              samples  tps mean  tps p95  lat mean  lat p95  ttft
llama3.2:3b           120     48.2     52.1     4.2s     6.3s    0.3s
qwen2.5-coder:7b      120     31.7     34.0     8.1s    11.2s    0.5s
```

These numbers are *backend* numbers — they include HTTP overhead, the
model server's scheduling, batching, and so on. They are useful as
relative comparisons within one backend; treat absolute values with care.

## Fine-tuning loop

A small pipeline turns arena failures into a teacher-distilled SFT
dataset, runs Unsloth LoRA on it, exports a GGUF and registers the
result as an Ollama model. End-to-end example:
[`examples/finetune_pipeline.py`](examples/finetune_pipeline.py).

CUDA is required for the Unsloth step.

## Limitations

- ELO updates per task, not per match. This converges faster but is
  noisier than the official chess formula for small sample sizes.
- The keyword-based scorers for security/inspection/planning are
  approximate. They reward mentioning the right thing, not necessarily
  understanding it. Use the LLM judge for higher-stakes scoring.
- Sandbox isolation without Docker relies on the subprocess timeout and
  the static pattern filter. Do not feed model output from untrusted
  sources to the host sandbox.
- HuggingFace dataset normalization is per-loader; some upstream schema
  changes will require updates to `loader.py`.

## Contributing

See `CONTRIBUTING.md`. The most useful contributions are new dataset
loaders, new language sandboxes, and new backends; each takes only a few
dozen lines.

## License

MIT. See `LICENSE`.

<details>
<summary>Logo</summary>

```
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⢄⡲⠖⠛⠉⠉⠉⠉⠉⠙⠛⠿⣿⣶⣦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣿⣿⣿⣿⣷⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡔⢡⣶⠏⠀⠀⠀⠀⠀⠀⣠⣴⣶⣶⣶⣶⣶⣶⣦⣄⣸⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠌⢀⣿⠏⠀⠀⠀⠀⠀⠀⠸⠿⠋⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡞⠀⡼⢿⣦⣄⠠⠤⠐⠒⠒⠒⠢⠤⣄⣠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠀⠀⠀⣸⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢠⠞⠁⠀⠀⠠⠇⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠈⠙⠛⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⣴⣁⠀⣀⣤⣴⣾⣿⣿⣿⣿⡿⢿⣿⣶⣄⠀⠀⠀⠀⠀⣿⣷⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⡇⠀⢸⣿⣿⣿⡇⠘⠟⣻⣿⣧⠀⠀⠀⠀⢿⣿⣤⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⡿⠀⠀⠸⣿⠿⠋⠉⠁⠛⠻⠿⢿⣧⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⡿⠋⠁⠀⢀⣄⡀⠀⠀⠀⢀⣀⣤⣴⣿⣿⣧⠀⢀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⠏⢀⠀⢀⡴⠿⣿⣿⣷⣶⣾⣿⣿⣿⣿⣿⣿⣿⣇⠀⢷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣤⣿⣷⡈⠀⠀⠀⠙⠻⣿⣿⣿⣿⠿⠛⠛⣻⣿⣿⡄⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣄⠀⠀⠀⠀⠈⠋⢉⣠⣴⣾⣿⣿⣿⣿⣷⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⢻⡏⢹⠙⡆⠀⠀⠀⠒⠚⢛⣉⣉⣿⣿⣿⣿⣿⣿⡇⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢀⡞⠁⠉⠀⠁⠀⣄⣀⣠⣴⣶⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣈⡛⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠛⠋⠉⠉⠉⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡷⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣻⠿⠿⢿⣿⠿⠿⠋⠁⠀⠙⣿⡁⠈⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠛⠋⠉⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣈⣹⣦⣴⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⣀⣀⣀⣼⣿⣄⣀⣀⡄⠀⣀⣀⣠⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠀⠀
⠀⠀⠀⠀⠀⢰⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠉⠀⠀⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀
⠀⠀⠀⢀⣤⣤⣤⣶⣿⣿⣿⣿⠿⠿⠟⠋⢹⠇⠀⠀⢀⣼⣿⣿⣿⣿⣿⡿⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⠀⢀⣴⣿⣿⣿⣿⣿⣿⣿⡟⠁⠀⠀⠀⢀⡏⠀⠀⢀⣾⠋⣹⣿⣿⣿⡟⠀⠀⣸⡟⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⢠⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⠀⠀⠀⡼⠀⠀⢀⣾⠏⢀⣿⣿⣿⠋⠀⠀⣰⣿⣧⡀⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
```

</details>
