Metadata-Version: 2.4
Name: local-llm-speed
Version: 0.2.1
Summary: Terminal benchmark runner with automatic discovery of local LLM servers
Project-URL: Homepage, https://github.com/KazKozDev/llm-speed
Project-URL: Repository, https://github.com/KazKozDev/llm-speed
Project-URL: Issues, https://github.com/KazKozDev/llm-speed/issues
Author: llm-speed contributors
License-Expression: MIT
License-File: LICENSE
Keywords: benchmark,llama.cpp,llm,ollama,vllm
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.11
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# llm-speed

Benchmark local LLM runtimes without confusing runtime speed with model differences.

![Status](https://img.shields.io/badge/status-experimental-orange)
[![CI](https://github.com/KazKozDev/llm-speed/actions/workflows/ci.yml/badge.svg)](https://github.com/KazKozDev/llm-speed/actions/workflows/ci.yml)
![License](https://img.shields.io/badge/license-MIT-blue.svg)

## Highlights

- Finds local Ollama and OpenAI-compatible servers without configuration.
- Measures TTFT, total latency, prompt throughput, generation throughput, RAM, CPU, and macOS energy.
- Keeps cross-engine comparisons honest with pinned weights, quantization, revisions, and checksums.
- Installs, launches, tunes, and stops 15 supported runtimes one at a time.
- Stores every tuning run in SQLite and flags performance regressions.

## Demo

<!-- TODO: Add a 20-second terminal GIF showing discover -> tune -> HTML report. -->

```bash
llm-speed discover
llm-speed run --all-models --max-tokens 256 --output reports/run.json
```

## Overview

Local inference speed is not one number. It is a function of the model, exact weights, quantization,
runtime, launch flags, context length, and hardware state. `llm-speed` controls those variables,
runs streamed benchmarks, and records enough evidence to make the result useful later. It is built
for people comparing local LLM servers on their own machines.

## Motivation

A benchmark is easy to write and easy to get wrong. Comparing an Ollama tag against a different MLX
snapshot mostly measures different artifacts, not different engines. Counting characters as tokens
can move the winner again. `llm-speed` treats model identity and token provenance as part of the
benchmark, then refuses strict ranking when those facts are missing.

## Features

- Concurrent discovery across common localhost ports and custom URLs.
- Ollama and OpenAI-compatible streaming drivers.
- Generation, concurrency, long-context, prompt-processing, deterministic quality, MMLU, GPQA, and HumanEval benchmarks.
- Exact server counters with an isolated tokenizer fallback for pinned local models.
- Managed profiles for Ollama, LM Studio, llama.cpp, LocalAI, vLLM, text-generation-webui, and nine MLX runtimes.
- Verified GGUF downloads and immutable Hugging Face snapshots.
- Declarative configuration matrices with up to 256 variants per profile.
- JSON, CSV, HTML, SQLite history, comparisons, and regression detection.
- CPU, RAM, thermal-state, and optional CPU/GPU/ANE energy collection.
- Benchmark and launch-profile plugins through Python entry points.

## Architecture

Components:

- `discovery` probes local endpoints and identifies their protocol.
- `drivers` normalize streaming responses and token accounting.
- `model_matrix` and `model_acquisition` prove which artifact each engine receives.
- `tuning` installs engines, isolates processes, expands variants, monitors resources, and ranks runs.
- `benchmarks`, `reporting`, and `history` turn measurements into reusable evidence.

Flow: discover or launch -> warm up -> stream -> measure -> validate -> rank -> persist.

## Tech Stack

- Python 3.11+ with a dependency-free runtime.
- `argparse`, `urllib`, `sqlite3`, and `subprocess` from the standard library.
- Hatchling for wheel and sdist builds.
- Ruff, strict Mypy, and unittest-compatible tests.
- GitHub Actions on Python 3.11, 3.12, and 3.13.

## Quick Start

Install the PyPI distribution; the command remains `llm-speed`:

```bash
pip install local-llm-speed
llm-speed discover
```

See [docs/setup.md](docs/setup.md) for authentication, managed engines, quality suites, and macOS
energy measurement.

## Usage

Benchmark servers that are already running:

```bash
llm-speed run --all-models --max-tokens 256 --output reports/run.json
```

Run a strict comparison on one pinned MLX snapshot:

```bash
llm-speed tune \
  --model qwen3-0.6b-mlx-4bit \
  --model-matrix model-matrix.qwen3-0.6b-mlx.json \
  --engine mlx-lm \
  --engine vllm-mlx \
  --full-suite
```

Inspect stored results:

```bash
llm-speed history
llm-speed compare 12 13
```

## Project Structure

```text
src/llm_speed/
  benchmarks/       benchmark contracts and built-in suites
  drivers/          Ollama and OpenAI-compatible protocols
  tuning/           installation, lifecycle, monitoring, and ranking
  cli.py             command-line entry point
  model_matrix.py    strict cross-engine artifact identity
tests/               unit and lifecycle tests with fake local servers
model-matrix.*.json  reproducible MLX and GGUF examples
profiles.example.json
```

## Status

Stage: Experimental (`0.2.x`, PyPI classifier: Alpha).

Planned:

- Future work is tracked through GitHub issues and benchmark evidence.

## Testing

```bash
python -m unittest discover -s tests -v
ruff check .
mypy src
python -m build
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

---

MIT - see LICENSE

If you like this project, please give it a star ⭐

For questions, feedback, or support, reach out to:

[LinkedIn](https://www.linkedin.com/in/kazkozdev/)
[Email](mailto:kazkozdev@gmail.com)
