Metadata-Version: 2.4
Name: infermark
Version: 0.2.0
Summary: Benchmark any OpenAI-compatible LLM endpoint. TTFT, inter-token latency, throughput, P50-P99 — in one command.
Project-URL: Homepage, https://github.com/stef41/infermark
Project-URL: Repository, https://github.com/stef41/infermark
Project-URL: Issues, https://github.com/stef41/infermark/issues
Author: Zacharie Bhatti
License: Apache-2.0
License-File: LICENSE
Keywords: benchmark,inference,latency,llm,ollama,performance,tgi,throughput,vllm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx>=0.24
Provides-Extra: all
Requires-Dist: click>=8.0; extra == 'all'
Requires-Dist: rich>=13.0; extra == 'all'
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == 'cli'
Requires-Dist: rich>=13.0; extra == 'cli'
Description-Content-Type: text/markdown

# infermark

[![CI](https://github.com/stef41/infermark/actions/workflows/ci.yml/badge.svg)](https://github.com/stef41/infermark/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**Know how fast your LLM endpoint actually is.**

infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind `/v1/chat/completions`. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.

Both [llmperf](https://github.com/ray-project/llmperf) and [llm-bench](https://github.com/bentoml/llm-bench) were archived in 2025. infermark fills the gap.

![Demo](assets/demo.svg)

## What it measures

| Metric | What it tells you |
|--------|-------------------|
| **TTFT** | Time to first token — how long until streaming starts |
| **ITL** | Inter-token latency — smoothness of the stream |
| **Throughput** (tok/s) | Output tokens per second across all concurrent requests |
| **P50 / P95 / P99** | Tail latency distribution at each concurrency level |
| **Error rate** | Failed requests under load |
| **RPS** | Requests per second the server can sustain |

## Install

```bash
pip install infermark
```

With the CLI (rich tables, progress):

```bash
pip install infermark[cli]
```

## Quick start

### CLI

```bash
# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50

# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100

# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json

# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json
```

### Python

```python
from infermark import BenchmarkConfig, run_benchmark

config = BenchmarkConfig(
    url="http://localhost:8000/v1",
    model="meta-llama/Llama-3-70B-Instruct",
    concurrency_levels=[1, 4, 8, 16, 32],
    n_requests=100,
    max_tokens=256,
)

report = run_benchmark(config)

# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")

# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")
```

### Async

```python
import asyncio
from infermark import BenchmarkConfig, run_benchmark_async

async def main():
    config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
    report = await run_benchmark_async(config)
    print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")

asyncio.run(main())
```

## Compare endpoints

Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:

![Comparison](assets/comparison.svg)

```bash
# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json

# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json
```

## Export formats

```bash
# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json

# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md
```

## Configuration

```python
BenchmarkConfig(
    url="http://localhost:8000/v1",     # Any OpenAI-compatible endpoint
    model="meta-llama/Llama-3-70B",     # Model name
    prompt="Explain relativity.",        # Prompt to send
    max_tokens=256,                      # Max output tokens per request
    concurrency_levels=[1, 4, 8, 16],   # Test these concurrency levels
    n_requests=100,                      # Requests per level
    timeout=120.0,                       # Per-request timeout (seconds)
    mode=BenchmarkMode.STREAMING,        # STREAMING or NON_STREAMING
    warmup=3,                            # Warmup requests before measurement
    api_key="sk-...",                    # Optional API key
)
```

## How it works

1. **Warmup** — Sends a few requests to prime the server's KV cache and JIT compilation
2. **For each concurrency level** — Fires N requests with M concurrent workers using `asyncio`
3. **Streaming measurement** — Parses SSE chunks to measure TTFT and inter-token latency
4. **Statistics** — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
5. **Report** — Rich terminal tables, JSON, or Markdown output

## Supported endpoints

Anything that speaks the OpenAI chat completions API:

- [vLLM](https://github.com/vllm-project/vllm)
- [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
- [SGLang](https://github.com/sgl-project/sglang)
- [Ollama](https://ollama.ai) (with `OLLAMA_ORIGINS=*`)
- [llama.cpp server](https://github.com/ggerganov/llama.cpp)
- [LiteLLM proxy](https://github.com/BerriAI/litellm)
- OpenAI, Anthropic (via compatible proxy), Together, Fireworks, etc.

## See Also

Part of the **stef41 LLM toolkit** — open-source tools for every stage of the LLM lifecycle:

| Project | What it does |
|---------|-------------|
| [tokonomics](https://github.com/stef41/tokonomics) | Token counting & cost management for LLM APIs |
| [datacrux](https://github.com/stef41/datacrux) | Training data quality — dedup, PII, contamination |
| [castwright](https://github.com/stef41/castwright) | Synthetic instruction data generation |
| [datamix](https://github.com/stef41/datamix) | Dataset mixing & curriculum optimization |
| [toksight](https://github.com/stef41/toksight) | Tokenizer analysis & comparison |
| [trainpulse](https://github.com/stef41/trainpulse) | Training health monitoring |
| [ckpt](https://github.com/stef41/ckpt) | Checkpoint inspection, diffing & merging |
| [quantbench](https://github.com/stef41/quantbench) | Quantization quality analysis |
| [modeldiff](https://github.com/stef41/modeldiff) | Behavioral regression testing |
| [vibesafe](https://github.com/stef41/vibesafe) | AI-generated code safety scanner |
| [injectionguard](https://github.com/stef41/injectionguard) | Prompt injection detection |

## License

Apache-2.0
