Metadata-Version: 2.4
Name: artemisllmbench
Version: 0.1.2
Summary: Artemis LLM Benchmark — correctness validation and performance benchmarking for any OpenAI-compatible LLM serving endpoint
Author-email: TurinTech AI <neda@turintech.ai>
License: MIT
Project-URL: Homepage, https://pypi.org/project/artemisllmbench
Project-URL: Documentation, https://pypi.org/project/artemisllmbench
Project-URL: Bug Tracker, https://turintech.ai
Keywords: llm,benchmark,vllm,inference,performance,validation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Benchmark
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: httpx>=0.28
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: numpy>=1.24
Provides-Extra: dashboard
Requires-Dist: streamlit>=1.30; extra == "dashboard"
Requires-Dist: plotly>=5.0; extra == "dashboard"
Provides-Extra: full
Requires-Dist: artemisllmbench[dashboard]; extra == "full"
Requires-Dist: sentence-transformers>=2.2; extra == "full"

# Artemis LLM Benchmark

A Python CLI for **correctness validation and performance benchmarking** of LLM serving endpoints. Works with any OpenAI-compatible server — vLLM, Ollama, llama.cpp, and more.

---

## Install

```bash
pip install artemisllmbench                    # core
pip install "artemisllmbench[dashboard]"       # + Streamlit dashboard
pip install "artemisllmbench[full]"            # + dashboard + semantic similarity
```

> `[full]` adds `sentence-transformers` for semantic similarity checks (Layer 3 validity).

---

## What It Does

Artemis answers two questions after you optimize an LLM endpoint:
- **Did the optimization preserve correctness?** — multi-layer validity checks on every response
- **What is the publishable performance number?** — reproducible latency, throughput, and goodput metrics

---

## Quick Start

**Validate a single endpoint** — correctness + performance in one command:

```bash
artemisllmbench validate \
  --endpoint http://localhost:9000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100
```

**Compare stock vs optimized** — sequential benchmarking with full GPU resources for each:

```bash
artemisllmbench compare \
  --endpoint-a http://localhost:9000 \
  --endpoint-b http://localhost:9001 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100
```

**Split-session compare** — when the stock endpoint is already torn down:

```bash
# Session 1: save the stock baseline
artemisllmbench baseline --endpoint http://localhost:9000 --model <model> --hardware a100

# Session 2: run the optimized candidate
artemisllmbench candidate --endpoint http://localhost:9000 --model <model> --hardware a100
```

The dashboard launches automatically after each run. Open it at `http://<your-ip>:8501`.

---

## Key Features

- **Multi-layer validity** — sanity, structural, semantic (embedding similarity ≥ 0.92), and exact-match checks catch regressions that latency numbers alone miss
- **Reproducible metrics** — TTFT, P95/P99 latency, ITL (inter-token latency), throughput, CV, drift, and spike detection
- **SLO / goodput tracking** — set `--slo-ttft` and `--slo-latency` thresholds; get the % of requests that met them
- **Streamlit dashboard** — live progress, side-by-side results, analytics charts, and a live response comparison panel
- **Fast mode** — `--fast` cuts runtime by ~75% for quick iteration checks
- **Cross-machine support** — endpoints can be on different hosts or different hardware

---

## Common Flags

| Flag | Description |
|------|-------------|
| `--fast` | Reduced runs (~75% faster). For quick checks only. |
| `--production` | Full 50-run sequential + concurrent load (default). |
| `--slo-ttft <ms>` | TTFT SLO threshold — enables goodput reporting. |
| `--slo-latency <ms>` | End-to-end latency SLO threshold. |
| `--plots` | ASCII charts inline in terminal output. |
| `--live` | Rich terminal live view during concurrent phases. |
| `--no-dashboard` | Skip auto-launching Streamlit. |
| `--port N` | Streamlit port (default: 8501). |

---

## Validity Layers

| Layer | Check | On failure |
|-------|-------|------------|
| 1 Sanity | Non-empty, complete sentence, token bounds | Hard fail |
| 2 Structural | JSON/Python syntax where required | Hard fail |
| 3 Semantic | Cosine similarity ≥ 0.92 vs. reference | Hard fail / warning |
| 4 Exact match | String equality (`control_prompt_v1` only) | Warning |

---

## Pre-flight Conformance Check

```bash
artemisllmbench check-conformance --endpoint http://localhost:9000
```

Verifies your endpoint speaks the required OpenAI-compatible SSE format before a full benchmark run.

---

**Full documentation and source:** `artemisllmbench --help` or `artemisllmbench <command> --help`
