Metadata-Version: 2.4
Name: splleed
Version: 0.1.0a3
Summary: LLM inference benchmarking harness with pluggable backends
Project-URL: Homepage, https://github.com/Bradley-Butcher/Splleed
Project-URL: Repository, https://github.com/Bradley-Butcher/Splleed
Project-URL: Issues, https://github.com/Bradley-Butcher/Splleed/issues
Author-email: Bradley Butcher <bradleybutcher@outlook.com>
License: MIT
License-File: LICENSE
Keywords: benchmark,inference,latency,llm,performance,tgi,vllm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: httpx>=0.25
Requires-Dist: numpy<2.3,>=1.24
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: tqdm>=4.0
Provides-Extra: hf
Requires-Dist: datasets>=2.0; extra == 'hf'
Description-Content-Type: text/markdown

# splleed

LLM inference benchmarking with a Python-first API.

## Features

- **Python API**: Write benchmarks as scripts, not config files
- **Pluggable backends**: vLLM, TGI (more coming)
- **Comprehensive metrics**: TTFT, ITL, TPOT, throughput, E2E latency
- **Statistical rigor**: Multiple trials with confidence intervals
- **Flexible operation**: Connect to existing servers or let splleed manage them

## Installation

```bash
pip install splleed
```

For HuggingFace dataset support:
```bash
pip install splleed[hf]
```

Inference engines (vLLM, TGI) are **not** bundled - install them separately.

## Quick Start

```python
import asyncio
from splleed import Benchmark, VLLMConfig, SamplingParams

async def main():
    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct"),
        prompts=[
            "What is the capital of France?",
            "Explain quantum computing briefly.",
        ],
        concurrency=[1, 2, 4],
        trials=3,
        sampling=SamplingParams(max_tokens=100),
    ).run()

    results.print()
    results.save("results.json")

if __name__ == "__main__":
    asyncio.run(main())
```

## Connect vs Managed Mode

**Managed mode** - splleed starts and stops the server:
```python
backend = VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct")
```

**Connect mode** - use an existing server:
```python
backend = VLLMConfig(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    endpoint="http://localhost:8000",
)
```

## Using HuggingFace Datasets

```python
from datasets import load_dataset
from splleed import Benchmark, VLLMConfig

async def main():
    ds = load_dataset("tatsu-lab/alpaca", split="train")
    ds = ds.shuffle(seed=42).select(range(100))
    prompts = list(ds["instruction"])

    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-3B-Instruct"),
        prompts=prompts,
        concurrency=[1, 2, 4, 8],
        trials=3,
    ).run()

    results.print()
```

## Backend Configuration

### vLLM

```python
from splleed import VLLMConfig

backend = VLLMConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel=2,
    gpu_memory_utilization=0.9,
    quantization="awq",  # optional
    dtype="auto",
)
```

### TGI

```python
from splleed import TGIConfig

backend = TGIConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantize="bitsandbytes-nf4",  # optional
)
```

## Benchmark Modes

### Latency Mode (default)
Sequential requests to measure per-request latency without interference:
```python
Benchmark(..., mode="latency")
```

### Throughput Mode
Concurrent requests to measure maximum throughput:
```python
Benchmark(..., mode="throughput", concurrency=[1, 4, 8, 16])
```

### Serve Mode
Simulate realistic traffic with controlled arrival patterns:
```python
Benchmark(
    ...,
    mode="serve",
    arrival_rate=10.0,           # 10 requests/sec
    arrival_pattern="poisson",   # realistic traffic
    concurrency=[32],            # max concurrent requests
)
```

Arrival patterns:
- `poisson` - exponential inter-arrival times (realistic web traffic)
- `gamma` - configurable burstiness
- `constant` - fixed interval between requests

## Benchmark Options

```python
Benchmark(
    backend=...,
    prompts=["..."],

    # Benchmark settings
    mode="latency",          # "latency", "throughput", or "serve"
    concurrency=[1, 4, 8],   # concurrency levels to test
    warmup=2,                # warmup iterations
    runs=10,                 # requests per concurrency level
    trials=3,                # independent trials for CI
    confidence_level=0.95,   # confidence interval level

    # Serve mode only
    arrival_rate=10.0,       # requests per second
    arrival_pattern="poisson",  # "poisson", "gamma", "constant"

    # Sampling parameters
    sampling=SamplingParams(
        max_tokens=100,
        temperature=0.0,
        top_p=1.0,
    ),
)
```

## Metrics

| Metric | Description |
|--------|-------------|
| TTFT | Time to first token |
| ITL | Inter-token latency |
| TPOT | Time per output token (mean ITL) |
| E2E | End-to-end request latency |
| Throughput | Tokens/sec |
| Goodput | % of requests meeting SLO |

All latency metrics include p50, p95, p99, and mean. With multiple trials, results include 95% confidence intervals.

## Output Formats

```python
results.print()              # Rich table to console
results.save("out.json")     # JSON format
results.save("out.csv")      # CSV format

json_str = results.to_json()
csv_str = results.to_csv()
```

## License

MIT
