Metadata-Version: 2.3
Name: tokenspeed-trie
Version: 0.1.1.post20260523
Summary: A small harness for evaluating OpenAI-compatible inference endpoints with synthetic agentic workloads.
License: Apache-2.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Testing
Requires-Dist: chz
Requires-Dist: numpy
Requires-Dist: openai
Requires-Dist: rich
Requires-Dist: structlog
Requires-Dist: transformers
Requires-Dist: pytest ; extra == 'test'
Requires-Python: >=3.11
Provides-Extra: test
Description-Content-Type: text/markdown

# tokenspeed-trie

`trie` (trace replay inference evaluation) is a lightweight harness that exercises a running TokenSpeed inference endpoint with synthetic multi-turn workloads derived from production traces.

Prefill-heavy or decode-heavy synthetic benchmarks (1k/8k, 1k/1k, 8k/1k, etc.) don't capture real agentic traffic: it's multi-turn, has high per-turn prefill from tool outputs, and stresses KV-cache management as context grows. `trie` replays that shape.

## Install

```bash
pip install tokenspeed-trie
```

The published package name on PyPI is `tokenspeed-trie`; the import name and the CLI command stay `trie`.

## Quick start

Three-minute smoke test against a running TokenSpeed endpoint. The example model below is `nvidia/Kimi-K2.5-NVFP4`; substitute any served model name. Make sure the engine is idle (no leftover traffic from a previous run) before starting.

```bash
trie \
  workload_path=agentic \
  endpoint=http://localhost:8000/v1 \
  model=nvidia/Kimi-K2.5-NVFP4 \
  tokenizer_model=nvidia/Kimi-K2.5-NVFP4 \
  concurrency=8 \
  duration=180 \
  stream=True \
  num_gpus=4
```

`workload_path` accepts a short alias (`agentic` / `qa` / `office`) backed by the [`lightseekorg/trie-dataset`](https://huggingface.co/datasets/lightseekorg/trie-dataset) mirror on Hugging Face, or a filesystem path to a custom JSONL. Aliases are downloaded into `/tmp/trie-dataset/` on first use and reused thereafter. Override the cache directory with `TRIE_DATASET_CACHE=/some/other/path` if `/tmp` isn't usable.

`model` is sent to the inference endpoint. `tokenizer_model` is loaded separately via `transformers.AutoTokenizer.from_pretrained(...)` to generate synthetic prompts at the requested token lengths. Pass `tokenizer_model=` explicitly when `model` is not a valid Hugging Face ID or local tokenizer path.

`stream=True` is required to surface `TTFT`, `TTFAT`, and `Decode TPS`.

Start TokenSpeed with `--enable-cache-report` so the server returns `usage.prompt_tokens_details.cached_tokens`; without it the `Cache hit rate (%)` columns are zero.

## Full benchmark sweep

Mirrors the methodology in [Applied Compute's inference benchmark](https://www.appliedcompute.com/research/inference-benchmark): three workloads × six concurrency levels × 2-hour runs. Total wall time: ~36 hours on a single node.

```bash
ENDPOINT=http://localhost:8000/v1
MODEL=nvidia/Kimi-K2.5-NVFP4

mkdir -p logs
for WL in agentic qa office; do
  for C in 8 16 24 32 40 48; do
    trie \
      workload_path=$WL \
      endpoint=$ENDPOINT \
      model=$MODEL \
      tokenizer_model=$MODEL \
      concurrency=$C \
      duration=7200 \
      stream=True \
      num_gpus=4 \
      2>&1 | tee logs/${WL}_c${C}.log
  done
done
```

`duration` is the deadline for launching new traces. Once it elapses, the harness stops admitting work and cancels everything in flight.

## Python API

```python
from trie import Client

client = Client(
    endpoint="http://localhost:8000/v1",
    model="nvidia/Kimi-K2.5-NVFP4",
)
client.sync_run("agentic", concurrency=8, duration=180, stream=True, num_gpus=4)
# Use client.run(...) directly if you're already inside an event loop.
```

## Workload aliases

| Alias | HF dataset filename | Cached path |
|---|---|---|
| `agentic` | `agentic_coding_8k.jsonl` | `/tmp/trie-dataset/agentic_coding_8k.jsonl` |
| `qa` | `code_qa_8k.jsonl` | `/tmp/trie-dataset/code_qa_8k.jsonl` |
| `office` | `office_work_8k.jsonl` | `/tmp/trie-dataset/office_work_8k.jsonl` |

Anything not in the alias table is treated as a filesystem path and passed through unchanged.

## Custom workload format

Each JSONL row defines one trace:

- `num_turns` — number of tool-use turns
- `input_prompt_length` — initial user prompt token length
- `assistant_response_length` — per-turn assistant tokens (list of length `num_turns`)
- `tool_call_output_length` — per-turn tool result tokens (list of length `num_turns`)
- `tool_call_latency` — per-turn simulated delay in seconds (list of length `num_turns`)
- `final_assistant_response_length` — final assistant response tokens after all tool turns

Example row:

```json
{"num_turns": 2, "input_prompt_length": 32, "assistant_response_length": [16, 20], "tool_call_output_length": [8, 12], "tool_call_latency": [0.0, 0.0], "final_assistant_response_length": 64}
```

A trace produces `num_turns + 1` completion requests: one per tool-use turn, plus a final turn after the last tool result.

## Example output

```
[info     ] starting benchmark             concurrency=24 duration=300.0 model=nvidia/Kimi-K2.5-NVFP4 num_gpus=4 workload_templates=8192
[info     ] benchmark complete             completed_requests=… failed_requests=0 wall_time_s=… ...

                                                 Per-trace metrics
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        ┃             ┃          ┃           ┃                    ┃                    ┃ Eligible cache hit rate ┃
┃ Metric ┃ Latency (s) ┃ TTFT (s) ┃ TTFAT (s) ┃ Decode TPS (tok/s) ┃ Cache hit rate (%) ┃                     (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mean   │      ...    │     ...  │      ...  │              ...   │               ... │                    ...  │
└────────┴─────────────┴──────────┴───────────┴────────────────────┴────────────────────┴─────────────────────────┘

                                     Workload metrics
                                completed=N/N  trace/s=…
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                ┃  Overall ┃ Last 30s Window ┃ Steady State ┃ Steady State / GPU ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ total prompt tok/s    │    ...   │           ...   │         ...  │              ...   │
│ cached prompt tok/s   │    ...   │           ...   │         ...  │              ...   │
│ uncached prompt tok/s │    ...   │           ...   │         ...  │              ...   │
│ completion tok/s      │    ...   │           ...   │         ...  │              ...   │
└───────────────────────┴──────────┴─────────────────┴──────────────┴────────────────────┘
```

## Metrics

### Per-trace

- `Latency (s)` — end-to-end latency from the first request of a trace to the final response.
- `TTFT (s)` — (streaming) time to the first streamed token of the first request.
- `TTFAT (s)` — (streaming) time from trace start to the first streamed token of the **final** request. The user-visible first token in an agent that hides intermediate tool turns.
- `Decode TPS (tok/s)` — (streaming) mean post-TTFT decode throughput across the trace's requests.
- `Cache hit rate (%)` — server-reported `cached_prompt_tokens / prompt_tokens` over all requests in a trace.
- `Eligible cache hit rate (%)` — same numerator, denominator restricted to prompt tokens expected to be cacheable. Excludes the initial prompt and, on each turn, the tool output newly appended on that request:
  `sum_i cached_prompt_tokens_i / sum_i eligible_prompt_tokens_i`,
  where `eligible_prompt_tokens_0 = 0` and `eligible_prompt_tokens_i = prompt_tokens_{i-1} + completion_tokens_{i-1}` for `i > 0`.

### Workload

- `trace/s` — completed traces per wall-clock second.
- `total / cached / uncached prompt tok/s` — aggregate prompt-token throughput, split by what the synthetic workload accounting expects to be cached vs. new.
- `completion tok/s` — aggregate completion-token throughput.

Each is reported under four columns:

- `Overall` — totals over the full benchmark wall time.
- `Last 30s Window` — slope of cumulative token counts over the most recent 30 seconds.
- `Steady State` — **the headline throughput metric**. Slope after dropping the first 20% of wall time as warmup. Avoids dilution from ramp-up and drain when fewer than `concurrency` traces are in flight. With `concurrency > 1` the completion curve depends on finish order, so the metric has small run-to-run variance even at fixed seed.
- `Steady State / GPU` — `Steady State / num_gpus` when `num_gpus` is set.

Prompt-token throughputs use synthetic workload accounting; cache-hit metrics use server-reported usage. Divergence implies a tokenizer mismatch between client and server.

## Known limitations

- Synthetic prompts are freshly random per trace, so cross-trace prefix sharing (e.g. a common system prompt or tool definitions) is not modeled and cache hit rates may be lower than in a deployment that shares prefixes.
- `Decode TPS` assumes the first streamed chunk carries exactly one token. Backends that buffer multiple tokens into the first chunk overstate it slightly.
- `transformers` is unpinned; install the version whose tokenizer matches your inference server's. Mismatched versions can produce subtly different token counts and cause prompt-accounting drift.
