Latency
|
p50 |
p90 |
p99 |
| Queue wait |
2.1 ms |
8.4 ms |
41.2 ms |
| Prefill (server) |
10.1 ms |
38.3 ms |
98.5 ms |
| Decode (server) |
172.0 ms |
398.1 ms |
842.0 ms |
| Decode (per-tok) |
4.2 ms |
5.1 ms |
13.9 ms |
| End-to-end |
201.4 ms |
450.3 ms |
983.1 ms |
KV Cache
Hit rate
41.2%
Avg usage
73.2%
Peak usage
91.4%
Eviction rate
12
Prefix Sharing
Unique prefixes
23
Avg req/prefix
36.8
Cacheable tokens
72.1%
Actual hit rate
41.2%
Disaggregation Advisor
Recommendation
+ DISAGGREGATE
prefill contention is the dominant bottleneck at p99
|
Current |
Projected |
Improvement |
| Throughput (req/s) |
14.1 |
19.8 |
+40% |
| p99 TTFT (ms) |
84.5 |
52.3 |
-42% |
| p99 ITL (ms) |
9.8 |
5.2 |
-47% |
Optimal P/D ratio
1:3 (2 prefill, 6 decode)
KV transfer overhead
1.2 ms/req at 100 Gbps IB
Min network bandwidth
50 Gbps (below this, disagg hurts)