GPU Util % utilisation
GPU Temp °C die
Unified GB of 128 · 8 GB guard
Throughput tok / second
TTFT ms · first token
throughput & first-token from the active lane
Active Lane idle no warm brain

← Models

What it's for
  • Offline legal-domain chat and clause/issue classification on consumer hardware
  • Drafting and triage behind your own document-retrieval layer
  • Picking a quant variant by workload shape, not just RAM budget

Audience — Local-LLM power users and legal-tech builders who want an offline legal chat model on a consumer GPU — for drafting and triage support, not legal advice.

Quant economics quality × speed per build
Variant Perplexity tok/s LegalBench (n=50, contains)
Q4_K_M 5.986 29.4 0.62
Q5_K_M sweet spot 5.938 20.2 0.72
Q6_K 5.925 22.4 0.68
Q8_0 5.914 7.3 0.66
F16 5.917 10.9 0.68

Perplexity lower = better; tok/s measured on the DGX Spark (GB10, 128 GB unified).

Efficiency curve quality index × tok/s
Known drift bounded · honest
  • LegalBench scored with a lenient "contains" matcher The LegalBench mini-eval (n=50) scores by substring "contains" match, more forgiving than strict exact-match — read the 62–72% range as an upper bound on that rubric, not a strict-accuracy figure. Q5_K_M tops at 36/50.
  • Q8_0 sustained-throughput anomaly Q8_0 generates at 7.3 tok/s — ~33% below F16's 10.9 and slower than every K-quant — the same continued-pretrain-shape Q8_0 slowdown seen on the finance card. Perplexity favors Q8_0 but Q6_K (22.4 tok/s) is the safer throughput pick.
  • Not legal advice A 7B model inherited from the upstream Mistral base — for drafting, triage, and classification support, not legal advice or filing decisions. No jurisdiction-specific validation is claimed.