Generated 2026-04-06 | Branch: feature/pipeline-middleware | 417 tests passing
| Config | Scorer | Strategy | NMSE | vs Baseline | Cosine Sim | Compress (ms) | Quality |
|---|---|---|---|---|---|---|---|
| baseline | - | - | 0.009327 | - | 0.9953 | 0.432 | |
| palm+tiered | palm | tiered | 0.009362 | +0.4% | 0.9953 | 0.564 | |
| palm+delta | palm | delta | 0.003755 | -59.7% | 0.9981 | 0.571 | |
| snr+delta2 | snr | delta2 | 0.003755 | -59.7% | 0.9981 | 0.483 | |
| sheaf+delta2 | sheaf | delta2 | 0.003752 | -59.8% | 0.9981 | 0.579 | |
| fisher+delta | fisher | delta | 0.003755 | -59.7% | 0.9981 | 0.499 | |
| palm+window | palm | window | 0.018908 | +102.7% | 0.9905 | 0.264 | |
| fisher+tiered | fisher | tiered | 0.116099 | +1145% | 0.9402 | 0.341 | |
| palm+tiered+stab | palm | tiered stability | 0.009319 | -0.1% | 0.9953 | 0.551 | |
| snr+delta2+lyap | snr | delta2 lyapunov | 0.003747 | -59.8% | 0.9981 | 0.480 |
Delta and delta2 strategies achieve ~60% NMSE reduction vs baseline by exploiting inter-step temporal redundancy. This validates QuantSparse's (arXiv:2509.23681) second-order residual insight.
Best config: sheaf+delta2 (NMSE 0.003752, CosSim 0.9981)
Window configs show ~40% faster compress time by reusing cached quantized outputs. Trade-off: 2x worse NMSE. Best for real-time inference where latency > distortion.
Best config: palm+window (0.264ms compress)
Fisher+tiered shows NMSE=0.116 (12x worse) because squared-activation proxy over-estimates importance. Needs true gradient-based Fisher or offline calibration.
Fix: Use Fisher for offline GA calibration only, not runtime scoring.
Adding stability or Lyapunov monitors has negligible impact on compress time (+2-5%) while enabling runtime adaptation. palm+tiered+stab slightly improves over plain palm+tiered.
Recommendation: Always enable stability monitor.
| Model | head_dim | Best Config | Best NMSE | CosSim | Compress ms |
|---|---|---|---|---|---|
| Qwen2.5-0.5B | 64 | sheaf+delta2 | 0.003645 | 0.9982 | 0.151 |
| Qwen2.5-3B | 128 | sheaf+delta2 | 0.003693 | 0.9981 | 0.214 |
| Qwen2.5-7B | 128 | snr+delta2 | 0.003707 | 0.9981 | 0.258 |
| Gemma-2B | 256 | palm+delta | 0.003794 | 0.9981 | 0.211 |
| Gemma-7B | 256 | snr+delta2 | 0.003772 | 0.9981 | 1.468 |
| Llama-3.1-8B | 128 | palm+delta | 0.003746 | 0.9981 | 0.555 |
| Model | head_dim | Best Config | Best NMSE | CosSim | Compress ms |
|---|---|---|---|---|---|
| WAN2.2-5B | 128 | snr+delta2 | 0.003743 | 0.9981 | 0.641 |
| Paper | Technique | Module | Status |
|---|---|---|---|
| QuantSparse | Second-order Δ² | strategies/delta2.py | Full |
| DiTFastAttn | Step sharing | strategies/delta.py | Full |
| DiTFastAttn | Window Attn | strategies/window.py | Partial |
| BSA | KV saliency | scorers/bsa.py | Full |
| BSA | Q sparsity | — | Needs kernel |
| Fisher-Rao | FIM scoring | scorers/fisher.py | Proxy only |
| Sheaf Theory | Harmonicity | scorers/sheaf.py | Full |
| Copresheaf | Per-head codebooks | codebook/registry.py | Registry ready |
| SparseDiT | Layer allocation | skip_layers config | Full |
| VDiT Analysis | Non-sparse layers | skip_layers | Full |
| Spherical Attn | L2-norm attn | — | Not impl |
palm EMA novelty/surprise (TurboQuant)
snr Diffusion schedule SNR (Min-SNR)
fisher Squared activation proxy (APTQ)
sheaf Laplacian harmonicity (Sheaf Theory)
bsa Block centroid saliency (BSA)
tiered Dual-quantizer routing by score
delta First-order inter-step Δ
delta2 Second-order Δ² (QuantSparse)
window Similarity-based cache reuse
stability Attention entropy tracking
lyapunov FTLE divergence detection
llm HuggingFace / mlx-lm autoregressive
dit Diffusers DiT (SD3, Flux)
wan WAN 2.2 (TI2V-5B, A14B)
# LLM (Qwen, Gemma, Llama) — best quality
tqai run "prompt" -m Qwen/Qwen2.5-7B --scorer palm --strategy delta
# LLM — best quality with monitoring
pipeline = {"scorer": "palm", "strategy": "delta", "monitor": "stability"}
# DiT / WAN 2.2 — second-order delta with SNR schedule
pipeline = {"scorer": "snr", "strategy": "delta2", "monitor": "lyapunov",
"scorer_kwargs": {"schedule": "cosine"}}
# DiT with layer protection (identify non-sparse layers first)
pipeline = {"scorer": "sheaf", "strategy": "delta2",
"skip_layers": [0, 1, 28, 29]} # protect first/last layers
# Fast inference (lower quality, lower latency)
pipeline = {"scorer": "palm", "strategy": "window"}