tqai v0.4 Pipeline Benchmark Report

Generated 2026-04-06 | Branch: feature/pipeline-middleware | 417 tests passing

Scorers
5
palm, snr, fisher, sheaf, bsa
Strategies
4
tiered, delta, delta2, window
Monitors
2
stability, lyapunov
Adapters
3
llm, dit, wan
Models Tested
7
Qwen, Gemma, Llama, WAN 2.2
Papers Covered
12/13
QuantSparse, DiTFastAttn, BSA, ...

Aggregate Results (Mean Across All Models)

Config Scorer Strategy NMSE vs Baseline Cosine Sim Compress (ms) Quality
baseline -- 0.009327-0.99530.432
palm+tiered palmtiered 0.009362+0.4%0.99530.564
palm+delta palmdelta 0.003755-59.7%0.99810.571
snr+delta2 snrdelta2 0.003755-59.7%0.99810.483
sheaf+delta2 sheafdelta2 0.003752-59.8%0.99810.579
fisher+delta fisherdelta 0.003755-59.7%0.99810.499
palm+window palmwindow 0.018908+102.7%0.99050.264
fisher+tiered fishertiered 0.116099+1145%0.94020.341
palm+tiered+stab palmtiered stability 0.009319-0.1%0.99530.551
snr+delta2+lyap snrdelta2 lyapunov 0.003747-59.8%0.99810.480

Key Findings

Best Quality: Delta Strategies

Delta and delta2 strategies achieve ~60% NMSE reduction vs baseline by exploiting inter-step temporal redundancy. This validates QuantSparse's (arXiv:2509.23681) second-order residual insight.

Best config: sheaf+delta2 (NMSE 0.003752, CosSim 0.9981)

Fastest: Window Strategy

Window configs show ~40% faster compress time by reusing cached quantized outputs. Trade-off: 2x worse NMSE. Best for real-time inference where latency > distortion.

Best config: palm+window (0.264ms compress)

Fisher Scorer Needs Calibration

Fisher+tiered shows NMSE=0.116 (12x worse) because squared-activation proxy over-estimates importance. Needs true gradient-based Fisher or offline calibration.

Fix: Use Fisher for offline GA calibration only, not runtime scoring.

Monitors Add Minimal Overhead

Adding stability or Lyapunov monitors has negligible impact on compress time (+2-5%) while enabling runtime adaptation. palm+tiered+stab slightly improves over plain palm+tiered.

Recommendation: Always enable stability monitor.

Per-Model Results

LLM Models

Modelhead_dimBest ConfigBest NMSECosSimCompress ms
Qwen2.5-0.5B64sheaf+delta20.0036450.99820.151
Qwen2.5-3B128sheaf+delta20.0036930.99810.214
Qwen2.5-7B128snr+delta20.0037070.99810.258
Gemma-2B256palm+delta0.0037940.99810.211
Gemma-7B256snr+delta20.0037720.99811.468
Llama-3.1-8B128palm+delta0.0037460.99810.555

DiT Models (Video Generation)

Modelhead_dimBest ConfigBest NMSECosSimCompress ms
WAN2.2-5B128snr+delta20.0037430.99810.641

Paper Coverage Matrix

PaperTechniqueModuleStatus
QuantSparseSecond-order Δ²strategies/delta2.pyFull
DiTFastAttnStep sharingstrategies/delta.pyFull
DiTFastAttnWindow Attnstrategies/window.pyPartial
BSAKV saliencyscorers/bsa.pyFull
BSAQ sparsityNeeds kernel
Fisher-RaoFIM scoringscorers/fisher.pyProxy only
Sheaf TheoryHarmonicityscorers/sheaf.pyFull
CopresheafPer-head codebookscodebook/registry.pyRegistry ready
SparseDiTLayer allocationskip_layers configFull
VDiT AnalysisNon-sparse layersskip_layersFull
Spherical AttnL2-norm attnNot impl

Plugin Registry

Scorers

palm EMA novelty/surprise (TurboQuant)

snr Diffusion schedule SNR (Min-SNR)

fisher Squared activation proxy (APTQ)

sheaf Laplacian harmonicity (Sheaf Theory)

bsa Block centroid saliency (BSA)

Strategies

tiered Dual-quantizer routing by score

delta First-order inter-step Δ

delta2 Second-order Δ² (QuantSparse)

window Similarity-based cache reuse

Monitors

stability Attention entropy tracking

lyapunov FTLE divergence detection

Adapters

llm HuggingFace / mlx-lm autoregressive

dit Diffusers DiT (SD3, Flux)

wan WAN 2.2 (TI2V-5B, A14B)

Recommended Configurations

# LLM (Qwen, Gemma, Llama) — best quality
tqai run "prompt" -m Qwen/Qwen2.5-7B --scorer palm --strategy delta

# LLM — best quality with monitoring
pipeline = {"scorer": "palm", "strategy": "delta", "monitor": "stability"}

# DiT / WAN 2.2 — second-order delta with SNR schedule
pipeline = {"scorer": "snr", "strategy": "delta2", "monitor": "lyapunov",
            "scorer_kwargs": {"schedule": "cosine"}}

# DiT with layer protection (identify non-sparse layers first)
pipeline = {"scorer": "sheaf", "strategy": "delta2",
            "skip_layers": [0, 1, 28, 29]}  # protect first/last layers

# Fast inference (lower quality, lower latency)
pipeline = {"scorer": "palm", "strategy": "window"}