turbo_attn (imported as `tqkv`)
Copyright 2026 Arbi City (Dmitri Evseev <dmitri.evseev@arbi.city>)

This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at:

    https://mozilla.org/MPL/2.0/

A copy of the License is included as `LICENSE` at the root of this repository.

================================================================================
ORIGINATED CONTRIBUTIONS
================================================================================

The following techniques and components were originated by Arbi City and
are released under MPL-2.0 with the request that downstream uses
preserve attribution. See `ATTRIBUTION.md` for technique-by-technique
provenance, novelty notes, and file references.

Kernel-level contributions (Arbi City, 2026):

  * FA4 inline-dequant — subclass of FlashAttention-4's CuTeDSL forward
    that overrides load_K / load_V to dequantize compressed KV bytes
    directly into MMA register tiles, eliminating the decompress buffer.
    File: tqkv/kernels/cuda/prefill/fa4_tq.py.

  * V4 split-D Q-once smem layout — full sQ + aliased sKV (96 KB)
    allowing K and V to share a single 32 KB chunk-sized smem region
    sequentially under barrier protection.

  * Dual-kernel SWA decode dispatcher — split-K kernel for sliding-window
    layers at sw=0; unified BLOCK_M=1 kernel for sw>0 SWA layers.
    Files: tqkv/kernels/_cuda_decode_splitk.cu,
    tqkv/kernels/_cuda_decode_unified.cu.

  * Cooperative bf16→fp32 smem dtype swap in the decode kernel.

  * Multi-warp compress pack at w=2 (occupancy 41.5% → 80.7%).
    File: tqkv/kernels/cuda_compress_store.py.

  * Per-layer bit-allocation pipeline using logit-KLD distortion +
    greedy bang-per-byte solver, with `lossless` / `balanced` /
    `aggressive` profiles. Files: tqkv/calibration/build_distortion_table.py,
    tqkv/calibration/solve_bits.py, tqkv/calibration/calibrate_model.py.

  * Fisher-weighted Lloyd's calibration using attention-mass as the
    weighting proxy. Files: tqkv/calibrate_online.py and
    tqkv/calibration/.

Integration-level contributions (Arbi City, 2026):

  * vLLM v1 attention-backend plugin with FULL_AND_PIECEWISE CUDAGraph
    capture across both prefill and decode.
    File: tqkv/integrations/vllm/plugin.py, backend.py.

  * Per-group BlockPool patch for hybrid models (attention + Mamba/GDN);
    proposed upstream to vLLM.

  * SGLang attention-backend integration (pool, backend, metadata,
    CUDAGraph). Files: tqkv/integrations/sglang/.

  * Hybrid GatedDeltaNet support end-to-end (validated on Qwen3.5-27B-AWQ
    TP=2 at 1.36M context).

  * Hybrid MoE support end-to-end (validated on LFM2-8B-A1B and
    Qwen3.6-MoE TP=2).

================================================================================
THIRD-PARTY COMPONENTS AND ATTRIBUTION
================================================================================

This product includes software developed by third parties listed below.
Their licenses apply to those components in addition to the MPL-2.0
License covering this repository as a whole.

--------------------------------------------------------------------------------
TurboQuant codec
--------------------------------------------------------------------------------

The codec layer (Walsh-Hadamard rotation, Lloyd-Max codebook construction,
sign-removed rotation, n_centroids parameterization) is the contribution
of Zandieh et al., "TurboQuant" (ICLR 2026), and underlies turbo_attn's
compression scheme. turbo_attn extends the codec with:

  * bf16 norm storage (was fp32 in initial drafts)
  * Multi-corpus calibration (c4 + chat + code + math, multi-run pooled)
  * Per-layer per-(k,v) bit allocation via logit-KLD distortion
  * Fisher-weighted Lloyd's centroid fitting
  * Asymmetric K/V across all nine {2,4,8}² combinations

See ATTRIBUTION.md §"Codec layer" for the technique-by-technique split.

--------------------------------------------------------------------------------
FlashAttention / FlashAttention-4 (vendored)
--------------------------------------------------------------------------------

Directory: `tqkv/csrc/third_party/flash_attention_cute/`

This directory contains a vendored copy of the FlashAttention-4 CuTeDSL
forward kernel by Tri Dao and the FlashAttention team, distributed under
the Apache-2.0 license. The original LICENSE is preserved at
`tqkv/csrc/third_party/flash_attention_cute/LICENSE`.

turbo_attn does not modify the vendored code. The TQ-specific
inline-dequant logic lives entirely in
`tqkv/kernels/cuda/prefill/fa4_tq.py` as a subclass of the upstream
`FlashAttentionForwardSm120` class.

--------------------------------------------------------------------------------
vLLM
--------------------------------------------------------------------------------

turbo_attn registers as a vLLM attention backend via vLLM's plugin
entry-point system. vLLM is © 2023 the vLLM team and is distributed
under the Apache-2.0 license. See https://github.com/vllm-project/vllm.

A small set of patches against vLLM is required while upstream merges
the `CacheDType` Literal relaxation; see `docker/PATCHES.md`.

--------------------------------------------------------------------------------
SGLang
--------------------------------------------------------------------------------

turbo_attn registers as an SGLang attention backend via a runtime
monkey-patch of `_init_pools`. SGLang is © the SGLang Project and is
distributed under the Apache-2.0 license. See
https://github.com/sgl-project/sglang.

================================================================================
CITATION
================================================================================

If you build on turbo_attn — kernels, integration patterns, or the
per-layer bit-allocation pipeline — please cite this work via the
metadata in `CITATION.cff` and credit the originating contribution
listed above. If you build on the underlying TurboQuant codec, please
cite Zandieh et al., ICLR 2026, in addition.

Per MPL-2.0 §3.3, this NOTICE and the accompanying LICENSE must be
preserved in any redistribution of the Covered Software.
