turbo_attn (imported as `tqkv`)
Copyright 2026 Arbi City (Dmitri Evseev <dmitri.evseev@arbi.city>)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
A copy of the License is included as `LICENSE` at the root of this repository.
You may also obtain a copy at:

    http://www.apache.org/licenses/LICENSE-2.0

================================================================================
ORIGINATED CONTRIBUTIONS
================================================================================

The following techniques and components were originated by Arbi City and
are released under Apache-2.0 with the request that downstream uses
preserve attribution. See `ATTRIBUTION.md` for technique-by-technique
provenance, novelty notes, and file references.

Kernel-level contributions (Arbi City, 2026):

  * FA4 inline-dequant — subclass of FlashAttention-4's CuTeDSL forward
    that overrides load_K / load_V to dequantize compressed KV bytes
    directly into MMA register tiles, eliminating the decompress buffer.
    File: tqkv/kernels/cuda/prefill/fa4_tq.py.

  * V4 split-D Q-once smem layout — full sQ + aliased sKV (96 KB)
    allowing K and V to share a single 32 KB chunk-sized smem region
    sequentially under barrier protection.

  * Dual-kernel SWA decode dispatcher — split-K kernel for sliding-window
    layers at sw=0; unified BLOCK_M=1 kernel for sw>0 SWA layers.
    Files: tqkv/kernels/_cuda_decode_splitk.cu,
    tqkv/kernels/_cuda_decode_unified.cu.

  * Cooperative bf16→fp32 smem dtype swap in the decode kernel.

  * Multi-warp compress pack at w=2 (occupancy 41.5% → 80.7%).
    File: tqkv/kernels/cuda_compress_store.py.

  * Per-layer bit-allocation pipeline using logit-KLD distortion +
    greedy bang-per-byte solver, with `lossless` / `balanced` /
    `aggressive` profiles. Files: scripts/build_distortion_table.py,
    scripts/solve_bits.py, scripts/calibrate_model.py.

  * Fisher-weighted Lloyd's calibration using attention-mass as the
    weighting proxy. File: tqkv/calibrate_online.py and scripts/.

Integration-level contributions (Arbi City, 2026):

  * vLLM v1 attention-backend plugin with FULL_AND_PIECEWISE CUDAGraph
    capture across both prefill and decode.
    File: tqkv/integrations/vllm/plugin.py, backend.py.

  * Per-group BlockPool patch for hybrid models (attention + Mamba/GDN);
    proposed upstream to vLLM.

  * SGLang attention-backend integration (pool, backend, metadata,
    CUDAGraph). Files: tqkv/integrations/sglang/.

  * Hybrid GatedDeltaNet support end-to-end (validated on Qwen3.5-27B-AWQ
    TP=2 at 1.36M context).

  * Hybrid MoE support end-to-end (validated on LFM2-8B-A1B and
    Qwen3.6-MoE TP=2).

================================================================================
THIRD-PARTY COMPONENTS AND ATTRIBUTION
================================================================================

This product includes software developed by third parties listed below.
Their licenses apply to those components in addition to the Apache-2.0
License covering this repository as a whole.

--------------------------------------------------------------------------------
TurboQuant codec
--------------------------------------------------------------------------------

The codec layer (Walsh-Hadamard rotation, Lloyd-Max codebook construction,
sign-removed rotation, n_centroids parameterization) is the contribution
of Zandieh et al., "TurboQuant" (ICLR 2026), and underlies turbo_attn's
compression scheme. turbo_attn extends the codec with:

  * bf16 norm storage (was fp32 in initial drafts)
  * Multi-corpus calibration (c4 + chat + code + math, multi-run pooled)
  * Per-layer per-(k,v) bit allocation via logit-KLD distortion
  * Fisher-weighted Lloyd's centroid fitting
  * Asymmetric K/V across all nine {2,4,8}² combinations

See ATTRIBUTION.md §"Codec layer" for the technique-by-technique split.

--------------------------------------------------------------------------------
FlashAttention / FlashAttention-4 (vendored)
--------------------------------------------------------------------------------

Directory: `tqkv/csrc/third_party/flash_attention_cute/`

This directory contains a vendored copy of the FlashAttention-4 CuTeDSL
forward kernel by Tri Dao and the FlashAttention team, distributed under
the Apache-2.0 license. The original LICENSE is preserved at
`tqkv/csrc/third_party/flash_attention_cute/LICENSE`.

turbo_attn does not modify the vendored code. The TQ-specific
inline-dequant logic lives entirely in
`tqkv/kernels/cuda/prefill/fa4_tq.py` as a subclass of the upstream
`FlashAttentionForwardSm120` class.

--------------------------------------------------------------------------------
vLLM
--------------------------------------------------------------------------------

turbo_attn registers as a vLLM attention backend via vLLM's plugin
entry-point system. vLLM is © 2023 the vLLM team and is distributed
under the Apache-2.0 license. See https://github.com/vllm-project/vllm.

A small set of patches against vLLM is required while upstream merges
the `CacheDType` Literal relaxation; see `docker/PATCHES.md`.

--------------------------------------------------------------------------------
SGLang
--------------------------------------------------------------------------------

turbo_attn registers as an SGLang attention backend via a runtime
monkey-patch of `_init_pools`. SGLang is © the SGLang Project and is
distributed under the Apache-2.0 license. See
https://github.com/sgl-project/sglang.

================================================================================
CITATION
================================================================================

If you build on turbo_attn — kernels, integration patterns, or the
per-layer bit-allocation pipeline — please cite this work via the
metadata in `CITATION.cff` and credit the originating contribution
listed above. If you build on the underlying TurboQuant codec, please
cite Zandieh et al., ICLR 2026, in addition.

This NOTICE file is required to be reproduced unchanged in any
redistribution per Section 4(d) of the Apache 2.0 License.
