Metadata-Version: 2.4
Name: b12x
Version: 0.1.0
Summary: Unapologetically SM120-only CuTe DSL kernels for NVFP4 GEMM and MoE.
Author: Luke
License-Expression: Apache-2.0
Requires-Python: <4.0,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.10.0
Requires-Dist: cuda-python
Requires-Dist: nvidia-cutlass-dsl[cu13]==4.4.1
Requires-Dist: apache-tvm-ffi!=0.1.8,!=0.1.8.post0,<0.2,>=0.1.6
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"

# b12x

`b12x` is an unapologetically SM120-only CuTe DSL kernel library for Blackwell-class NVFP4 GEMM and routed Mixture-of-Experts inference.

The project is intentionally narrow. It is not a generic CUDA kernel collection for LLM inference. It is a focused package for shipping a small number of high-performance kernels and the runtime glue needed to launch them cleanly from PyTorch and `sglang`/`vLLM`.

## What It Includes

- Clean, standalone CuTe DSL dense NVFP4 GEMM kernels
- A fused tensor-parallel MoE path for NVFP4 expert weights
- CuTe DSL FP4 packing, scaling, and quantization helpers
- Torch reference implementations for correctness checking
- Lightweight integration surfaces for PyTorch and `sglang`

## What It Does Not Try To Be

- Multi-architecture
- Backward-compatible with pre-SM120 GPUs
- A model-serving framework
- A wrapper around inherited FlashInfer runtime code

## Requirements

- NVIDIA Blackwell SM120 GPU
- CUDA 13 toolchain
- CUDA 13 PyTorch build, `torch>=2.10.0`
- `nvidia-cutlass-dsl[cu13]==4.4.1`

## Package Layout

- `b12x.gemm`
  - Dense NVFP4 GEMM kernels
- `b12x.integration`
  - Public runtime entrypoints such as `b12x_moe_fp4`
- `b12x.moe.fused`
  - The fused MoE kernel, scheduler, and reference paths
- `b12x.quant`
  - Expert-weight quantization helpers
- `b12x.sglang`
  - `sglang` integration shims

The published wheel only contains the `b12x` package tree. Benchmarks, experiments, and tests remain in the source repo.
