Metadata-Version: 2.4
Name: b12x
Version: 0.3.0
Summary: Unapologetically SM120-only CuTe DSL kernels for NVFP4 GEMM and MoE.
Author: Luke
License-Expression: Apache-2.0
Requires-Python: <4.0,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.10.0
Requires-Dist: cuda-python
Requires-Dist: nvidia-cutlass-dsl[cu13]==4.4.1
Requires-Dist: apache-tvm-ffi!=0.1.8,!=0.1.8.post0,<0.2,>=0.1.6
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"

# b12x

`b12x` is an SM120-only CuTe DSL kernel library for Blackwell NVFP4 dense GEMM
and routed Mixture-of-Experts inference.

It is intentionally narrow. This is not a generic CUDA kernel collection or a
full model-serving stack. It does not intend to target any other GPU architectures,
including SM100. It is a focused package for a small number of hand-tuned, high-performance
SM120 kernels plus the runtime glue needed to launch them cleanly from PyTorch and `sglang`.

## Installation

### Runtime install

```bash
python -m pip install b12x
```

### Development install from source

```bash
git clone <repo-url>
cd b12x
python -m pip install -e '.[dev]'
```

## Requirements

- Blackwell SM120 GPU
- CUDA 13 toolchain
- Python `>=3.10,<4.0`
- CUDA 13 PyTorch, `torch>=2.10.0`
- `nvidia-cutlass-dsl[cu13]==4.4.1`
- FlashInfer available if you want reference and benchmark comparisons, but it's not a runtime dependency
- Qwen3.5-397B A17B NVFP4 checkpoint available through `B12X_MODEL_PATH` for the end-to-end MoE benchmark

## Package layout

- `b12x.cute`
  - Low-level CuTe and FP4 helpers
- `b12x.gemm`
  - Standalone dense NVFP4 GEMM
- `b12x.integration`
  - Public runtime entrypoints such as `b12x_moe_fp4`
- `b12x.moe.fused`
  - Static and dynamic fused MoE kernels and reference paths
- `b12x.quant`
  - Torch-side NVFP4 packing and quantization helpers
- `b12x.sglang`
  - Thin `sglang` integration shims

## MoE runtime contract

- `b12x.integration.tp_moe.b12x_moe_fp4` requires a caller-owned workspace.
- `b12x` selects its fused MoE backend from shape alone:
  - compact routed workloads use the static backend
  - all larger routed workloads use dynamic
- Use `allocate_tp_moe_workspace(...)` for one exact unchunked launch shape.
- Use `allocate_tp_moe_workspace_pool()` for variable-size or chunked workloads.
- Keep one workspace pool per process/device, and let the pool partition scratch by CUDA stream internally.
- During CUDA graph capture, `output=` must also be caller-owned and stable across replays.

## Benchmarks and tests

### Benchmarks

- `benchmarks/benchmark_moe.py`
  - End-to-end Qwen3.5-397B TP=4 MoE benchmark
  - `micro` batch profile: `[1, 2, 4, 8]`
  - `sglang-single-request` batch profile: `[1, 23, 80]`
  - `chunked-prefill` batch profile: `[8192, 16384, 24576, 32768]`
- `benchmarks/benchmark_dense_gemm.py`
  - Dense FP4 GEMM vs FlashInfer/cuDNN/CUTLASS

### Tests

- `tests/test_tp_moe_reference.py`
  - Independent oracle-backed MoE correctness test
- `tests/test_moe_equivalence.py`
  - Real-weight smoke and CUDA-graph replay routing-safety checks
- `tests/test_gemm_stack.py`
  - Dense GEMM exactness vs FlashInfer/cuDNN

## Common commands

```bash
# Graph-first benchmark defaults with auto-dispatch
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py

# Measure eager launches instead of CUDA graph replay
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --no-cuda-graph

# Include routing in the timed region
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --include-routing

# Use the recorded single-request sglang profile
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --batch-size-profile sglang-single-request

# Graph-first prefill-scale sweep aligned with chunked-prefill serving
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --batch-size-profile chunked-prefill

# Multi-layer CUDA-graph replay validation with real consecutive MoE layers
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --graph-mode multi-layer --reference none --validate none

# Dense GEMM microbenchmark
python benchmarks/benchmark_dense_gemm.py

# Oracle-backed MoE correctness
python tests/test_tp_moe_reference.py --impls b12x --scale-contract per-expert

# Real-weight CUDA-graph smoke
pytest tests/test_moe_equivalence.py
```
