Metadata-Version: 2.4
Name: warpscope
Version: 0.1.0
Summary: In-kernel %globaltimer profiler for warp-specialized CUDA kernels (Perfetto/Chrome timelines).
Author: warpscope contributors
License: MIT
Project-URL: Homepage, https://github.com/YangWang92/warpscope
Keywords: cuda,profiler,globaltimer,perfetto,gpu,warp-specialized,tracing
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Classifier: Environment :: GPU :: NVIDIA CUDA
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Provides-Extra: torch
Requires-Dist: torch; extra == "torch"
Provides-Extra: perfetto
Requires-Dist: tg4perfetto; extra == "perfetto"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# warpscope

English | [中文](README.zh.md)

> Credits: the idea and design all come from **侯博涵 (Hou Bohan)**'s write-up
> ([zhihu](https://zhuanlan.zhihu.com/p/2054305616391304228)); the wire format and the
> host-side decode / Perfetto export are adapted from **Apache TVM TIRx `CudaProfiler`**
> ([bench.py](https://github.com/apache/tvm/blob/main/python/tvm/tirx/bench.py),
> [docs](https://tvm.apache.org/docs/tirx/native_basics/cuda/profiling.html)).
> The implementation in this repo was **written by Claude Opus (Anthropic)**.

In-kernel `%globaltimer` profiler for **warp-specialized CUDA kernels**. Bracket the
logical stages inside a kernel (TMA load, MMA, softmax, epilogue, ...) with
`start`/`end` markers; one leader thread per logical group stamps the GPU global timer
into a buffer you pass as an ordinary kernel argument. Decode it on the host into
per-`(block, group)` durations or a **Perfetto / Chrome trace** to see how the
producer and consumer warp-groups actually overlap — something total launch time and
SM-level counters can't show.

It is **not** zero cost (a timer read + a global store + a block fence per event), so
it is a debugging/analysis tool. Build with the profiler disabled for production.

## Layout

```
warpscope/
  include/
    warpscope.cuh         # device header (header-only, NVRTC-safe)
    warpscope_host.hpp    # host decoder + Chrome-trace writer (header-only, pure C++)
  *.py                    # Python: Profiler buffer mgmt, decode, trace export
examples/                 # toy CUDA program (pure C++ path) + python driver
tests/                    # wire-format + decode tests
```

## Install

```bash
pip install -e .            # core (numpy only)
pip install -e ".[torch]"   # + GPU buffer allocation
pip install -e ".[dev]"     # + pytest
```

## Device side (CUDA C++)

```cpp
#include <warpscope.cuh>
enum : uint32_t { EvWait = 0, EvWork = 1 };

__global__ void k(..., uint64_t* prof, uint32_t stride,
                  uint32_t num_groups, uint32_t num_blocks, uint32_t max_rec) {
    ws::Profiler<true> p;     // <false> compiles to a no-op
    const uint32_t warp = threadIdx.x / 32, lane = threadIdx.x % 32;

    if (warp == 0) {                                  // e.g. TMA producer = group 0
        p.init(prof, stride, /*group=*/0, num_groups, num_blocks,
               /*leader=*/lane == 0, max_rec);
        { WS_REGION(p, EvWait); /* barrier wait */ }  // RAII start/end
        { WS_REGION(p, EvWork); /* issue work  */ }
        p.finalize();
    }
    // ... other warp-groups: init with their own group id + one leader each ...
}
```

Build: `nvcc -I"$(warpscope --include)" -arch=sm_100a my.cu`

## Host side — pick one

**Pure C++ (header-only, no Python):**

```cpp
#include <warpscope_host.hpp>
std::vector<uint64_t> h(slots);                 // cudaMemcpy buffer back into h
ws::write_chrome_trace(h.data(), h.size(),
    /*events*/ {"wait", "work"},
    /*groups*/ {"tma", "umma", "utccp", "epilogue"},
    "trace.json");                              // open in chrome://tracing or perfetto
```

**Python:**

```python
import warpscope as ws
prof = ws.Profiler(num_blocks=num_sms, num_groups=4, max_records_per_lane=64)
launch(..., prof.ptr)        # pass the device pointer
torch.cuda.synchronize()
res = prof.decode(event_names={0: "wait", 1: "work"},
                  group_names={0: "tma", 1: "umma", 2: "utccp", 3: "epilogue"})
res.print_durations()
res.to_perfetto("trace.json")   # Chrome JSON; opens in ui.perfetto.dev too
```

## Output

The raw output is a `uint64` buffer. Both host paths turn it into a **Chrome Trace
Event JSON** file (`pid = block`, `tid = group`, `ts/dur` in microseconds) that opens
directly in `chrome://tracing` and <https://ui.perfetto.dev>. A native
`.perfetto-trace` writer is available via the optional `tg4perfetto` dependency.

## Wire format (v1, shared ABI)

```
record = (globaltimer_lo32 << 32) | tag32
tag32  = (block_group << 12) | (event_id << 2) | event_type
block_group = block_idx * num_groups + group_id
event_type : 0=begin 1=end 2=instant 3=finalize
buf[0] header = (num_groups << 32) | num_blocks
```

Identical to the format used by TIRx/flashinfer, so traces are cross-tool compatible.

## Caveats

- Zero the buffer before launch (the decoder treats 0 as empty).
- Exactly one leader thread per `(block, group)` lane (two writers clobber the cursor).
- `%globaltimer_lo` is 32-bit ns: ~tens-of-ns resolution and a ~4.29 s wrap.
- Persistent grids stream records — cap with `max_records_per_lane` (host) which is
  also enforced device-side via `init(..., max_records_per_lane=...)`.
- The fence + store perturb tight pipelines; keep events coarse and compare against an
  unprofiled (`ws::Profiler<false>`) build.

## Credits & License

Licensed under the **MIT License** (see [LICENSE](LICENSE)).

- **侯博涵 (Hou Bohan)** — original idea and write-up:
  <https://zhuanlan.zhihu.com/p/2054305616391304228>
- **Apache TVM TIRx `CudaProfiler`** (Apache-2.0) — wire format + host decode/Perfetto
  export are adapted from it:
  <https://github.com/apache/tvm/blob/main/python/tvm/tirx/bench.py> ·
  <https://tvm.apache.org/docs/tirx/native_basics/cuda/profiling.html>
- The CUDA/Python implementation in this repository was **written by Claude Opus (Anthropic)**.
