Generate a CUDA kernel for fp32 sum reduction over the last dimension: out = sum(x, dim=-1).
