Generate a CUDA kernel for fp32 mean over the last dimension: out = mean(x, dim=-1).
