Generate a CUDA kernel for fp32 cumulative max along the last dimension.
