Generate a CUDA kernel for fp32 max over the last dimension: out = max(x, dim=-1).values.
