Generate a CUDA kernel for fp32 min over the last dimension: out = min(x, dim=-1).values.
