Generate a CUDA kernel for fp16 stable softmax numerator: out = exp(x - max(x, dim=-1)).
