Generate a CUDA kernel for numerically stable fp16 softmax over the last dimension.
