Generate a CUDA kernel for fp16 HardSigmoid activation: out = clamp(x / 6 + 0.5, 0, 1). Single input tensor x, fp16.
