Generate a CUDA kernel for fp16 SiLU activation: out = x * sigmoid(x).
