Generate a CUDA kernel for fp16 Softsign activation on a single input tensor x: out = x / (1 + |x|). Elementwise, output shape equals input shape.
