Generate a CUDA kernel for fp16 SwiGLU-style fused multiply: out = x * sigmoid(gate) for same-shaped fp16 tensors.
