Generate a CUDA kernel for fp16 fused bias add and GELU: out = gelu(x + bias) for same-shaped tensors.
