Generate a CUDA kernel for fp32 fused tanh add: out = tanh(x) + y for same-shaped tensors.
