Generate a CUDA kernel for fp32 fused bias add and ReLU: out = max(x + bias, 0) for same-shaped tensors.
