Generate a CUDA kernel for fp32 fused add and ReLU: out = relu(x + y).
