Generate a CUDA kernel for fp32 elementwise hypotenuse of two tensors x and y of the same shape: out = sqrt(x*x + y*y). Two input tensors x and y, fp32.
