Generate a CUDA kernel for the fp32 mean squared error between two input tensors predictions and targets of equal shape: out = mean((predictions - targets)^2), reduced over all elements to a single scalar.
