Generate a CUDA kernel for fp32 reverse cumulative sum over the last dimension of a 2-D input tensor x: out = flip(cumsum(flip(x, dim=-1), dim=-1), dim=-1). Output shape equals input shape.
