Generate a CUDA kernel for fp16 RMSNorm over the last dimension: out = x / sqrt(mean(x*x) + 1e-5).
