Generate a fused fp16 RMSNorm plus SiLU kernel over the last dimension: y = rmsnorm(x); out = y * sigmoid(y).
