Generate a CUDA kernel for fp32 clamp: out = min(max(x, -1.0), 1.0).
