Generate a CUDA kernel for fp16 LeakyReLU activation on a single input tensor x: out = x if x >= 0 else 0.01 * x. Elementwise, output shape equals input shape.
