Metadata-Version: 2.4
Name: lfm-onnx-hf
Version: 0.1.1
Summary: Standalone LFM ONNX inference with first-run Hugging Face download and local cache
Author: Carlo Moro
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.20
Requires-Dist: tokenizers>=0.20
Requires-Dist: huggingface_hub>=0.24

# LFM ONNX HF Library

## Install

```bash
pip install lfm-onnx-hf
```

## Python Usage

### Basic Prompt (sync, stream=False)

```python
from lfm_onnx_hf import LFMOnnxEngine, GenerationConfig

engine = LFMOnnxEngine()

text = engine.basic_prompt(
    "What is the capital of France?",
    stream=False,
    generation=GenerationConfig(
        max_new_tokens=64,
        temperature=0.0,
        top_k=50,
        repetition_penalty=1.05,
        seed=42,
    ),
)
print(text)
```

### Basic Prompt (sync, stream=True)

```python
from lfm_onnx_hf import LFMOnnxEngine

engine = LFMOnnxEngine()

for chunk in engine.basic_prompt("Write a one-line poem about the sea.", stream=True):
    print(chunk, end="", flush=True)
print()
```

### Basic Prompt + Assistant Prefill

```python
from lfm_onnx_hf import LFMOnnxEngine

engine = LFMOnnxEngine()

text = engine.basic_prompt(
    "Return strict JSON with fields city and country.",
    assistant_prefill="```json\n",
    stream=False,
)
print(text)
```

### Chat Input (sync, stream=False)

```python
from lfm_onnx_hf import LFMOnnxEngine

engine = LFMOnnxEngine()

turns = [
    {"role": "system", "content": "Be concise."},
    {"role": "user", "content": "My name is Ana."},
    {"role": "assistant", "content": "Nice to meet you, Ana."},
    {"role": "user", "content": "What is my name?"},
]

text = engine.chat_input(turns, stream=False)
print(text)
```

### Chat Input (sync, stream=True)

```python
from lfm_onnx_hf import LFMOnnxEngine

engine = LFMOnnxEngine()

turns = [{"role": "user", "content": "Give me 3 short productivity tips."}]

for chunk in engine.chat_input(turns, stream=True):
    print(chunk, end="", flush=True)
print()
```

### Full Generate API (sync)

```python
from lfm_onnx_hf import LFMOnnxEngine

engine = LFMOnnxEngine()

messages = [{"role": "user", "content": "Explain recursion in 2 sentences."}]
text, stats = engine.generate(
    messages=messages,
    max_new_tokens=80,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    seed=7,
    assistant_prefill="",
)

print(text)
print(stats)
```

### Async Usage (stream=False and stream=True)

```python
import asyncio
from lfm_onnx_hf import LFMOnnxEngine


async def main():
    engine = LFMOnnxEngine()

    # async non-stream
    text = await engine.basic_prompt_async(
        "One word for water in French?",
        stream=False,
    )
    print(text)

    # async stream
    stream_iter = await engine.chat_input_async(
        [{"role": "user", "content": "List 5 planets."}],
        stream=True,
    )
    async for chunk in stream_iter:
        print(chunk, end="", flush=True)
    print()


asyncio.run(main())
```

## Hugging Face Source

By default, first use downloads from:

- Repo: `cnmoro/LFM-Q4-GGUFS`
- Subfolder: `2_5_350m`
- Model: `model_q4.slim.spec.strip.min.onnx`

## CLI Usage

### Basic

```bash
lfm-onnx-hf \
  --prompt "What is the capital of France?" \
  --max-new-tokens 64 \
  --temperature 0.0
```

### Stream Output

```bash
lfm-onnx-hf \
  --prompt "Write a short haiku about rain" \
  --stream
```

### Assistant Prefill

```bash
lfm-onnx-hf \
  --prompt "Return strict JSON with fields city and country" \
  --assistant-prefill '```json\n'
```

### Multi-turn Messages

```bash
lfm-onnx-hf \
  --messages-json '[{"role":"system","content":"Be concise."},{"role":"user","content":"Summarize photosynthesis in one paragraph."}]'
```

### HF Options

```bash
lfm-onnx-hf \
  --repo-id cnmoro/LFM-Q4-GGUFS \
  --subfolder 2_5_350m \
  --model model_q4.slim.spec.strip.min.onnx \
  --download-max-retries 8 \
  --download-initial-backoff 1.5
```

### Common CLI Options

- `--repo-id`
- `--subfolder`
- `--model`
- `--revision`
- `--token`
- `--cache-root`
- `--prompt`
- `--system`
- `--messages-json`
- `--max-new-tokens`
- `--temperature`
- `--top-k`
- `--repetition-penalty`
- `--seed`
- `--stream`
- `--assistant-prefill`
- `--benchmark-runs`
- `--provider`
- `--intra-op-threads`
- `--inter-op-threads`
- `--download-max-retries`
- `--download-initial-backoff`
