Metadata-Version: 2.4
Name: llm-host
Version: 0.2.4
Summary: OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok
License-Expression: MIT
Project-URL: Homepage, https://github.com/saurabhkpxt/colab_ngrok_1
Project-URL: Repository, https://github.com/saurabhkpxt/colab_ngrok_1
Project-URL: Issues, https://github.com/saurabhkpxt/colab_ngrok_1/issues
Keywords: llm,inference,whisper,tts,openai,vllm,ngrok,colab,kokoro
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: flask>=3.0
Requires-Dist: pyngrok>=7.0
Requires-Dist: faster-whisper>=1.0
Requires-Dist: kokoro>=0.9
Requires-Dist: soundfile
Requires-Dist: numpy
Requires-Dist: requests
Provides-Extra: vllm
Requires-Dist: vllm>=0.6.0; extra == "vllm"

# llm-host

OpenAI-compatible inference server that runs an **LLM** (via vLLM), **Whisper** transcription/translation, and **Kokoro TTS** on a GPU and exposes them all at a single URL — optionally via ngrok for a public endpoint.

Default model: **Qwen/Qwen3.5-2B** + **whisper-small** — tuned to run on a T4 GPU (15 GB VRAM). Swap to larger models via `--model` / `--whisper-model` or env vars.

Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.

---

## Install

```bash
pip install llm-host

# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"

# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng   # Debian/Ubuntu/Colab
```

## Quickstart

**With ngrok (public URL):**
```bash
llm-host \
  --ngrok-token YOUR_NGROK_TOKEN \
  --hf-token    YOUR_HF_TOKEN
```

**Without ngrok (localhost / LAN only):**
```bash
llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001  and  http://<server-ip>:5001
```

Or with environment variables:

```bash
NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host
```

Without `--ngrok-token` the server binds to `0.0.0.0` and prints both the
localhost and network IP URLs.  Pass `--ngrok-token` to get a public ngrok URL.

---

## Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET`  | `/` | Dashboard UI |
| `GET`  | `/health` | Service status |
| `GET`  | `/v1/models` | List models |
| `POST` | `/v1/chat/completions` | LLM chat (streaming supported) |
| `POST` | `/v1/audio/transcriptions` | Whisper STT (keep source language) |
| `POST` | `/v1/audio/translations` | Whisper STT → English |
| `POST` | `/v1/audio/speech` | Kokoro TTS |

All API endpoints are **OpenAI-compatible** — drop in the ngrok URL as `base_url` with any OpenAI SDK.

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://<ngrok-id>.ngrok-free.app/v1",
    api_key="dummy",
)

# Chat  (model name = --served-model-name, default is last part of --model)
# Default: "Qwen3.5-2B" — check active name via GET /v1/models
resp = client.chat.completions.create(
    model="Qwen3.5-2B",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Transcription
with open("audio.wav", "rb") as f:
    text = client.audio.transcriptions.create(model="whisper-1", file=f)

# TTS
client.audio.speech.create(
    model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")
```

---

## Model configuration

### LLM (reasoning / chat)

Set via `--model` or `MODEL=` env var. Default is **Qwen/Qwen3.5-2B** — runs on T4, no HuggingFace token required.

```bash
# Default — T4-friendly, no HF token needed
llm-host

# Larger Qwen3 variants (A100 recommended)
MODEL=Qwen/Qwen3-8B  llm-host
MODEL=Qwen/Qwen3-14B llm-host
MODEL=Qwen/Qwen3-32B llm-host

# Llama 3.1 (gated — requires HF token + accepted licence)
MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_xxx llm-host

# AWQ-quantized Llama (lower VRAM, still needs A100 for large-v3 Whisper)
MODEL=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 QUANTIZATION=awq llm-host
```

The model is served under its last path component by default (e.g. `Qwen3.5-2B`).
Override with `--served-model-name` / `SERVED_MODEL_NAME=`.

### Whisper (speech-to-text)

Set via `--whisper-model` or `WHISPER_MODEL=` env var. Default: `large-v3`.

| Model | VRAM | Speed | Accuracy |
|-------|------|-------|----------|
| `tiny` | ~1 GB | fastest | lowest |
| `base` | ~1 GB | fast | low |
| `small` | ~2 GB | fast | good |
| `medium` | ~5 GB | moderate | better |
| `large-v2` | ~10 GB | slow | high |
| `large-v3` | ~10 GB | slow | **highest** |
| `large-v3-turbo` | ~6 GB | fast | high |

```bash
WHISPER_MODEL=small       llm-host   # default — T4-friendly (~1 GB VRAM)
WHISPER_MODEL=medium      llm-host   # better accuracy, ~2 GB VRAM
WHISPER_MODEL=large-v3    llm-host   # highest accuracy, ~10 GB VRAM (A100)
WHISPER_MODEL=large-v3-turbo llm-host  # good balance on A100
```

---

## CLI options

```
llm-host --help

  --ngrok-token              ngrok authtoken (optional; omit for localhost/LAN only)
  --hf-token                 HuggingFace token (needed only for gated models)
  --model                    HuggingFace model ID (default: Qwen/Qwen3.5-2B)
  --served-model-name        name used in API calls (default: last part of --model)
  --quantization             awq | bitsandbytes | none  (default: none)
  --whisper-model            tiny | base | small | medium | large-v1 | large-v2 | large-v3 | large-v3-turbo
                             (default: small)
  --tts-voice                alloy | echo | fable | onyx | nova | shimmer  (default: alloy)
  --vllm-port                internal vLLM port (default: 8000)
  --gateway-port             public gateway port (default: 5001)
  --gpu-memory-utilization   vLLM GPU memory fraction (default: 0.75).
                             vLLM pre-allocates this share of VRAM for model + KV cache +
                             CUDA graphs (vLLM ≥0.21 includes CUDA graph memory here).
                             0.75 works on T4 for 2B models; use ~0.85 for 8B+ on A100.
  --max-model-len            context length (default: 8192)
  --no-vllm                  skip starting vLLM (use existing instance)
```

All flags can also be set via `UPPER_SNAKE_CASE` environment variables:

```bash
MODEL=Qwen/Qwen3-14B \
WHISPER_MODEL=large-v3-turbo \
NGROK_TOKEN=xxx \
llm-host
```

---

## TTS voices

| Voice | Character | Kokoro name |
|-------|-----------|-------------|
| `alloy` | Neutral female | af_heart |
| `echo` | Male | am_echo |
| `fable` | British female | bf_emma |
| `onyx` | Deep male | am_adam |
| `nova` | Energetic female | af_nova |
| `shimmer` | Soft female | af_bella |

Raw Kokoro voice names (e.g. `af_sky`) are also accepted directly.

---

## License

MIT

---

## Contact

For issues, questions, or feedback, reach out at **saurabh.kpxt@gmail.com**.
