Metadata-Version: 2.4
Name: llm-host
Version: 0.1.0
Summary: OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok
License-Expression: MIT
Project-URL: Homepage, https://github.com/saurabhkpxt/colab_ngrok_1
Project-URL: Repository, https://github.com/saurabhkpxt/colab_ngrok_1
Project-URL: Issues, https://github.com/saurabhkpxt/colab_ngrok_1/issues
Keywords: llm,inference,whisper,tts,openai,vllm,ngrok,colab,kokoro
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: flask>=3.0
Requires-Dist: pyngrok>=7.0
Requires-Dist: faster-whisper>=1.0
Requires-Dist: kokoro>=0.9
Requires-Dist: soundfile
Requires-Dist: numpy
Requires-Dist: requests
Provides-Extra: vllm
Requires-Dist: vllm>=0.6.0; extra == "vllm"

# llm-host

OpenAI-compatible inference server that runs **Llama 3.1 8B** (via vLLM), **Whisper large-v3** transcription, and **Kokoro TTS** on a GPU and exposes them all at a single public URL via ngrok.

Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.

---

## Install

```bash
pip install llm-host

# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"

# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng   # Debian/Ubuntu/Colab
```

## Quickstart

**With ngrok (public URL):**
```bash
llm-host \
  --ngrok-token YOUR_NGROK_TOKEN \
  --hf-token    YOUR_HF_TOKEN
```

**Without ngrok (localhost / LAN only):**
```bash
llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001  and  http://<server-ip>:5001
```

Or with environment variables:

```bash
NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host
```

Without `--ngrok-token` the server binds to `0.0.0.0` and prints both the
localhost and network IP URLs.  Pass `--ngrok-token` to get a public ngrok URL.

---

## Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET`  | `/` | Dashboard UI |
| `GET`  | `/health` | Service status |
| `GET`  | `/v1/models` | List models |
| `POST` | `/v1/chat/completions` | Llama 3.1 8B (streaming supported) |
| `POST` | `/v1/audio/transcriptions` | Whisper large-v3 |
| `POST` | `/v1/audio/speech` | Kokoro TTS |

All API endpoints are **OpenAI-compatible** — drop in the ngrok URL as `base_url` with any OpenAI SDK.

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://<ngrok-id>.ngrok-free.app/v1",
    api_key="dummy",
)

# Chat
resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Transcription
with open("audio.wav", "rb") as f:
    text = client.audio.transcriptions.create(model="whisper-1", file=f)

# TTS
client.audio.speech.create(
    model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")
```

---

## CLI options

```
llm-host --help

  --ngrok-token              ngrok authtoken (optional; omit for localhost/LAN only)
  --hf-token                 HuggingFace token (for gated models)
  --model                    HuggingFace model ID (default: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
  --quantization             awq | bitsandbytes | none  (default: awq)
  --whisper-model            tiny | base | small | medium | large-v3  (default: large-v3)
  --tts-voice                alloy | echo | fable | onyx | nova | shimmer  (default: alloy)
  --vllm-port                internal vLLM port (default: 8000)
  --gateway-port             public gateway port (default: 5001)
  --gpu-memory-utilization   vLLM GPU memory fraction (default: 0.82)
  --max-model-len            context length (default: 8192)
  --no-vllm                  skip starting vLLM (use existing instance)
```

All flags can also be set via `UPPER_SNAKE_CASE` environment variables.

---

## TTS voices

| Voice | Character | Kokoro name |
|-------|-----------|-------------|
| `alloy` | Neutral female | af_heart |
| `echo` | Male | am_echo |
| `fable` | British female | bf_emma |
| `onyx` | Deep male | am_adam |
| `nova` | Energetic female | af_nova |
| `shimmer` | Soft female | af_bella |

Raw Kokoro voice names (e.g. `af_sky`) are also accepted directly.

---

## License

MIT
