⚡ Inference Server

LIVE
Base URL
LLM
Checking…
Llama 3.1 8B · AWQ-INT4 · vLLM
Transcription
Checking…
Whisper large-v3 · lazy-loaded
Text-to-Speech
Checking…
Kokoro 82M · lazy-loaded
Endpoints
GET /health Service status
curl
curl /health
Response
{"ok": true, "vllm": true, "whisper": true, "tts": true}
GET /v1/models List models
curl
curl /v1/models
POST /v1/chat/completions Llama 3.1 8B · streaming
ParameterTypeDescription
modelstringrequiredMust be llama-3.1-8b-instruct
messagesarrayrequiredArray of {role, content} objects
max_tokensintoptionalMax tokens to generate
temperaturefloatoptional0 = deterministic, default 1.0
top_pfloatoptionalNucleus sampling threshold
streambooloptionalStream tokens via SSE
stopstring/arrayoptionalStop sequence(s)
curl — non-streaming
curl /v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "What is the capital of France?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
curl — streaming
curl /v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Tell me a short joke"}],
    "max_tokens": 128,
    "stream": true
  }'
Python — openai SDK
from openai import OpenAI
client = OpenAI(base_url="/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)
print(resp.choices[0].message.content)
POST /v1/audio/transcriptions Whisper large-v3
ParameterTypeDescription
filefilerequiredAudio file — wav, mp3, m4a, ogg, flac, webm
modelstringrequiredAny value accepted, e.g. whisper-1
response_formatstringoptionaljson · text · verbose_json · srt · vtt
languagestringoptionalISO-639-1 code (e.g. en, hi). Auto-detect if omitted.
timestamp_granularities[]stringoptionalword for word-level timestamps (use with verbose_json)
curl — basic
curl /v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-1"
curl — verbose JSON with word timestamps
curl /v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper-1" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"
curl — SRT subtitles
curl /v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=whisper-1" \
  -F "response_format=srt"
Python — openai SDK
with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )
print(transcript.text)
POST /v1/audio/speech Kokoro TTS · 6 voices
ParameterTypeDescription
inputstringrequiredText to synthesize (max ~500 words)
voicestringoptionalOpenAI alias or raw Kokoro voice name (default: alloy)
response_formatstringoptionalmp3 (default) · wav · opus · aac · flac · pcm
speedfloatoptional0.5 – 2.0, default 1.0
modelstringoptionalAny value accepted, e.g. tts-1
VOICES
alloy
Neutral female · af_heart
echo
Male · am_echo
fable
British female · bf_emma
onyx
Deep male · am_adam
nova
Energetic female · af_nova
shimmer
Soft female · af_bella
curl — basic mp3
curl /v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello, welcome to the debate.","voice":"nova"}' \
  --output speech.mp3
curl — WAV with speed control
curl /v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "This is slightly faster speech.",
    "voice": "onyx",
    "speed": 1.25,
    "response_format": "wav"
  }' \
  --output speech.wav
Python — openai SDK
response = client.audio.speech.create(
    model="tts-1",
    input="Hello from the debate stage.",
    voice="nova",
)
response.stream_to_file("output.mp3")