Metadata-Version: 2.4
Name: endpoint-vps
Version: 0.1.0
Summary: Free Open-Source LLM VPS — Ultra-Fast Inference on Kaggle
Project-URL: Homepage, https://github.com/myth-tools/endpoint
Project-URL: Documentation, https://github.com/myth-tools/endpoint#readme
Project-URL: Source, https://github.com/myth-tools/endpoint
Project-URL: Issues, https://github.com/myth-tools/endpoint/issues
Project-URL: Changelog, https://github.com/myth-tools/endpoint/releases
Project-URL: Funding, https://github.com/sponsors/myth-tools
Author: Myth Org
Author-email: Shesher Hasan <contact@shesher.work.gd>
Maintainer: Myth Org
Maintainer-email: Shesher Hasan <contact@shesher.work.gd>
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: cloud,endpoint,inference,kaggle,llama.cpp,llm,openai,vps
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Framework :: FastAPI
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: kaggle>=2.2.0
Requires-Dist: pyyaml>=6.0.3
Requires-Dist: requests>=2.34.2
Requires-Dist: rich>=15.0.0
Provides-Extra: dev
Requires-Dist: bandit>=1.9.4; extra == 'dev'
Requires-Dist: mypy>=2.1.0; extra == 'dev'
Requires-Dist: pytest>=9.0.3; extra == 'dev'
Requires-Dist: ruff>=0.15.15; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.12.20260518; extra == 'dev'
Requires-Dist: types-requests>=2.33.0.20260518; extra == 'dev'
Provides-Extra: engine
Requires-Dist: fastapi>=0.136.3; extra == 'engine'
Requires-Dist: huggingface-hub>=1.17.0; extra == 'engine'
Requires-Dist: pydantic>=2.13.4; extra == 'engine'
Requires-Dist: uvicorn[standard]>=0.48.0; extra == 'engine'
Description-Content-Type: text/markdown

<div align="center">

# ⚡ Endpoint VPS

**Turn Kaggle's Free Tier into a Production-Grade LLM Inference Server**

[![Version](https://img.shields.io/badge/version-0.1.0-blue?style=flat-square)]()
[![License](https://img.shields.io/badge/License-GPLv3-green?style=flat-square)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.12+-brightgreen?style=flat-square)](.python-version)
[![Platform](https://img.shields.io/badge/platform-Linux%20%7C%20macOS-lightgrey?style=flat-square)]()
[![PyPI](https://img.shields.io/badge/pypi-endpoint-006dad?style=flat-square)](https://pypi.org/project/endpoint/)
[![Code Style](https://img.shields.io/badge/code%20style-ruff-000000?style=flat-square)]()
[![Type Checked](https://img.shields.io/badge/type%20checked-mypy-039dfc?style=flat-square)]()

**Deploy an OpenAI-compatible LLM inference server on Kaggle with a single command — zero cost, zero cloud bills.**

[Quick Start](#quick-start) •
[Commands](#command-reference) •
[Configuration](#configuration) •
[API Reference](#api-reference) •
[Installation](#quick-start)

<br>

```bash
uvx endpoint boot           # One-shot deploy (no install)
pip install endpoint        # Or install globally
endpoint init               # Setup wizard
endpoint boot               # Deploy the VPS
```

</div>

---

## Table of Contents

- [What is Endpoint?](#what-is-endpoint)
- [Features](#features)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Command Reference](#command-reference)
- [Configuration](#configuration)
- [API Reference](#api-reference)
- [Environment Variables](#environment-variables)
- [Project Structure](#project-structure)
- [Development](#development)
- [License](#license)

---

## What is Endpoint?

**Endpoint** is a Python CLI tool that provisions a free, persistent, OpenAI-compatible LLM inference server on Kaggle's infrastructure. It combines three components into one seamless workflow:

1. **Kaggle Notebook** — Builds and runs `llama.cpp` (or `sd-server` for images/video) on Kaggle's free GPU (T4 x2, P100) or CPU with automatic model loading
2. **Cloudflare Tunnel** — Exposes the server via a secure HTTPS tunnel (trycloudflare.com) with an optional Cloudflare Worker proxy for custom domains and rate limiting
3. **CLI** — Manages the full lifecycle: boot, stop, monitor, pull models, update settings, and stream logs — all from your terminal

**Key differentiator:** Unlike managed API services, you retain full control — choose any GGUF model, tune every inference parameter, and pay nothing. llama.cpp runs blazing fast on Kaggle's multi-core CPUs or optional GPU accelerators.

---

## Features

### Core Platform
- **Zero Cost** — Runs entirely on Kaggle's free tier (CPU: unlimited hours, GPU: 30h/week, TPU: 20h/week)
- **OpenAI-Compatible API** — Full `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models` with streaming SSE
- **Multi-Model** — Pull any GGUF model from HuggingFace (text, image, video, embeddings, voice, multimodal)
- **Model Hot-Swap** — Switch models at runtime without redeploying
- **Interactive Boot Wizard** — Guided model selection with real-time HuggingFace metadata fetching
- **Persistent Storage** — Models and settings survive session restarts via Kaggle datasets

### Inference Capabilities
- **Chat Completions** — Full OpenAI API with streaming (SSE), function calling, logprobs, stop sequences
- **Text Completions** — Legacy completions endpoint
- **Embeddings** — Generate vector embeddings for RAG pipelines
- **Image Generation** — Stable Diffusion via sd-server (txt2img, img2img)
- **Video Generation** — Wan/LTX video models via sd-server
- **Tokenization** — Tokenize/detokenize, content moderation, reranking
- **Reasoning Models** — DeepSeek-R1 and reasoning-aware model support

### Performance & Optimization
- **llama.cpp** — State-of-the-art CPU/GPU inference with `march=native` + `-O3` compilation
- **Flash Attention** — Reduced VRAM usage and faster context processing
- **Multi-GPU** — Tensor split support for T4 x2 configurations
- **KV Cache** — Configurable cache size, quantization, and management
- **Auto-Batching** — Dynamic batch size calculation based on context length
- **Connection Pooling** — Persistent HTTP sessions with connection reuse

### Operations
- **Real-Time Logs** — SSE-based log streaming from the engine
- **Live Status** — Watch boot progress and engine health in real-time
- **Idle Auto-Termination** — Configurable timeout shuts down idle VPS to save resources
- **Rate Limiting** — Per-IP rate limiting middleware in both CLI and Cloudflare proxy
- **Graceful Shutdown** — SIGINT/SIGTERM handling with clean resource cleanup
- **Shell Completions** — Bash and Zsh tab completion for all commands and flags

### Security & Reliability
- **Cloudflare Proxy** — Optional Worker-based proxy with KV-backed tunnel map and SHA-256 API key hashing
- **API Key Auth** — Bearer token authentication on all inference endpoints
- **GPU/TPU Detection** — Automatic compute capability detection for optimal binary selection
- **Auto-Retry** — Upstream proxy retries on transient network failures
- **Atomic Config Writes** — Crash-safe configuration file updates

---

## Quick Start

### Prerequisites

- [Kaggle account](https://kaggle.com) (phone verified for API access)
- [Kaggle API token](https://www.kaggle.com/settings) saved to `~/.kaggle/kaggle.json`:
  ```json
  {"username":"your-username","key":"kgat_xxxxxxxxxxxxxxxx"}
  ```
- Python 3.12+

### Installation

Choose your preferred method:

```bash
# Run instantly without installing (recommended for first try):
uvx endpoint boot

# Install globally with pip:
pip install endpoint

# Or with uv:
uv add endpoint

# Or with pipx (isolated environment):
pipx install endpoint

# From source:
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install
```

### Configure

```bash
# Set your Kaggle API token:
export KAGGLE_API_TOKEN='kgat_xxxx'

# Run the interactive setup wizard:
endpoint init
```

The wizard prompts for your Kaggle username, kernel slug, and default model selection. It creates `~/.config/endpoint/endpoint-config.yaml` with all settings.

### Deploy

```bash
# Interactive boot with model selection and configuration:
endpoint boot

# Boot with GPU acceleration (T4 x2, up to 70B param models):
endpoint boot --gpu

# Boot with TPU acceleration (v5e-8, up to 100B param models):
endpoint boot --tpu

# Boot without live streaming status:
endpoint boot --no-watch
```

### Verify & Use

```bash
# Show API endpoint with example curl commands:
endpoint base-url

# Test API connectivity:
endpoint connect

# List deployed models:
endpoint models

# Stream engine logs:
endpoint logs
```

### Stop

```bash
# Graceful shutdown with cache cleanup:
endpoint stop
```

---

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        YOUR TERMINAL                            │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │ endpoint CLI │  │ Shell Complet│  │ Config (YAML)          │ │
│  │ (commands.py)│  │ (bash/zsh)   │  │ ~/.config/endpoint/    │ │
│  └──────┬───────┘  └──────────────┘  └────────────────────────┘ │
└─────────┼───────────────────────────────────────────────────────┘
          │ ① boot / stop / pull / settings
          ▼
┌─────────────────────────────────────────────────────────────────┐
│                    KAGGLE NOTEBOOK (VPS)                        │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                    FastAPI Engine                          │ │
│  │  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │ │
│  │  │ Chat/Embed  │  │ Image Gen    │  │ Video Gen        │   │ │
│  │  │ /v1/chat    │  │ /v1/images   │  │ /v1/video        │   │ │
│  │  │ /v1/embed   │  │ sd-server    │  │ sd-server        │   │ │
│  │  └──────┬──────┘  └──────┬───────┘  └───────┬──────────┘   │ │
│  │         │                │                  │              │ │
│  │  ┌──────┴────────────────┴──────────────────┴──────────┐   │ │
│  │  │               llama.cpp backend                     │   │ │
│  │  │  GGUF models • KV cache • Flash attn • Batching     │   │ │
│  │  └─────────────────────┬───────────────────────────────┘   │ │
│  └────────────────────────┼───────────────────────────────────┘ │
│                           │                                     │
│  ┌────────────────────────┴────────────────────────────────┐    │
│  │              cloudflared tunnel                         │    │
│  │  https://xxxx.trycloudflare.com → engine:5003           │    │
│  └────────────────────────┬────────────────────────────────┘    │
└───────────────────────────┼─────────────────────────────────────┘
                            │ ③ proxy requests
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                  CLOUDFLARE WORKER (Optional)                   │
│                                                                 │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  proxy-worker.js                                           │ │
│  │  • Rate limiting (100 req/min per IP)                      │ │
│  │  • KV-backed tunnel map (SHA-256 hashed API keys)          │ │
│  │  • Request proxying with retry + timeout                   │ │
│  │  • CORS headers for browser clients                        │ │
│  └────────────────────────────────────────────────────────────┘ │
│  Domain: api.endpoint.dpdns.org → tunnel                        │
└─────────────────────────────────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │  YOUR APPLICATION       │
              │  OpenAI SDK / curl / etc│
              └─────────────────────────┘
```

### Signal Flow

The CLI and notebook communicate via **ntfy.sh** pub/sub for push-based signaling:

```
CLI ──boot──▶ Kaggle ──STATUS──▶ ntfy.sh ──▶ CLI (streamed in terminal)
CLI ──stop──▶ ntfy.sh ──KILL────▶ Kaggle ──shutdown──▶ ntfy.sh ──▶ CLI
CLI ◀──WS URL──────────────────── ntfy.sh ◀──base64── Kaggle (tunnel acquired)
```

---

## Command Reference

All 19 commands with complete flag documentation.

### Global Flags

Available before any subcommand:

| Flag | Alias | Description |
|------|-------|-------------|
| `--help` | `-h` | Show help message and exit |
| `--version` | `-v` | Show version and build information |
| `--gpu` | `-g` | GPU accelerator (T4 x2; max 70B params). Mutually exclusive with `--tpu` |
| `--tpu` | `-t` | TPU v5e-8 accelerator (max 100B params). Mutually exclusive with `--gpu` |

---

### `endpoint boot`

Deploy the LLM VPS to Kaggle — interactive model selection, configuration, and deployment.

**Flags:**

| Flag | Description |
|------|-------------|
| `--no-watch` | Build and push without streaming status signals |
| `--p100` | Use P100 GPU accelerator instead of default T4 x2 |

**Use case:** Starting a fresh inference session. Run this first after configuration.

**Examples:**

```bash
endpoint boot                  # Interactive boot (recommended)
endpoint boot --gpu            # Boot with T4 x2 GPU accelerator
endpoint boot --tpu            # Boot with TPU v5e-8 accelerator
endpoint boot --no-watch       # Boot without streaming status
endpoint boot --p100           # Boot with P100 GPU accelerator
```

```
$ endpoint boot

MODEL SELECTION
  Current default: bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  Enter any HuggingFace GGUF model ID (e.g. Qwen/Qwen2.5-1.5B-Instruct-GGUF)

  Model ID [Enter to keep default]: Qwen/Qwen2.5-1.5B-Instruct-GGUF
  Type:         text
  Context:      32,768 tokens

  Available GGUF files:
    1) Qwen2.5-1.5B-Instruct-Q4_K_M.gguf (0.99 GB)
    2) Qwen2.5-1.5B-Instruct-Q5_K_M.gguf (1.14 GB)
    3) Qwen2.5-1.5B-Instruct-Q6_K.gguf (1.32 GB)
    4) Qwen2.5-1.5B-Instruct-Q8_0.gguf (1.70 GB)

  Enter number or filename [1]: 1

MODEL CONFIGURATION
  max_tokens [2048]:
  temperature [0.7]:
  top_p [0.9]:
  context_length [32768]:

✓ Model and configuration saved.
DEPLOYMENT
  Kernel "shesher/endpoint-llm-vps" is "none"
  Building notebook...
  Pushing to Kaggle...
  ...
```

---

### `endpoint stop`

Stop the running VPS instance — sends kill signal, deletes kernel, clears cached tunnel URL and API key.

**Use case:** Tear down the VPS when done to free Kaggle resources and prevent idle charges.

**Flags:** None.

**Examples:**

```bash
endpoint stop
```

```
✓ Stop signal sent.
✓ Kernel deleted.
✓ Cached credentials cleared.
```

---

### `endpoint status`

Show VPS kernel state, tunnel URL, and deployed models using cached data with background refresh.

**Use case:** Quickly check whether your VPS is running, what models are deployed, and get quick-access links.

**Flags:** None.

**Examples:**

```bash
endpoint status
```

```
Kernel: running (shesher/endpoint-llm-vps)
  URL: https://random-1234.trycloudflare.com

Deployed Models:
  • bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
  • Qwen/Qwen2.5-1.5B-Instruct-GGUF

Quick links:
  endpoint base-url    — Show API endpoint
  endpoint logs        — Stream engine logs
  endpoint stop        — Stop VPS
```

---

### `endpoint watch`

Stream boot and engine status signals in real-time via ntfy.sh.

**Flags:** None.

**Use case:** Monitor deployment progress in a separate terminal while `endpoint boot --no-watch` runs.

**Flags:** None.

**Examples:**

```bash
endpoint watch
```

```
$ endpoint watch
WATCHING endpoint-shesher...
  → notebook-cell-execution-begins
  → Tunnel: https://random-1234.trycloudflare.com
  → Endpoint IS ONLINE
^C Stopped.
```

---

### `endpoint kill-all`

Terminate all running Kaggle kernels. Lists active kernels, asks for confirmation, then stops them in parallel.

**Use case:** Clean slate — stop every running Kaggle kernel when you have multiple stale sessions, or automate cleanup in scripts.

**Flags:**

| Flag | Alias | Description |
|------|-------|-------------|
| `--yes` | `-y` | Skip interactive confirmation prompt |
| `--all` | — | Also delete inactive (completed/error) kernels after stopping active ones |

**Examples:**

```bash
endpoint kill-all                    # Interactive mode
endpoint kill-all --yes              # Non-interactive (script-friendly)
endpoint kill-all --yes --all        # Kill everything including completed
```

```
$ endpoint kill-all
Checking kernels...
  Active: shesher/endpoint-llm-vps, shesher/experiment-01
  Inactive: shesher/old-test

Stop these 2 active kernels? [y/N] y
  ✓ Stopped shesher/endpoint-llm-vps
  ✓ Stopped shesher/experiment-01
  ✓ Deleted shesher/old-test
Done. Stopped 2, errors 0.
```

---

### `endpoint register`

Register the tunnel URL with the Cloudflare proxy. Fetches the API key from the engine and registers the tunnel so the proxy forwards requests.

**Use case:** After booting, if proxy registration failed or needs to be re-done (e.g., tunnel URL changed, proxy was restarted).

**Flags:**

| Flag | Alias | Description |
|------|-------|-------------|
| `--force` | `-f` | Skip the proxy health check and register anyway |

**Examples:**

```bash
endpoint register                    # Normal registration
endpoint register --force            # Force-register even if proxy is unresponsive
```

```
✓ Registered tunnel with proxy.
Test with:
  curl https://your-proxy.workers.dev/v1/models \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx"
```

---

### `endpoint base-url`

Show the API endpoint URL with example curl commands for chat, embeddings, and model listing.

**Use case:** Get the URL and auth header needed to call the API from `curl`, Python, or any OpenAI-compatible client.

**Flags:** None.

**Examples:**

```bash
endpoint base-url
```

```
$ endpoint base-url
Endpoint: https://random-1234.trycloudflare.com

Chat Completion:
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

List Models:
  curl https://random-1234.trycloudflare.com/v1/models \
    -H "Authorization: Bearer <key>"

Embeddings:
  curl -X POST https://random-1234.trycloudflare.com/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <key>" \
    -d '{"model":"default","input":"Hello world"}'
```

---

### `endpoint connect`

Test API connectivity by querying `/v1/models`. Prints the raw JSON response on success.

**Use case:** Quick health check — verify the VPS API is responding before sending inference requests.

**Flags:** None.

**Examples:**

```bash
endpoint connect
```

```
$ endpoint connect
CONNECTIVITY TEST
✓ API is responding!
{
  "object": "list",
  "data": [
    {"id": "default", "object": "model", ...}
  ]
}
```

---

### `endpoint logs`

Stream real-time VPS engine logs via Server-Sent Events (SSE). Press `Ctrl+C` to stop.

**Use case:** Debug model loading, monitor inference requests, troubleshoot errors on the VPS in real-time.

**Flags:** None.

**Examples:**

```bash
endpoint logs
```

```
$ endpoint logs
ENGINE LOGS (Ctrl+C to stop)
[2026-05-30 12:00:00] INFO     Starting llama.cpp server...
[2026-05-30 12:00:05] INFO     Loading model default...
[2026-05-30 12:01:00] INFO     POST /v1/chat/completions 200 2.3s
^C Log stream ended.
```

---

### `endpoint models`

List deployed models on the VPS. Without flags, queries the running VPS for its current model roster.

**Use case:** See what models are available for inference, verify a model was pulled successfully, or inspect model metadata.

**Flags:**

| Flag | Alias | Description |
|------|-------|-------------|
| `--builtin` | `-b` | Show built-in models from local config (no VPS query needed) |

**Examples:**

```bash
endpoint models                      # Query VPS for deployed models
endpoint models --builtin            # Show models defined in config
```

```
$ endpoint models --builtin
#   Name                                             Size    HuggingFace Repo
1   Qwen2.5-Coder-3B-Instruct-abliterated (default)  1.99 GB bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
2   Qwen2.5-1.5B-Instruct                            0.95 GB Qwen/Qwen2.5-1.5B-Instruct-GGUF

Hint:
  endpoint pull <hf-repo>   Deploy a new model
  endpoint boot             Start the VPS with these models
```

---

### `endpoint pull [model]`

Pull a GGUF model from HuggingFace onto the VPS. The model is downloaded in the background on the VPS.

**Use case:** Add a new model (e.g., a fine-tune, larger/smaller variant, or different model family) without re-deploying the VPS.

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `model` | No | HuggingFace model ID (e.g. `Qwen/Qwen2.5-1.5B-Instruct-GGUF`). Prompts if omitted |

**Examples:**

```bash
endpoint pull Qwen/Qwen2.5-1.5B-Instruct-GGUF
endpoint pull                          # Interactive prompt
```

---

### `endpoint remove [model]`

Remove a deployed model from the VPS to free disk space.

**Use case:** Free up storage on the VPS by removing unused models, or replace a model with a different quantization.

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `model` | No | Model ID to remove. Prompts if omitted |

**Examples:**

```bash
endpoint remove Qwen/Qwen2.5-1.5B-Instruct-GGUF
```

```
Removing model: Qwen/Qwen2.5-1.5B-Instruct-GGUF
✓ Model removed.
```

---

### `endpoint settings [action] [key] [value]`

View or update engine parameters on the running VPS.

**Use case:** Tweak inference parameters (temperature, context length, threads) without restarting the VPS, or inspect current configuration.

**Arguments:**

| Action | Required | Description |
|--------|----------|-------------|
| *(none)* | — | View all current settings (default when no arguments given) |
| `update` | Yes | Change a setting. Requires `key` and `value` arguments |

**Usage:**

```bash
endpoint settings                          # View all current settings
endpoint settings update <key> <value>     # Update a specific setting
```

**Supported settings keys:**

| Key | Type | Description |
|-----|------|-------------|
| `context_length` | int | Context window size in tokens |
| `batch_size` | int | Batch size |
| `threads` | int | CPU inference threads |
| `temperature` | float | Sampling temperature (0.0–2.0) |
| `top_p` | float | Nucleus sampling threshold (0.0–1.0) |
| `top_k` | int | Top-K sampling |
| `repeat_penalty` | float | Repetition penalty (1.0 = none) |
| `min_p` | float | Min-P sampling threshold |
| `typical_p` | float | Typical sampling threshold |
| `flash_attn` | bool | Enable flash attention |
| `max_tokens` | int | Maximum tokens to generate |
| `ngl` | int | GPU layers to offload (0=CPU, 999=max) |
| `mlock` | bool | Lock model in physical RAM |
| `no_mmap` | bool | Disable memory-mapped model loading |
| `cache_size` | int | KV cache size |
| `ignore_eos` | bool | Ignore end-of-sequence token |
| `seed` | int | Random seed for reproducibility |
| `presence_penalty` | float | Presence penalty (-2.0–2.0) |
| `frequency_penalty` | float | Frequency penalty (-2.0–2.0) |

**Examples:**

```bash
endpoint settings update temperature 0.3
endpoint settings update flash_attn true
endpoint settings update context_length 16384
endpoint settings update threads 8
```

```
$ endpoint settings
SETTINGS
{
  "context_length": 8192,
  "batch_size": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  ...
}
Use: endpoint settings update <key> <value>
```

---

### `endpoint init`

Interactive setup wizard for first-time configuration. Generates `~/.config/endpoint/endpoint-config.yaml`.

**Use case:** First-time setup — creates the configuration file with your Kaggle username, kernel slug, and default model selection.

**Flags:** None.

**Examples:**

```bash
endpoint init
```

```
$ endpoint init
ENDPOINT SETUP WIZARD

Kaggle username [shesher]:
Kernel slug [endpoint-llm-vps]:

Default model:
  1) Qwen2.5-Coder-3B-Instruct-abliterated (default, 1.99 GB)
  2) Qwen2.5-1.5B-Instruct (0.95 GB)
Enter number or HF repo ID [1]: 2

✓ Configuration saved to ~/.config/endpoint/endpoint-config.yaml

Next steps:
  endpoint doctor       — Check dependencies
  endpoint boot         — Deploy the VPS
```

---

### `endpoint doctor`

Run system diagnostics. Checks for required CLI tools (`kaggle`, `curl`, `jq`, `git`), Python modules, Kaggle API token validity, config status, and proxy health.

**Use case:** Verify your environment is ready before deploying, or diagnose issues when something isn't working.

**Flags:** None.

**Examples:**

```bash
endpoint doctor
```

```
$ endpoint doctor
CLI Tools:
  ✓ kaggle     found
  ✓ curl       found
  ✓ jq         found
  ✓ git        found

Python:
  ✓ yaml       module available

Kaggle API:
  ✓ Token found
  ✓ API working (1 kernel listed)

Config:
  ✓ Username:   shesher
  ✓ Kernel ID:  shesher/endpoint-llm-vps

Proxy:
  ✓ Proxy healthy — 200 OK

✓ All checks passed.
```

---

### `endpoint provider-config`

Show Myth Org provider identity, live API key, and deployed models from the VPS with metadata (context length, type, reasoning capability). Optionally filter by model name and see usage examples.

**Use case:** Inspect provider details for integration with OpenAI-compatible clients, or get ready-to-use curl/Python examples for a specific model.

**Flags:**

| Flag | Alias | Description |
|------|-------|-------------|
| `--model` | `-m` | Filter displayed models (exact, prefix, or substring match) |

**Examples:**

```bash
endpoint provider-config                         # Show all provider info
endpoint provider-config --model Qwen            # Filter to Qwen models
endpoint provider-config -m "Qwen2.5-Coder"     # Exact match with examples
```

```
PROVIDER CONFIGURATION
  Provider ID:     myth
  Provider Name:   MYTH Org
  Base URL:        https://random-1234.trycloudflare.com
  API Key:         sk-xxxxxxxxxxxx

Deployed Models:
  Model ID                        Display Name                      Context    Type    Reasoning
  Qwen2.5-Coder-3B-Instruct-...   Qwen 2.5 Coder 3B Abliterated     32K        text    no

Usage Examples (curl, Python):
  curl -X POST https://random-1234.trycloudflare.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer sk-xxxxxxxxxxxx" \
    -d '{"model":"Qwen2.5-Coder-3B-Instruct-...","messages":[{"role":"user","content":"Hello"}]}'
```

---

### `endpoint autocomplete [shell]`

Install shell completion for `bash` or `zsh`. Appends the completion function to `~/.bashrc` or `~/.zshrc`.

**Use case:** Enable tab-completion for all `endpoint` commands, subcommands, and flags in your shell.

**Arguments:**

| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `shell` | No | `bash` | Shell type: `bash` or `zsh` |

**Examples:**

```bash
endpoint autocomplete                # Install bash completions (default)
source ~/.bashrc                     # Reload to activate

endpoint autocomplete zsh            # Install zsh completions
source ~/.zshrc                      # Reload to activate
```

```
✓ Bash completion installed. Restart your shell or run: source ~/.bashrc
```

---

### `endpoint help [command]`

Show help for a specific command. Without arguments, prints the general help listing all commands.

**Use case:** Quick reference for a command's flags and usage without checking the full documentation.

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `command` | No | Command name to get help for |

**Examples:**

```bash
endpoint help                        # General help
endpoint help boot                   # Help for boot command
endpoint help kill-all               # Help for kill-all
endpoint help settings               # Help for settings
```

```
$ endpoint help boot
usage: endpoint boot [-h] [--no-watch] [--p100]

Deploy LLM VPS to Kaggle and stream status

options:
  -h, --help   show this help message and exit
  --no-watch   Build and push without streaming
  --p100       Use P100 GPU (default T4 x2)
```

---

### `endpoint version`

Show version and build information.

**Use case:** Verify which version of Endpoint you have installed, confirm the active config path, and check the kernel name.

**Flags:** None.

**Examples:**

```bash
endpoint version
```

```
$ endpoint version
Endpoint CLI v0.1.0
License: GPLv3+
Engine version: 0.1.0
Config: /home/user/.config/endpoint/endpoint-config.yaml
Kernel: shesher/endpoint-llm-vps
```

---

## Configuration

Endpoint uses a YAML configuration file loaded with layered precedence:

1. **Bundled defaults** — `endpoint/data/endpoint-config.yaml`
2. **User config** — `~/.config/endpoint/endpoint-config.yaml`
3. **Local override** — `./endpoint-config.yaml` (project root, gitignored)

### Full Schema

```yaml
# ── Identity ──────────────────────────────────────────────────────
identity:
  kaggle_username: "your-username"         # Kaggle username
  kernel_slug: "endpoint-llm-vps"          # Kaggle kernel slug
  api_key: ""                              # API key (auto-provisioned on boot)

# ── Engine ────────────────────────────────────────────────────────
engine:
  version: "0.1.0"                         # Engine version
  engine_port: 5003                        # FastAPI server port
  accelerator: "cpu"                       # cpu | gpu_t4 | gpu_p100 | tpu

# ── Signal (ntfy.sh) ─────────────────────────────────────────────
signal:
  topic_prefix: "endpoint"                 # ntfy.sh topic prefix

# ── LLM Inference ─────────────────────────────────────────────────
llm:
  port: 8080                               # llama.cpp server port
  context_length: 8192                     # Context window (tokens)
  batch_size: 512                          # Batch size
  threads: 4                               # CPU threads
  flash_attn: true                         # Flash attention
  max_tokens: 2048                         # Max generation tokens
  temperature: 0.7                         # Sampling temperature
  top_p: 0.9                               # Nucleus sampling
  top_k: 40                                # Top-K sampling
  repeat_penalty: 1.1                      # Repetition penalty
  min_p: 0.0                               # Min-P sampling
  typical_p: 0.0                           # Typical sampling
  ngl: 0                                   # GPU layers (0=CPU, 999=max)
  tensor_split: ""                         # Multi-GPU split: "1,1"
  mlock: false                             # Lock model in RAM
  no_mmap: false                           # Disable memory mapping
  cache_size: 0                            # KV cache size (0=auto)
  chat_template: ""                        # Jinja2 template override
  ignore_eos: false                        # Ignore EOS token

# ── Image Generation (sd-server) ─────────────────────────────────
image:
  port: 8081                               # sd-server port
  steps: 20                                # Denoising steps
  cfg_scale: 7.0                           # Guidance scale
  width: 1024                              # Output width
  height: 1024                             # Output height
  sampler: "euler"                         # Sampler type

# ── Video Generation ─────────────────────────────────────────────
video:
  port: 8081                               # sd-server port (shared)
  fps: 12                                  # Frames per second
  frames: 41                               # Total frames
  cfg_scale: 5.0                           # Video CFG scale
  steps: 20                                # Denoising steps

# ── Models ────────────────────────────────────────────────────────
default_model: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
default_model_file: "*.Q4_K_M.gguf"
default_model_index: 0

models:
  - name: "Qwen2.5-Coder-3B-Instruct-abliterated"
    hf_repo: "bartowski/Qwen2.5-Coder-3B-Instruct-abliterated-GGUF"
    hf_file: "*.Q4_K_M.gguf"
    size_gb: 1.99
    temperature: 0.7
    top_p: 0.9
    top_k: 40
  - name: "Qwen2.5-1.5B-Instruct"
    hf_repo: "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
    hf_file: "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
    size_gb: 0.95
    temperature: 0.7
    top_p: 0.9
    top_k: 40

# ── Proxy ─────────────────────────────────────────────────────────
proxy:
  enabled: true
  domain: "api.endpoint.dpdns.org"
  url: "https://api.endpoint.dpdns.org"
  base_url: "https://api.endpoint.dpdns.org/v1"
  register_url: "https://api.endpoint.dpdns.org/__register"
  unregister_url: "https://api.endpoint.dpdns.org/__unregister"

# ── VPS Packages ─────────────────────────────────────────────────
packages:
  system:
    - curl
    - wget
    - tar
    - jq
    - python3-dev
    - pip
  python:
    - huggingface_hub
    - requests
    - fastapi
    - uvicorn[standard]
    - pydantic
```

### Accelerator Reference

| Accelerator | CLI Flag | Kaggle Type | Max Model Size | Quota |
|-------------|----------|-------------|---------------|-------|
| CPU | *(default)* | CPU only | 20B params | Unlimited |
| GPU T4 x2 | `--gpu` / `-g` | `NvidiaTeslaT4` | 70B params | 30h/week |
| GPU P100 | `--p100` | `NvidiaTeslaP100` | 40B params | 30h/week |
| TPU v5e-8 | `--tpu` / `-t` | `TpuV5E8` | 100B params | 20h/week |

---

## API Reference

Once deployed, the engine exposes a full OpenAI-compatible REST API.

### Public Endpoints (no auth required)

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/` | Root status: service info, docs link |
| `GET` | `/health` | Health check: backend status, model loaded, uptime |
| `GET` | `/tunnel` | Tunnel URL and status |
| `GET` | `/metrics` | Request count, latency, error rate, latency buckets |
| `GET` | `/docs` | Swagger UI documentation |
| `GET` | `/openapi.json` | OpenAPI 3.0 schema |
| `GET` | `/v1/apikey` | Get the current API key |

### Authenticated Endpoints (Bearer token required)

#### Chat & Text

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/chat/completions` | Chat completions with streaming SSE support |
| `POST` | `/v1/completions` | Legacy text completions |

**Chat Request** supports all OpenAI parameters: `model`, `messages`, `stream`, `temperature`, `top_p`, `max_tokens`, `stop`, `frequency_penalty`, `presence_penalty`, `logprobs`, `top_logprobs`, `seed`, `n`, `user`, and `reasoning_effort`.

#### Models

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/v1/models` | List all models with metadata (type, context, size, reasoning) |
| `GET` | `/v1/models/{model_id}` | Get specific model details |
| `POST` | `/v1/models/pull?model=<id>` | Pull model from HuggingFace |
| `DELETE` | `/v1/models/{model_id}` | Remove a model |

#### Embeddings & Rerank

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/embeddings` | Generate text embeddings |
| `POST` | `/v1/rerank` | Re-rank documents by relevance |

#### Image & Video

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/images/generations` | Generate images (txt2img) |
| `POST` | `/v1/video/generations` | Generate video |
| `POST` | `/v1/video/edits` | Edit video |

#### Tokenization & Moderation

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/tokenize` | Tokenize text |
| `POST` | `/v1/detokenize` | Detokenize token IDs |
| `POST` | `/v1/moderations` | Content moderation |

#### Settings & Logs

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/v1/settings` | Get current engine settings |
| `POST` | `/v1/settings` | Update engine settings |
| `GET` | `/v1/logs` | Get recent log lines |
| `GET` | `/v1/logs/stream` | SSE stream of live logs |

### Usage Examples

```bash
# Chat with streaming
curl -X POST https://your-tunnel.trycloudflare.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-xxxx" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
```

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://your-tunnel.trycloudflare.com/v1",
    api_key="sk-xxxx",
)
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")
```

```python
# Generate embeddings
response = client.embeddings.create(
    model="default",
    input=["Hello world", "How are you?"],
)
print(response.data[0].embedding[:5])  # First 5 dimensions
```

---

## Environment Variables

### CLI Variables

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `KAGGLE_API_TOKEN` | Yes* | — | Kaggle API token (`kgat_...`). Alternative to `kaggle.json` |
| `ENDPOINT_API_KEY` | Auto-set | — | Engine API key (set by CLI during boot) |
| `CONFIG` | No | — | Path to YAML config file override |

*Required if `~/.kaggle/kaggle.json` is not present.

### Engine Variables (set on the Kaggle VPS)

| Variable | Default | Description |
|----------|---------|-------------|
| `LLAMA_PORT` | `8080` | llama.cpp server port |
| `SD_PORT` | `8081` | sd-server port |
| `ENGINE_PORT` | `5003` | FastAPI engine port |
| `LLM_SIGNAL_TOPIC` | `""` | ntfy.sh signal topic (auto-derived) |
| `LLM_CONTROL_TOPIC` | `""` | ntfy.sh control topic (auto-derived) |
| `SESSION_ID` | `"default"` | Session identifier |
| `ENGINE_VERSION` | `"0.1.0"` | Engine version string |
| `LLM_ACCELERATOR` | — | Accelerator type (`cpu`, `gpu_t4`, etc.) |
| `LLM_IDLE_TIMEOUT` | `3600` | Idle shutdown timeout in seconds (`0` = disable) |
| `RATE_LIMIT` | `"1"` | Enable rate limiting (`1`/`true`/`yes`) |
| `LLM_DEBUG` | — | Enable verbose error messages |
| `LLM_CONTEXT_LEN` | — | Context length override |
| `LLM_BATCH_SIZE` | — | Batch size override |
| `LLM_MAX_TOKENS` | — | Max tokens override |
| `LLM_NGL` | — | GPU layers override |
| `LLM_TENSOR_SPLIT` | — | Tensor split override |
| `DEFAULT_MODEL` | — | Default model ID override |
| `DEFAULT_MODEL_FILE` | — | Default model filename override |
| `MODEL_TYPE` | — | Model type override |
| `HF_TOKEN` | — | HuggingFace token for gated model downloads |

---

## Project Structure

```
endpoint/
├── endpoint/                        # CLI package (pip-installable)
│   ├── __init__.py                  # Version string
│   ├── __main__.py                  # python -m endpoint support
│   ├── main.py                      # Argparse parser, dispatch
│   ├── commands.py                  # All 19 command implementations
│   ├── core.py                      # Config, VPSClient, console, signals, cache
│   ├── py.typed                     # PEP 561 type marker
│   └── data/                        # Bundled defaults
│       ├── endpoint-config.yaml
│       └── endpoint-config.example.yaml
├── engine/                          # Inference server (runs on Kaggle)
│   ├── engine.py                    # FastAPI app: all API endpoints, lifecycle
│   └── models_config.py             # Model management, GGUF parsing, settings
├── scripts/                         # Build & automation
│   ├── master_build_notebook.py     # Kaggle notebook generator (962 lines)
│   ├── update_and_embed.py          # Engine payload sync into notebook
│   ├── lint.py                      # Code quality pipeline (ruff + mypy + pytest)
│   └── release.py                   # PyPI and GitHub release automation
├── cloudflare/                      # Cloudflare proxy worker
│   ├── deploy.py                    # Wrangler-based deployment orchestration
│   ├── proxy-worker.js              # CF Worker: rate limiting, KV tunnel map
│   ├── wrangler.toml                # Wrangler configuration
│   └── wrangler.example.toml        # Example wrangler config
├── tests/
│   └── sanity_test.py               # 26-test static analysis suite
├── man/
│   └── endpoint.1                   # Man page (roff format)
├── Makefile                         # Build, lint, test, publish targets
├── pyproject.toml                   # Package metadata + tool configuration
├── uv.lock                          # Dependency lock file
├── endpoint-config.yaml             # Local config (gitignored)
└── README.md
```

---

## Development

### Setup

```bash
git clone https://github.com/shesher/endpoint.git
cd endpoint
make install      # or: uv sync
```

### Commands

```bash
make all          # Full pipeline: lint + test (default target)
make lint         # Run ruff linter (format check + lint)
make test         # Run pytest test suite
make build        # Build PyPI wheel + sdist
make install      # Editable install with uv
make release      # Dry-run release build
make publish      # Build & publish to PyPI
make clean        # Remove build artifacts
```

### Quality Standards

- **Linting:** ruff with full ruleset (E, F, I, N, W, UP, B, SIM, ARG, etc.)
- **Formatting:** ruff formatter (compatible with Black)
- **Type Checking:** mypy for CLI, engine, scripts, and tests
- **Testing:** pytest with 26 static analysis tests covering syntax, security, naming, config schema, base64 roundtrip, version consistency, import resolution, license headers, file sizes, and more
- **CI:** `make all` runs the full pipeline: lint → test

---

## License

[GNU General Public License v3.0 or later](LICENSE) © 2024-2026 Shesher Hasan.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.
