Metadata-Version: 2.4
Name: koda-llm
Version: 0.1.0
Summary: Run LLMs locally. One command.
Author: Ryan Cuff
License: MIT License
        
        Copyright (c) 2026 Ryan Cuff
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/rjcuff/koda
Project-URL: Repository, https://github.com/rjcuff/koda
Project-URL: Bug Tracker, https://github.com/rjcuff/koda/issues
Keywords: llm,ai,local,inference,llama,ollama
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: llama-cpp-python
Requires-Dist: huggingface_hub
Requires-Dist: fastapi
Requires-Dist: uvicorn[standard]
Requires-Dist: typer
Requires-Dist: rich
Requires-Dist: pyyaml
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: httpx; extra == "dev"
Dynamic: license-file

# koda

**Run any LLM locally. One command.**

[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Platform](https://img.shields.io/badge/platform-macOS%20%7C%20Linux%20%7C%20Windows-lightgrey.svg)]()

Koda downloads and runs quantized LLMs on your machine. No cloud, no API keys, no Docker. It speaks the Ollama and OpenAI protocols, so any compatible client works out of the box.

```bash
koda pull llama3.2
koda run llama3.2
```

Inspired by [Ollama](https://ollama.com). Built with [llama.cpp](https://github.com/ggerganov/llama.cpp) + FastAPI.

---

## Requirements

- **Python 3.12+**
- **RAM:** 4 GB minimum (8 GB+ recommended for 7B models)
- **Disk:** varies by model — 2–5 GB per model
- **GPU:** optional but recommended — CUDA (Linux/Windows) or Metal (Apple Silicon)

---

## Install

### macOS / Linux

```bash
curl -fsSL https://raw.githubusercontent.com/rjcuff/koda/main/install.sh | bash
```

The installer detects your platform and automatically builds `llama-cpp-python` with GPU support (CUDA or Metal) if available. After install, run:

```bash
source ~/.bashrc   # or ~/.zshrc on zsh
koda version
```

### Windows

```powershell
irm https://raw.githubusercontent.com/rjcuff/koda/main/install.ps1 | iex
```

### Manual

```bash
git clone https://github.com/rjcuff/koda
cd koda
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
```

#### GPU support (optional but faster)

**CUDA (Linux / Windows):**
```bash
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir
pip install -e .
```

**Apple Silicon (Metal):**
```bash
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-cache-dir
pip install -e .
```

CPU-only works fine if you skip the above — inference is just slower.

---

## Quick Start

```bash
# 1. Download a model
koda pull llama3.2

# 2. Chat interactively
koda run llama3.2

# 3. Or start the API server
koda serve
```

Type `/bye` to exit an interactive session.

---

## Commands

| Command | Description |
|---------|-------------|
| `koda pull <model>` | Download a model from HuggingFace |
| `koda list` | Show downloaded models |
| `koda list --available` | Show all pullable models |
| `koda run <model>` | Start an interactive chat session |
| `koda run <model> --system "..."` | Chat with a custom system prompt |
| `koda run <model> --ctx 8192` | Set the context window size |
| `koda run --kodafile Kodafile` | Run with a Kodafile config |
| `koda serve` | Start the API server on :11434 |
| `koda serve --host 0.0.0.0 --port 8080` | Custom host and port |
| `koda create` | Generate a Kodafile template |
| `koda version` | Show Koda version |

---

## Available Models

| Name | Description | Size |
|------|-------------|------|
| `llama3.2` / `llama3.2:3b` | Meta Llama 3.2 3B Instruct | 2.0 GB |
| `llama3.1` / `llama3.1:8b` | Meta Llama 3.1 8B Instruct | 4.9 GB |
| `mistral` | Mistral 7B Instruct v0.3 | 4.4 GB |
| `phi3` | Microsoft Phi-3 Mini 4K | 2.2 GB |
| `gemma2` | Google Gemma 2 2B Instruct | 1.6 GB |
| `qwen2.5` | Qwen 2.5 7B Instruct | 4.7 GB |
| `deepseek-r1` | DeepSeek R1 Distill Qwen 7B | 4.7 GB |

All models use Q4_K_M quantization — a good balance of quality and size. Run `koda list --available` to see the current list.

---

## API Server

```bash
koda serve
# Listening on http://127.0.0.1:11434
```

The server implements both the Ollama and OpenAI API protocols, so you can point any compatible client at it without any code changes.

### Ollama-compatible endpoints

```bash
# List downloaded models
curl http://localhost:11434/api/tags

# Text generation
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'

# Chat completion
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

# List running models
curl http://localhost:11434/api/ps

# Pull a model
curl http://localhost:11434/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

# Delete a model
curl -X DELETE http://localhost:11434/api/delete \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'
```

### OpenAI-compatible endpoints

Drop-in replacement for any client that supports a custom base URL:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="koda")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
```

```bash
# List models
curl http://localhost:11434/v1/models

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'
```

Streaming works on both protocols — set `"stream": true` in the request body.

---

## Python Library

Use Koda directly in your Python code. No server, no daemon, no subprocess.

```python
from koda import Koda

k = Koda()

# Download a model (no-op if already present)
k.pull("llama3.2")

# Chat — returns a string
reply = k.chat("llama3.2", [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"},
])
print(reply)

# Streaming chat — returns a token iterator
for token in k.chat("llama3.2", [{"role": "user", "content": "Tell me a story"}], stream=True):
    print(token, end="", flush=True)

# Raw text completion (no chat template applied)
text = k.generate("llama3.2", "The capital of France is")
print(text)

# Streaming completion
for token in k.generate("llama3.2", "Once upon a time", stream=True):
    print(token, end="", flush=True)

# Manage loaded models
print(k.models())    # all models in the registry
print(k.loaded())    # models currently in memory
k.unload("llama3.2") # free memory
```

**Custom context window:**
```python
k = Koda(n_ctx=8192)
```

---

## Kodafile

A `Kodafile` is a YAML config file that defines a model's behavior — useful for project-specific assistants or repeatable setups.

```bash
koda create            # writes a Kodafile template in the current directory
koda run --kodafile Kodafile
```

**Kodafile format:**
```yaml
base: llama3.2
system: You are a concise coding assistant. Respond in plain text, no markdown.
parameters:
  n_ctx: 8192
  temperature: 0.7
  top_p: 0.9
  repeat_penalty: 1.1
```

| Field | Description | Default |
|-------|-------------|---------|
| `base` | Model name (required) | — |
| `system` | System prompt | `"You are a helpful assistant."` |
| `parameters.n_ctx` | Context window size (tokens) | `4096` |
| `parameters.temperature` | Sampling temperature | `0.8` |
| `parameters.top_p` | Nucleus sampling | — |
| `parameters.top_k` | Top-K sampling | — |
| `parameters.repeat_penalty` | Repetition penalty | — |
| `parameters.max_tokens` | Max tokens to generate (`-1` = unlimited) | `-1` |

---

## Project Structure

```
koda/
├── koda/
│   ├── api.py          # Python library API (Koda class)
│   ├── cli.py          # CLI commands: pull, list, run, serve, create
│   ├── config.py       # Paths and defaults (~/.koda/)
│   ├── inference.py    # Model loading + thread-safe in-memory cache
│   ├── kodafile.py     # Kodafile YAML config format
│   ├── pull.py         # HuggingFace model downloads
│   ├── registry.py     # Model name → HuggingFace repo mapping
│   └── server.py       # FastAPI server — Ollama + OpenAI APIs
├── install.sh          # macOS / Linux one-line installer
├── install.ps1         # Windows one-line installer
└── pyproject.toml
```

Models are stored in `~/.koda/models/`.

---

## Stack

| Component | Library |
|-----------|---------|
| Inference | [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) |
| API server | [FastAPI](https://fastapi.tiangolo.com) + [uvicorn](https://www.uvicorn.org) |
| CLI | [Typer](https://typer.tiangolo.com) + [Rich](https://github.com/Textualize/rich) |
| Model downloads | [huggingface_hub](https://github.com/huggingface/huggingface_hub) |
| GPU backends | CUDA (Linux/Windows) · Metal (Apple Silicon) · CPU fallback |

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add models, endpoints, and commands.

---

## License

[MIT](LICENSE)
