Metadata-Version: 2.4
Name: llamacpp-cli
Version: 0.1.8
Summary: Ollama-like CLI wrapper around llama.cpp
License: MIT
Project-URL: Homepage, https://github.com/joeyjiaojg/llamacpp-cli
Project-URL: Source, https://github.com/joeyjiaojg/llamacpp-cli
Project-URL: Bug Tracker, https://github.com/joeyjiaojg/llamacpp-cli/issues
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: click>=8.1
Requires-Dist: requests>=2.31
Requires-Dist: rich>=13.0
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: fastapi>=0.104
Requires-Dist: uvicorn[standard]>=0.24
Requires-Dist: httpx>=0.25
Requires-Dist: structlog>=24.1.0
Requires-Dist: prometheus-client>=0.19
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"

# llamacpp-cli

Ollama-like CLI wrapper around llama.cpp. Provides a simple command-line interface that mirrors Ollama's subcommands but powered by llama.cpp as the backend inference engine.

## Features

- **pull** - Download GGUF models from Hugging Face
- **run** - Run models interactively using llama.cpp
- **serve** - Start the llama.cpp server
- **lb-proxy** - Multi-backend load balancer proxy (NEW!)
- **list** - List downloaded models
- **ps** - Show running llama.cpp processes
- **rm** - Remove a downloaded model
- **search** - Search Hugging Face for GGUF models
- **install** - Install/update llama.cpp binaries

## Installation

### From PyPI

```bash
pip install llamacpp-cli
```

### From Source

```bash
pip install -e .
```

## Quick Start

### 1. Install llama.cpp binaries

```bash
llamacpp install
```

This downloads the latest llama.cpp release to `~/.llamacpp/bin/`.

### 2. Pull a model

```bash
llamacpp pull unsloth/gemma-3-270m-it-GGUF:Q4_K_M
```

Or use a short alias:

```bash
llamacpp pull gemma3:270m
```

### 3. Run interactively

```bash
llamacpp run gemma3:270m
```

### 4. Start the server

```bash
llamacpp serve -m gemma3:270m
```

The server runs at `http://0.0.0.0:8080` with OpenAI-compatible API.

#### CPU-Optimized Presets

For CPU-only servers, use presets optimized for different workloads:

```bash
# Code tasks (default): 16K context, 2-4 parallel requests
llamacpp serve --preset code

# Chat/conversational: 8K context, 4-6 parallel requests
llamacpp serve --preset chat

# Fast queries: 4K context, 6-8 parallel requests
llamacpp serve --preset fast

# Large codebases: 32K context, 1 parallel request (slower)
llamacpp serve --preset max-context
```

See [CPU_OPTIMIZATION.md](docs/CPU_OPTIMIZATION.md) for detailed tuning guide.

## Commands

```
llamacpp pull <model>      Download GGUF model from Hugging Face
llamacpp run <model>       Run a model interactively
llamacpp serve             Start the llama.cpp server
llamacpp lb-proxy          Start multi-backend load balancer (see LB_PROXY.md)
llamacpp list              List downloaded models
llamacpp ps                Show running processes
llamacpp rm <model>        Remove a model
llamacpp search <query>    Search for models on Hugging Face
llamacpp install           Install/update llama.cpp binaries
```

### Load Balancer Proxy

For distributing requests across multiple machines, use the load balancer:

```bash
# Auto-discover backends on your network
llamacpp lb-proxy --discover-subnet 192.168.1.0/24

# Or specify backends manually
llamacpp lb-proxy -b http://machine1:8000 -b http://machine2:8000
```

See [LB_PROXY.md](LB_PROXY.md) for detailed documentation on:
- Model-aware routing
- Least-connections load balancing
- Auto-discovery and health checks
- Configuration options

## Model Names

Model names can be specified in multiple ways:

- Full Hugging Face path: `unsloth/gemma-3-270m-it-GGUF:Q4_K_M`
- Short format: `namespace/model:quantization` (e.g., `gemma3:270m`)
- Short name: `gemma3:270m`, `qwen3`, `llama3:8b`

Alias support is planned for future releases.

## Configuration

- Models are stored in `~/.llamacpp/models/`
- Binaries are installed to `~/.llamacpp/bin/`
- Database (SQLite) is at `~/.llamacpp/llamacpp.db`

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `LLAMACPP_BIN_DIR` | Directory for llama.cpp binaries | `~/.llamacpp/bin` |
| `LLAMACPP_MODEL_DIR` | Directory for models | `~/.llamacpp/models` |

## Usage with LLM CLI

This package also registers as an LLM plugin for the `llm` CLI:

```bash
# Install the plugin (requires llm and llama-cpp-python)
pip install llm-llama-cpp llama-cpp-python

# Register a model
llm llama-cpp add-model ~/.llamacpp/models/gemma-3-270m-it-Q4_K_M.gguf --alias gemma3:270m

# Use with llm
llm -m gemma3:270m "Your prompt here"
```

## Development

```bash
# Install in editable mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run a single test file
pytest tests/test_foo.py

# Lint
ruff check .

# Format
ruff format .
```

## Publishing to PyPI

### Prerequisites

1. Create a PyPI account at https://pypi.org/
2. Install build tools:

```bash
pip install build twine
```

### Build and Publish

1. Update version in `pyproject.toml`:

```toml
[project]
version = "0.1.0"
```

2. Build the package:

```bash
python -m build
```

This creates distributable archives in `dist/`.

3. Upload to PyPI:

```bash
twine upload dist/*
```

You'll be prompted for your PyPI username and password.

For Test PyPI (testing first):

```bash
twine upload --repository testpypi dist/*
```

### Using uv (Alternative)

```bash
# Install uv if not already
pip install uv

# Build
uv build

# Publish to PyPI
uv publish

# Or Test PyPI
uv publish --test
```

## License

MIT
