Metadata-Version: 2.4
Name: llmstack-cli
Version: 0.1.0
Summary: One command. Full LLM stack. Zero config.
Author: mara-werils
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ai,cli,docker,inference,llm,openai,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: docker>=7.0
Requires-Dist: httpx>=0.27
Requires-Dist: psutil>=5.9
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: fastapi>=0.115; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: starlette>=0.40; extra == 'dev'
Provides-Extra: gateway
Requires-Dist: fastapi>=0.115; extra == 'gateway'
Requires-Dist: starlette>=0.40; extra == 'gateway'
Requires-Dist: uvicorn[standard]>=0.30; extra == 'gateway'
Description-Content-Type: text/markdown

<p align="center">
  <h1 align="center">llmstack</h1>
  <p align="center"><strong>One command. Full LLM stack. Zero config.</strong></p>
  <p align="center">Stop wiring Docker containers. Start building AI apps.</p>
</p>

<p align="center">
  <a href="https://pypi.org/project/llmstack-cli/"><img src="https://img.shields.io/pypi/v/llmstack-cli?color=blue" alt="PyPI"></a>
  <a href="https://github.com/mara-werils/llmstack/actions/workflows/ci.yml"><img src="https://github.com/mara-werils/llmstack/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://github.com/mara-werils/llmstack/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-green" alt="License"></a>
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.11+-blue" alt="Python"></a>
</p>

---

**llmstack** spins up a production-grade LLM stack locally with a single command. It auto-detects your hardware, picks the optimal inference backend, and wires everything together.

```bash
pip install llmstack-cli
llmstack init
llmstack up
```

That's it. You now have a full LLM API running locally.

## Architecture

```
                         llmstack up
                              |
                    +---------v----------+
                    |   Hardware Detect   |
                    |  NVIDIA / Apple / CPU|
                    +---------+----------+
                              |
              +-------+-------+-------+-------+
              |       |       |       |       |
         +----v--+ +--v---+ +v-----+ +v----+ +v--------+
         |Qdrant | |Redis | |Ollama| | TEI | | Gateway  |
         |Vector | |Cache | | or   | |Embed| | FastAPI  |
         |  DB   | |      | | vLLM | |     | | OpenAI   |
         +-------+ +------+ +------+ +-----+ |compatible|
              :6333   :6379   :11434   :8002  +----+-----+
                                                   |:8000
                                              +----v-----+
                                              |Prometheus |
                                              | + Grafana |
                                              +----------+
                                                   :8080
```

## What you get

| Layer | Service | Default | Port |
|-------|---------|---------|------|
| Inference | Ollama / vLLM (auto) | llama3.2 | 11434 |
| Embeddings | TEI / Ollama (auto) | bge-m3 | 8002 |
| Vector DB | Qdrant | - | 6333 |
| Cache | Redis | 256MB LRU | 6379 |
| API Gateway | FastAPI (OpenAI-compatible) | auth + rate limit | 8000 |
| Dashboard | Grafana + Prometheus | pre-built panels | 8080 |

## How it works

```
llmstack init       # Detects hardware, generates llmstack.yaml
                    # Picks optimal backend: vLLM for NVIDIA 16GB+, Ollama otherwise

llmstack up         # Boots services in order with health checks:
                    # Qdrant -> Redis -> Inference -> Embeddings -> Gateway -> Metrics

llmstack status     # Shows health of all running services
llmstack logs ollama # Stream inference logs
llmstack down       # Stops everything
```

### Use the API

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello!"}]}'
```

Works with **any OpenAI-compatible client**: LangChain, LlamaIndex, Vercel AI SDK, openai-python.

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="YOUR_KEY")
response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
```

## Auto hardware detection

| Your hardware | Backend | Why |
|---|---|---|
| NVIDIA GPU 16GB+ VRAM | vLLM | Max throughput, PagedAttention |
| NVIDIA GPU <16GB | Ollama | Lower memory overhead |
| Apple Silicon (M1-M4) | Ollama | Metal acceleration |
| CPU only | Ollama | GGUF quantized models |

## Presets

```bash
llmstack init --preset chat    # Minimal: inference + cache + gateway
llmstack init --preset rag     # + Qdrant + embeddings for RAG apps
llmstack init --preset agent   # 70B model + 16K context + longer timeouts
```

## Configuration

One file: `llmstack.yaml`

```yaml
version: "1"

models:
  chat:
    name: llama3.2
    backend: auto              # auto | ollama | vllm
    context_length: 8192
  embeddings:
    name: bge-m3

services:
  vectors:
    provider: qdrant
    port: 6333
  cache:
    provider: redis
    max_memory: 256mb

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min
  cors: ["*"]

observe:
  metrics: true
  dashboard_port: 8080
```

## CLI

| Command | Description |
|---------|-------------|
| `llmstack init [--preset]` | Create config with smart defaults |
| `llmstack up [--attach]` | Start all services |
| `llmstack down [--volumes]` | Stop and clean up |
| `llmstack status` | Health check all services |
| `llmstack logs <service>` | Stream service logs |
| `llmstack doctor` | Diagnose system issues |

## Observability

When `observe.metrics: true`, llmstack boots Prometheus + Grafana with a pre-built dashboard:

- **Request rate** per endpoint
- **Latency** p50 / p99 histograms
- **Token throughput** (input + output)
- **Error rate** (4xx / 5xx)
- **Service health** (up/down)

Access at `http://localhost:8080` (login: admin / llmstack)

## Plugins

Extend llmstack with new backends via pip:

```bash
pip install llmstack-cli-plugin-chromadb
# Now: vectors.provider: chromadb in llmstack.yaml
```

Create your own: implement `ServiceBase`, register via entry_points. See [CONTRIBUTING.md](CONTRIBUTING.md).

## Why llmstack?

| | llmstack | Ollama | Harbor | AnythingLLM | LiteLLM |
|---|---|---|---|---|---|
| One-command full stack | Yes | No (inference only) | Partial | Partial | No (proxy only) |
| Auto hardware detection | Yes | No | No | No | No |
| OpenAI-compatible API | Yes | Yes | Varies | No | Yes |
| Built-in vector DB | Yes | No | Config needed | Bundled | No |
| Built-in embeddings | Yes | No | No | Bundled | No |
| Caching (Redis) | Yes | No | No | No | No |
| Auth + rate limiting | Yes | No | No | Yes | Yes |
| Observability dashboard | Yes | No | Partial | No | Partial |
| Plugin ecosystem | Yes | No | No | No | No |
| SSE streaming | Yes | Yes | Yes | Yes | Yes |

## Tech stack

- **CLI**: [Typer](https://typer.tiangolo.com/) + [Rich](https://rich.readthedocs.io/)
- **Config**: [Pydantic v2](https://docs.pydantic.dev/)
- **Gateway**: [FastAPI](https://fastapi.tiangolo.com/)
- **Containers**: [Docker SDK for Python](https://docker-py.readthedocs.io/)
- **Metrics**: Prometheus + Grafana

## Requirements

- Python 3.11+
- Docker

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

## License

Apache-2.0
