Metadata-Version: 2.4
Name: kestrel
Version: 0.2.1
Summary: a fast, efficient inference engine for moondream
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch<2.11,>=2.4
Requires-Dist: tokenizers>=0.15
Requires-Dist: safetensors>=0.4
Requires-Dist: torch-c-dlpack-ext>=0.1.3
Requires-Dist: httpx>=0.27
Requires-Dist: apache-tvm-ffi<0.2,>=0.1.5
Requires-Dist: kestrel-native==0.1.3
Requires-Dist: kestrel-kernels==0.2.1; sys_platform == "linux"
Requires-Dist: huggingface-hub>=0.20

# Kestrel

![Kestrel Overview](https://raw.githubusercontent.com/m87-labs/kestrel/main/assets/kestrel-overview.png)

High-performance inference engine for the [Moondream](https://moondream.ai) vision-language model.

Kestrel is the inference engine behind [Photon](https://moondream.ai/p/photon), Moondream's on-device deployment option. Most users should install via `pip install moondream` — this repo is the internal engine for those who need direct access.

Kestrel provides async, micro-batched inference with streaming support, paged KV caching, and optimized CUDA kernels. It's designed for production deployments where throughput and latency matter.

## Features

- **Async micro-batching** — Cooperative scheduler batches heterogeneous requests without compromising per-request latency
- **Streaming** — Real-time token streaming for query and caption tasks
- **Multi-task** — Visual Q&A, captioning, point detection, object detection, and segmentation
- **Paged KV cache** — Efficient memory management for high concurrency
- **Prefix caching** — Radix tree-based caching for repeated prompts and images
- **LoRA adapters** — Parameter-efficient fine-tuning support with automatic cloud loading

## Requirements

- Python 3.10+
- NVIDIA GPU with optimized kernels for SM80 (A100), SM86 (A40, A10, RTX 3090), SM87 (Jetson Orin), SM89 (L4, L40S), SM90 (H100, H200, GH200). Other GPUs may work but have not been tested.
- `MOONDREAM_API_KEY` environment variable or `api_key` parameter (get a key from [moondream.ai](https://moondream.ai))

## Installation

```bash
pip install kestrel
```

For Jetson Orin, see the [Jetson setup guide](docs/jetson.md).

## Model Access

Kestrel supports both Moondream 3 and Moondream 2:

| Model | Repository | Notes |
|-------|------------|-------|
| Moondream 2 | [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) | Public, no approval needed |
| Moondream 3 | [moondream/moondream3-preview](https://huggingface.co/moondream/moondream3-preview) | Requires access approval |

For Moondream 3, request access (automatically granted) then authenticate with `huggingface-cli login` or set `HF_TOKEN`.

## Quick Start

```python
import asyncio

from kestrel.config import RuntimeConfig
from kestrel.engine import InferenceEngine


async def main():
    # Weights are automatically downloaded from HuggingFace on first run.
    # Use model="moondream2" or model="moondream3-preview".
    cfg = RuntimeConfig(model="moondream2")

    # Create the engine (loads model and warms up)
    engine = await InferenceEngine.create(cfg, api_key="your-key-here")

    # Load an image (JPEG, PNG, or WebP bytes)
    image = open("photo.jpg", "rb").read()

    # Visual question answering
    result = await engine.query(
        image=image,
        question="What's in this image?",
        settings={"temperature": 0.2, "max_tokens": 512},
    )
    print(result.output["answer"])

    # Clean up
    await engine.shutdown()


asyncio.run(main())
```

## Tasks

Kestrel supports several vision-language tasks through dedicated methods on the engine.

### Query (Visual Q&A)

Ask questions about an image:

```python
result = await engine.query(
    image=image,
    question="How many people are in this photo?",
    settings={
        "temperature": 0.2,  # Lower = more deterministic
        "top_p": 0.9,
        "max_tokens": 512,
    },
)
print(result.output["answer"])
```

### Caption

Generate image descriptions:

```python
result = await engine.caption(
    image,
    length="normal",  # "short", "normal", or "long"
    settings={"temperature": 0.2, "max_tokens": 512},
)
print(result.output["caption"])
```

### Point

Locate objects as normalized (x, y) coordinates:

```python
result = await engine.point(image, "person")
print(result.output["points"])
# [{"x": 0.5, "y": 0.3}, {"x": 0.8, "y": 0.4}]
```

Coordinates are normalized to [0, 1] where (0, 0) is top-left.

### Detect

Detect objects as bounding boxes:

```python
result = await engine.detect(
    image,
    "car",
    settings={"max_objects": 10},
)
print(result.output["objects"])
# [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.5, "y_max": 0.6}, ...]
```

Bounding box coordinates are normalized to [0, 1].

### Segment

Generate a segmentation mask (Moondream 3 only):

```python
result = await engine.segment(image, "dog")
seg = result.output["segments"][0]
print(seg["svg_path"])  # SVG path data for the mask
print(seg["bbox"])      # {"x_min": ..., "y_min": ..., "x_max": ..., "y_max": ...}
```

Note: Segmentation requires Moondream 3 and separate model weights. Contact [moondream.ai](https://moondream.ai) for access.

## Streaming

For longer responses, you can stream tokens as they're generated:

```python
image = open("photo.jpg", "rb").read()

stream = await engine.query(
    image=image,
    question="Describe this scene in detail.",
    stream=True,
    settings={"max_tokens": 1024},
)

# Print tokens as they arrive
async for chunk in stream:
    print(chunk.text, end="", flush=True)

# Get the final result with metrics
result = await stream.result()
print(f"\n\nGenerated {result.metrics.output_tokens} tokens")
```

Streaming is supported for `query` and `caption` methods.

## Response Format

All methods return an `EngineResult` with these fields:

```python
result.output          # Dict with task-specific output ("answer", "caption", "points", etc.)
result.finish_reason   # "stop" (natural end) or "length" (hit max_tokens)
result.metrics         # Timing and token counts
```

The `metrics` object contains:

```python
result.metrics.input_tokens     # Number of input tokens (including image)
result.metrics.output_tokens    # Number of generated tokens
result.metrics.prefill_time_ms  # Time to process input
result.metrics.decode_time_ms   # Time to generate output
result.metrics.ttft_ms          # Time to first token
```

## Using Finetunes

If you've created a finetuned model through the [Moondream API](https://moondream.ai), you can use it by passing the adapter ID:

```python
result = await engine.query(
    image=image,
    question="What's in this image?",
    settings={"adapter": "01J5Z3NDEKTSV4RRFFQ69G5FAV@1000"},
)
```

The adapter ID format is `{finetune_id}@{step}` where:
- `finetune_id` is the ID of your finetune job
- `step` is the training step/checkpoint to use

Adapters are automatically downloaded and cached on first use.

## Configuration

### RuntimeConfig

```python
RuntimeConfig(
    model="moondream3-preview",  # or "moondream2"
    max_batch_size=4,            # Max concurrent requests
)
```

### Environment Variables

| Variable | Description |
|----------|-------------|
| `MOONDREAM_API_KEY` | Required. Get this from [moondream.ai](https://moondream.ai). |
| `HF_HOME` | Override HuggingFace cache directory for downloaded weights (default: `~/.cache/huggingface`). |
| `HF_TOKEN` | HuggingFace token for gated models like Moondream 3. Alternatively, run `huggingface-cli login`. |

## Triton Inference Server

Kestrel can be deployed as a [Triton Inference Server](https://github.com/triton-inference-server/server) backend. See the [Triton setup guide](triton/README.md).

## Benchmarks

Throughput and latency for the `query` skill are tracked in [PERFORMANCE.md](./PERFORMANCE.md), with results broken out by GPU.

## License

Kestrel requires a Moondream API key for billing. See [moondream.ai/pricing](https://moondream.ai/pricing) for plans.
