Metadata-Version: 2.4
Name: ai-track
Version: 0.1.9
Summary: Universal AI runtime for local and remote inference.
License-Expression: MIT
Project-URL: Source, https://github.com/langelabs/ai-track
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.0
Requires-Dist: pydantic>=2.0
Requires-Dist: numpy>=1.26
Requires-Dist: pillow>=10.0
Requires-Dist: huggingface-hub>=0.24
Requires-Dist: tqdm>=4.66
Provides-Extra: macos
Requires-Dist: mlx>=0.24; sys_platform == "darwin" and extra == "macos"
Requires-Dist: mlx-embeddings>=0.0.4; sys_platform == "darwin" and extra == "macos"
Requires-Dist: mlx-lm>=0.24; sys_platform == "darwin" and extra == "macos"
Requires-Dist: mlx-vlm>=0.4; sys_platform == "darwin" and extra == "macos"
Requires-Dist: mlx-audio>=0.4; sys_platform == "darwin" and extra == "macos"
Requires-Dist: mflux>=0.5; sys_platform == "darwin" and extra == "macos"
Provides-Extra: cuda
Requires-Dist: torch>=2.2; sys_platform == "linux" and extra == "cuda"
Requires-Dist: vllm<0.21,>=0.20; sys_platform == "linux" and extra == "cuda"
Requires-Dist: transformers>=4.45; sys_platform == "linux" and extra == "cuda"
Requires-Dist: accelerate>=1.0; sys_platform == "linux" and extra == "cuda"
Requires-Dist: diffusers>=0.30; sys_platform == "linux" and extra == "cuda"
Dynamic: license-file

[![CI](https://github.com/langelabs/ai-track/actions/workflows/ci.yml/badge.svg)](https://github.com/langelabs/ai-track/actions/workflows/ci.yml)

![ai-track logo](https://raw.githubusercontent.com/langelabs/ai-track/main/assets/logo_light.png)

`ai-track` is a universal AI runtime library for local and remote inference.
It chooses the best available execution tier automatically, keeps the core
package lightweight, and exposes an OpenAI-style client surface so application
code can stay backend-agnostic.

## What it does

- Routes requests through local or remote inference automatically.
- Supports macOS MLX backends for on-device inference.
- Supports CUDA backends for GPU inference with vLLM and Hugging Face models.
- Falls back to a remote OpenAI-compatible client when no local backend fits.
- Exposes a familiar client surface for chat, embeddings, images, audio, and
  transcription.

## Architecture

The codebase is split into two major layers:

- `track.inference` contains the runtime primitives and backend implementations.
- `track.hub` contains the public routing layer that decides whether a model
  should use local inference or a remote client.
- `track.contracts` contains shared dataclasses, protocols, and base interfaces.
- `track.utils` contains shared helper functions for devices, storage, audio,
  chat message handling, and transcription input prep.

The runtime is centered around `LocalAI`, which can manage:

- chat generation
- embeddings
- image generation
- text-to-speech
- speech-to-text transcription

### Runtime selection

The runtime chooses a backend automatically when you do not pass one explicitly:

- macOS resolves to the MLX backend
- CUDA-capable Linux systems resolve to the CUDA backend
- everything else stays available through the remote OpenAI-compatible path

You can still force a backend explicitly when you need to.

## Local-first routing

Routing is local-first:

1. The hub checks whether the selected model is local.
2. If the runtime can serve it locally, the request stays on-device.
3. Otherwise the hub falls back to a remote OpenAI-compatible client.

This keeps local inference fast and private when available while preserving a
reliable remote fallback.

## Public API

The main entrypoints are:

```python
from track import hub, inference
from track.hub import AiHub
from track.inference import LocalAI
```

`LocalAI` exposes the local runtime directly and can also return an
OpenAI-style client.

`Hub` resolves a final client for a selected model and is the preferred way to
route requests from application code.

## OpenAI-style client

The local compatibility layer mirrors the shape of the OpenAI Python client.
It supports:

- `client.chat.completions.create(...)`
- `client.embeddings.create(...)`
- `client.images.generate(...)`
- `client.audio.speech.create(...)`
- `client.audio.transcriptions.create(...)`

### Example: chat

```python
from track.hub import AiHub
from track.inference import AiModel, InferenceConfig, LocalAI

chat_model = AiModel(
  default=True,
  location="local",
  type="llm",
  status="available",
  model="mlx-community/qwen2",
  alias="Qwen2",
  inference_config=InferenceConfig(max_tokens=256, temperature=0.2),
)

runtime = LocalAI(
  chat_config=chat_model,
  remote_api_key="sk-example",
  remote_base_url="https://openrouter.ai/api/v1",
)

hub = AiHub(local_ai=runtime)
client = hub.get_client(chat_model)

response = client.chat.completions.create(
  model=chat_model.model,
  messages=[
    {"role": "user", "content": "Summarize this architecture."},
  ],
)

print(response.choices[0].message["content"])
```

### Example: transcription

```python
from track.inference import LocalAI, TranscriptionModelConfig

runtime = LocalAI(
    backend="cuda",
    transcription_config=TranscriptionModelConfig(
        model_id="openai/whisper-small",
        alias="Whisper Small",
    ),
)

result = runtime.transcribe("sample.wav")
print(result.text)
```

### Example: OpenAI-style transcription

```python
client = runtime.get_client()
result = client.audio.transcriptions.create(
    model="openai/whisper-small",
    file="sample.wav",
)
print(result.text)
```

## Installation

The core package is intentionally small and works without the optional local
backends.

### Core install

```bash
uv sync
```

If you want to install from PyPI with `pip`, use:

```bash
pip install ai-track
```

For the latest main-branch publish, use:

```bash
pip install --pre ai-track
```

### macOS MLX extras

```bash
uv sync --extra macos
```

For `pip`:

```bash
pip install "ai-track[macos]"
```

The macOS extra installs the full MLX runtime stack used by local inference,
including the base `mlx` package alongside `mlx-embeddings`, `mlx-lm`,
`mlx-vlm`, `mlx-audio`, and `mflux`.

MLX chat support is limited to model architectures that the installed
`mlx_vlm` package can actually load. If `mlx_vlm` does not support a model's
chat architecture, register that model only for the modalities you intend to
use instead of advertising generic text/chat support.

For embeddings, MLX checkpoints that expose a native `.embed()` method are
used directly. Embedding-focused MLX checkpoints that rely on the
`mlx-embeddings` loader are also supported. Generic MLX checkpoints can be
used for embeddings through hidden-state fallback pooling when the full MLX
stack is installed.

For local embedding-only models, declare explicit capabilities so downstream
apps do not boot the MLX chat backend unnecessarily:

```python
from track.contracts import AiModel, AiModelCapabilities

embedding_model = AiModel(
    provider="local",
    model_id="your-org/your-embedding-model",
    alias="embedding-model",
    capabilities=AiModelCapabilities(
        embedding_input=True,
        embedding_output=True,
    ),
)
```

### CUDA extras

```bash
uv sync --extra cuda
```

For `pip`:

```bash
pip install "ai-track[cuda]"
```

The CUDA extra brings in the GPU-oriented runtime stack, including vLLM,
Transformers, Diffusers, and PyTorch-based helpers.

`ai-track` validates CUDA support against the pinned `vllm` 0.20.x minor line.
Future `vllm` releases are not treated as automatically compatible just because
they satisfy an open-ended lower bound.

For local embedding-only models, declare explicit capabilities so downstream
apps do not register unrelated modalities and accidentally boot unused CUDA
backends:

```python
from track.contracts import AiModel, AiModelCapabilities

embedding_model = AiModel(
    provider="local",
    model_id="Qwen/Qwen3-Embedding-0.6B",
    alias="qwen-embedding",
    capabilities=AiModelCapabilities(
        embedding_input=True,
        embedding_output=True,
    ),
)
```

## Testing

Run the full unit suite with:

```bash
uv run pytest -q tests
```

The tests focus on:

- hub routing decisions
- backend selection
- OpenAI-style client compatibility
- multimodal cleanup behavior
- transcription support
- CUDA factory selection

## Development notes

- Prefer `track.hub` for routing decisions.
- Keep optional imports lazy so the core package stays importable without MLX
  or CUDA dependencies.
- Add docstrings and type hints to new helpers and edited functions.
- Reuse shared helpers where both MLX and CUDA backends need the same logic.
