Metadata-Version: 2.4
Name: ai-factory-sdk
Version: 0.2.0.dev5
Summary: Python SDK for the AI Factory Compute API
Project-URL: Homepage, https://pypi.org/project/ai-factory-sdk/
Author-email: AI Factory Team <software.platform@ai-at.eu>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx>=0.28
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: typer>=0.12
Description-Content-Type: text/markdown

# AI Factory SDK

Python SDK for the [AI Factory Compute API](https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1/docs) — submit and manage HPC jobs from Python.

## Features

- Synchronous and asynchronous clients (`AIFactoryClient`, `AsyncAIFactoryClient`)
- Typed request/response models with Pydantic validation
- Job polling with configurable timeout and retry (`client.wait()`)
- Automatic retry on transient errors (429, 5xx)
- PEP 561 compatible — full type annotation coverage
- `ai-factory` CLI for shell workflows (`ai-factory jobs list/get/submit-container/cancel`)

## Installation

```bash
pip install ai-factory-sdk
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add ai-factory-sdk
```

### Pre-release versions

Development builds published from the `dev` branch use PEP 440 pre-release
suffixes (e.g., `0.2.0.dev1`). Install them with:

```bash
pip install ai-factory-sdk --pre
```

## Quick Start

```python
from ai_factory.sdk import AIFactoryClient, JobRequest

# Credentials resolve from ~/.ai-factory/config.yaml, env vars, or constructor
# args (see "Configuration" below). Passed explicitly here for clarity:
with AIFactoryClient(
    api_key="dev-portal-api-key",
    slurm_token="slurm-jwt",
    slurm_user="jane",
) as client:
    # Submit a job
    resp = client.submit_job(
        JobRequest(name="hello", script="#!/bin/bash\necho Hello from SLURM")
    )
    print(f"Submitted job {resp.job_id}")

    # Wait for completion
    if resp.job_id is not None:
        detail = client.wait(str(resp.job_id), timeout=3600)
        print(f"Job finished with status: {detail.status}")
```

### Async Usage

```python
import asyncio
from ai_factory.sdk import AsyncAIFactoryClient, JobRequest

async def main():
    async with AsyncAIFactoryClient(
        api_key="dev-portal-api-key",
        slurm_token="slurm-jwt",
        slurm_user="jane",
    ) as client:
        resp = await client.submit_job(
            JobRequest(name="async-job", script="#!/bin/bash\nsleep 10 && echo done")
        )
        if resp.job_id is not None:
            detail = await client.wait(str(resp.job_id))
            print(detail.status)

asyncio.run(main())
```

### Container Jobs

```python
from ai_factory.sdk import AIFactoryClient, ContainerJobRequest

with AIFactoryClient(
    api_key="dev-portal-api-key",
    slurm_token="slurm-jwt",
    slurm_user="jane",
) as client:
    resp = client.submit_container(
        ContainerJobRequest(
            name="gpu-training",
            image="docker://nvcr.io/nvidia/pytorch:24.01-py3",
            container_command="python train.py",
            gres="gpu:a40:1",
            time_limit=120,
        )
    )
```

## Configuration

The Compute API sits behind an APISIX gateway, so **two distinct credentials**
are required (see
[onboarding & auth flow](https://gitlab.tuwien.ac.at/ai-factory/monorepo/-/blob/dev/docs/architecture/onboarding-auth-flow.md)):

- **`api_key`** — the Developer Portal API key. The SDK sends it as the
  `apikey` header so APISIX's `key-auth` plugin lets the request through.
- **`slurm_token`** — the Slurm JWT (`scontrol token`). The Compute API
  forwards it to the upstream Slurm REST endpoints.

Credentials resolve from three sources, in priority order:

1. **Explicit constructor arguments** —
   `Client(api_key=..., slurm_token=..., slurm_user=...)`.
2. **Environment variables**.
3. **YAML config file** at `~/.ai-factory/config.yaml`.

If a required value is missing from all three sources, `Client()` raises
`ValueError` with a message listing all three options.

| Parameter | Environment Variable | Config File Key | Default |
|-----------|---------------------|-----------------|---------|
| `base_url` | `AI_FACTORY_API_URL` | `api_url` | `https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1` |
| `api_key` | `AI_FACTORY_API_KEY` | `api_key` | *(required — Developer Portal key, sent as `apikey`)* |
| `slurm_token` | `AI_FACTORY_SLURM_TOKEN` | `slurm_token` | *(required — Slurm JWT, sent as `X-SLURM-USER-TOKEN`)* |
| `slurm_user` | `AI_FACTORY_SLURM_USER` | `slurm_user` | *(required)* |
| `timeout` | — | — | `30.0` (HTTP timeout in seconds) |

### Config file

Example `~/.ai-factory/config.yaml`:

```yaml
api_url: "https://aifactory.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
api_key: "your-developer-portal-api-key"
slurm_token: "eyJhbGciOiJSUzI1NiIs..."   # scontrol token output
slurm_user: "jane.doe"
```

Secure the file so only your user can read it:

```bash
chmod 600 ~/.ai-factory/config.yaml
```

The SDK emits a `UserWarning` when a `Client()` is constructed if the file
is group- or world-accessible. A malformed or unreadable file raises
`ConfigFileError` (a subclass of `SDKError`).

## Command-Line Interface

Installing the SDK also registers the `ai-factory` console script for users
who prefer shell workflows or want to drive the platform from bash:

```bash
# Set credentials once (or use ~/.ai-factory/config.yaml — same resolution chain as Client())
export AI_FACTORY_API_KEY="your-developer-portal-api-key"     # -> apikey header
export AI_FACTORY_SLURM_TOKEN="eyJhbGciOi..."                 # -> X-SLURM-USER-TOKEN
export AI_FACTORY_SLURM_USER="jane.doe"

ai-factory --version                    # print SDK version and exit
ai-factory jobs list                    # table output
ai-factory jobs list --json             # machine-readable
ai-factory jobs get 459381              # single job detail
ai-factory jobs submit-container \
    --name training-run \
    --image docker://nvcr.io/nvidia/pytorch:24.01-py3 \
    --command "python train.py" \
    --partition GPU-a100 \
    --gres gpu:a40:1 \
    --time-limit 120
ai-factory jobs cancel 459381
```

Every subcommand supports `--help` and `--json`. Errors map to distinct
exit codes so shell scripts can branch on the failure mode (codes start at
`10` so they do not collide with Click/Typer's argument-parse exit `2`):

| Exit code | Meaning |
|-----------|---------|
| `0` | success |
| `2` | usage error (raised by Typer for unknown options/missing arguments) |
| `10` | configuration error (missing credentials, bad config file) |
| `11` | authentication failed (expired/invalid token) |
| `12` | resource not found (e.g. unknown job ID) |
| `13` | API error (server returned non-success status) |
| `14` | other SDK error |

The CLI shares the credential resolution chain with `AIFactoryClient`:
explicit env vars take precedence over `~/.ai-factory/config.yaml`.

There is no `ai-factory jobs wait` subcommand yet. Poll with `jobs get --json`
in a script until `.status` is one of `completed` / `errored` / `cancelled`,
or use the Python `client.wait()` method directly.

## API Reference

### Clients

| Class | Description |
|-------|-------------|
| `AIFactoryClient` | Synchronous client (context manager) |
| `AsyncAIFactoryClient` | Asynchronous client (async context manager) |

### Methods

| Method | Description |
|--------|-------------|
| `submit_job(request)` | Submit a Slurm job script |
| `submit_container(request)` | Submit a containerised job |
| `get_job(job_id)` | Get job details by ID |
| `list_jobs(...)` | List jobs with optional filters and pagination |
| `cancel_job(job_id)` | Cancel a running or pending job |
| `wait(job_id, ...)` | Poll until the job reaches a terminal state |

### Request Models

| Model | Fields |
|-------|--------|
| `JobRequest` | `name`, `script`, `partition`, `tasks`, `cpus_per_task`, `time_limit`, `gres`, `standard_output`, `standard_error` |
| `ContainerJobRequest` | `name`, `image`, `container_command`, `partition`, `tasks`, `cpus_per_task`, `time_limit`, `gres`, `standard_output`, `standard_error` |

### Response Models

| Model | Fields |
|-------|--------|
| `SubmitJobResponse` | `job_id`, `output_dir`, `logs_url` |
| `JobDetail` | `job_id`, `name`, `status`, `partition`, `nodes`, `exit_code`, `duration`, `start_time`, `end_time`, `submit_time`, `working_directory`, `standard_output`, `standard_error`, `gres`, `output_dir`, `logs_url` |
| `JobListItem` | `job_id`, `name`, `status`, `duration`, `start_time`, `end_time` |
| `JobList` | `jobs`, `total`, `limit`, `offset` |
| `CancelJobResponse` | `message` |

### Exceptions

| Exception | When |
|-----------|------|
| `SDKError` | Base for all SDK errors |
| `APIError` | Non-2xx HTTP response |
| `AuthError` | 401 or 403 response |
| `NotFoundError` | 404 response |
| `WaitTimeoutError` | `wait()` exceeded its deadline |
| `ConfigFileError` | `~/.ai-factory/config.yaml` unreadable or malformed |

## Requirements

- Python >= 3.11
- [httpx](https://www.python-httpx.org/) >= 0.28
- [pydantic](https://docs.pydantic.dev/) >= 2.0

## End-to-end verification

A SDK-driven Path 2 test lives at `test/e2e/test_sdk_path2.py` in the monorepo.
It submits a real container job through the published-shape `AIFactoryClient`,
polls until terminal state, and validates the JobDetail schema.

Run it locally against staging:

```bash
export COMPUTE_API_URL="https://aifactory-dev.ai-factory.datalab.tuwien.ac.at/compute-api/v1"
export APISIX_CI_API_KEY="your-developer-portal-api-key"    # sent as the apikey header
export SLURM_USERNAME="$(whoami)"
export SLURM_USER_TOKEN="$(scontrol token | cut -d= -f2)"   # Slurm JWT, rotates often

uv run pytest test/e2e/test_sdk_path2.py -v -s -m sdk_e2e
```

In CI, trigger the manual `sdk-e2e` job in the `post-deploy-verify` stage.
The job is intentionally manual because each run queues a real Slurm job.

## License

[MIT](LICENSE)
