Metadata-Version: 2.4
Name: lqck
Version: 0.1.1
Summary: Liquid cluster kit to launch and manage long-lived Slurm services jobs.
Project-URL: Repository, https://github.com/Liquid4All/cluster-kit
Project-URL: Issues, https://github.com/Liquid4All/cluster-kit/issues
Author: Liquid AI
License-Expression: MIT
License-File: LICENSE
Keywords: cluster,service,slurm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

# cluster-kit

Cluster toolkit for Liquid AI. First package: **`cluster_kit.slurm_services`** — a
lightweight SDK to launch and manage long-lived Slurm *service* jobs (e.g. an
inference server used as an LLM judge) whose lifecycle is bound to one or more
consumer jobs.

No database, no central worker: the **shared filesystem is the registry** and
**`squeue` is the source of truth** for liveness. Multiple consumers can share
one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster
that schedules by QoS (`--gpus-per-node` / `--cpus-per-gpu`), runs ROCm, has
`uv` preinstalled, and exposes a shared `$HOME` — adjust the resource flags and
server entrypoint for your own cluster.

## Install

```bash
uv add lqck        # or: pip install lqck
```

The distribution is **`lqck`**; the import package is **`cluster_kit`**
(`pip install lqck`, then `import cluster_kit` / `python -m cluster_kit.slurm_services`).
Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: `uv sync --extra dev && uv run pytest` (no cluster needed).

## Run it on the cluster

Everything goes through Slurm. The **consumer/driver is itself a job**, and the
SDK does a nested `sbatch` from inside it to launch the GPU-backed **service**
job; the two find each other via the registry and talk over HTTP.

```bash
cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm
```

Service stdout (sglang startup) lands in `~/.slurm_services/<name>/service-<jobid>.out`.

Preview the exact `sbatch` without submitting anything:

```bash
uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint python \
    --arg=-m --arg=sglang.launch_server \
    --arg=--model-path --arg=LiquidAI/LFM2.5-350M --arg=--host --arg=0.0.0.0 --arg=--port --arg=8000 \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --container-image lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi30x-20260428 \
    --dry-run
# --port (8000), the cluster's --container-mount defaults, --health-path, and the
# warm-reuse window all have defaults; pass them only to override.
```

## Python API

```python
from cluster_kit.slurm_services import Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    # No server script: the SDK runs this command inside the container, publishes
    # host:port, and health-gates it. --host 0.0.0.0 so consumers reach it across nodes.
    entrypoint="python",
    args=["-m", "sglang.launch_server", "--model-path", "LiquidAI/LFM2.5-350M",
          "--host", "0.0.0.0", "--port", "8000"],
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    container_image="lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi30x-20260428",
    # port (8000), the cluster's container_mounts, the /health check, and a warm-
    # reuse window default in — override (e.g. port=9000, container_mounts=[...],
    # idle_timeout_s=600) only when the service needs something different.
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss
```

The CLI (`ensure` / `heartbeat` / `release`) is the same thing for shell-driven
`*.slurm` scripts — see `examples/train_with_judge.slurm`.

## Logging

Verbose by default; set `SLURM_SERVICES_LOG` to `DEBUG` / `INFO` / `WARNING`. At
INFO you always get the exact, copy-pasteable submission and every state change:

```
INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)
```

## How it works

`ensure_service` (and the `slurm_service` context manager over it) runs inside
the consumer job and: checks the registry for a healthy same-fingerprint service
to reuse; otherwise takes an atomic lock, renders a wrapper around your
`entrypoint`, and `sbatch`es it; polls `squeue` to RUNNING then `/health` to 200;
registers a lease and returns a `Handle`. On exit it drops the lease.

Reaping is belt-and-suspenders (since `atexit` doesn't fire on SIGKILL/node loss):

- **Consumer side** — a background thread renews this consumer's lease file.
- **Service side** — a watcher inside the service job `scancel`s itself once no
  lease is live (fresh heartbeat, or the lease's parent job still in `squeue`),
  after an `idle_timeout_s` grace window.

Key design choices:

- **Lease set, not a single parent** — the service stays up while ≥1 consumer
  holds a live lease, so concurrent runs share it; one consumer is just N=1.
- **Fingerprint-gated reuse** — a same-name service running a *different* model
  raises `FingerprintMismatch` rather than handing back the wrong endpoint.
- **`--export` omitted by default** so the service inherits the consumer job's
  modules + venv (set `export_env="NONE"` for a clean env).
- **Registry location** — per-user `~/.slurm_services` by default (`$HOME` is
  shared on the cluster, so it's reachable from every node); set
  `$SLURM_SERVICES_ROOT` to a shared path for team-wide sharing.

## Layout

```
src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs (inline entrypoint + container_image)
  train_with_judge.slurm shell-driven consumer: shared judge + training
```

## Roadmap

**Possible future improvements**

- Hetjob co-scheduling if/when the cluster supports `--hetjob` (today: dependency
  + client-side health gate, which is authoritative regardless).
- Service restart / endpoint hot-swap (today: dependent consumers fail fast).
- Optional fire-and-forget status POST for dashboard visibility (never a dependency).
