Metadata-Version: 2.4
Name: lqck
Version: 0.1.0a1
Summary: Liquid cluster kit to launch and manage long-lived Slurm services jobs.
Project-URL: Repository, https://github.com/Liquid4All/cluster-kit
Project-URL: Issues, https://github.com/Liquid4All/cluster-kit/issues
Author: Liquid AI
License-Expression: MIT
License-File: LICENSE
Keywords: cluster,hpc,inference,service,sglang,slurm,vllm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

# cluster-kit

Cluster toolkit for Liquid AI. First package: **`cluster_kit.slurm_services`** — a
lightweight SDK to launch and manage long-lived Slurm *service* jobs (e.g. an
inference server used as an LLM judge) whose lifecycle is bound to one or more
consumer jobs.

No database, no central worker: the **shared filesystem is the registry** and
**`squeue` is the source of truth** for liveness. Multiple consumers can share
one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster
that schedules by QoS (`--gpus-per-node` / `--cpus-per-gpu`), runs ROCm, has
`uv` preinstalled, and exposes a shared `$HOME` — adjust the resource flags and
server entrypoint for your own cluster.

## Install

```bash
uv add lqck        # or: pip install lqck
```

The distribution is **`lqck`**; the import package is **`cluster_kit`**
(`pip install lqck`, then `import cluster_kit` / `python -m cluster_kit.slurm_services`).
Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: `uv sync --extra dev && uv run pytest` (no cluster needed).

## Run it on the cluster

Everything goes through Slurm. The **consumer/driver is itself a job**, and the
SDK does a nested `sbatch` from inside it to launch the GPU-backed **service**
job; the two find each other via the registry and talk over HTTP.

```bash
cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm
```

Service stdout (sglang startup) lands in `~/.slurm_services/<name>/service-<jobid>.out`.

Preview the exact `sbatch` without submitting anything:

```bash
uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint "$PWD/examples/start_model_server.sh" \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --port 8000 --env MODEL=LiquidAI/LFM2.5-350M --env PORT=8000 --dry-run
```

## Python API

```python
from cluster_kit.slurm_services import HealthCheck, Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    entrypoint="examples/start_model_server.sh",   # reused as-is; serves an OpenAI API
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    env={"MODEL": "LiquidAI/LFM2.5-350M", "PORT": "8000"},
    port=8000,
    health_check=HealthCheck(path="/health", timeout_s=600),
    idle_timeout_s=600,            # keep warm for the next run to reuse
    fingerprint_keys=["MODEL"],    # only reuse a service running this model
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss
```

The CLI (`ensure` / `heartbeat` / `release`) is the same thing for shell-driven
`*.slurm` scripts — see `examples/train_with_judge.slurm`.

## Logging

Verbose by default; set `SLURM_SERVICES_LOG` to `DEBUG` / `INFO` / `WARNING`. At
INFO you always get the exact, copy-pasteable submission and every state change:

```
INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)
```

## How it works

`ensure_service` (and the `slurm_service` context manager over it) runs inside
the consumer job and: checks the registry for a healthy same-fingerprint service
to reuse; otherwise takes an atomic lock, renders a wrapper around your
`entrypoint`, and `sbatch`es it; polls `squeue` to RUNNING then `/health` to 200;
registers a lease and returns a `Handle`. On exit it drops the lease.

Reaping is belt-and-suspenders (since `atexit` doesn't fire on SIGKILL/node loss):

- **Consumer side** — a background thread renews this consumer's lease file.
- **Service side** — a watcher inside the service job `scancel`s itself once no
  lease is live (fresh heartbeat, or the lease's parent job still in `squeue`),
  after an `idle_timeout_s` grace window.

Key design choices:

- **Lease set, not a single parent** — the service stays up while ≥1 consumer
  holds a live lease, so concurrent runs share it; one consumer is just N=1.
- **Fingerprint-gated reuse** — a same-name service running a *different* model
  raises `FingerprintMismatch` rather than handing back the wrong endpoint.
- **`--export` omitted by default** so the service inherits the consumer job's
  modules + venv (set `export_env="NONE"` for a clean env).
- **Registry location** — per-user `~/.slurm_services` by default (`$HOME` is
  shared on the cluster, so it's reachable from every node); set
  `$SLURM_SERVICES_ROOT` to a shared path for team-wide sharing.

## Layout

```
src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs
  train_with_judge.slurm shell-driven consumer: shared judge + training
  start_model_server.sh  server entrypoint: sglang-ROCm container (OpenAI API + /health)
```

## Status & roadmap

The SDK is implemented and unit-tested. Remaining work and possible follow-ups:

**To do**
- Tag the first release (`v0.1.0`) so the git-install pin in *Install* resolves.
- Validate one real end-to-end run on the AMD cluster — confirm the sglang-ROCm
  image serves the chosen LFM2 model and that a CPU-only consumer job schedules —
  then retire the old 2-node judge setup.

**Possible future improvements**
- Hetjob co-scheduling if/when the cluster supports `--hetjob` (today: dependency
  + client-side health gate, which is authoritative regardless).
- Service restart / endpoint hot-swap (today: dependent consumers fail fast).
- A private package index, only if git-install friction shows up.
- Optional fire-and-forget status POST for dashboard visibility (never a dependency).
