SLURM Runtimes
scitex-agent-container ships two SLURM runtimes:
Runtime |
Use when |
|---|---|
|
You want one agent per SLURM job. The agent’s sbatch wrapper holds
the allocation, runs |
|
You want many agents inside one allocation. Book a reservation
once, launch agents as tenants, pay the queue wait only once for
the whole pool. Requires |
Both runtimes are bastion-initiated (SSH from a non-HPC host into the SLURM submission host). No persistent daemons or outbound tunnels are ever installed — compatible with HPC policies that forbid them (reference: 2026-04-26 IT Security ruling on Spartan).
Single-agent (runtime: slurm)
Submit one agent as one sbatch job with auto-resubmit:
apiVersion: scitex-agent-container/v3
kind: Agent
spec:
runtime: slurm
model: opus
slurm:
partition: cascade
cpus_per_task: 4
mem: "16G"
time_limit: "7-00:00:00"
auto_resubmit: true
hooks:
pre_agent: ~/path/to/module-load.sh # source 'module load Python/3.11.3' etc.
Lifecycle:
sac agent start head-spartan/head-spartan.yaml # submits sbatch on the SLURM submission host
sac agent status head-spartan # squeue + tmux pane state
sac agent attach head-spartan # srun --pty + tmux attach on the compute node
sac agent logs head-spartan -n 100 # tmux capture-pane via srun --overlap
sac agent stop head-spartan # scancel + clear local state
The slurm.hooks.pre_agent script is sourced (not exec’d) inside
the sbatch wrapper, so any env it sets persists into the agent
process. This is where module load Python/3.11.3 lives on Lmod
clusters.
Multi-tenant (runtime: slurm-tenant)
Cuts queue wait from minutes per agent launch to one ssh round-trip
per launch. Pattern: book the node once via scitex-hpc, then start
tenant agents into it.
Step 1 — book the reservation
scitex-hpc reservations book dev-pool \
--host spartan --partition cascade \
--cpus 8 --mem 32G --time 7-0 \
--tmux-server sac --persistent
Two flags matter:
--tmux-server sacbootstraps a long-lived tmux server as PID 1 of the sbatch script. Tenant tmux sessions connect to this server viatmux -L sac …and so live in the job’s cgroup, not in transientsrun --overlapstep cgroups (which would kill them on step exit). Without this flag,runtime: slurm-tenantwill refuse to start with a clear error message.--persistentenables walltime auto-resubmit via SLURM’sSIGUSR1signal — when the job is 1 hour from walltime, the trap callssbatch "$0"to resubmit itself in place. The reservation’s friendly name stays stable; only the SLURMjob_idchanges. Usescitex-hpc reservations refresh dev-poolto update the cached id.
Step 2 — write tenant YAMLs
A minimal tenant yaml:
apiVersion: scitex-agent-container/v3
kind: Agent
spec:
runtime: slurm-tenant
model: sonnet
slurm:
reservation: dev-pool # name of the existing scitex-hpc lease
claude:
flags: [--dangerously-skip-permissions]
Drop multiple yamls under $SCITEX_AGENT_CONTAINER_YAML_DIRS (or
~/.scitex/agent-container/agents/) following the dir-as-SSoT
layout (<name>/<name>.yaml).
Step 3 — start agents into the allocation
sac agent start dev-helper.yaml # tmux session in dev-pool's allocation
sac agent start doc-builder.yaml # second session, same allocation
sac agent start test-runner.yaml # third, same allocation
Each sac agent start becomes one tmux -L sac new-session inside the
existing reservation — no new sbatch is submitted. The whole
operation is one ssh round-trip per agent.
Or launch them all at once:
sac agent start --all # discovers all yamls, starts each
Step 4 — operate them
sac list # registry view; tenants show alongside other agents
sac agent attach dev-helper # srun --pty + tmux -L sac agent attach -t sac-dev-helper
sac agent logs dev-helper -n 100 # tmux capture-pane via srun --overlap
sac agent stop dev-helper # tmux kill-session (does NOT release the allocation)
sac agent stop --all # kill every tenant; reservation still alive
Stopping a tenant only kills its tmux session; the reservation outlives its tenants. Releasing the reservation is a separate scitex-hpc CLI call:
scitex-hpc reservations release dev-pool
Architectural notes
Why the tmux server has to be PID 1 of the job: SLURM kills all processes in a step’s cgroup when the step ends. A
tmux new-sessionspawned viasrun --jobid --overlapruns in such a step and gets killed within ~2 seconds (verified live on spartan-bm021 2026-04-28). TheReservation.book(tmux_server="sac", ...)call wirestmux -L sac new-session -d -s _root 'sleep infinity'into the sbatch script body — making the tmux server itself the job’s main process. Tenants then connect via the same-L sacsocket and their sessions are siblings of_root, all in the job’s cgroup.Why bastion-initiated: every
saccall from your laptop results inssh <host> 'bash -lc "srun --jobid=… --overlap …"'. The HPC side never initiates a connection back. No persistent daemons, no autossh, no cloudflared, nocrontab @reboot— the whole architecture is policy-compliant by construction (matches the 2026-04-26 IT Security ruling on Spartan).
Migration path
To move an existing runtime: slurm agent into a multi-tenant
allocation:
Book a reservation that fits the agent’s resource requirements (
--cpus,--mem,--time,--partition).Change the agent’s yaml from
runtime: slurmtoruntime: slurm-tenantand replace the entireslurm:block withslurm: {reservation: <pool-name>}.sac agent stopthe old agent (or wait for its job to walltime-out).sac agent startthe migrated yaml.
The agent runs in the same shell context (Python venv, env vars from
pre_agent hook fragments — though tenants don’t get the
slurm.hooks.pre_agent lifecycle, since the reservation owns the
script). If your agent needs module load etc., wire that into the
reservation’s hold_body via Reservation.book(hold_body=...).
Troubleshooting
runtime: slurm-tenant requires spec.slurm.reservation— the yaml is missing thereservationfield.reservation 'foo' was not booked with tmux_server set— re-book with--tmux-server sac.Tenant tmux session disappears immediately — almost certainly the reservation was booked without
--tmux-server. Runscitex-hpc reservations get <name>and check"extras": {"tmux_server": "sac"}is set.sac agent attachexits immediately — same as above; or the session was killed externally. Runsac agent logsfirst to see whether the process inside crashed.