SLURM Runtimes

scitex-agent-container ships two SLURM runtimes:

Runtime

Use when

slurm

You want one agent per SLURM job. The agent’s sbatch wrapper holds the allocation, runs claude in tmux on the compute node, and auto-resubmits before walltime via a SIGUSR1 trap. This is what production agents like head-spartan use.

slurm-tenant

You want many agents inside one allocation. Book a reservation once, launch agents as tenants, pay the queue wait only once for the whole pool. Requires scitex-hpc>=0.6.1 (provides the Reservation primitive with tmux-server bootstrap).

Both runtimes are bastion-initiated (SSH from a non-HPC host into the SLURM submission host). No persistent daemons or outbound tunnels are ever installed — compatible with HPC policies that forbid them (reference: 2026-04-26 IT Security ruling on Spartan).

Single-agent (runtime: slurm)

Submit one agent as one sbatch job with auto-resubmit:

apiVersion: scitex-agent-container/v3
kind: Agent
spec:
  runtime: slurm
  model: opus
  slurm:
    partition: cascade
    cpus_per_task: 4
    mem: "16G"
    time_limit: "7-00:00:00"
    auto_resubmit: true
    hooks:
      pre_agent: ~/path/to/module-load.sh   # source 'module load Python/3.11.3' etc.

Lifecycle:

sac agent start head-spartan/head-spartan.yaml   # submits sbatch on the SLURM submission host
sac agent status head-spartan                    # squeue + tmux pane state
sac agent attach head-spartan                    # srun --pty + tmux attach on the compute node
sac agent logs head-spartan -n 100               # tmux capture-pane via srun --overlap
sac agent stop head-spartan                      # scancel + clear local state

The slurm.hooks.pre_agent script is sourced (not exec’d) inside the sbatch wrapper, so any env it sets persists into the agent process. This is where module load Python/3.11.3 lives on Lmod clusters.

Multi-tenant (runtime: slurm-tenant)

Cuts queue wait from minutes per agent launch to one ssh round-trip per launch. Pattern: book the node once via scitex-hpc, then start tenant agents into it.

Step 1 — book the reservation

scitex-hpc reservations book dev-pool \
    --host spartan --partition cascade \
    --cpus 8 --mem 32G --time 7-0 \
    --tmux-server sac --persistent

Two flags matter:

  • --tmux-server sac bootstraps a long-lived tmux server as PID 1 of the sbatch script. Tenant tmux sessions connect to this server via tmux -L sac and so live in the job’s cgroup, not in transient srun --overlap step cgroups (which would kill them on step exit). Without this flag, runtime: slurm-tenant will refuse to start with a clear error message.

  • --persistent enables walltime auto-resubmit via SLURM’s SIGUSR1 signal — when the job is 1 hour from walltime, the trap calls sbatch "$0" to resubmit itself in place. The reservation’s friendly name stays stable; only the SLURM job_id changes. Use scitex-hpc reservations refresh dev-pool to update the cached id.

Step 2 — write tenant YAMLs

A minimal tenant yaml:

apiVersion: scitex-agent-container/v3
kind: Agent
spec:
  runtime: slurm-tenant
  model: sonnet
  slurm:
    reservation: dev-pool       # name of the existing scitex-hpc lease
  claude:
    flags: [--dangerously-skip-permissions]

Drop multiple yamls under $SCITEX_AGENT_CONTAINER_YAML_DIRS (or ~/.scitex/agent-container/agents/) following the dir-as-SSoT layout (<name>/<name>.yaml).

Step 3 — start agents into the allocation

sac agent start dev-helper.yaml         # tmux session in dev-pool's allocation
sac agent start doc-builder.yaml        # second session, same allocation
sac agent start test-runner.yaml        # third, same allocation

Each sac agent start becomes one tmux -L sac new-session inside the existing reservation — no new sbatch is submitted. The whole operation is one ssh round-trip per agent.

Or launch them all at once:

sac agent start --all                   # discovers all yamls, starts each

Step 4 — operate them

sac list                          # registry view; tenants show alongside other agents
sac agent attach dev-helper             # srun --pty + tmux -L sac agent attach -t sac-dev-helper
sac agent logs dev-helper -n 100        # tmux capture-pane via srun --overlap
sac agent stop dev-helper               # tmux kill-session (does NOT release the allocation)
sac agent stop --all                    # kill every tenant; reservation still alive

Stopping a tenant only kills its tmux session; the reservation outlives its tenants. Releasing the reservation is a separate scitex-hpc CLI call:

scitex-hpc reservations release dev-pool

Architectural notes

  • Why the tmux server has to be PID 1 of the job: SLURM kills all processes in a step’s cgroup when the step ends. A tmux new-session spawned via srun --jobid --overlap runs in such a step and gets killed within ~2 seconds (verified live on spartan-bm021 2026-04-28). The Reservation.book(tmux_server="sac", ...) call wires tmux -L sac new-session -d -s _root 'sleep infinity' into the sbatch script body — making the tmux server itself the job’s main process. Tenants then connect via the same -L sac socket and their sessions are siblings of _root, all in the job’s cgroup.

  • Why bastion-initiated: every sac call from your laptop results in ssh <host> 'bash -lc "srun --jobid=… --overlap …"'. The HPC side never initiates a connection back. No persistent daemons, no autossh, no cloudflared, no crontab @reboot — the whole architecture is policy-compliant by construction (matches the 2026-04-26 IT Security ruling on Spartan).

Migration path

To move an existing runtime: slurm agent into a multi-tenant allocation:

  1. Book a reservation that fits the agent’s resource requirements (--cpus, --mem, --time, --partition).

  2. Change the agent’s yaml from runtime: slurm to runtime: slurm-tenant and replace the entire slurm: block with slurm: {reservation: <pool-name>}.

  3. sac agent stop the old agent (or wait for its job to walltime-out).

  4. sac agent start the migrated yaml.

The agent runs in the same shell context (Python venv, env vars from pre_agent hook fragments — though tenants don’t get the slurm.hooks.pre_agent lifecycle, since the reservation owns the script). If your agent needs module load etc., wire that into the reservation’s hold_body via Reservation.book(hold_body=...).

Troubleshooting

  • runtime: slurm-tenant requires spec.slurm.reservation — the yaml is missing the reservation field.

  • reservation 'foo' was not booked with tmux_server set — re-book with --tmux-server sac.

  • Tenant tmux session disappears immediately — almost certainly the reservation was booked without --tmux-server. Run scitex-hpc reservations get <name> and check "extras": {"tmux_server": "sac"} is set.

  • sac agent attach exits immediately — same as above; or the session was killed externally. Run sac agent logs first to see whether the process inside crashed.