Metadata-Version: 2.4
Name: smtop
Version: 0.10.7
Summary: Live Slurm GPU and job monitor.
Author: Ren Jiaxi
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# smtop

Live Slurm GPU and job monitor derived from `slmtop`, installed as a separate
package and command so the original package remains untouched.

Run `smtop` in a terminal to refresh continuously. Press `q` to quit.

Useful options:

- `smtop --once`: print one snapshot.
- `smtop --free`: only show nodes with free GPUs.
- `smtop -d 5`: refresh every five seconds.
- `smtop -n 10`: refresh ten times and exit.
- `smtop --gpu-metrics local`: show live `nvidia-smi` util, memory, power, temp, and fan data for the current node.
- `smtop --gpu-metrics ssh`: query Slurm GPU nodes with `ssh <node> nvidia-smi ...`.
- `smtop --gpu-interval 10`: sample GPU telemetry every ten seconds while still refreshing Slurm state at the main delay.
- `smtop --no-ssh-unlock`: disable helper jobs and use plain ssh telemetry only.

The main table is node-level: each row aggregates all GPUs on one Slurm node.
GPU utilization is averaged, GPU memory and power are summed, and temperature is
the hottest GPU on that node.

The live curses UI uses an `nvitop`-style layout with boxed node telemetry,
block bars for memory and utilization, cluster resource bars, and a boxed job
queue.

Interactive controls:

- Up/down arrows select node or job rows; Tab switches between the node and job panels.
- Mouse clicks select visible node or job rows; mouse wheel moves selection within the panel under the cursor.
- Click outside selectable rows, or press `c`, to clear the current selection.
- Press Enter on a selected GPU node to open `nvitop` on that node over ssh; press `q` in `nvitop` to return to `smtop`.
- Press `k` on your own selected job to open a confirmation dialog for `scancel`.

By default, `smtop` uses ssh telemetry and, for nodes that reject ssh, submits
CPU-only helper jobs to establish persistent SSH master connections. Helper jobs
request one CPU, 100M memory, no GPU, and sleep for 15 minutes by default. They
are submitted in parallel. For unlocked nodes, `smtop` starts a persistent SSH
telemetry channel while the helper job is active, then cancels the helper job
after that node returns its first successful GPU telemetry sample. Use
`--unlock-hold-jobs` if your cluster needs the helper jobs to stay alive for
every telemetry refresh, or `--unlock-keep-jobs` to leave helper jobs running
after `smtop` exits. To avoid repeated helper churn on nodes that keep rejecting
or dropping telemetry, `smtop` submits at most three helper jobs per node per
run by default; change this with `--unlock-max-attempts`.

Nodes that Slurm reports as unavailable, such as `DOWN` or `NODE_FAIL`, are not
unlocked or sampled; the node table reports the Slurm state instead.

Unlocking retries SSH master setup for up to 60 seconds by default. The curses UI
keeps accepting `q` and `r` while telemetry and unlock work runs in the
background. If a node still cannot be sampled, the `ERR` column prefers unlock
diagnostics such as `unlock submit denied`, `unlock PD Resources`,
`unlock timeout`, `unlock metric denied`, or `unlock nvidia missing` over the
original access-denied message.

When Enter opens `nvitop` for a GPU node, `smtop` temporarily suspends its curses
screen and starts `ssh -tt <node> nvitop`. Existing `smtop` SSH masters and
persistent telemetry channels stay open; they are only cleaned up when `smtop`
itself exits. Use `--nvitop-command` to override the remote command.
