Metadata-Version: 2.2
Name: sgpustat
Version: 0.1.1
Summary: A summary of GPU usage on a SLURM cluster
Author: Sam McCarthy
License: MIT
Project-URL: Repository, https://gitlab.surrey.ac.uk/sm0049/sgpustat
Project-URL: Issues, https://gitlab.surrey.ac.uk/sm0049/sgpustat/-/issues
Keywords: slurm,gpu,cluster,hpc,monitoring
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: System Administrators
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tabulate
Requires-Dist: termcolor>=2.1
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"

## sgpustat

`sgpustat` is a simple command line utility that produces a summary of GPU usage on a SLURM cluster, following the naming convention of the other SLURM tools (`squeue`, `sinfo`, `scontrol`, ...). The tool can be used in two ways:
1. To query the current usage of GPUs on the cluster.
2. To launch a daemon which will log usage over time. This log can later be queried to provide usage statistics.

All data comes from exactly two `scontrol` calls per invocation, so it is fast even on busy clusters, and GPU accounting is exact — including on nodes with NVIDIA MIG instances and for jobs submitted with untyped `--gres=gpu:N` requests.

This project began as a fork of [albanie/slurm_gpustat](https://github.com/albanie/slurm_gpustat); the implementation has since been rewritten.

### Installation

Install via `pip install sgpustat`. The pre-rename `slurm_gpustat` command is kept as an alias. The parsing/accounting logic lives in [core.py](sgpustat/core.py), data collection in [collect.py](sgpustat/collect.py), rendering in [render.py](sgpustat/render.py), the logging daemon in [daemon.py](sgpustat/daemon.py), and the CLI entry point in [cli.py](sgpustat/cli.py).

### Usage

To print a summary of current activity:

`sgpustat`

To print a summary of current activity on particular partitions, e.g. `debug` & `normal`:

`sgpustat -p debug,normal` or `sgpustat --partition debug,normal`

To include a per-node breakdown of available GPUs:

`sgpustat --verbose`

To output machine-readable CSV:

`sgpustat --raw`

Output is colorized when stdout is a terminal; `--color 0` or the `NO_COLOR` environment variable disables it, `--color 1` forces it (e.g. when piping to `less -R`).

To start the logging daemon:

`sgpustat --action daemon-start`

To view a summary of logged data:

`sgpustat --action history`

### Example output

```
SLURM Cluster GPU Status
========================

GPU Summary

+----------------------------+-------+----------+-------------+
| GPU model                  |   all |   online |   available |
+============================+=======+==========+=============+
| total                      |   214 |      193 |          51 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_3090    |    68 |       53 |          11 |
+----------------------------+-------+----------+-------------+
| nvidia_geforce_rtx_2080_ti |    54 |       54 |          22 |
+----------------------------+-------+----------+-------------+
| nvidia_a100-sxm4-80gb      |    36 |       32 |           0 |
+----------------------------+-------+----------+-------------+

----------------------------------------------------------------------

Usage by User

+---------+------------------------+-------------------------------+
| User    |   Total GPUs Allocated | Count per GPU Type            |
+=========+========================+===============================+
| user01  |                     24 | nvidia_geforce_rtx_2080_ti:24 |
+---------+------------------------+-------------------------------+
```

With `--verbose`, each GPU type is broken down per node:

```
nvidia_geforce_rtx_3090: 11 available
  -> gpunode14: 2 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 6/8, mem: 376G/500G] [user02,user03]
  -> gpunode15: 4 nvidia_geforce_rtx_3090 [cpu: 56/64, gpu: 4/8, mem: 180G/500G] [user02]
```

### Notes on accounting

* "all" counts every configured GPU; "online" excludes nodes whose state contains DRAIN/DOWN/MAINT/etc.; "available" is unallocated GPUs on online nodes.
* GPU inventory is read from each node's `Gres=` field (not `CfgTRES`, whose typed entries can be incomplete for MIG profiles).
* Per-job allocations come from the per-node `GRES=...(IDX:...)` detail lines of `scontrol show job -dd`, falling back to the job's typed `AllocTRES` and then to `TresPerNode`.

### Dependencies

* `Python >= 3.8`
* `tabulate`
* `termcolor >= 2.1`

### Tests

`python -m pytest tests/` — no SLURM installation required; tests run against recorded `scontrol` fixtures.
