Metadata-Version: 2.4
Name: slurmwatch
Version: 0.1.1
Summary: Live, process-isolated node-local hardware telemetry for active Slurm jobs
Author-email: Youzhi Yu <yuyouzhi666@icloud.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/PursuitOfDataScience/slurmwatch
Project-URL: Repository, https://github.com/PursuitOfDataScience/slurmwatch.git
Project-URL: Documentation, https://github.com/PursuitOfDataScience/slurmwatch#readme
Project-URL: Issues, https://github.com/PursuitOfDataScience/slurmwatch/issues
Keywords: slurm,hpc,monitoring,gpu,tui
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Hardware
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: textual<9,>=0.53
Requires-Dist: pynvml<12,>=11.5
Provides-Extra: nvidia
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.3; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Requires-Dist: setuptools-scm>=8; extra == "dev"
Dynamic: license-file

<h1 align="center">slurmwatch</h1>

<p align="center">
  <strong>Live, per-process CPU / memory / GPU telemetry for a running Slurm job — with a plain-language efficiency verdict.</strong>
</p>

<p align="center">
  <a href="https://github.com/PursuitOfDataScience/slurmwatch/actions/workflows/ci.yml"><img src="https://github.com/PursuitOfDataScience/slurmwatch/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://pypi.org/project/slurmwatch/"><img src="https://img.shields.io/pypi/v/slurmwatch.svg?cache=bust" alt="PyPI"></a>
  <img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python 3.10+">
  <img src="https://img.shields.io/badge/license-MIT-green.svg" alt="MIT License">
  <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/badge/lint-ruff-261230.svg" alt="Ruff"></a>
</p>

<p align="center">
  <img src="https://raw.githubusercontent.com/PursuitOfDataScience/slurmwatch/main/assets/demo.gif" width="860" alt="slurmwatch live TUI dashboard: per-process CPU, memory, and GPU telemetry for a Slurm job. Memory climbs from safe into the OOM-guard WARNING and CRITICAL bands while the allocation-efficiency verdict flags an idle GPU (1 of 2 active).">
</p>

## Features

- **Efficiency verdict** — grades CPU, memory, and GPU (`GOOD` / `UNDERUSED` / `IDLE` / `WARNING`) so you know when you're wasting cores or GPUs.
- **Per-process GPU attribution** — NVML sees only *your* PIDs, so a neighbor's job never inflates your numbers.
- **Honest memory** — working set (RSS minus reclaimable cache) with a configurable OOM guard.
- **Works anywhere** — full live telemetry on the node; auto-falls back to Slurm accounting (`sstat`) from a login node.
- **Zero config** — `slurmwatch <jobid>` auto-discovers jobs, cgroup v1/v2, and whether it's on the node. No flags to memorize.

## Install

```bash
pip install slurmwatch
```

Requires Python 3.10+ and Linux with cgroup v1 or v2. One install works across a mixed cluster: GPU monitoring (NVIDIA, via `pynvml`) auto-activates on GPU nodes and is silently skipped on CPU-only nodes. Works with `pipx` / `uv` too — e.g. `uv tool install slurmwatch`.

## Usage

```bash
slurmwatch                       # auto-discover and attach to your running job
slurmwatch 12345                 # attach to a job (array: 12345_3, het: 12345+1)
slurmwatch --demo                # try the live TUI right now — no Slurm needed
slurmwatch 12345 --once --json   # one machine-readable snapshot, then exit
slurmwatch 12345 --log run.jsonl # headless logging (JSON Lines or CSV)
```

For full live telemetry, run on the node executing the job:
`srun --jobid 12345 --overlap slurmwatch 12345`. From a login node you get an `sstat` summary (peak memory + CPU time + allocation) instead — GPU *utilization* isn't available remotely, since Slurm tracks GPU count, not per-device util.

TUI keys: `c`/`m`/`g`/`v` focus a panel, arrows/`PgUp`/`PgDn` scroll, `q` quits.

Exit codes: `0` success · `1` runtime failure · `2` bad config. Errors go to stderr, so piped `--once`/`--log` output stays clean.

See `slurmwatch --help` for the full flag list. Behavior is also tunable via `SLURMWATCH_*` environment variables — e.g. `SLURMWATCH_OOM_WARN`, `SLURMWATCH_GPU_IDLE_PCT`, `SLURMWATCH_POLL_INTERVAL` (plus ASCII mode and more).

## Library

```python
import asyncio
from slurmwatch import TelemetryCollector, resolve_job_context

async def sample(job_id: str):
    collector = TelemetryCollector(resolve_job_context(job_id))
    await collector.start()
    try:
        print((await collector.next_snapshot()).to_json())
    finally:
        await collector.stop()

asyncio.run(sample("12345"))
```

## Limitations

- NVIDIA-only GPU support (no AMD/ROCm).
- Single-node view — multi-node jobs show data for the node you're on.
- Live GPU utilization and working-set memory require running on the job's node.

## License

MIT
