Metadata-Version: 2.1
Name: gpumanager
Version: 0.1.0
Summary: Lightweight NVIDIA GPU utilization sampler, Slack reporter, and systemd timer helper
Author: OpenAI Codex
License: MIT
Keywords: gpu,nvidia,slack,systemd,monitoring,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tomli>=1.1.0; python_version < "3.11"
Requires-Dist: backports.zoneinfo>=0.2.1; python_version < "3.9"

# gpumanager

`gpumanager` is a lightweight Python CLI tool for sampling NVIDIA GPU utilization, storing minute-by-minute CSV snapshots, aggregating utilization over a reporting window, and sending GPU-wise summaries to Slack.

The installable Python distribution is named `gpumanager`. The CLI entrypoint is `gpumanager`.

## Features

- Samples NVIDIA GPU utilization with `nvidia-smi`
- Stores one CSV file per sample
- Aggregates average utilization by GPU UUID
- Sends reports to Slack via webhook
- Supports interactive configuration
- Installs user-level `systemd` services and timers
- Uses minimal dependencies and stays close to the standard library

## Requirements

- Linux
- Python 3.8+
- NVIDIA GPU
- `nvidia-smi` in `PATH`
- `systemd` recommended

## Installation

Python 3.8 support uses small compatibility dependencies installed automatically by pip:

- `tomli` on Python < 3.11
- `backports.zoneinfo` on Python < 3.9

```bash
pip install .
# or
pipx install .
```

After publishing:

```bash
pip install gpumanager
# or
pipx install gpumanager
```

## Quick Start

```bash
gpumanager init
gpumanager sample
gpumanager test
gpumanager install-systemd
```

## Commands

- `gpumanager init`
- `gpumanager config set`
- `gpumanager config show`
- `gpumanager sample`
- `gpumanager report`
- `gpumanager test`
- `gpumanager delete-csv`
- `gpumanager status`
- `gpumanager install-systemd`
- `gpumanager uninstall-systemd`

## Configuration

The tool searches for configuration in this order:

1. Path passed with `--config`
2. `GPUMANAGER_CONFIG`
3. `~/.config/gpumanager/config.toml`
4. `/etc/gpumanager/config.toml`

Example:

```toml
[slack]
webhook_url = "https://hooks.slack.com/services/..."

[storage]
csv_dir = "/var/lib/gpumanager"

[report]
send_time = "09:00"
interval = "1d"

[general]
timezone = "Asia/Seoul"
server_name = "AICA_H100"
```

## Before Running Reports

A few things must be prepared by the user before `gpumanager` can collect data and send Slack notifications:

- `nvidia-smi` must work on the server
- A valid Slack incoming webhook URL must be configured
- The CSV storage directory must be writable
- If you want automatic collection and reporting, the user-level `systemd` timers must be enabled

Quick manual verification:

```bash
nvidia-smi
gpumanager config show
gpumanager sample
gpumanager test
```

## Automatic Scheduling

`gpumanager` does not start background collection on its own. To run sampling every minute and reporting at the configured send time, install and enable the user timers.

Install unit files:

```bash
gpumanager install-systemd
```

Enable timers:

```bash
systemctl --user daemon-reload
systemctl --user enable --now gpumanager-sample.timer gpumanager-report.timer
```

Check timer status:

```bash
systemctl --user status gpumanager-sample.timer
systemctl --user status gpumanager-report.timer
```

Stop automatic scheduling later if needed:

```bash
systemctl --user disable --now gpumanager-sample.timer gpumanager-report.timer
gpumanager uninstall-systemd
```

## Sampling

Each sample creates a CSV file named like:

```text
2026-03-22T16-21-00.csv
```

Each CSV contains one row per GPU:

```csv
timestamp,gpu_index,gpu_uuid,gpu_name,util_gpu
2026-03-22T16:21:00+09:00,0,GPU-aaa,NVIDIA A100,35
2026-03-22T16:21:00+09:00,1,GPU-bbb,NVIDIA A100,2
```

## Report Format

Reports use the configured `general.server_name` as the bracketed name prefix. Average GPU utilization is rounded to two decimal places.

Example:

```text
[AICA_H100] 2025.09.06 16:49:32 KST
Window: last 1h
GPU 0: 31.38%
GPU 1: 29.39%
GPU 2: 31.57%
GPU 3: 56.36%
GPU 4: 61.25%
GPU 5: 61.52%
GPU 6: 59.88%
GPU 7: 63.93%
```

## systemd

`gpumanager install-systemd` installs user services into `~/.config/systemd/user/`:

- `gpumanager-sample.service`
- `gpumanager-sample.timer`
- `gpumanager-report.service`
- `gpumanager-report.timer`

Then enable them with:

```bash
systemctl --user daemon-reload
systemctl --user enable --now gpumanager-sample.timer gpumanager-report.timer
```

## Notes

- The report timer runs daily at the configured `send_time`
- The report window is controlled by `report.interval`
- Missing samples are ignored during aggregation
- The README content is used as the package long description, so this setup guide will also be visible on package index web pages after publishing
