Metadata-Version: 2.4
Name: gmprof
Version: 0.1.0
Summary: GPU VRAM profiler (mprof-style) for NVIDIA GPUs
Home-page: https://github.com/fabulousmatin/gmprof
Author: Matin Bazrafshan
Author-email: Matin Bazrafshan <matinbazrafshan2003@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/fabulousmatin/gmprof
Project-URL: Repository, https://github.com/fabulousmatin/gmprof
Project-URL: Issues, https://github.com/fabulousmatin/gmprof/issues
Keywords: gpu,vram,profiler,nvidia,nvml,cuda
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Debuggers
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nvidia-ml-py3
Requires-Dist: psutil
Requires-Dist: tabulate
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: plot
Requires-Dist: matplotlib; extra == "plot"
Provides-Extra: examples
Requires-Dist: cupy-cuda12x; extra == "examples"
Requires-Dist: matplotlib; extra == "examples"
Dynamic: license-file

# gmprof

`gmprof` is a small NVIDIA GPU VRAM profiler for Python. It is inspired by
`memory_profiler`/`mprof`, but focuses on GPU memory instead of CPU RAM.

It provides:

- an `mprof`-style CLI for sampling a subprocess and writing `.dat` files
- plotting and text reports for sampled `.dat` files
- `vram_overview` as both a decorator and context manager
- `vram_profile` as both a decorator and context manager
- Linux per-process VRAM accounting through NVML
- Windows fallback to total device VRAM when per-process memory is unavailable

## Requirements

- Python 3.8+
- NVIDIA GPU and NVIDIA driver
- NVML access through `nvidia-ml-py3`

Core install:

```bash
pip install gmprof
```

Install plotting support:

```bash
pip install "gmprof[plot]"
```

Install the CuPy example dependencies for CUDA 12:

```bash
pip install "gmprof[examples]"
```

If you use CUDA 11, install the CuPy wheel that matches your CUDA runtime
instead of `cupy-cuda12x`.

## Important Caveats

- `gmprof run` and `vram_overview` are sampling-based. If the sampling interval is too long, short-lived GPU allocations can be   missed, and repeated runs of the same code may produce slightly different results due to sampling bias.. This caveat does not apply
  to `vram_profile`, which samples at Python line events. For more precise VRAM measurements, especially when tracking short-lived allocations or small differences, use a smaller sampling interval; see the [results](examples/results) for comparison results.

- `vram_profile` can significantly increase runtime because it samples on
  every executed line in the profiled function/block. Its `time` column is best
  used for comparing lines relative to each other, not as an exact benchmark of
  unprofiled application speed.

## Platform Behavior

On Linux, `gmprof` uses NVML per-process accounting when available. With
`include_children=True`, child processes are included in the measurement.

On Windows, NVIDIA tooling often does not expose per-process VRAM and reports
process memory as unavailable. In that case, `gmprof` automatically reports
total device VRAM and emits a warning the first time it falls back.

## Python API

### Overview Decorator

```python
from gmprof import vram_overview


@vram_overview(device=0, label="train_step")
def train_step():
    ...


train_step()
```

`vram_overview` reports start, end, peak, delta, peak delta, and elapsed time.

### Line-By-Line Decorator

```python
from gmprof import vram_profile


@vram_profile(device=0, label="allocations")
def allocations():
    ...


allocations()
```

`vram_profile` reports each executed line with elapsed line time, used VRAM,
and delta from the previous measured line.

### Context Managers

```python
from gmprof import vram_overview, vram_profile


with vram_overview(device=0, label="block"):
    ...


with vram_profile(device=0, label="line_block"):
    ...
```

The decorator and context-manager forms expose the same profiling behavior.

### Python Arguments

`vram_overview(...)`

| Argument | Default | Meaning |
| --- | --- | --- |
| `device` | `0` | NVIDIA GPU index to inspect. |
| `interval` | `0.01` | Sampling interval in seconds for peak tracking. Shorter intervals catch shorter peaks but add overhead. |
| `label` | `None` | Optional name shown in the printed report. Defaults to the function name or `"overview"`. |
| `include_children` | `True` | Include child process VRAM when per-process accounting is available. |
| `pid` | current process | Process ID to measure. Usually left as default. |

`vram_profile(...)`

| Argument | Default | Meaning |
| --- | --- | --- |
| `device` | `0` | NVIDIA GPU index to inspect. |
| `label` | `None` | Optional name shown in the printed report. Defaults to the function name or `"profile"`. |
| `include_children` | `True` | Include child process VRAM when per-process accounting is available. |
| `pid` | current process | Process ID to measure. Usually left as default. |

## CLI

Profile a command:

```bash
gmprof run -i 0.1 -o profile.dat -- python train.py
```

Include children is enabled by default. Disable it when needed:

```bash
gmprof run --no-children -o profile.dat -- python train.py
```

Generate a report:

```bash
gmprof report profile.dat
```

Plot sampled VRAM:

```bash
gmprof plot profile.dat -o profile.png --no-show
```

### CLI Options

Global options:

| Option | Meaning |
| --- | --- |
| `-h`, `--help` | Show help. |
| `--version` | Print the package version. |

`gmprof run`

| Option | Default | Meaning |
| --- | --- | --- |
| `-o`, `--out` | `gmprofile_TIMESTAMP.dat` | Output `.dat` file. |
| `-i`, `--interval` | `0.1` | Sampling interval in seconds. Lower values catch shorter peaks and add overhead. |
| `-c`, `--include-children` | enabled | Include child process VRAM. |
| `--no-children` | disabled | Exclude child processes. |
| `-d`, `--device` | `0` | GPU device index. |
| `cmd` | required | Command to profile, usually after `--`. |

`gmprof plot`

| Option | Default | Meaning |
| --- | --- | --- |
| `dat_file` | required | Input `.dat` file from `gmprof run`. |
| `-o`, `--output` | none | Save plot to this path, for example `.png` or `.pdf`. |
| `-t`, `--title` | generated | Plot title. |
| `--no-show` | disabled | Save without opening an interactive plot window. |

`gmprof report`

| Option | Default | Meaning |
| --- | --- | --- |
| `dat_file` | required | Input `.dat` file from `gmprof run`. |
| `-o`, `--output` | none | Save report text to this path. |
| `-f`, `--format` | `text` | Report format. Currently only `text`. |

### `.dat` Format

`gmprof run` writes a text file with metadata comments followed by samples:

| Column | Meaning |
| --- | --- |
| `timestamp` | Wall-clock sample time. |
| `vram_mib` | VRAM usage in MiB. |
| `scope` | `process` for per-process samples, or `device_total` when fallback is used. |

## Example Output


```text
@vram_overview: start/end/peak for the workload
[gmprof:decorator_overview] device=0 scope=process | start=258.0 MiB | end=266.0 MiB | peak=1.3 GiB | delta=8.0 MiB | peak_delta=1.0 GiB | time=0.468s
```


```text
@vram_profile: line-by-line usage for the same workload
[gmprof:decorator_profile] device=0 | scope=process | time=1.396s
+----------+---------+------------------------------------------------+-----------+------------+
|   lineno | time    | code                                           | used      | delta      |
+==========+=========+================================================+===========+============+
|       22 | 0.0956s | assert cp is not None                          | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       23 | 0.0927s | a = cp.ones((8192, 8192), dtype=cp.float32)    | 522.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       24 | 0.1097s | cp.cuda.Stream.null.synchronize()              | 522.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       25 | 0.1044s | b = cp.full((8192, 8192), 2, dtype=cp.float32) | 778.0 MiB | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       26 | 0.0900s | cp.cuda.Stream.null.synchronize()              | 778.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       27 | 0.0815s | c = a + b                                      | 1.0 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       28 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       29 | 0.0803s | d = c @ b                                      | 1.3 GiB   | +256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       30 | 0.0832s | cp.cuda.Stream.null.synchronize()              | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       31 | 0.0757s | del a                                          | 1.3 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       32 | 0.0759s | cp.get_default_memory_pool().free_all_blocks() | 1.0 GiB   | -256.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       33 | 0.0851s | cp.cuda.Stream.null.synchronize()              | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       34 | 0.0907s | del b, c, d                                    | 1.0 GiB   | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
|       35 | 0.0835s | cp.get_default_memory_pool().free_all_blocks() | 266.0 MiB | -768.0 MiB |
+----------+---------+------------------------------------------------+-----------+------------+
|       36 | 0.0785s | cp.cuda.Stream.null.synchronize()              | 266.0 MiB | 0.0 B      |
+----------+---------+------------------------------------------------+-----------+------------+
```


### `examples/results/gmprof_fast.dat`

```text
# gmprof profiling data
# pid: 786600
# include_children: True
# device: 0
# interval: 0.01
# start_time: 2026-06-29 14:42:36.939
timestamp vram_mib scope
2026-06-29 14:42:37.055 0.000 process
2026-06-29 14:42:37.225 0.000 process
2026-06-29 14:42:37.336 0.000 process
2026-06-29 14:42:37.443 0.000 process
2026-06-29 14:42:37.537 0.000 process
2026-06-29 14:42:37.651 0.000 process
2026-06-29 14:42:37.738 0.000 process
2026-06-29 14:42:37.832 0.000 process
2026-06-29 14:42:37.923 0.000 process
2026-06-29 14:42:38.020 0.000 process
2026-06-29 14:42:38.110 0.000 process
2026-06-29 14:42:38.196 0.000 process
2026-06-29 14:42:38.302 0.000 process
2026-06-29 14:42:38.412 0.000 process
2026-06-29 14:42:38.512 0.000 process
2026-06-29 14:42:38.613 0.000 process
2026-06-29 14:42:38.712 0.000 process
2026-06-29 14:42:38.805 0.000 process
2026-06-29 14:42:38.899 0.000 process
2026-06-29 14:42:38.991 0.000 process
2026-06-29 14:42:39.082 0.000 process
2026-06-29 14:42:39.172 0.000 process
2026-06-29 14:42:39.264 0.000 process
2026-06-29 14:42:39.362 0.000 process
2026-06-29 14:42:39.459 0.000 process
2026-06-29 14:42:39.547 0.000 process
2026-06-29 14:42:39.636 0.000 process
2026-06-29 14:42:39.722 0.000 process
2026-06-29 14:42:39.813 0.000 process
2026-06-29 14:42:39.903 0.000 process
2026-06-29 14:42:39.996 0.000 process
2026-06-29 14:42:40.181 0.000 process
2026-06-29 14:42:40.293 256.000 process
2026-06-29 14:42:40.435 1026.000 process
2026-06-29 14:42:40.556 1290.000 process
2026-06-29 14:42:40.681 266.000 process
2026-06-29 14:42:40.806 266.000 process
2026-06-29 14:42:40.930 266.000 process
2026-06-29 14:42:41.032 266.000 process
2026-06-29 14:42:41.133 266.000 process
2026-06-29 14:42:41.246 256.000 process
2026-06-29 14:42:41.406 0.000 process
```

### `examples/results/gmprof_fast_report.txt`

```text
============================================================
GMPROF REPORT
============================================================

COMMAND INFO
----------------------------------------
PID:         786600
Device:      0
Children:    True
Start Time:  2026-06-29 14:42:36.939
Scope:       process

SAMPLING INFO
----------------------------------------
Interval:    0.01s
Samples:     42

VRAM USAGE STATISTICS
----------------------------------------
Minimum:     0.0 B
Maximum:     1.3 GiB
Mean:        99.0 MiB
Median:      0.0 B
Std Dev:     260.9 MiB
Total Δ:     1.3 GiB

TIMELINE SUMMARY
----------------------------------------
First:       2026-06-29 14:42:37.055 - 0.000 MiB
Last:        2026-06-29 14:42:41.406 - 0.000 MiB
Peak:        2026-06-29 14:42:40.556 - 1290.000 MiB

============================================================
============================================================
```

### Plot Files

The same code measured with different sampling intervals:

![gmprof CLI plot](examples/results/gmprof.png)

![gmprof fast CLI plot](examples/results/gmprof_fast.png)

## License

MIT License. See [LICENSE](LICENSE).
