Metadata-Version: 2.4
Name: nvidia-gpu-admin-tools
Version: 2025.11.21
Summary: NVIDIA GPU administration and configuration tools for managing Confidential Computing modes and debugging
Project-URL: Homepage, https://github.com/fpolica91/gpu-admin-tools
Project-URL: Issues, https://github.com/fpolica91/gpu-admin-tools/issues
Project-URL: Repository, https://github.com/fpolica91/gpu-admin-tools.git
Author: NVIDIA CORPORATION & AFFILIATES
License: MIT
Keywords: admin,cc-mode,confidential-computing,gpu,nvidia,nvlink
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Hardware
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.8
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# NVIDIA GPU Admin Tools

This utility is used for various configuration including the Confidential Computing modes of supported GPUs as well as some debug/test tasks. It is designed to be run as a privileged python3 command.

Supported CC modes are:

- on
  - All supported GPU security features are enabled (e.g., bus encryption, performance counters off)
- devtools
  - All supported GPU security features are enabled, however blocks preventing DevTools profiling/debugging are lifted
- off
  - The GPU operates in its default mode; no supplementary confidential computing features are enabled

## Most Commonly Used Examples
##### Query the CC mode of all GPUs the system
` sudo python3 ./nvidia_gpu_tools.py --devices gpus --query-cc-mode`
##### Query the CC mode of first 4 GPUs the system
` sudo python3 ./nvidia_gpu_tools.py --devices gpus[0:4] --query-cc-mode`
##### Enable CC mode on all GPUs
` sudo python3 ./nvidia_gpu_tools.py --devices gpus --set-cc-mode=on --reset-after-cc-mode-switch `
##### Disable CC mode on a specific GPU in the system
` sudo python3 ./nvidia_gpu_tools.py --devices 45:00.0 --set-cc-mode=off --reset-after-cc-mode-switch`


##### Generic debug dump from GPU
` sudo python3 ./nvidia_gpu_tools.py --gpu-bdf=45:00.0 --debug-dump --log debug`
##### Debug dump of NVLINK state
` sudo python3 ./nvidia_gpu_tools.py --gpu-bdf=45:00.0 --nvlink-debug-dump --log debug`

## Usage
  ```bash
sudo python3 nvidia_gpu_tools.py --help

NVIDIA GPU Tools version v2025.03.26o
Command line arguments: ['nvidia_gpu_tools.py', '--help']
usage: nvidia_gpu_tools.py [-h] [--devices DEVICES] [--gpu GPU]
                           [--gpu-bdf GPU_BDF] [--gpu-name GPU_NAME]
                           [--no-gpu]
                           [--log {debug,info,warning,error,critical}]
                           [--mmio-access-type {devmem,sysfs}]
                           [--recover-broken-gpu]
                           [--set-next-sbr-to-fundamental-reset]
                           [--reset-with-sbr] [--reset-with-flr]
                           [--reset-with-os] [--remove-from-os]
                           [--sysfs-bind SYSFS_BIND] [--sysfs-unbind]
                           [--query-ecc-state] [--query-cc-mode]
                           [--query-cc-settings] [--query-ppcie-mode]
                           [--query-ppcie-settings] [--query-prc-knobs]
                           [--set-cc-mode {off,on,devtools}]
                           [--reset-after-cc-mode-switch]
                           [--test-cc-mode-switch]
                           [--reset-after-ppcie-mode-switch]
                           [--set-ppcie-mode {off,on}]
                           [--test-ppcie-mode-switch]
                           [--set-bar0-firewall-mode {off,on}]
                           [--query-bar0-firewall-mode]
                           [--query-l4-serial-number] [--query-module-name]
                           [--clear-memory] [--debug-dump]
                           [--nvlink-debug-dump]
                           [--knobs-reset-to-defaults-list]
                           [--knobs-reset-to-defaults KNOBS_RESET_TO_DEFAULTS [KNOBS_RESET_TO_DEFAULTS ...]]
                           [--knobs-reset-to-defaults-assume-no-pending-changes]
                           [--knobs-reset-to-defaults-test] [--noop]
                           [--force-ecc-on-after-reset] [--test-ecc-toggle]
                           [--query-mig-mode] [--force-mig-off-after-reset]
                           [--test-mig-toggle]
                           [--block-nvlink BLOCK_NVLINK [BLOCK_NVLINK ...]]
                           [--block-all-nvlinks] [--test-nvlink-blocking]
                           [--dma-test] [--test-pcie-p2p]
                           [--read-sysmem-pa READ_SYSMEM_PA]
                           [--write-sysmem-pa WRITE_SYSMEM_PA WRITE_SYSMEM_PA]
                           [--read-config-space READ_CONFIG_SPACE]
                           [--write-config-space WRITE_CONFIG_SPACE WRITE_CONFIG_SPACE]
                           [--read-bar0 READ_BAR0]
                           [--write-bar0 WRITE_BAR0 WRITE_BAR0]
                           [--read-bar1 READ_BAR1]
                           [--write-bar1 WRITE_BAR1 WRITE_BAR1]
                           [--ignore-nvidia-driver]
                           {} ...

positional arguments:
  {}

options:
  -h, --help            show this help message and exit
  --devices DEVICES     Generic device selector supporting multiple comma-separated specifiers:
                        - 'gpus' - Find all NVIDIA GPUs
                        - 'gpus[n]' - Find nth NVIDIA GPU
                        - 'gpus[n:m]' - Find NVIDIA GPUs from index n to m
                        - 'nvswitches' - Find all NVIDIA NVSwitches
                        - 'nvswitches[n]' - Find nth NVIDIA NVSwitch
                        - 'vendor:device' - Find devices matching 4-digit hex vendor:device ID
                        - 'domain:bus:device.function' - Find device at specific BDF address
  --gpu GPU
  --gpu-bdf GPU_BDF     Select a single GPU by providing a substring of the
                        BDF, e.g. '01:00'.
  --gpu-name GPU_NAME   Select a single GPU by providing a substring of the
                        GPU name, e.g. 'T4'. If multiple GPUs match, the first
                        one will be used.
  --no-gpu              Do not use any of the GPUs; commands requiring one
                        will not work.
  --log {debug,info,warning,error,critical}
  --mmio-access-type {devmem,sysfs}
                        On Linux, specify whether to do MMIO through /dev/mem
                        or /sys/bus/pci/devices/.../resourceN
  --recover-broken-gpu  Attempt recovering a broken GPU (unresponsive config
                        space or MMIO) by performing an SBR. If the GPU is
                        broken from the beginning and hence correct config
                        space wasn't saved then reenumarate it in the OS by
                        sysfs remove/rescan to restore BARs etc.
  --set-next-sbr-to-fundamental-reset
                        Configure the GPU to make the next SBR same as
                        fundamental reset. After the SBR this setting resets
                        back to False. Supported on H100 only.
  --reset-with-sbr      Reset the GPU with SBR and restore its config space
                        settings, before any other actions
  --reset-with-flr      Reset the GPU with FLR and restore its config space
                        settings, before any other actions
  --reset-with-os       Reset with OS through /sys/.../reset
  --remove-from-os      Remove from OS through /sys/.../remove
  --sysfs-bind SYSFS_BIND
                        Bind devices to the specified driver
  --sysfs-unbind        Unbind devices from the current driver
  --query-ecc-state     Query the ECC state of the GPU
  --query-cc-mode       Query the current Confidential Computing (CC) mode of
                        the GPU.
  --query-cc-settings   Query the Confidential Computing (CC) settings of the
                        GPU.This prints the lower level setting knobs that
                        will take effect upon GPU reset.
  --query-ppcie-mode    Query the current Protected PCIe (PPCIe) mode of the
                        GPU or switch.
  --query-ppcie-settings
                        Query the Protected PPCIe (PPCIe) settings of the GPU
                        or switch.This prints the lower level setting knobs
                        that will take effect upon GPU or switch reset.
  --query-prc-knobs     Query all the Product Reconfiguration (PRC) knobs.
  --set-cc-mode {off,on,devtools}
                        Configure Confidentail Computing (CC) mode. The
                        choices are off (disabled), on (enabled) or devtools
                        (enabled in DevTools mode).The GPU needs to be reset
                        to make the selected mode active. See --reset-after-
                        cc-mode-switch for one way of doing it.
  --reset-after-cc-mode-switch
                        Reset the GPU after switching CC mode such that it is
                        activated immediately.
  --test-cc-mode-switch
                        Test switching CC modes.
  --reset-after-ppcie-mode-switch
                        Reset the GPU or switch after switching PPCIe mode
                        such that it is activated immediately.
  --set-ppcie-mode {off,on}
                        Configure Protected PCIe (PPCIe) mode. The choices are
                        off (disabled) or on (enabled).The GPU or switch needs
                        to be reset to make the selected mode active. See
                        --reset-after-ppcie-mode-switch for one way of doing
                        it.
  --test-ppcie-mode-switch
                        Test switching PPCIE mode.
  --set-bar0-firewall-mode {off,on}
                        Configure BAR0 firewall mode. The choices are off
                        (disabled) or on (enabled).
  --query-bar0-firewall-mode
                        Query the current BAR0 firewall mode of the GPU.
                        Blackwell+ only.
  --query-l4-serial-number
                        Query the L4 certificate serial number without the
                        MSB. The MSB could be either 0x41 or 0x40 based on the
                        RoT returning the certificate chain.
  --query-module-name   Query the module name (aka physical ID and module ID).
                        Supported only on H100 SXM and NVSwitch_gen3
  --clear-memory        Clear the contents of the GPU memory. Supported on
                        Pascal+ GPUs. Assumes the GPU has been reset with SBR
                        prior to this operation and can be comined with
                        --reset-with-sbr if not.
  --debug-dump          Dump various state from the device for debug
  --nvlink-debug-dump   Dump NVLINK debug state.
  --knobs-reset-to-defaults-list
                        Show the supported knobs and their default state
  --knobs-reset-to-defaults KNOBS_RESET_TO_DEFAULTS [KNOBS_RESET_TO_DEFAULTS ...]
                        Set various device configuration knobs to defaults.
                        Supported on Turing+ GPUs and NvSwitch_gen3. See
                        --knobs-reset-to-defaults-list for the list of
                        supported knobs and their defaults on a specific
                        device. The option can be specified multiple times to
                        list specific knobs or 'all' can be used to indicate
                        all supported ones should be reset.
  --knobs-reset-to-defaults-assume-no-pending-changes
                        Indicate that the device was reset after last time any
                        knobs were modified. This allows the reset to defaults
                        to be slightly optimized by querying the current state
  --knobs-reset-to-defaults-test
                        Test knob setting and resetting
  --noop                An empty option that can be used to separate nargs=+
                        options from positional arguments
  --force-ecc-on-after-reset
                        Force ECC to be enabled after a subsequent GPU reset
  --test-ecc-toggle     Test toggling ECC mode.
  --query-mig-mode      Query whether MIG mode is enabled.
  --force-mig-off-after-reset
                        Force MIG mode to be disabled after a subsequent GPU
                        reset
  --test-mig-toggle     Test toggling MIG mode.
  --block-nvlink BLOCK_NVLINK [BLOCK_NVLINK ...]
                        Block the specified NVLinks. NVLinks will be blocked
                        until a subsequent GPU reset (SBR on A100, FLR or SBR
                        on Hopper GPUs [based on OOB configuration], FLR or
                        SBR on Blackwell and later). Supported on A100 and
                        later GPUs that have NVLinks.
  --block-all-nvlinks   Block all NVLinks. See --block-nvlink for more
                        details.
  --test-nvlink-blocking
                        Test blocking NVLinks.
  --dma-test            Check that GPUs are able to perform DMA to all/most of
                        available system memory.
  --test-pcie-p2p       Check that all GPUs are able to perform DMA to each
                        other.
  --read-sysmem-pa READ_SYSMEM_PA
                        Use GPU's DMA to read 32-bits from the specified
                        sysmem physical address
  --write-sysmem-pa WRITE_SYSMEM_PA WRITE_SYSMEM_PA
                        Use GPU's DMA to write specified 32-bits to the
                        specified sysmem physical address
  --read-config-space READ_CONFIG_SPACE
                        Read 32-bits from device's config space at specified
                        offset
  --write-config-space WRITE_CONFIG_SPACE WRITE_CONFIG_SPACE
                        Write 32-bit to device's config space at specified
                        offset
  --read-bar0 READ_BAR0
                        Read 32-bits from GPU BAR0 at specified offset
  --write-bar0 WRITE_BAR0 WRITE_BAR0
                        Write 32-bit to GPU BAR0 at specified offset
  --read-bar1 READ_BAR1
                        Read 32-bits from GPU BAR1 at specified offset
  --write-bar1 WRITE_BAR1 WRITE_BAR1
                        Write 32-bit to GPU BAR1 at specified offset
  --ignore-nvidia-driver
                        Do not treat nvidia driver apearing to be loaded as an
                        error