Metadata-Version: 2.4
Name: brokkr-diagnostics
Version: 0.3.9
Summary: NVIDIA GPU and InfiniBand diagnostics toolkit
Author-email: Brokkr <fabricio.policarpo@hydrahost.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24.4
Requires-Dist: nvidia-gpu-admin-tools>=2025.11.21
Requires-Dist: termcolor>=2.4.0
Requires-Dist: prompt_toolkit>=3.0.0
Provides-Extra: dev
Requires-Dist: pyinstaller>=6.0.0; extra == "dev"

# Brokkr Diagnostics

A comprehensive diagnostic toolkit for NVIDIA GPU and InfiniBand infrastructure. Designed for data center operators, HPC administrators, and GPU cluster managers who need deep visibility into hardware health and configuration.

## Features

- **NVIDIA Driver Diagnostics** - Version checks, module loading, ABI compatibility, installation logs
- **GPU Hardware Analysis** - GPU identification, memory, clocks, power states, thermal monitoring
- **CUDA Environment** - CUDA runtime, compute capability, library versions
- **PCIe Topology** - Bus enumeration, link speed/width validation, topology mapping
- **Kernel Monitoring** - SystemD services, kernel logs, error detection, dmesg analysis
- **System Information** - NUMA topology, /proc filesystem analysis, memory configuration
- **InfiniBand Diagnostics** - IB device enumeration, port status, link state
- **GPU Reset** - Safe GPU reset functionality for stuck or hung devices
- **Confidential Computing** - Enable/disable NVIDIA Confidential Compute mode (requires root)

## Installation

```bash
pip install brokkr-diagnostics
```

## Usage

### Interactive Mode

Launch the interactive menu to explore diagnostics:

```bash
brokkr-diagnostics
```

### Command-Line Mode

Run specific diagnostics non-interactively:

```bash
# Check NVIDIA driver installation and versions
brokkr-diagnostics --run driver

# Analyze GPU hardware state
brokkr-diagnostics --run gpu

# Inspect PCIe topology
brokkr-diagnostics --run lspci

# Check SystemD services
brokkr-diagnostics --run services

# Scan kernel logs for errors
brokkr-diagnostics --run kernel

# View system information (NUMA, /proc)
brokkr-diagnostics --run system

# InfiniBand diagnostics
brokkr-diagnostics --run ib

# CUDA environment checks
brokkr-diagnostics --run cuda

# Run all diagnostics
brokkr-diagnostics --run all

# Reset all GPUs (requires appropriate permissions)
brokkr-diagnostics --run reset

# Enable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc on"

# Disable Confidential Compute mode (requires root)
sudo brokkr-diagnostics --run "cc off"
```

## Diagnostic Modules

### Driver Diagnostics (`driver`)

Validates NVIDIA driver installation and configuration:
- `nvidia-smi` version and availability
- Driver version from `/proc/driver/nvidia` and `nvidia-smi`
- Kernel module version magic (vermagic) for ABI compatibility
- Installation/uninstallation logs
- DKMS compilation logs
- Loaded modules status (`nvidia`, `nvidia_drm`, `nvidia_modeset`, `nvidia_uvm`, `nvidia_peermem`)
- Module parameters and configuration

### GPU Hardware (`gpu`)

Comprehensive GPU hardware state analysis:
- GPU enumeration and identification
- Memory capacity and utilization
- Clock speeds (graphics, memory, SM)
- Power state and consumption
- Temperature monitoring
- Fan speed
- PCIe generation and link width
- Compute mode
- Persistence mode
- ECC status and errors
- Performance state (P-state)

### PCIe Diagnostics (`lspci`)

PCIe bus topology and link analysis:
- Device enumeration on PCIe bus
- Link speed and width (current vs. maximum)
- Bus topology mapping
- Device class identification
- Vendor and device IDs

### Service Status (`services`)

SystemD service monitoring:
- NVIDIA persistence daemon
- Fabric manager
- GPU driver services
- Service health and uptime

### Kernel Logs (`kernel`)

Kernel message analysis:
- Recent kernel logs (`dmesg`)
- NVIDIA-related messages
- Error detection and filtering
- Hardware error events
- PCIe link errors
- GPU initialization logs

### System Information (`system`)

Linux system configuration:
- NUMA topology
- CPU information from `/proc/cpuinfo`
- Memory configuration from `/proc/meminfo`
- Kernel version from `/proc/version`
- System uptime

### InfiniBand (`ib`)

InfiniBand network diagnostics:
- IB device enumeration
- Port status and state
- Link layer information
- Physical state
- Rate and link speed

### CUDA Diagnostics (`cuda`)

CUDA environment validation:
- CUDA runtime version
- Compute capability per GPU
- CUDA library paths
- nvcc compiler availability
- CUDA sample compilation tests

## Requirements

- Python >= 3.8
- Linux operating system (Ubuntu, CentOS, RHEL, etc.)
- NVIDIA GPU (for GPU-related diagnostics)
- Root/sudo access (for some operations like GPU reset and Confidential Compute)

## Dependencies

- `numpy>=1.24.4`
- `nvidia-gpu-admin-tools>=2025.11.21`
- `termcolor>=2.4.0`
- `prompt_toolkit>=3.0.0`

## Use Cases

### Data Center Operations
- Pre-deployment hardware validation
- Post-maintenance verification
- Troubleshooting GPU issues
- Configuration auditing

### HPC Clusters
- Node health checks
- GPU allocation validation
- Performance baseline verification
- InfiniBand network validation

### ML/AI Infrastructure
- Training environment validation
- Multi-GPU setup verification
- CUDA environment checks
- Memory and thermal monitoring

### CI/CD Pipelines
- Automated hardware testing
- Regression detection
- Configuration compliance

## License

MIT License

## Author

Brokkr by Hydra Host
Contact: fabricio.policarpo@hydrahost.com

## Contributing

Issues and pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
