Metadata-Version: 2.4
Name: brokkr-diagnostics
Version: 0.5.4
Summary: GPU, storage, thermal & infrastructure diagnostics toolkit
Author-email: Brokkr <fabricio.policarpo@hydrahost.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: nvidia-gpu-admin-tools>=2025.11.21
Requires-Dist: termcolor>=2.4.0
Requires-Dist: prompt_toolkit>=3.0.0
Requires-Dist: nvidia-ml-py>=13.590.48
Provides-Extra: dev
Requires-Dist: pyinstaller>=6.0.0; extra == "dev"

# Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Collects GPU state, PCIe health, NVLink topology, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — structured JSON output, async execution, graceful degradation.

Built for data center operators, HPC administrators, and GPU cluster managers.

## Installation

```bash
pip install brokkr-diagnostics
```

## Quick Start

```bash
# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send

# Run and send a specific diagnostic to Brokkr
brokkr-diagnostics --send gpu
```

## Commands

| Command | Aliases | Description |
|---------|---------|-------------|
| `driver` | | NVIDIA driver version & installation checks |
| `gpu` | | GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, processes |
| `nvlink` | | NVLink status, error counters & topology |
| `lspci` | `pcie` | PCIe topology, link status, AER error counters, ACS & IOMMU |
| `kernel` | `logs` | Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks |
| `services` | | NVIDIA systemd service status |
| `system` | `proc` | System info — /proc files, NUMA topology |
| `storage` | `nvme`, `disk` | NVMe SMART, error logs, RAID, LVM, filesystem read-only detection |
| `thermal` | `ipmi`, `sensors` | CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log |
| `ib` | | InfiniBand device/port status, error & traffic counters |
| `cuda` | `cuda-tests` | CUDA memory allocation & P2P bandwidth tests |
| `reset` | | GPU reset sequence (requires root) |

## Usage Examples

### GPU Hardware State

```bash
brokkr-diagnostics --run gpu
```

Returns per-GPU: ECC error counts (volatile + aggregate), retired pages (SBE/DBE/pending), remapped rows (correctable/uncorrectable/pending/failure), memory usage, power draw vs limits, clock speeds, PCIe link gen/width, temperature (GPU + HBM), running processes.

### Kernel Log Analysis

```bash
brokkr-diagnostics --run kernel
```

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID/SXid errors, NVSwitch failures, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups. Each error is categorized by domain and severity.

### PCIe Diagnostics

```bash
brokkr-diagnostics --run lspci
```

Enumerates NVIDIA PCIe devices, checks link speed/width degradation, reads AER error counters (correctable, fatal, nonfatal) from sysfs, and checks ACS/IOMMU status for P2P readiness.

### Storage & NVMe Health

```bash
brokkr-diagnostics --run storage
```

Collects NVMe SMART data (critical warnings, media errors, spare capacity, wear), NVMe error logs, MD RAID array status, LVM layout, and detects read-only filesystem remounts.

### Thermal & IPMI

```bash
brokkr-diagnostics --run thermal
```

Reads CPU/board thermals via lm-sensors and collects IPMI BMC sensor readings (fans, temps, voltages, PSUs) with threshold status. Parses the IPMI System Event Log for hardware fault history.

### NVLink Topology

```bash
brokkr-diagnostics --run nvlink
```

Per-GPU NVLink link state, version, remote device type, and error counters (CRC flit, CRC data, replay, recovery). Includes GPU-pair topology mapping.

### NVIDIA Driver

```bash
brokkr-diagnostics --run driver
```

Driver version, VBIOS versions, kernel module info (vermagic, parameters), /proc/driver/nvidia state, DKMS logs, installation logs.

### InfiniBand

```bash
brokkr-diagnostics --run ib
```

IB device enumeration with firmware/hardware info, per-port state and rate, error counters, and traffic stats from sysfs.

### GPU Reset

```bash
sudo brokkr-diagnostics --run reset
```

Kills GPU processes, stops NVIDIA services, unloads kernel modules, performs PCIe bus reset, then reloads everything.

## Sending Diagnostics to Brokkr

The `--send` (`-s`) flag runs diagnostics and POSTs the results to the Brokkr API. This is used for remote monitoring and automated health reporting.

```bash
# Send all diagnostics
brokkr-diagnostics --send

# Send a specific diagnostic
brokkr-diagnostics --send gpu
```

When sending all diagnostics, commands that require root or are synchronous-only (e.g., `reset`) are automatically skipped.

### Authentication

Credentials are resolved in this order:

1. **Creds file** — `/var/lib/brokkr/phone-home-creds.json` (keys: `cipher`, `signature`)
2. **Fallback script** — Parsed from `/var/lib/cloud/scripts/per-boot/90-phone-home.sh` if the creds file is missing or incomplete
3. **Environment variables** — `BROKKR_AUTH_CIPHER` and `BROKKR_AUTH_SIGNATURE` override any file-based values

### Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `BROKKR_API_URL` | `https://brokkr.hydrahost.com/api/v1` | API base URL |
| `BROKKR_AUTH_CIPHER` | *(from creds file)* | Authentication cipher |
| `BROKKR_AUTH_SIGNATURE` | *(from creds file)* | Authentication signature |
| `BROKKR_TIMEOUT` | `30` | HTTP request timeout in seconds |

## Optional System Tools

These enhance output when available. Missing tools are detected and skipped — never crashes.

| Tool | Package | Used by |
|------|---------|---------|
| `smartctl` | `smartmontools` | NVMe SMART health |
| `nvme` | `nvme-cli` | NVMe error & smart logs |
| `ipmitool` | `ipmitool` | IPMI sensors & System Event Log |
| `sensors` | `lm-sensors` | CPU/board thermals |
| `lspci` | `pciutils` | PCIe topology & AER counters |
| `mdadm` | `mdadm` | MD RAID status |

## Requirements

- Python >= 3.10
- Linux
- NVIDIA GPU with drivers loaded (for GPU diagnostics)
- Root/sudo for some operations (GPU reset, IPMI, dmesg)

## License

MIT
