Metadata-Version: 2.4
Name: brokkr-diagnostics
Version: 0.6.1
Summary: GPU, storage, thermal & infrastructure diagnostics toolkit
Author-email: Brokkr <fabricio.policarpo@hydrahost.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: nvidia-gpu-admin-tools>=2025.11.21
Requires-Dist: termcolor>=2.4.0
Requires-Dist: prompt_toolkit>=3.0.0
Requires-Dist: nvidia-ml-py>=13.590.48
Requires-Dist: aiohttp>=3.13.3
Provides-Extra: dev
Requires-Dist: pyinstaller>=6.0.0; extra == "dev"

# Brokkr Diagnostics

Comprehensive hardware diagnostics for NVIDIA GPU nodes. Detects GPU health issues, PCIe degradation, NVLink errors, Fabric Manager problems, ECC/memory faults, kernel errors, storage/NVMe health, thermal/IPMI data, and InfiniBand status — with a unified health check that auto-detects machine topology and produces a single HEALTHY / DEGRADED / UNHEALTHY verdict.

Built for data center operators, HPC administrators, and GPU cluster managers.

## Installation

```bash
pip install brokkr-diagnostics
```

## Quick Start

```bash
# Machine health check — single pass/fail verdict
brokkr-diagnostics --run health

# Interactive REPL with tab completion
brokkr-diagnostics --interactive

# Run a specific diagnostic
brokkr-diagnostics --run gpu

# Run all diagnostics
brokkr-diagnostics --run all

# Run and send all diagnostics to Brokkr
brokkr-diagnostics --send
```

## Health Check

The `health` command auto-detects your machine topology (PCIe-only, NVLink, or NVSwitch) and runs targeted checks to produce a single verdict:

```bash
brokkr-diagnostics --run health
```

**What it checks:**

| Category | Checks | When |
|----------|--------|------|
| **gpu_presence** | All GPUs enumerable via NVML | Always |
| **driver** | Driver loaded, persistence mode | Always |
| **memory_health** | ECC errors, row remapping, pending repairs | Always |
| **pcie** | Link speed/width at max, zero fatal errors | Always |
| **thermal_power** | Temperature headroom, no HW throttling, violation time | Always |
| **kernel_errors** | No XID/SXid, OOM kills, GPU fallen off bus in dmesg | Always |
| **p2p_communication** | GPU-to-GPU memory copy works for all pairs | Multi-GPU |
| **nvlink** | All links active, zero CRC/replay errors | NVLink systems |
| **fabric_manager** | FM running, version matches driver, NVSwitch count | NVSwitch systems |

**Output:**
- `HEALTHY` — all checks pass
- `DEGRADED` — warnings present (e.g., thermal headroom low, correctable errors)
- `UNHEALTHY` — failures detected (e.g., PCIe x2 instead of x16, FM version mismatch, P2P broken)

**Known issues detection:** When P2P failures are detected, the tool cross-references the GPU architecture against a known-issues database and suggests specific fixes (e.g., Blackwell UVM HMM workaround).

## Commands

| Command | Aliases | Description |
|---------|---------|-------------|
| `health` | `status`, `check` | Unified machine health check (auto-detects topology) |
| `driver` | | NVIDIA driver version & installation checks |
| `gpu` | | GPU hardware — ECC, retired pages, remapped rows, thermals, power, clocks, PCIe link, utilization, violations, BAR1, architecture |
| `nvlink` | | NVLink status, error counters, topology, P2P status matrix, throughput measurement |
| `lspci` | `pcie` | PCIe topology, link status, AER error counters, ACS & IOMMU |
| `kernel` | `logs` | Kernel log analysis — XID/SXid, storage errors, IOMMU faults, OOM, hung tasks |
| `services` | | NVIDIA systemd service status |
| `system` | `proc` | System info — /proc files, NUMA topology |
| `storage` | `nvme`, `disk` | NVMe SMART, error logs, RAID, LVM, filesystem read-only detection |
| `thermal` | `ipmi`, `sensors` | CPU/board thermals (lm-sensors) & IPMI/BMC sensors + System Event Log |
| `ib` | | InfiniBand device/port status, error & traffic counters |
| `cuda` | `cuda-tests` | CUDA memory allocation & P2P bandwidth tests |
| `fabric` | `fm`, `nvswitch` | Fabric Manager health, NVSwitch enumeration, version validation |
| `ecc` | `ras`, `memory-health` | Deep ECC/RAS — per-location errors, row remapper histogram, RMA recommendation |
| `gpu_debug` | `debug`, `hw-debug` | Register-level GPU debug via gpu-admin-tools (requires root) |
| `reset` | | GPU reset sequence (requires root) |

## Usage Examples

### Machine Health Check

```bash
brokkr-diagnostics --run health
```

```json
{
  "overall_status": "HEALTHY",
  "topology": {
    "gpu_count": 8,
    "gpu_name": "NVIDIA A100-SXM4-40GB",
    "topology_type": "nvswitch",
    "nvlink_per_gpu": 12,
    "nvswitch_count": 6
  },
  "categories": [
    {"category": "gpu_presence", "status": "pass", "checks": [...]},
    {"category": "memory_health", "status": "pass", "checks": [...]},
    {"category": "p2p_communication", "status": "pass", "checks": [...]},
    {"category": "nvlink", "status": "pass", "checks": [...]},
    {"category": "fabric_manager", "status": "pass", "checks": [...]}
  ],
  "issue_count": 0,
  "issues": []
}
```

### GPU Hardware State

```bash
brokkr-diagnostics --run gpu
```

Per-GPU: ECC errors, retired pages, remapped rows, memory usage, power (draw/limits/default/violations), clock speeds with full 9-bit throttle reason decode, PCIe link gen/width with error counters, temperature with shutdown/slowdown thresholds, BAR1 memory, GPU utilization, performance state, architecture, MIG mode, accounting mode.

### Deep ECC/RAS Diagnostics

```bash
brokkr-diagnostics --run ecc
```

Per-memory-location ECC counters (DRAM vs SRAM), row remapper spare-capacity histogram (Ampere+), pending repairs, InfoROM validation, and automated RMA recommendation.

### NVLink Diagnostics

```bash
brokkr-diagnostics --run nvlink
```

Per-GPU NVLink link state, version, remote device type, error counters. P2P status matrix with NVLink/atomics capability per GPU pair. NVLink throughput delta measurement. Per-link version validation and NVSwitch link count consistency check.

### Fabric Manager & NVSwitch

```bash
brokkr-diagnostics --run fabric
```

NVSwitch device enumeration (count, BIOS version, UUID), Fabric Manager service health, driver/FM version match validation, per-GPU fabric state, FM log error extraction, NVSwitch count validation against known platform configurations.

### Register-Level GPU Debug

```bash
sudo brokkr-diagnostics --run gpu_debug
```

Uses nvidia-gpu-admin-tools for hardware register access (BAR0 MMIO). Detects broken/unresponsive GPUs, security faults, and NVLink training states (ACTIVE/TRAIN/FAULT/RCVY/SHUTDOWN) that are invisible to NVML. Reports physical module ID for SXM GPUs.

### Kernel Log Analysis

```bash
brokkr-diagnostics --run kernel
```

Scans dmesg, journalctl, and log files for errors across domains: NVIDIA XID errors (enriched with severity and recommended action from a 40+ code lookup table), NVSwitch SXid errors, storage/NVMe I/O errors, IOMMU page faults, OOM kills, hung tasks, and soft/hard lockups.

### PCIe Diagnostics

```bash
brokkr-diagnostics --run lspci
```

PCIe device enumeration, link speed/width degradation detection, AER error counters (correctable, fatal, nonfatal) from sysfs and lspci, ACS status on upstream bridges (P2P blocking), IOMMU status and GPU group mapping.

### Storage & NVMe Health

```bash
brokkr-diagnostics --run storage
```

NVMe SMART data (critical warnings, media errors, spare capacity, wear), error logs, MD RAID status, LVM layout, read-only filesystem detection.

### Thermal & IPMI

```bash
brokkr-diagnostics --run thermal
```

CPU/board thermals via lm-sensors, IPMI BMC sensor readings with threshold status, IPMI System Event Log for hardware fault history.

## Sending Diagnostics to Brokkr

The `--send` (`-s`) flag runs diagnostics and POSTs the results to the Brokkr API for remote monitoring.

```bash
# Send all diagnostics
brokkr-diagnostics --send

# Send a specific diagnostic
brokkr-diagnostics --send gpu
```

### Authentication

Credentials are resolved in this order:

1. **Creds file** — `/var/lib/brokkr/phone-home-creds.json` (keys: `cipher`, `signature`)
2. **Fallback script** — Parsed from `/var/lib/cloud/scripts/per-boot/90-phone-home.sh`
3. **Environment variables** — `BROKKR_AUTH_CIPHER` and `BROKKR_AUTH_SIGNATURE`

### Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `BROKKR_API_URL` | `https://brokkr.hydrahost.com/api/v1` | API base URL |
| `BROKKR_AUTH_CIPHER` | *(from creds file)* | Authentication cipher |
| `BROKKR_AUTH_SIGNATURE` | *(from creds file)* | Authentication signature |
| `BROKKR_TIMEOUT` | `30` | HTTP request timeout in seconds |

## Optional System Tools

Missing tools are detected and skipped — never crashes.

| Tool | Package | Used by |
|------|---------|---------|
| `smartctl` | `smartmontools` | NVMe SMART health |
| `nvme` | `nvme-cli` | NVMe error & smart logs |
| `ipmitool` | `ipmitool` | IPMI sensors & System Event Log |
| `sensors` | `lm-sensors` | CPU/board thermals |
| `lspci` | `pciutils` | PCIe topology & AER counters |
| `mdadm` | `mdadm` | MD RAID status |

## Dependencies

| Package | Purpose |
|---------|---------|
| `nvidia-ml-py` | NVML Python bindings (pynvml) — GPU metrics |
| `nvidia-gpu-admin-tools` | Register-level GPU access — broken GPU detection, NVLink training state |
| `termcolor` | Colored console output |
| `prompt_toolkit` | Interactive CLI |
| `aiohttp` | HTTP client for Brokkr API |

## Requirements

- Python >= 3.10
- Linux
- NVIDIA GPU with drivers loaded
- Root/sudo for: `gpu_debug`, `reset`, IPMI, dmesg

## License

MIT
