⚠️ These actions will stop ALL workloads and cause system downtime!
🔁 System RebootIPMI Required
Warm reboot via IPMI when soft recovery fails
🔍 When this triggers:
All soft recovery actions failed
GPU requires driver reload (only via reboot)
Uncorrectable memory errors (ECC)
GPU firmware errors
⚙️ What it does:
Sends IPMI reset command to BMC
System performs warm reboot (~2-5 min)
All services restart automatically
GPU driver reinitializes fresh
🔄 Recovery Flow:
Soft Reset Failed → PCI Reset Failed → System Reboot
↓ Failed/Stuck
Power Cycle (if enabled)
❗ Downtime: ~2-5 minutes. All containers/VMs will be stopped.
⚡ Power CycleIPMI Required
Hard power off/on - last resort when reboot fails or system is stuck
🔍 When this triggers:
System is stuck during shutdown/reboot
GPU firmware halted (microcontroller error)
Reboot command didn't complete after timeout
System completely unresponsive
⚙️ What it does:
Sends IPMI power cycle command to BMC
Server powers off completely
BMC waits ~5 seconds
Server powers back on and boots (~3-8 min total)
❗ Downtime: ~3-8 minutes. Equivalent to pulling the power cord. No graceful shutdown.
💾 Disk CleanupSSH Required
Auto-clean when disk usage exceeds 90%
🔍 What it cleans (safe items only):
Docker build cache (docker builder prune)
Dangling images (unused, untagged)
Old container logs (>7 days)
Systemd journal logs (>7 days)
✅ Safe: Does NOT delete stopped containers, volumes, or running workload data.
🏷️ Auto Maintenance Flag
Create maintenance task when recovery fails repeatedly
🔍 When this triggers:
All enabled recovery actions have been tried and failed
Same GPU error recurring multiple times (>3 in 24h)
Multiple power cycles needed in short period
⚙️ What it does:
Creates a maintenance task in the dashboard
Records all recovery attempts and outcomes
Suggests likely hardware issue (GPU replacement)
🤖 Agent Analysis
📜 Recovery History
No recent recovery actions
Summary Options
Ctrl+click to select multiple
📊 Fleet Summary
Scan for Issues
🔧 Maintenance Tasks
Predictive Failure Analytics
⚠️ Data Requirement: Predictions require at least 30 days of historical data for accuracy.
Results improve with more data (90+ days recommended).
📊 Predictions are best-effort estimates similar to weather forecasts - use as guidelines for proactive maintenance, not guarantees.
🔮 Failure Predictions
Root Cause Analysis
Select a device to see its events
Tip: Describe issues not captured in SEL, like PCI errors or GPU issues
🔍 Root Cause Analysis
Try asking:
Which servers need maintenance?Show ECC errorsNVIDIA server healthWhy memory issues on brickbox-06?
👋 Hi! I'm your AI assistant. I have full context of your server fleet (devices, IPs, 72h of events and sensors). Ask me anything about maintenance, health, or issues.