📊 Overview
IPMI Monitor is a web-based tool for monitoring server hardware via IPMI (Intelligent Platform Management Interface) and Redfish APIs. It provides real-time visibility into your server fleet's health.
What It Monitors
- System Event Log (SEL) - Hardware events, errors, warnings
- Sensor Readings - Temperature, voltage, fan speed, power consumption
- Hardware Inventory - CPU, memory, storage, GPU information
- Connectivity Status - BMC and OS reachability
- Power State - On/off status with remote control
🚀 Quick Start
1. Add Your First Server
- Go to Settings → Manage Servers
- Click ➕ Add New Server
- Enter the BMC IP address (e.g.,
192.168.1.100) - Give it a friendly name (e.g.,
server-01) - Click Add Server
2. Configure IPMI Credentials
If your servers use custom IPMI credentials:
- Click the server in the list to edit
- Enter the IPMI username and password
- Click 🔌 Test BMC to verify
- Save changes
3. View Server Health
Return to the Dashboard to see your servers. Click any server card to view detailed events, sensors, and inventory.
🎯 Key Concepts
BMC (Baseboard Management Controller)
A dedicated processor on the server motherboard that operates independently of the main CPU. It allows remote monitoring and management even when the server is powered off or the OS has crashed.
IPMI vs Redfish
| IPMI | Redfish |
|---|---|
| Legacy protocol (port 623) | Modern REST API (HTTPS port 443) |
| Widely supported | More detailed information |
| Binary protocol | JSON responses |
BMC IP vs OS IP
- BMC IP - The management network IP (often ends in .0, e.g., 192.168.1.100)
- OS IP - The server's main network IP where the OS runs (often .1, e.g., 192.168.1.101)
📱 Dashboard
The main dashboard shows all monitored servers in a grid view.
Server Cards
Each card displays:
- Server Name and BMC IP
- Status Badge: 🟢 Online, 🔴 Offline, 🟡 Warning
- Event Count: Recent events in last 24 hours
- Temperature: Current CPU/inlet temperature
Auto-Refresh
Data refreshes automatically every 60 seconds. Event collection runs every 5 minutes by default (configurable via POLL_INTERVAL).
🖥️ Server Details
Click any server card to view detailed information across three tabs.
Events Tab
Shows System Event Log (SEL) entries with:
- Timestamp - When the event occurred
- Severity - Critical (🔴), Warning (🟡), Info (🔵)
- Description - Event message from BMC
Event Actions
- Clear DB Events - Remove from IPMI Monitor only (BMC unaffected)
- Clear BMC SEL - Clear actual BMC log (⚠️ Admin only, use carefully)
Sensors Tab
Real-time sensor readings including:
- Temperature sensors (CPU, inlet, exhaust, DIMMs)
- Voltage sensors (3.3V, 5V, 12V, battery)
- Fan speeds (RPM)
- Power consumption (Watts)
Inventory Tab
Hardware information collected via IPMI FRU, Redfish, and SSH:
- System manufacturer, model, serial number
- CPU model, core count
- Memory total, slots used
- Storage devices with sizes
- GPU information (if present)
📋 Events & Logs
Common Event Types
| Event | Meaning | Action |
|---|---|---|
| Correctable ECC Error | Memory error detected and corrected | Monitor frequency; replace DIMM if recurring |
| Uncorrectable ECC Error | Memory error that couldn't be fixed | Replace DIMM immediately |
| Temperature Threshold | Component exceeded temperature limit | Check cooling, clean dust, verify airflow |
| Fan Failure | Fan stopped or below speed threshold | Replace fan ASAP to prevent overheating |
| Power Supply Failure | PSU issue detected | Check/replace PSU, verify redundancy |
🌡️ Sensor Readings
Temperature Guidelines
| Sensor | Normal | Warning | Critical |
|---|---|---|---|
| CPU Temperature | < 70°C | 70-85°C | > 85°C |
| Inlet Temperature | < 30°C | 30-40°C | > 40°C |
| DIMM Temperature | < 60°C | 60-75°C | > 75°C |
Voltage Guidelines
| Rail | Normal Range |
|---|---|
| 3.3V | 3.1V - 3.5V |
| 5V | 4.75V - 5.25V |
| 12V | 11.4V - 12.6V |
| VBAT (Backup Battery) | 2.8V - 3.3V |
🔧 Hardware Inventory
Data Sources
| Source | Data Collected | Requirements |
|---|---|---|
| IPMI FRU | Manufacturer, model, serial, board info | IPMI access |
| IPMI SDR | Sensor list, CPU/DIMM counts | IPMI access |
| Redfish API | Detailed CPU, memory, storage, GPU | Redfish-enabled BMC |
| SSH to OS | Exact CPU model, memory config, drives | SSH enabled + credentials |
Collecting Inventory
Inventory is collected automatically during setup. To refresh:
- Go to Server Detail → Inventory tab
- Click 📦 Collect Inventory
- Wait for collection to complete
For bulk collection: Settings → Manage Servers → 📦 Collect All Inventory
⚙️ Manage Servers
Adding Servers
Go to Settings → Manage Servers → Add New Server:
- BMC IP - The IPMI management IP
- Server Name - A friendly name for identification
- OS IP - Optional, for SSH inventory collection
- Protocol - Auto (recommended), IPMI only, or Redfish only
Editing Servers
Click any server in the list to open the edit dialog:
- Change name, IPs, protocol
- Set custom IPMI credentials
- Configure SSH credentials
- Test BMC - Verify IPMI connection
- Test SSH - Verify SSH connection
- Check Redfish - Test Redfish availability
Bulk Import
Import servers from a YAML/JSON file. Mount your config file to /app/config/servers.yaml
# servers.yaml example
servers:
- name: server-01
bmc_ip: 192.168.1.100
server_ip: 192.168.1.101
- name: server-02
bmc_ip: 192.168.1.102
server_ip: 192.168.1.103
🔐 SSH Configuration
SSH enables detailed inventory collection from the server's OS.
Enable SSH
- Go to Settings → SSH tab
- Toggle Enable SSH to OS
- Configure default credentials
SSH Key Management
Store SSH keys centrally and assign them to servers:
- Click ➕ Add New Key
- Give it a name (e.g., "Production Key")
- Paste the private key content
- Use the dropdown in server edit to assign
-----BEGIN OPENSSH PRIVATE KEY-----
Per-Server Overrides
Each server can have custom SSH settings different from the defaults:
- Custom OS IP (if different from BMC IP pattern)
- Custom username
- Different SSH key
- Custom port
🔔 Alerts & Rules
Alert Rules
Pre-configured rules watch for:
- Temperature exceeding thresholds
- Fan speed below minimum
- ECC memory errors
- Power supply issues
- Critical BMC events
Creating Custom Rules
- Go to Settings → Alerts
- Click Add Rule
- Select alert type and condition
- Set threshold and severity
- Enable notification channels
Cooldown
Each rule has a cooldown period to prevent alert spam. Default is 5-15 minutes depending on severity.
📬 Notifications
Telegram Setup
- Message
@BotFatheron Telegram - Create a new bot with
/newbot - Copy the bot token
- Get your chat ID (message
@userinfobot) - Paste both in Settings → Notifications → Telegram
- Click Test to verify
Email Setup
Configure SMTP settings for email notifications. Works with Gmail, SendGrid, or any SMTP server.
Webhook
Send alerts to Slack, Discord, or custom endpoints. Webhooks receive JSON payloads with alert details.
🛡️ Security & Users
User Roles
| Role | Permissions |
|---|---|
| Admin | Full access: manage users, security, AI features, power control |
| Read-Write | Manage servers, run power commands, but not user management |
| Read-Only | View only - no changes allowed |
Anonymous Access
Enable to allow viewing the dashboard without login. Anonymous users get read-only access.
📊 Prometheus & Grafana Integration
IPMI Monitor provides a built-in Prometheus exporter for integration with your existing monitoring stack.
Prometheus Metrics Endpoint
Metrics are exposed at /metrics in Prometheus text format:
http://your-ipmi-monitor:5000/metrics
Available Metrics
| Metric | Type | Description |
|---|---|---|
ipmi_server_reachable | Gauge | Whether BMC is reachable (1=yes, 0=no) |
ipmi_server_power_on | Gauge | Server power state (1=on, 0=off) |
ipmi_temperature_celsius | Gauge | Temperature readings per sensor |
ipmi_fan_speed_rpm | Gauge | Fan speed readings |
ipmi_voltage_volts | Gauge | Voltage sensor readings |
ipmi_power_watts | Gauge | Power consumption |
ipmi_events_total | Gauge | Total events collected per server |
ipmi_events_critical_24h | Gauge | Critical events in last 24h |
ipmi_events_warning_24h | Gauge | Warning events in last 24h |
ipmi_total_servers | Gauge | Total monitored servers |
ipmi_reachable_servers | Gauge | Number of reachable servers |
ipmi_alerts_total | Gauge | Total fired alerts |
ipmi_alerts_unacknowledged | Gauge | Unacknowledged alerts |
ipmi_last_collection_timestamp | Gauge | Unix timestamp of last collection |
Prometheus Configuration
Add this to your prometheus.yml:
scrape_configs:
- job_name: 'ipmi-monitor'
static_configs:
- targets: ['ipmi-monitor:5000']
scrape_interval: 60s
scrape_timeout: 30s
metrics_path: /metrics
ipmi-monitor:5000- Docker network (container name)localhost:5000- Same host192.168.1.50:5000- Remote IP
Pre-built Grafana Dashboard
We provide a ready-to-import Grafana dashboard with:
- Fleet Overview - Total servers, reachable count, alerts
- Server Health - Per-server temperature, power, events
- Event Timeline - Critical/warning events over time
- Temperature Heatmap - Temperature trends across fleet
- Alert History - Alert counts and status
Import Dashboard
- Go to Grafana → Dashboards → Import
- Download the dashboard JSON from:
github.com/cryptolabsza/ipmi-monitor/grafana/dashboards/ipmi-monitor.json - Upload or paste the JSON
- Select your Prometheus data source
- Click Import
Example Grafana Alerts
Create Grafana alerts based on IPMI Monitor metrics:
# High Temperature Alert
ipmi_temperature_celsius{sensor=~"CPU.*"} > 80
# Server Unreachable
ipmi_server_reachable == 0
# Critical Events Spike
increase(ipmi_events_critical_24h[1h]) > 5
# Multiple Servers Down
count(ipmi_server_reachable == 0) > 2
/metrics reads cached data from the last collection cycle (default: every 5 minutes). Faster scrape intervals won't give you fresher data - they'll just read the same values repeatedly.
🤖 AI Features
Premium AI features provide intelligent analysis of your server fleet.
Features Included
- Fleet Health Summaries - Daily overview of all servers
- Maintenance Tasks - AI-identified work items with priorities
- Predictive Analytics - Failure predictions before they happen
- Root Cause Analysis - Deep analysis of specific events
- AI Chat - Interactive assistant for questions
Getting Started
- Go to Settings → AI Features
- Click Start Free Trial
- Sign up for a CryptoLabs account
- AI features activate automatically
Pricing
- 1 month free trial
- Then $100/month for up to 50 servers
- +$15 per additional 10 servers
💬 AI Chat
Ask questions about your servers in natural language.
Example Questions
- "Which servers have high temperatures?"
- "Show me servers with ECC errors"
- "What maintenance is needed this week?"
- "Explain this error: [paste event]"
- "How do I add a new server?"
- "What does ECC mean?"
Tips for Better Responses
- Be specific about which server if asking about one
- Include time ranges when relevant ("in the last 24 hours")
- Ask follow-up questions for more detail
🔧 Maintenance Tasks
AI analyzes events and sensors to generate maintenance work items.
Priority Levels
| Priority | Meaning | Timeframe |
|---|---|---|
| Critical | Immediate risk of outage | Today |
| High | Component degrading | This week |
| Medium | Needs attention | Next maintenance window |
| Low | Monitor and plan | When convenient |
Task Information
Each task includes:
- Affected Servers - Specific server names
- Component - What hardware needs attention
- Reason - Why this task was generated
- Suggested Action - What to do
- Evidence - Supporting data from events/sensors
🔍 Troubleshooting
Server Shows Offline
- Verify BMC IP is reachable:
ping 192.168.1.100 - Check IPMI credentials in server edit
- Use Test BMC button to diagnose
- Verify firewall allows port 623 (IPMI)
- Try accessing BMC web interface directly
SSH Test Fails
- "Permission denied" - Wrong SSH key or password
- "Connection refused" - SSH not running or wrong port
- "No route to host" - Wrong IP or network issue
- "error in libcrypto" - Key format issue, re-paste the key
Missing Inventory Data
- Enable SSH in Settings → SSH tab
- Configure SSH credentials for the server
- Click Collect Inventory
- Check SSH connectivity with Test SSH button
No Events Showing
- Wait for collection cycle (default 5 minutes)
- Verify server is enabled in settings
- Some BMCs have empty SEL by default
- Check BMC firmware supports SEL
📚 Glossary
| Term | Definition |
|---|---|
| BMC | Baseboard Management Controller - dedicated processor for server management |
| IPMI | Intelligent Platform Management Interface - protocol for BMC communication |
| Redfish | Modern REST API alternative to IPMI |
| SEL | System Event Log - BMC's record of hardware events |
| FRU | Field Replaceable Unit - hardware inventory data |
| SDR | Sensor Data Record - sensor configuration data |
| ECC | Error Correcting Code - memory error detection/correction |
| DIMM | Dual Inline Memory Module - RAM stick |
| PSU | Power Supply Unit |
| VBAT | Backup battery voltage (usually CR2032 for CMOS) |
| iDRAC | Dell's BMC implementation |
| iLO | HP's BMC implementation |
🔌 API Reference
IPMI Monitor provides a REST API for integration.
Authentication
API endpoints require session authentication. Login via POST to /login.
Key Endpoints
GET /api/servers - List all servers
GET /api/servers/managed - List managed servers
GET /api/server/{ip}/events - Get server events
GET /api/server/{ip}/sensors - Get sensor readings
GET /api/servers/{ip}/inventory - Get hardware inventory
POST /api/servers/{ip}/inventory - Collect inventory
GET /api/auth/status - Check auth status
POST /api/test/bmc - Test BMC connection
POST /api/test/ssh - Test SSH connection