Metadata-Version: 2.4
Name: server-guardian-mcp
Version: 1.0.3
Summary: MCP server to monitor and manage remote Linux servers via SSH. 63 tools: health checks, log search, APM, SLOs, anomaly detection, auto-remediation, live dashboard, CIS benchmarks, CVE scanning, database monitoring, compliance reports, team RBAC, PagerDuty/Telegram/OpsGenie.
Author: Md Nazish Arman
License-Expression: LicenseRef-Proprietary
License-File: LICENSE
Keywords: anomaly-detection,compliance,dashboard,devops,docker,linux,mcp,monitoring,playbooks,server,ssh,vps
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.10
Requires-Dist: mcp>=1.0.0
Requires-Dist: paramiko>=3.0.0
Requires-Dist: starlette>=0.36.0
Requires-Dist: uvicorn>=0.27.0
Description-Content-Type: text/markdown

# Server Guardian MCP

The most comprehensive server management MCP ever built. **63 tools**, **8 connection types**, **16 modules** — log search, access log APM, SLO tracking, anomaly detection, auto-remediation playbooks, CIS benchmarks, CVE scanning, database monitoring, network monitoring, file integrity, live web dashboard, compliance reports, public status pages, team RBAC, PagerDuty/Telegram/OpsGenie — all through Claude. No agents. Just SSH.

> **"The AI SRE that lives in your terminal. SSH into any server, diagnose any problem, fix it automatically — all through a conversation with Claude. No agents. No SaaS bills. No PromQL."**

## Live Dashboard

```bash
python -m server_guardian_mcp dashboard           # start on port 8080
python -m server_guardian_mcp dashboard --port 9090
```

Real-time web UI with auto-refresh every 30 seconds. Dark theme, Chart.js charts for CPU/memory/disk trends, active alerts feed, incident timeline.

## Why Server Guardian?

| What you say to Claude | What happens |
|------------------------|-------------|
| "Is my server okay?" | SSH in, check CPU/RAM/disk/temp, detect anomalies vs baseline |
| "Why is production slow?" | Check processes, disk, logs, access log APM, identify the bottleneck |
| "Search logs for OOM errors" | Index logs in SQLite, search with pattern detection, show error rates |
| "Show me endpoint latency" | Parse nginx access logs — p50/p95/p99 latency, error rates, slowest endpoints |
| "Are we meeting our SLOs?" | Track uptime/latency/error targets, calculate error budget remaining |
| "What happened overnight?" | Generate incident narrative from alerts, service events, playbook runs |
| "Fix it automatically" | Run playbooks: clear disk, restart services, renew SSL certs |
| "Run a security audit" | 61 CIS benchmark checks + CVE scan + rootkit detection + FIM |
| "Generate a compliance report" | Branded HTML report with score (A-F) for SOC2/ISO prep |
| "How's the database?" | Slow query analysis, connection counts, replication lag, table sizes |
| "Am I overpaying?" | Rightsizing analysis: "CPU at 0.4%, memory at 7.7% — downsize to save 50%" |
| "What connects to what?" | Map service dependencies from active network connections |
| "Write the postmortem" | Auto-generate structured postmortem from incident timeline |
| "Create a status page" | Public-facing uptime page for customers (replaces $29/mo tools) |

## Benchmarks vs Alternatives

| Feature | Server Guardian | ssh-mcp | mcp-ssh-manager | HomeButler |
|---------|:-:|:-:|:-:|:-:|
| **Total tools** | **63** | 2 | 37 | 20 |
| **Connection types** | **8** | 1 | 1 | 1 |
| Log search + pattern detection | **Yes** | - | - | - |
| Access log APM (p50/p95/p99) | **Yes** | - | - | - |
| SLO tracking + error budgets | **Yes** | - | - | - |
| Smart anomaly detection | **Yes** | - | - | - |
| Auto-remediation playbooks | **Yes** | - | - | - |
| CIS benchmark (61 checks) | **Yes** | - | - | - |
| CVE scanning + rootkit detection | **Yes** | - | - | - |
| File integrity monitoring | **Yes** | - | - | - |
| Database monitoring (MySQL/PG) | **Yes** | - | - | - |
| Network bandwidth monitoring | **Yes** | - | - | - |
| Service dependency mapping | **Yes** | - | - | - |
| Root cause correlation | **Yes** | - | - | - |
| Resource rightsizing | **Yes** | - | - | - |
| Multi-step API tests | **Yes** | - | - | - |
| Maintenance windows | **Yes** | - | - | - |
| Public status page | **Yes** | - | - | - |
| AI postmortem generation | **Yes** | - | - | - |
| Live web dashboard (Chart.js) | **Yes** | - | - | - |
| Compliance report (SOC2/ISO) | **Yes** | - | - | - |
| Team RBAC (admin/operator/viewer) | **Yes** | - | - | - |
| PagerDuty / Telegram / OpsGenie | **Yes** | - | - | - |
| Background watchdog daemon | **Yes** | - | - | Yes |
| Email / Slack / Discord alerts | **Yes** | - | - | Yes |
| Multi-cloud (AWS/GCP/Azure) | **Yes** | - | - | - |
| Docker container management | **Yes** | - | Yes | Yes |

## Quick Install

### Claude Code (recommended)
```bash
claude mcp add server-guardian -- uvx server-guardian-mcp
```

### pip
```bash
pip install server-guardian-mcp
claude mcp add server-guardian -- python -m server_guardian_mcp
```

### From source
```bash
pip install -e .
claude mcp add server-guardian -- python -m server_guardian_mcp
```

## Setup (2 minutes)

### 1. Create your .env
```bash
cp .env.example .env
```

### 2. Add your servers

```env
# SSH (most common)
SERVER_PROD=ssh,203.0.113.10,22,deploy,key,~/.ssh/prod_key,Production

# Local machine
SERVER_LOCAL=local,,,,,My Machine

# Docker / Kubernetes / AWS SSM / GCP / Azure / WinRM also supported
```

### 3. Auto-discover existing servers
> "Discover my SSH servers" — reads ~/.ssh/config and shows ready-to-paste .env lines.

### 4. Add aliases (optional)
```env
SERVER_ALIASES=prod:PROD,stg:STAGING,dev:DEV
```

## All 63 Tools

### Core Server Management (6)
| Tool | What it does |
|------|-------------|
| `list_all_servers` | Show all servers with online/offline status and latency |
| `check_server_health` | Full snapshot: CPU, RAM, disk, swap, temp, load, top processes, network |
| `run_shell_commands` | Run one or more shell commands on any server |
| `run_shell_script` | Run multi-line bash scripts with shared variables |
| `fetch_system_logs` | Fetch dmesg/syslog/journal/auth/nginx/custom logs with grep filter |
| `list_running_processes` | Processes sorted by CPU or memory, with name filter |

### Service Management (5)
| Tool | What it does |
|------|-------------|
| `manage_systemd_service` | Start/stop/restart/enable/disable/status/logs for any systemd service |
| `list_all_services` | List ALL systemd services, filter by running/failed/inactive |
| `find_failed_services` | Find every crashed/failed service in one call |
| `restart_failed_services` | Bulk restart failed services — pass names or "ALL_FAILED" |
| `watch_service_status` | Quick is-active + is-enabled check for specific services |

### Monitoring & Alerting (5)
| Tool | What it does |
|------|-------------|
| `check_ssl_certificate` | SSL cert expiry, chain, issuer for any domain (no SSH) |
| `check_http_endpoint` | HTTP status, response time, headers for any URL (no SSH) |
| `monitor_server_health` | Health check + store in SQLite + auto-alert on thresholds |
| `monitor_endpoints` | Check HTTP/SSL targets + store + alert on failures |
| `get_active_alerts` | Show unresolved alerts grouped by severity |

### Log Search & APM (2)
| Tool | What it does |
|------|-------------|
| `search_logs` | Index logs in SQLite, search with pattern detection, extract error rates |
| `analyze_access_logs` | Nginx/Apache APM — per-endpoint p50/p95/p99 latency, error rates, throughput, top IPs |

### SLO Tracking & Reporting (4)
| Tool | What it does |
|------|-------------|
| `manage_slos` | Define uptime/latency/error rate targets, track compliance, error budgets |
| `generate_postmortem_tool` | Structured incident postmortem from alerts, services, playbook data |
| `generate_status_page_tool` | Public-facing status page for customers (replaces Better Stack $29/mo) |
| `get_weekly_report` | Weekly health summary for email or team review |

### Database Monitoring (2)
| Tool | What it does |
|------|-------------|
| `query_database` | Run SQL queries on MySQL, PostgreSQL, or SQLite on any server |
| `monitor_database` | Slow queries, connections, replication lag, table sizes (MySQL/PostgreSQL auto-detected) |

### Network Monitoring (2)
| Tool | What it does |
|------|-------------|
| `inspect_network` | Listening ports, active connections, interfaces, DNS, routing |
| `monitor_network` | Bandwidth per interface, connection states, TCP retransmissions, throughput rates |

### Security & Compliance (6)
| Tool | What it does |
|------|-------------|
| `run_security_audit` | 10-point security check (SSH, firewall, logins, updates, sudo) |
| `run_cis_benchmark` | 61 CIS Linux Benchmark checks across filesystem, network, SSH, PAM, logging |
| `scan_vulnerabilities` | CVE scanning (package versions), rootkit detection, crypto miner detection |
| `check_file_integrity` | FIM — hash critical files (/etc/passwd, sshd_config, etc.), detect unauthorized changes |
| `manage_firewall` | UFW/iptables: status, allow, deny, delete rules, enable/disable |
| `generate_compliance_report_tool` | Branded HTML report with score (A-F), suitable for SOC2/ISO |

### Docker (2)
| Tool | What it does |
|------|-------------|
| `list_docker_containers` | Containers with CPU, memory, network, block I/O stats |
| `fetch_docker_logs` | Container logs with grep filter and time range |

### Disk & Files (4)
| Tool | What it does |
|------|-------------|
| `analyze_disk_usage` | Find largest items, files >100MB, inode usage |
| `read_remote_file` | Read files on server (tail/head/all) with metadata |
| `upload_file_to_server` | SFTP upload with size verification |
| `download_file_from_server` | SFTP download |

### Multi-Server (2)
| Tool | What it does |
|------|-------------|
| `run_on_all_servers` | Same commands on multiple servers — pass ["ALL"] for all |
| `compare_across_servers` | Spot config drift: same command, side-by-side results |

### System Administration (4)
| Tool | What it does |
|------|-------------|
| `manage_cron_jobs` | List, add, remove cron jobs on any server |
| `manage_users` | List users, user info, add SSH keys, list keys, who is logged in |
| `manage_packages` | List/install/remove/upgrade packages (apt, yum, dnf, apk auto-detected) |
| `manage_nginx` | Status, list sites, show config, test, reload, restart, access/error logs |

### Git Deploy (1)
| Tool | What it does |
|------|-------------|
| `git_deploy` | Status, pull, log, branch, switch, stash, diff on server git repos |

### Discovery (1)
| Tool | What it does |
|------|-------------|
| `discover_ssh_servers` | Auto-discover servers from ~/.ssh/config with ready-to-paste .env lines |

### Dashboard & Analytics (6)
| Tool | What it does |
|------|-------------|
| `multi_server_dashboard` | One-call summary of ALL servers: health, CPU, RAM, disk, failed services |
| `get_monitoring_history` | Query health trends, service events, endpoint checks from SQLite |
| `get_incident_timeline` | Chronological event log for a server |
| `forecast_disk_usage` | Predict when disk will be full based on growth rate |
| `generate_html_dashboard` | Self-contained HTML status page — open in any browser |
| `resolve_alert` | Mark an alert as resolved |

### Intelligence & Automation (3)
| Tool | What it does |
|------|-------------|
| `detect_anomalies_tool` | Statistical anomaly detection — flags metrics >2.5 sigma from baseline |
| `replay_incident` | Generate chronological narrative from alerts, service events, playbook runs |
| `manage_playbooks` | Auto-remediation: disk cleanup, service restart, SSL renewal, custom playbooks |

### Team & Integrations (3)
| Tool | What it does |
|------|-------------|
| `team_manage` | RBAC user management: admin/operator/viewer roles with API keys |
| `check_integrations` | Status and test for PagerDuty, Telegram, OpsGenie |
| `live_dashboard_info` | How to start the live web dashboard and available API endpoints |

### Advanced Operations (5)
| Tool | What it does |
|------|-------------|
| `run_api_test_tool` | Multi-step API tests with variable extraction and assertions |
| `manage_maintenance_windows` | Suppress alerts during planned work |
| `get_rightsizing_recommendations` | Identify over/under-provisioned resources to save costs |
| `map_service_dependencies` | Discover service topology from active network connections |
| `analyze_root_cause` | Correlate anomalies across metrics, services, alerts for root cause analysis |

## Access Log APM

80% of APM value with zero agent install. Parse nginx/Apache access logs for:

```
Tell Claude: "analyze access logs on PROD"
```

- Per-endpoint latency percentiles (p50, p95, p99)
- Error rates (4xx, 5xx) per endpoint
- Throughput (requests per endpoint)
- Slowest endpoints ranked
- Status code breakdown
- Top IPs by request volume
- URL normalization (replaces IDs/UUIDs with placeholders)

## Log Search & Pattern Detection

```
Tell Claude: "search logs on PROD for OOM" or "show me log patterns"
```

- Fetches logs via SSH, indexes in SQLite for future searching
- Pattern detection — clusters similar log lines, shows frequency
- Error rate extraction (log-to-metrics)
- Supports journal, syslog, auth, nginx, or any custom log path

## SLO Tracking & Error Budgets

```
Tell Claude: "create an SLO for 99.9% uptime on PROD"
Tell Claude: "show me SLO status"
```

- Define uptime, latency, or error rate targets
- Track compliance from stored health/endpoint data
- Calculate error budget remaining and burn rate
- Configurable measurement windows (7d, 30d, 90d)

## CIS Benchmark & Vulnerability Scanning

```
Tell Claude: "run CIS benchmark on PROD"
Tell Claude: "scan for vulnerabilities on PROD"
```

- **61 CIS Linux Benchmark checks** across: filesystem, software updates, boot security, process hardening, network config, SSH, PAM, user management, logging, cron
- **CVE scanning** — lists installed packages, checks for security updates
- **Rootkit detection** — hidden processes, suspicious kernel modules, SUID files, crypto miners, suspicious cron jobs
- **File integrity monitoring** — hashes critical files, alerts on unauthorized changes

## Database Monitoring

```
Tell Claude: "monitor database on PROD"
```

- **MySQL**: slow query log, connection stats, replication lag, table sizes, processlist
- **PostgreSQL**: pg_stat_statements, connections, replication, table sizes, lock analysis, cache hit ratio
- Auto-detects which database is installed

## Network Monitoring

```
Tell Claude: "monitor network on PROD"
```

- Bandwidth per interface (bytes/sec, Mbps)
- Connection state tracking (ESTABLISHED, TIME_WAIT, CLOSE_WAIT)
- TCP retransmission rates
- Historical trends stored in SQLite

## Resource Rightsizing

```
Tell Claude: "rightsizing recommendations for PROD"
```

- Analyzes CPU, memory, disk usage over time
- Identifies over-provisioned resources ("CPU at 0.4% — downsize from 16 to 8 cores")
- Identifies under-provisioned resources ("Memory at 92% — upgrade RAM")
- Cost savings estimates

## Service Dependency Mapping

```
Tell Claude: "map dependencies on PROD"
```

- Parses active TCP connections to discover what processes talk to what
- Groups by process (nginx -> database:5432, app -> redis:6379)
- Stored in SQLite for historical tracking

## Root Cause Analysis

```
Tell Claude: "analyze root cause on PROD"
```

- Correlates metric spikes with service failures and alerts
- Detects cascading failure patterns
- Identifies resource exhaustion as cause of service crashes
- Temporal correlation across all monitoring data

## Smart Anomaly Detection

```
Tell Claude: "detect anomalies on PROD"
```

- Builds baselines per metric grouped by hour and day of week
- Flags values >2.5 standard deviations from the mean
- No ML dependencies — pure statistics from SQLite data

## Auto-Remediation Playbooks

**5 built-in playbooks:**
| Playbook | Trigger | Action |
|----------|---------|--------|
| `disk_cleanup` | Disk > 90% | Clear journal, /tmp, old logs, package cache |
| `restart_failed_services` | Failed services detected | Restart each failed service |
| `high_memory_cleanup` | Memory > 95% | Drop filesystem caches |
| `high_cpu_investigation` | CPU load > 3x cores | Log top CPU consumers |
| `ssl_renewal` | SSL cert < 7 days | Run certbot renew, reload nginx |

Custom playbooks: drop JSON files in `~/.server-guardian-mcp/playbooks/`

## Public Status Page

```
Tell Claude: "generate a status page"
```

- Self-hosted uptime page for customers
- Shows server and endpoint health
- Active incidents section
- Auto-refreshes every 60 seconds
- Replaces Better Stack ($29/mo) and Instatus ($20/mo) — free

## Multi-Step API Tests

```
Tell Claude: "test my API"
```

- Chain API calls: login -> extract token -> call API with token -> verify response
- Variable extraction from JSON responses
- Assertions: status code, body content, response time
- Save and re-run named tests

## Maintenance Windows

```
Tell Claude: "create maintenance window for PROD for 2 hours"
```

- Suppress alerts during planned work
- Configurable duration
- List and delete windows

## Compliance Reports

```
Tell Claude: "generate a compliance report for PROD"
```

- Security score (0-100) with letter grade (A-F)
- Detailed check results with pass/fail/warning badges
- Active alerts section
- Print-friendly, works in any browser
- Suitable for SOC2/ISO prep and client deliverables

## Team Mode (RBAC)

```env
GUARDIAN_TEAM_MODE=true
GUARDIAN_API_KEY=sg_your_api_key_here
```

| Role | Permissions |
|------|------------|
| **admin** | Full access — all tools, user management |
| **operator** | Run commands, restart services, deploy — no user management |
| **viewer** | Read-only — view health, logs, alerts, dashboards |

## External Integrations

```env
PAGERDUTY_ROUTING_KEY=your-routing-key
TELEGRAM_BOT_TOKEN=your-bot-token
TELEGRAM_CHAT_ID=your-chat-id
OPSGENIE_API_KEY=your-api-key
```

## Background Watchdog

Runs independently of Claude — no AI, no API cost. Monitors 24/7 and sends alerts via email, Slack, Discord.

```bash
python -m server_guardian_mcp watchdog           # run forever
python -m server_guardian_mcp watchdog --once    # run one cycle
```

### Alert thresholds
| Condition | Severity |
|-----------|----------|
| Disk > 90% | Critical |
| Disk > 80% | Warning |
| CPU load > 2x cores | Warning |
| Temperature > 85C | Warning |
| Server unreachable | Critical |
| Failed services | Warning |
| HTTP endpoint down | Critical |
| SSL cert < 7 days | Critical |
| SSL cert < 30 days | Warning |

## Connection Types

| Type | Connects to | Requires |
|------|------------|----------|
| `ssh` | Linux/Mac servers | paramiko (included) |
| `local` | Your own machine | nothing |
| `docker` | Docker containers | docker CLI |
| `winrm` | Windows servers | `pip install pywinrm` |
| `k8s` | Kubernetes pods | kubectl CLI |
| `aws-ssm` | AWS EC2 instances | aws CLI |
| `gcloud` | GCP Compute Engine | gcloud CLI |
| `azure` | Azure VMs | az CLI |

## Security

- **Command blocklist** — blocks rm -rf, fork bombs, reverse shells
- **Sensitive file protection** — blocks .pem, .key, .env, /etc/shadow
- **SQL safety** — read-only by default
- **Read-only mode** — `GUARDIAN_MODE=readonly`
- **Rate limiting** — 30 calls/min per tool
- **Audit logging** — all invocations logged with sensitive param redaction
- **Shell injection prevention** — shlex.quote on all inputs
- **Output capped at 512KB** per command
- **File integrity monitoring** — detect unauthorized file changes
- **CIS benchmark compliance** — 61 security checks
- **CVE + rootkit scanning** — detect known vulnerabilities and malware

## Architecture

- **63 MCP tools** across 16 modules
- **8 connection adapters** (SSH, Local, Docker, WinRM, K8s, AWS SSM, GCloud, Azure)
- **15 SQLite tables** (health, services, endpoints, alerts, audit, baselines, playbooks, users, logs, SLOs, file hashes, network, maintenance, API tests, dependencies)
- **Background watchdog** with email/Slack/Discord/PagerDuty/Telegram/OpsGenie alerts
- **Live web dashboard** (Starlette + Chart.js)
- **Statistical anomaly detection** engine
- **Auto-remediation** playbook engine
- **Access log APM** parser
- **CIS benchmark + CVE scanner**
- **Database monitoring** (MySQL + PostgreSQL)
- **Network monitoring** with bandwidth tracking
- **SLO tracking** with error budgets
- **Team RBAC** (admin/operator/viewer)
- **Compliance report** generator
- **Public status page** generator

## Requirements

- Python 3.10+
- `mcp>=1.0.0`
- `paramiko>=3.0.0`
- `uvicorn>=0.27.0`
- `starlette>=0.36.0`

## License

**Proprietary** — Copyright (c) 2026 Md Nazish Arman. All rights reserved.

Free for personal, non-commercial evaluation only. Commercial use, business use, or any revenue-generating use requires a paid license. See [LICENSE](LICENSE) for full terms.

## Author

**Md Nazish Arman**
