Metadata-Version: 2.4
Name: vmcluster-mcp
Version: 1.1.0
Summary: MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU
Project-URL: Homepage, https://github.com/hornc/vmcluster-mcp
Project-URL: Repository, https://github.com/hornc/vmcluster-mcp
Project-URL: Issues, https://github.com/hornc/vmcluster-mcp/issues
Author-email: Chris Horn <chompinbits@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: cluster,kvm,libvirt,mcp,qemu,testing,vm
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.11
Requires-Dist: asyncssh>=2.22.0
Requires-Dist: libvirt-python>=12.1.0
Requires-Dist: mcp[cli]>=1.26.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pyyaml>=6.0.3
Description-Content-Type: text/markdown

# vmcluster-mcp

An MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU. Manages the full lifecycle of KVM virtual machine clusters — provisioning, starting, stopping, snapshotting, SSH execution, artifact distribution, and fault injection — through a structured tool interface designed for AI agents.

## Table of Contents

- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Quick Start (First 15 Minutes)](#quick-start-first-15-minutes)
- [Configuration](#configuration)
- [Topology Files](#topology-files)
- [Integration: VS Code (GitHub Copilot)](#integration-vs-code-github-copilot)
- [Integration: Claude CLI](#integration-claude-cli)
- [Available Tools](#available-tools)
- [Canonical Agent Workflow](#canonical-agent-workflow)
- [Troubleshooting](#troubleshooting)
- [Development](#development)

---

## Overview

`vmcluster-mcp` is a general-purpose MCP server. It manages clusters of KVM/QEMU virtual machines and produces a `ClusterHandle` — a typed descriptor passed to downstream consumers for direct SSH access. The server has no knowledge of what runs inside VMs; it knows nodes, networks, snapshots, and artifacts.

**Design principles:**
- **Topology-as-data** — cluster shape is declared in a YAML file, not constructed imperatively
- **Structured outputs** — all tools return typed Pydantic models serialized as JSON; no free-text parsing
- **Stateless server** — all persistent state lives in libvirt and on disk; safe to restart at any time
- **Idempotent operations** — `cluster_define` and related tools are safe to call multiple times

---

## Prerequisites

- Linux host with KVM/QEMU and libvirt installed (`libvirtd` running)
- Python 3.11+
- `qemu-img` available in `PATH`
- `genisoimage` or `mkisofs` for cloud-init ISO generation
- `iptables`, `tc` (from `iproute2`), and `rsync` for fault/artifact tools
- Permission to run libvirt and host network commands (`sudo` access is usually required)
- `uv` (recommended) or `pip` for installation

Typical package set on Ubuntu/Debian:

```bash
sudo apt-get update
sudo apt-get install -y \
  qemu-kvm libvirt-daemon-system libvirt-clients \
  qemu-utils cloud-image-utils genisoimage \
  iproute2 iptables rsync
```

```bash
# Verify libvirt access
virsh list --all

# Verify qemu-img
qemu-img --version

# Verify tc and iptables
tc -V
iptables --version
```

---

## Installation

### From PyPI (recommended)

Install the latest release:

```bash
pip install vmcluster-mcp
```

Or run directly without installing (via [uv](https://docs.astral.sh/uv/)):

```bash
uvx vmcluster-mcp
```

> **Prerequisite:** `libvirt-python` requires system-level development headers. Install them before running `pip install`:
>
> ```bash
> # Ubuntu/Debian
> sudo apt-get install -y libvirt-dev pkg-config gcc
>
> # Fedora/RHEL
> sudo dnf install -y libvirt-devel pkgconf-pkg-config gcc
>
> # Arch Linux
> sudo pacman -S libvirt pkgconf gcc
> ```

### From source (development)

```bash
git clone https://github.com/hornc/vmcluster-mcp.git
cd vmcluster-mcp

# Create virtual environment and install
uv venv
uv pip install -e .
```

---

## Quick Start (First 15 Minutes)

This path is for first-time setup on a single Linux host.

1. Create required directories and SSH key:

```bash
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""
```

2. Create `/etc/vmcluster/config.yaml`:

```yaml
topology_dir: /etc/vmcluster/topologies
overlay_dir: /var/lib/vmcluster/overlays
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO
```

3. Prepare a base image used by the example topology:

```bash
sudo mkdir -p /var/lib/vmcluster/images
sudo wget -O /tmp/ubuntu-24.04-server-cloudimg-amd64.img \
  https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img
sudo qemu-img convert -f qcow2 -O qcow2 \
  /tmp/ubuntu-24.04-server-cloudimg-amd64.img \
  /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
sudo qemu-img info /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
```

4. Add your first topology file in `/etc/vmcluster/topologies/` (see example below).

5. Run the server locally to verify it starts:

```bash
VMCLUSTER_CONFIG=/etc/vmcluster/config.yaml .venv/bin/python -m vmcluster_mcp
```

6. Connect from your MCP client (VS Code or Claude) and run this smoke flow:

```text
cluster_define("example-3node")
cluster_start("example-3node", wait_for_ssh=True)
cluster_status("example-3node")
node_exec("example-3node", "controller", "uname -r")
snapshot_create("example-3node", "baseline")
cluster_stop("example-3node")
```

7. Clean up when finished:

```text
cluster_destroy("example-3node", remove_overlays=True)
```

---

## Configuration

The server can be configured via a YAML file and/or environment variables. Environment variables take precedence over the config file, which takes precedence over defaults.

### Config file

Default location: `/etc/vmcluster/config.yaml`. Override with `VMCLUSTER_CONFIG` env var.

```yaml
# /etc/vmcluster/config.yaml

topology_dir: /etc/vmcluster/topologies      # Where topology YAML files live
overlay_dir: /var/lib/vmcluster/overlays     # Where per-node qcow2 overlays are created
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO
```

### Environment variables

| Variable | Config key | Default |
|----------|-----------|---------|
| `VMCLUSTER_CONFIG` | *(config file path)* | `/etc/vmcluster/config.yaml` |
| `VMCLUSTER_TOPOLOGY_DIR` | `topology_dir` | `/etc/vmcluster/topologies` |
| `VMCLUSTER_OVERLAY_DIR` | `overlay_dir` | `/var/lib/vmcluster/overlays` |
| `VMCLUSTER_ARTIFACT_REGISTRY` | `artifact_registry` | `/var/lib/vmcluster/artifacts/registry.json` |
| `VMCLUSTER_ARTIFACT_STORE_DIR` | `artifact_store_dir` | `/var/lib/vmcluster/artifacts/trees` |
| `VMCLUSTER_FAULT_REGISTRY` | `fault_registry` | `/var/lib/vmcluster/faults/registry.json` |
| `VMCLUSTER_SSH_KEY_PATH` | `ssh_key_path` | `/etc/vmcluster/ssh/vmcluster_id_ed25519` |
| `VMCLUSTER_SSH_USER` | `ssh_user` | `root` |
| `VMCLUSTER_LIBVIRT_URI` | `libvirt_uri` | `qemu:///system` |
| `VMCLUSTER_LOG_LEVEL` | `log_level` | `INFO` |

### Quick setup

```bash
# Create directories
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}

# Generate SSH key for VM access
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""
```

---

## Topology Files

Topology files are YAML files placed in `topology_dir`. The agent references topologies by filename (without `.yaml`).

```yaml
# /etc/vmcluster/topologies/example-3node.yaml

cluster_name: example-3node
base_image: /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
overlay_dir: /var/lib/vmcluster/overlays/

network:
  name: clusternet-example
  bridge: virbr-example0
  subnet: 192.168.100.0/24

nodes:
  - name: controller
    role: control
    vcpus: 2
    memory_mb: 2048
    ip: 192.168.100.10
    extra_disks:
      - path: /var/lib/vmcluster/disks/data0.qcow2
        size_gb: 20
        bus: virtio

  - name: worker-0
    role: worker
    vcpus: 2
    memory_mb: 2048
    ip: 192.168.100.11

  - name: client-0
    role: client
    vcpus: 2
    memory_mb: 1024
    ip: 192.168.100.20

ssh:
  key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
  user: root
  connect_timeout_s: 30

snapshots:
  baseline: clean-boot   # Logical name for snapshot_revert("baseline")
```

Node IPs are configured statically via cloud-init — no DHCP is used. Each node gets a NoCloud ISO injected at first boot.
The libvirt bridge named under `network.bridge` is created when `cluster_define` defines the topology network, so it does not need to pre-exist on the host.

---

## Integration: VS Code (GitHub Copilot)

Add the server to your VS Code MCP configuration. Open **Settings → MCP** or edit `.vscode/mcp.json` in your workspace (or `~/.vscode/mcp.json` globally).

If you installed from source into a virtualenv (recommended):

```json
{
  "servers": {
    "vmcluster-mcp": {
      "type": "stdio",
      "command": "/path/to/vmcluster-mcp/.venv/bin/python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}
```

If you prefer ephemeral launch with `uv run --with`:

```json
{
  "servers": {
    "vmcluster-mcp": {
      "type": "stdio",
      "command": "uv",
      "args": [
        "run",
        "--with", "git+https://github.com/chompinbits/vmcluster-mcp.git",
        "python", "-m", "vmcluster_mcp"
      ],
      "env": {
        "VMCLUSTER_TOPOLOGY_DIR": "/etc/vmcluster/topologies",
        "VMCLUSTER_OVERLAY_DIR": "/var/lib/vmcluster/overlays",
        "VMCLUSTER_SSH_KEY_PATH": "/etc/vmcluster/ssh/vmcluster_id_ed25519",
        "VMCLUSTER_LIBVIRT_URI": "qemu:///system"
      }
    }
  }
}
```

After saving, restart the MCP server from the VS Code MCP panel. The tools will appear in Copilot Chat under the `vmcluster-mcp` server.

---

## Integration: Claude CLI

### `claude` (Anthropic Claude CLI / Claude Desktop)

Add to `~/.claude/claude_desktop_config.json` (Claude Desktop) or `~/.config/claude/config.json` (Claude CLI):

```json
{
  "mcpServers": {
    "vmcluster-mcp": {
      "command": "python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}
```

If using a virtualenv:

```json
{
  "mcpServers": {
    "vmcluster-mcp": {
      "command": "/path/to/vmcluster-mcp/.venv/bin/python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}
```

For the `claude` CLI (interactive terminal), you can also pass it inline:

```bash
claude --mcp-server "vmcluster-mcp:python -m vmcluster_mcp"
```

Or register it persistently:

```bash
claude mcp add vmcluster-mcp -- python -m vmcluster_mcp
```

Verify the server is loaded:

```bash
claude mcp list
```

---

## Available Tools

All tools return `ToolResult[T]` — a structured JSON object with `success: bool`, `result: T | null`, and `error: { code, message, recoverable } | null`.

### Cluster Lifecycle and Recovery

| Tool | Description |
|------|-------------|
| `cluster_define(topology_name)` | Provision a cluster from a topology file: create network, per-node overlay disks, cloud-init ISOs, and libvirt domain definitions. Idempotent. |
| `cluster_start(cluster_name, wait_for_ssh, ssh_timeout_s)` | Boot all stopped nodes. Optionally waits for SSH on all nodes (strict: one failure = `success=False`). |
| `cluster_stop(cluster_name, mode)` | Stop all running nodes. `mode="shutdown"` (ACPI) or `mode="destroy"` (force-off). |
| `cluster_destroy(cluster_name, remove_overlays)` | Undefine all domains, destroy the network. Optionally delete overlay disk files. |
| `cluster_status(cluster_name)` | Return per-node domain state and SSH reachability. SSH is checked in parallel only for running nodes. |
| `cluster_handle(cluster_name)` | Return a `ClusterHandle` with node SSH descriptors, `artifact_path`, and `kernel_version` (fetched via SSH). Requires running cluster. |
| `node_crash(cluster_name, node, restart_after, wait_for_ssh, ssh_timeout_s)` | Simulate an unclean node failure (`virsh destroy`) and optionally restart/wait for SSH. |

### Remote Command Execution

| Tool | Description |
|------|-------------|
| `node_exec(cluster_name, node_name, command, timeout_s)` | Run a command on one node and return structured stdout/stderr/exit metadata. |
| `node_exec_all(cluster_name, command, nodes, require_all, timeout_s)` | Run a command on many nodes in parallel with per-node results and failure map. |

### Snapshot Management

| Tool | Description |
|------|-------------|
| `snapshot_create(cluster_name, snapshot_name, include_memory)` | Create disk snapshots for all nodes in the cluster. |
| `snapshot_list(cluster_name)` | List snapshots with per-node disk metadata. |
| `snapshot_revert(cluster_name, snapshot_name, restart_after, wait_for_ssh, ssh_timeout_s)` | Revert all nodes to a named snapshot and optionally restart/verify SSH. |
| `snapshot_delete(cluster_name, snapshot_name)` | Delete a named snapshot across all nodes (best effort with per-node status). |

### Artifact Management

| Tool | Description |
|------|-------------|
| `artifact_register(source_path, build_type, kernel_version, metadata)` | Register a local build tree and get a content-addressed artifact id. |
| `artifact_list()` | List registered artifacts. |
| `artifact_diff(artifact_id_a, artifact_id_b)` | Diff modules/binaries between two artifacts. |
| `artifact_sync(cluster_name, artifact_id, nodes, force, dest_base)` | Sync artifact content to target nodes over SSH/rsync. |
| `artifact_install(cluster_name, artifact_id, nodes, install_mode, dest_base)` | Install synced artifacts on nodes with structured per-node install status. |

### Network Fault Injection

| Tool | Description |
|------|-------------|
| `net_partition(cluster_name, partition_a, partition_b)` | Insert symmetric iptables partition rules between node groups. |
| `net_impair(cluster_name, source_node, target_node, latency_ms, jitter_ms, loss_pct, corrupt_pct, reorder_pct)` | Apply tc netem impairment on a source node tap interface. |
| `net_heal(cluster_name, fault_handle)` | Remove a specific fault and deregister its handle. |
| `net_heal_all(cluster_name)` | Remove all active faults for a cluster. |
| `net_fault_list(cluster_name)` | List all active fault handles and parameters from fault registry. |

### Kernel Observability

| Tool | Description |
|------|-------------|
| `dmesg_mark(cluster_name, nodes)` | Write a shared marker into `/dev/kmsg` on target nodes. |
| `dmesg_collect(cluster_name, nodes, since_marker, filter_level)` | Collect and classify dmesg lines (`all`, `warn+`, `err+`). |

### Return types

**`ClusterStatus`** — returned by `cluster_define`, `cluster_start`, `cluster_stop`, `cluster_destroy`, `cluster_status`:
```json
{
  "cluster_name": "example-3node",
  "network_active": true,
  "nodes": [
    {
      "name": "controller",
      "role": "control",
      "ip": "192.168.100.10",
      "domain_state": "running",
      "ssh_reachable": true
    }
  ]
}
```

**`ClusterHandle`** — returned by `cluster_handle`:
```json
{
  "cluster_name": "example-3node",
  "artifact_path": "/opt/vmcluster/artifacts",
  "kernel_version": "6.8.0-51-generic",
  "nodes": [
    {
      "name": "controller",
      "role": "control",
      "ip": "192.168.100.10",
      "ssh_port": 22,
      "ssh_user": "root",
      "ssh_key_path": "/etc/vmcluster/ssh/vmcluster_id_ed25519"
    }
  ]
}
```

Most non-lifecycle tools follow the same envelope with their own typed `result`
payload (for example `ExecResult`, `SnapshotInfo`, `NetFaultInfo`, `SyncStatus`).

---

## Canonical Agent Workflow

```
# 1. Define the cluster (idempotent — safe to call multiple times)
cluster_define("example-3node")

# 2. Start all nodes and wait for SSH
cluster_start("example-3node", wait_for_ssh=True)

# 3. Get cluster handle for downstream SSH use
handle = cluster_handle("example-3node")

# 4. Check status at any time
cluster_status("example-3node")

# 5. Graceful shutdown
cluster_stop("example-3node", mode="shutdown")

# 6. Full teardown (remove overlays too)
cluster_destroy("example-3node", remove_overlays=True)
```

### Extended workflow (artifacts + faults + observability, pseudo-notation)

The flow below shows the intended sequence of tool calls.

```text
# Register and deploy build artifacts
artifact_id = artifact_register("/path/to/build/tree").result.artifact_id
artifact_sync("example-3node", artifact_id)
artifact_install("example-3node", artifact_id)

# Add a network impairment and inspect active faults
fault = net_impair("example-3node", source_node="worker-0", latency_ms=150)
net_fault_list("example-3node")

# Mark and collect dmesg around your test window
markers = dmesg_mark("example-3node")
dmesg_collect("example-3node", since_marker=markers["worker-0"], filter_level="warn+")

# Heal injected faults
net_heal("example-3node", fault.result.handle_id)
```

---

## Troubleshooting

### `cluster_define` fails creating overlays

- Ensure base image path in topology exists and is readable.
- Validate host tool availability: `qemu-img --version`.
- Confirm overlay directory is writable by the user running the MCP server.

### SSH timeouts in `cluster_start` or `snapshot_revert`

- Confirm cloud-init configured the static IPs expected by the topology.
- Verify key/user pair: `VMCLUSTER_SSH_KEY_PATH`, `VMCLUSTER_SSH_USER`.
- Increase `ssh_timeout_s` for cold boots.

### Fault tools fail (`iptables`/`tc` errors)

- Ensure the MCP process has required privileges for host networking commands.
- Confirm `iptables` and `tc` are installed and executable.
- Validate libvirt bridge name in topology matches the active host interface.

### `artifact_sync` or `artifact_install` partial failures

- Use `node_exec_all(..., command="df -h")` to verify remote disk space.
- Verify SSH connectivity and remote path permissions under `dest_base`.
- Re-run with narrowed `nodes=[...]` to isolate problematic hosts.

### Snapshot delete blocked

- `snapshot_delete` refuses to remove active backing snapshots by design.
- Revert or switch active disk chain first, then delete snapshot.

### Useful host checks

```bash
virsh list --all
virsh net-list --all
ip -br link
sudo iptables -S | head
sudo tc qdisc show
```

---

## Development

```bash
# Clone and install with dev dependencies
git clone https://github.com/chompinbits/vmcluster-mcp.git
cd vmcluster-mcp
uv venv && uv pip install -e '.[dev]'

# Run tests
.venv/bin/pytest

# Lint
.venv/bin/ruff check vmcluster_mcp/

# Run the server directly (stdio mode)
.venv/bin/python -m vmcluster_mcp
```

### Project structure

```
vmcluster_mcp/
  cluster/          # Cluster lifecycle tools (define, start, stop, destroy, status, handle, crash)
    libvirt_client.py   # Thread-safe async libvirt wrapper
    domain_builder.py   # KVM domain XML generation
    network_builder.py  # libvirt NAT network XML generation
    cloud_init.py       # cloud-init NoCloud ISO generation
  exec/             # Remote command execution tools (node_exec, node_exec_all)
    ssh.py          # SSH client and connection pool management
  snapshot/         # Snapshot tools (create, list, revert, delete)
    manager.py      # Snapshot operations
  artifact/         # Artifact tools (register, list, diff, sync, install)
    installer.py    # Remote artifact installation
    registry.py     # Content-addressed artifact registry
    syncer.py       # rsync-based artifact synchronization
  net/              # Network fault tools (partition, impair, heal, list)
    fault_registry.py   # Persistent fault registry
    fault.py        # iptables/tc fault implementation
  observe/          # Kernel observability tools (dmesg_mark, dmesg_collect)
    classifier.py   # dmesg line classification
    dmesg.py        # dmesg collection and parsing
  topology/         # Topology YAML parsing and schema
    parser.py       # Topology loader
    schema.py       # Topology models
  models.py         # Shared Pydantic models (ToolResult, ClusterStatus, ClusterHandle, …)
  config.py         # Configuration loading (YAML + env vars)
  server.py         # FastMCP server instance and structured_tool_handler
```
