Metadata-Version: 2.4
Name: modelpulse
Version: 0.4.2
Summary: End-to-end partial-weight transfer pipeline.
Author-email: Mohammad Sufiyan <moahmmadsufiyan152@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/MdSufiyan005/ModelPulse
Project-URL: Source, https://github.com/MdSufiyan005/ModelPulse
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: httpx>=0.27
Requires-Dist: websockets>=12.0
Requires-Dist: typer>=0.12
Requires-Dist: fastapi>=0.111
Requires-Dist: uvicorn[standard]>=0.30
Requires-Dist: rich>=13.7
Requires-Dist: llama-cpp-python>=0.2.90
Requires-Dist: psutil>=5.9
Requires-Dist: groq>=0.18.0
Requires-Dist: huggingface_hub>=0.30.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: scipy>=1.11.0
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"

# ModelPulse 🚀

**End-to-end partial-weight transfer pipeline for edge LLM inference.**

ModelPulse enables a unique "Zero-Disk" inference strategy: **Device A** (Server) serves model shards over the network, while **Device B** (Client/Bridge) reconstructs the model entirely in RAM and runs inference via `llama.cpp` without ever writing the full GGUF to physical storage.

## Data Flow Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                     Server (Device A)                       │
│                  FastAPI @ 0.0.0.0:8000                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  WebSocket /ws (Control Plane)   HTTP (Data Plane)          │
│  ├─ MODEL_READY                  ├─ GET /manifest           │
│  ├─ PING/PONG                    ├─ GET /shards/*           │
│  ├─ METRICS                      └─ POST /metrics           │
│  └─ ACK/BYE                                                 │
│                                                             │
│  /models/upload (Multipart)                                 │
│  └─ Accept manifest.json + *.shard files                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
         ↑                              ↑
         │                              │
         │ WS connect                   │ HTTP GET/POST
         │ + MODEL_READY signal         │ + shard stream
         │                              │
┌────────┴──────────────────────────────┴─────────────────────┐
│                   Client (Device B)                         │
│                       Bridge CLI                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Connect WebSocket → Send HELLO                          │
│  2. Receive MODEL_READY → Fetch manifest (HTTP)             │
│  3. Download shards (HTTP streaming)                        │
│  4. Assemble GGUF in /dev/shm                               │
│  5. Load with llama.cpp                                     │
│  6. Run inference                                           │
│  7. Send METRICS → Wait for next MODEL_READY signal         │
│     (event-driven — no polling, no restart required)        │ 
│                                                             │
└─────────────────────────────────────────────────────────────┘
```



---

## ✨ Key Features

- **🛡️ Zero-Disk Strategy**: Models are assembled in `tmpfs` (`/dev/shm`), ensuring no persistent GGUF footprint on the client's disk.
- **🔄 Dynamic Model Swapping**: Upload new models to the server at runtime; connected clients automatically unload, pull, and reload the new model without a restart.
- **⚡ Delta Updates (New!)**: Update only the changed tensors in a model. The bridge patches its in-memory GGUF in real-time, downloading only a fraction of the full model size.
- **📊 Real-time Telemetry**: Detailed inference metrics (TTFT, tok/s, RAM delta, CPU temp) are streamed back to the server for centralized monitoring.
- **🛠️ Integrated Benchmarking**: Built-in suite to stress-test edge devices and validate performance across different quantization levels.
- **🌐 Network Agnostic**: Works seamlessly over local networks, Tailscale, or any HTTP/WS-capable connection.

---

## 📦 Installation

Install `ModelPulse` from PyPI:

```bash
pip install modelpulse
```

Alternatively, install directly from the repository for the latest dev features:

```bash
pip install git+https://github.com/MdSufiyan005/ModelPulse.git
```

*Note: Ensure you have `llama-cpp-python` dependencies installed on your system (e.g., `build-essential`, `python3-dev`).*

---

## 🔄 Workflow

### 1. Prepare Shards
Convert a monolithic `.gguf` file into a shard directory:

```bash
modelpulse server convert my_model.gguf ./my-shards/
```

### 2. Start the Server
Start the control plane on Device A. Use `--log-dir` to specify where inference metrics are saved.

```bash
modelpulse server run --host 0.0.0.0 --port 8000 --log-dir ./results
```

### 3. Run the Bridge
Connect your edge device to the server. It will wait for a model to be assigned.

```bash
modelpulse bridge run http://<server-ip>:8000
```

### 4. Dynamic Upload
Upload your prepared shards to the server. All connected bridges will instantly receive the update.

```bash
# Full Baseline Upload
modelpulse server upload "qwen-3.5-2b" "./my-shards/"

# Delta Update (Auto-Diff)
modelpulse server upload "qwen-3.5-2b-v2" "./new-shards/" --base "qwen-3.5-2b" --base-dir "./old-shards/"
```

---

## 📋 Command Reference

### `modelpulse server run`
Start the FastAPI control plane.

| Option | Default | Description |
| :--- | :--- | :--- |
| `--shard-dir`, `-d` | `./models-storage` | Root directory for model storage |
| `--host` | `127.0.0.1` | Bind address |
| `--port` | `8000` | Listening port |
| `--log-dir` | Current directory | Directory to save `metrics.jsonl` |
| `--ping-interval` | `20.0` | WebSocket ping interval (seconds) |

### `modelpulse server upload`
Upload models or delta patches to the control plane.

| Option | Default | Description |
| :--- | :--- | :--- |
| `model_id` | (Required) | Unique slug for the new model |
| `paths` | (Required) | Shard directory or list of .shard files |
| `--base` | `None` | Base model ID for delta update |
| `--base-dir` | `None` | Local directory of base model for auto-diff |
| `--server` | `http://127.0.0.1:8000` | Target server URL |

### `modelpulse server convert`
Convert a monolithic GGUF file into tensor-level shards.

| Argument | Description |
| :--- | :--- |
| `gguf_path` | Path to the monolithic .gguf file |
| `output_dir` | Directory to store the generated shards |

### `modelpulse bridge run`
Connect to a server and enter the inference loop.

| Option | Default | Description |
| :--- | :--- | :--- |
| `host` | (Required) | Server URL (e.g., `http://100.64.0.5:8000`) |
| `--prompt` | `None` (listen mode) | Send a single prompt then wait for further updates |
| `--benchmark`, `-b` | `false` | Run the standard benchmark suite |
| `--max-tokens`, `-m` | `256` | Token generation limit |
| `--temperature` | `0.7` | Sampling temperature |
| `--n-ctx` | `2048` | Context window size |
| `--perplexity`, `-p` | `false` | Compute perplexity score during benchmark |

### `modelpulse agent run`
Run the iterative quantization + deployment optimizer.

| Option | Default | Description |
| :--- | :--- | :--- |
| `hf_model_id` | (Required) | Hugging Face model repo ID containing GGUF files |
| `--base-model-name` | (Required) | Stable model slug prefix used for iterations |
| `--hf-gguf-filename` | `""` | Optional specific GGUF filename to pick from the HF snapshot |
| `--gguf-path` | `None` | Optional local GGUF path (skip HF download) |
| `--device-name` | (Required) | Target device name |
| `--ram-gb` | (Required) | Device RAM in GB |
| `--cpu` | (Required) | Target CPU model/name |
| `--gpu` | `""` | Optional GPU model/name |
| `--network` | `zerotier` | Network type (`zerotier`, `tailscale`, `cloudflare`, `lan`) |
| `--max-iterations` | `4` | Number of optimization rounds (`1-10`) |
| `--blockwise-top-k` | `64` | Top-K changed shards to send in blockwise delta mode |
| `--prefer-quality` | `false` | Prefer quality over speed when planning/scoring |
| `--require-llm-planner` | `false` | Enforce Groq planner and fail fast instead of heuristic fallback |
| `--verbose-tool-logs` | `false` | Print full quantization tool logs (default is compact agent output) |
| `--server` | `http://127.0.0.1:8000` | ModelPulse server URL |
| `--workspace` | `./.modelpulse-agent` | Agent artifact workspace |
| `--hf-cache-dir` | `./.modelpulse-agent/hf-cache` | Hugging Face cache directory |
| `--groq-model` | `llama-3.3-70b-versatile` | Groq planner model |
| `--quant-bin` | `llama-quantize` | Quantization binary path/name |
| `--llama-cpp-dir` | `None` | Optional llama.cpp source dir. If omitted, bundled `modelpulse/llama.cpp` is used when available |

---

## 📁 Project Layout

```bash
modelpulse/
├── modelpulse/
│   ├── main.py                 # Unified CLI entry point: bridge/server/agent
│   ├── server/
│   │   ├── app.py              # FastAPI app (HTTP + WS control/data plane)
│   │   ├── cli.py              # Server commands: run/upload/convert
│   │   ├── connection.py       # WS client manager
│   │   └── helpers.py          # File hash helpers
│   └── agent/
│       ├── cli.py              # Agent command and Rich output
│       ├── orchestrator.py     # Iterative optimization loop
│       ├── planner.py          # Groq planner + heuristic fallback
│       ├── quantization.py     # Quant plan + llama-quantize execution
│       ├── downloader.py       # Hugging Face GGUF fetch + selection
│       ├── toolchain.py        # Auto-detect/build llama-quantize
│       └── models.py           # Agent dataclasses/report model
├── README.md
└── pyproject.toml
```

---

## 💾 The Zero-Disk Strategy

ModelPulse leverages the Linux `tmpfs` (RAM-backed filesystem) to satisfy `llama.cpp`'s requirement for a file path while keeping the actual data off physical storage:

1. **Pull**: Bridge fetches `manifest.json`.
2. **Stream**: Bridge pulls `.shard` files (tensor by tensor) into memory.
3. **Assemble**: Bridge calculates GGUF layout and writes bytes to `/dev/shm/sb_<pid>.gguf`.
4. **Load**: `llama-cpp-python` loads the model via `mmap` from the RAM-backed file.
5. **Clean**: Once the model is unloaded, the virtual file is unlinked and memory is reclaimed.

---

## 📡 Networking (Tailscale, ZeroTier, Cloudflare)

For easy cross-device connectivity without port forwarding:

### 1. ZeroTier (Recommended for Large Models)
ZeroTier creates a virtual LAN with no payload limits, making it ideal for multi-GB model uploads.
```bash
# Connect Bridge to Server's ZeroTier IP
modelpulse bridge run http://10.147.17.100:8000 --benchmark
```

### 2. Tailscale
Standard virtual private networking:
```bash
# Get IP on Server
tailscale ip  # e.g., 100.66.170.100

# Connect Bridge
modelpulse bridge run http://100.66.170.100:8000
```

### 3. Cloudflare Tunnel
Good for public access, but note the **100MB upload limit** on the free tier which may affect `server upload` commands.
```bash
# Connect Bridge to public tunnel URL
modelpulse bridge run https://modelpulse.your-domain.com
```

---

## 🤖 Agentic Quant Optimization

ModelPulse now includes an iterative optimization agent that:
- accepts model + device requirements,
- uses a Groq planner to choose quantization strategy (`full_quant` or `tensor_blockwise`),
- quantizes and converts GGUF to shards,
- deploys each iteration via the same `modelpulse server upload` flow (full + auto-diff delta),
- reads benchmark metrics from `/results/latest`,
- computes KL-divergence over changed tensor shards for blockwise mode and can send only top-K shards,
- uses metric context for next iteration and recommends the best-fit model.

```bash
export GROQ_API_KEY="..."
modelpulse agent run "Qwen/Qwen2.5-0.5B-Instruct-GGUF" \
  --base-model-name "qwen-0.5b" \
  --hf-gguf-filename "qwen2.5-0.5b-instruct-f16.gguf" \
  --device-name "edge-rpi-5" \
  --ram-gb 8 \
  --cpu "Cortex-A76" \
  --network zerotier \
  --server http://10.147.17.100:8000 \
  --max-iterations 4 \
  --blockwise-top-k 64 \
  --require-llm-planner \
  --llama-cpp-dir "/home/haider/Coding/build-edgeopt/b-device/agent/Chiseled/llama.cpp"
```

The agent auto-downloads GGUF from Hugging Face into cache (unless `--gguf-path` is provided)
and auto-resolves/builds `llama-quantize` if it is missing.

If `GROQ_API_KEY` is not set, set to an empty string, or Groq SDK is unavailable, planner decisions fall back
to an internal heuristic so optimization can still run end-to-end.
Use `--require-llm-planner` to disable fallback and force LLM-only planning.

Artifacts are saved in `./.modelpulse-agent/`, including `optimization-report.json`.

---

<p align="center">
  <i>Built with ❤️ for Edge AI and Decentralized Inference.</i>
</p>
