Metadata-Version: 2.4
Name: tasc-hpc-daemon
Version: 0.1.1
Summary: HPC cluster daemon for bridging AI agents to remote compute resources
Author: Tiptree Advanced Systems Corporation, Miles Qi Li
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: websockets>=14.0

# HPC Daemon

A lightweight daemon that runs on HPC clusters or remote servers, bridging AI agents to local PTY shells and SLURM job schedulers over WebSocket.

## How It Works

AI agents run in isolated cloud sandboxes and cannot directly reach machines behind firewalls. The daemon solves this with a reverse-proxy pattern:

1. The daemon runs on your server and opens an **outbound** WebSocket to the Tiptree platform
2. The agent client also connects **outbound** to the same platform
3. The platform routes messages between them, keyed on your user identity

Because the daemon initiates the connection, no inbound firewall rules are needed.

### Execution Modes

- **PTY mode** (synchronous) — interactive shell sessions for quick commands
- **Job mode** (asynchronous) — batch job submission with automatic wake-on-complete callbacks
  that are persisted locally and retried until delivery succeeds

The daemon auto-detects SLURM. If `sbatch` is available, jobs go through SLURM; otherwise they run as local background processes.

## Installation

- Python 3.10+
- A [Tiptree](https://tiptreesystems.com) account

```bash
pip install tasc-hpc-daemon
```

This installs the `hpc-daemon` command and its only dependency (`websockets`).

## Setup

```bash
hpc-daemon setup \
    --email you@example.com \
    --url https://althea.tiptreesystems.com
```

For the development environment, use:

```bash
hpc-daemon setup \
    --email you@example.com \
    --url https://althea.dev.tiptreesystems.com
```

The `--url` value must be the Tiptree app/platform base URL that exposes `/otp`,
`/auth`, and `/hpc` routes. Do not use the marketing domain, such as
`https://tiptreesystems.com` or `https://dev.tiptreesystems.com`.

The setup wizard:

1. Authenticates via a one-time code sent to your email
2. Presents a **disclaimer** about remote code execution risks
3. Creates an API key for daemon authentication
4. Prompts for an optional **skill** (built-in cluster-specific guidance, or a custom server description)
5. Prompts for **directory restrictions** (where the agent is allowed to write; defaults to `$SCRATCH/tiptree-workspace` or `~/tiptree-workspace`)
6. Prompts for a **job working directory** and optional **server instructions**
7. Registers the daemon with the platform

The daemon ID defaults to the machine's hostname. Override with `--daemon-id`.

Re-running setup for the same daemon ID updates the existing profile (no duplicates).

### Non-Interactive Setup

For automated deployments:

```bash
hpc-daemon setup \
    --email you@example.com \
    --url https://althea.tiptreesystems.com \
    --no-interview \
    --allowed-dirs ~/workspace ~/scratch \
    --skill mila-hpc \
    --cluster-name my-cluster
```

Use `--cluster-name` to set a human-readable name for the cluster (defaults to the machine's hostname).

> **Warning:** `--no-interview` skips **all** interactive prompts (disclaimer, skill selection, directory restrictions, working directory, server instructions). OTP is still required. Without `--allowed-dirs`, the agent gets unrestricted filesystem write access.

## Running

```bash
# Start in foreground
hpc-daemon start

# Start in background (persists after logout)
nohup hpc-daemon start 2>&1 &

# Check status
hpc-daemon status

# View logs
tail -f ~/.hpc_daemon/<daemon_id>.log

# Stop
hpc-daemon stop

# List registered daemons
hpc-daemon list
```

If only one daemon profile exists, `--daemon-id` is auto-detected. With multiple profiles, specify it explicitly (e.g., `hpc-daemon start --daemon-id mila-login-1`).

To force local mode on a SLURM cluster (jobs run as background processes instead of `sbatch`):

```bash
LOCAL_MODE=1 hpc-daemon start
```

## State

All configuration, job records, and callback delivery state are stored in
`~/.hpc_daemon/state.db` (SQLite). PID files and logs live in the same directory.

## Safety

The daemon enforces directory restrictions via bash function wrappers injected into PTY sessions. File-modifying commands (`rm`, `rmdir`, `mv`, `cp`, `mkdir`, `touch`, `tee`), directory navigation (`cd`, `pushd`, `popd`), and output redirections are validated against the allowed directory list configured during setup.

Job ownership is also enforced: code assistants can only cancel jobs they submitted.

These guardrails are shell-level and not a hard security boundary. They prevent accidental damage, not a determined adversary.

> **Shared machines:** Your API key is stored in `~/.hpc_daemon/state.db`. The file and directory are restricted to your user account (`0600`/`0700`), so other users on the same login node cannot read it. However, if multiple people share the same Unix account, they all have access. Do not use the daemon on a shared account.

## Project Structure

```
hpc_daemon/
├── core/
│   ├── cli.py            # CLI (setup, start, stop, status, list)
│   ├── config.py         # Runtime config, SLURM detection
│   ├── ws.py             # WebSocket connection and message routing
│   └── pty_session.py    # PTY shell session management
├── setup/
│   ├── wizard.py         # Setup wizard and daemon registration
│   └── prompts.py        # Interactive setup prompts
├── jobs/
│   ├── handlers.py       # Job submit/status handlers
│   ├── interfaces.py     # SlurmInterface, LocalJobInterface
│   ├── models.py         # JobRecord, JobState
│   └── monitor.py        # Background job polling and callbacks
├── api_client/
│   ├── auth.py           # OTP signin, API key creation
│   ├── daemon_registry.py # Daemon registration
│   └── http.py           # HTTP client utilities
├── skills/
│   ├── __init__.py       # Skill discovery and loading
│   └── mila-hpc.md       # Built-in Mila cluster skill
├── db.py                 # SQLite database (profiles, jobs)
├── guardrails.py         # Directory restriction enforcement
└── __main__.py           # Entry point
```
