Metadata-Version: 2.4
Name: sixtytwo-cli
Version: 0.3.1
Summary: Sixtytwo CLI: `sixtytwo rent` reserves reliability-backed GPUs; `sixtytwo` qualifies, monitors, and NCCL-benchmarks your own GPU clusters, with Slurm/SkyPilot integration.
Author: Sixtytwo, Inc.
License-Expression: LicenseRef-Sixtytwo-Commercial
Keywords: gpu,nccl,skypilot,slurm,cluster,benchmarking
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Distributed Computing
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rich<15,>=14.0
Requires-Dist: fastapi<1,>=0.115
Provides-Extra: server
Requires-Dist: fastapi<1,>=0.115; extra == "server"
Requires-Dist: uvicorn<1,>=0.34; extra == "server"
Requires-Dist: geoip2<6,>=4.8; extra == "server"
Provides-Extra: sentry
Requires-Dist: sentry-sdk[fastapi]<3,>=2.18; extra == "sentry"
Provides-Extra: gpu
Requires-Dist: nvidia-ml-py3>=7.352.0; extra == "gpu"
Requires-Dist: torch>=2.11; extra == "gpu"
Provides-Extra: skypilot
Requires-Dist: skypilot<0.12,>=0.9.0; extra == "skypilot"
Provides-Extra: skypilot-aws
Requires-Dist: skypilot[aws]<0.12,>=0.9.0; extra == "skypilot-aws"
Provides-Extra: skypilot-gcp
Requires-Dist: skypilot[gcp]<0.12,>=0.9.0; extra == "skypilot-gcp"
Provides-Extra: skypilot-lambda
Requires-Dist: skypilot[lambda]<0.12,>=0.9.0; extra == "skypilot-lambda"
Provides-Extra: skypilot-runpod
Requires-Dist: skypilot[runpod]<0.12,>=0.9.0; extra == "skypilot-runpod"
Dynamic: license-file

# Sixtytwo Platform Backend

Backend for the Sixtytwo rent platform.

It includes:

- CLI-first agent workflows: qualification, monitoring, trust registry
- Recovery orchestration and job optimization
- Rent-layer commerce surface (catalog / quote / reserve / checkout / settle / credits)
- FastAPI app that serves the rent storefront (static HTML in `../frontend-rent/`)
- Per-minute meter loop + Stripe billing + Google OAuth
- Prometheus `/metrics` exporter for Grafana

## Install

```bash
cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
```

Or point the CLI at an explicit config:

```bash
sixtytwo --config /path/to/sixtytwo.yaml init --cluster prod
```

## Main commands

```bash
sixtytwo init --cluster prod
sixtytwo setup --provider runpod --cluster prod-runpod --checkpoint-dir /workspace/checkpoints
sixtytwo doctor --json
sixtytwo test --quick --all
sixtytwo test --full gpu-01,gpu-02
sixtytwo console
sixtytwo launch --pre-check --recovery confirm python train.py
sixtytwo nodes
sixtytwo optimize train.sh
sixtytwo dashboard start  # boots uvicorn against the FastAPI app
```

## Grafana visualization

`sixtytwo metrics serve` exposes a Prometheus-format `/metrics` endpoint backed
by the local `TrustRegistry`. This is what teams use to chart sixtytwo's
intelligence-layer signals (trust scores, fault counters, recovery downtime,
per-check status, fleet adverse rate) alongside DCGM in Grafana — without
sixtytwo having its own UI.

```bash
sixtytwo metrics serve --host 0.0.0.0 --port 9620
```

Exposed metric families:

| metric | type | labels | what it answers |
|---|---|---|---|
| `sixtytwo_node_trust_score` | gauge | node_id, gpu_type, provider, status | which nodes you should trust |
| `sixtytwo_node_faults_total` | counter | node_id, gpu_type, provider | which nodes accumulate faults |
| `sixtytwo_node_status` | gauge | node_id, status | lifecycle state (always 1) |
| `sixtytwo_check_status` | gauge | node_id, check, stage | latest qualification result (1/0.5/0/-1) |
| `sixtytwo_check_metric` | gauge | node_id, check, metric | numeric metrics from each check (TFLOPS, step_ms, ...) |
| `sixtytwo_fleet_*` | gauge | — | fleet-wide totals and adverse rate |
| `sixtytwo_recovery_events_total` | counter | status | recovery outcomes |
| `sixtytwo_recovery_downtime_seconds_*` | counter | — | total/sample-count of recovery downtime |

Wire it into Prometheus:

```yaml
scrape_configs:
  - job_name: sixtytwo
    static_configs:
      - targets: ["sixtytwo-host:9620"]
```

Then import the curated Grafana dashboard:

```bash
sixtytwo metrics export-grafana --output sixtytwo-overview.json
# Grafana → Dashboards → Import → upload sixtytwo-overview.json
```

The dashboard correlates sixtytwo trust scores with DCGM XID counters when
both data sources are present, so you can see at a glance whether a
trust-score drop is being driven by hardware events.

## Deployment notes

- `recovery.live_execute = false` by default, so recovery plans are fully logged before you enable real scheduler actions.
- `SIXTYTWO_CONFIG` lets you run the CLI and platform from a fixed config path outside the working directory.
- `SIXTYTWO_HOST` and `SIXTYTWO_PORT` control uvicorn bind addresses in production.
- `SIXTYTWO_FRONTEND_RENT_DIR` overrides where the rent storefront static files live (used by the container image since site-packages installs break the source-relative lookup).
- `sixtytwo setup` writes a provider profile and a bootstrap script under `.sixtytwo/bootstrap/`.
- `sixtytwo doctor --all-nodes` validates SSH reachability, `nvidia-smi`, `dcgmi`, topology capture, and checkpoint path access before the first live run.

## Provider-ready flow

For a real GPU box such as RunPod, Lambda, Vast, or a colocated Ubuntu node:

```bash
sixtytwo setup \
  --provider runpod \
  --cluster runpod-prod \
  --checkpoint-dir /workspace/checkpoints

bash .sixtytwo/bootstrap/install-sixtytwo.sh
sixtytwo doctor --json
sixtytwo test --quick --all
```

For a small SSH-managed cluster:

```bash
sixtytwo setup \
  --provider ssh-cluster \
  --cluster prod-h100 \
  --nodes gpu-01,gpu-02,gpu-03 \
  --checkpoint-dir /mnt/checkpoints \
  --ssh-user ubuntu \
  --ssh-key-path ~/.ssh/id_ed25519

sixtytwo doctor --all-nodes
sixtytwo test --full --all
```

For topology-aware replacement, add rack / switch / rail metadata and a
standby pool to `sixtytwo.yaml`; recovery will prefer the standby that best
preserves the failed node's communication shape:

```yaml
recovery:
  standby_pool: [gpu-14-99, gpu-15-99]
topology:
  racks:
    rack-14:
      switch: tor-14
      power: [pdu-14a, pdu-14b]
      cooling: cdu-7
      nodes: [gpu-14-*]
  rails:
    rail-0: [gpu-14-01:gpu0, gpu-14-99:gpu0]
```
