Metadata-Version: 2.4
Name: vaultlayer
Version: 0.1.42
Summary: AI compute arbitrage CLI — move GPU training jobs between clouds automatically
Author: VaultLayer
License: MIT
Keywords: gpu,cloud,training,arbitrage,mlops,ai
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: redis>=5.0.0
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: boto3>=1.35.0
Requires-Dist: anthropic>=0.40.0
Requires-Dist: huggingface-hub>=0.20.0
Provides-Extra: server
Requires-Dist: fastapi>=0.115.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.29.0; extra == "server"
Requires-Dist: supabase>=2.4.0; extra == "server"
Requires-Dist: resend>=2.0.0; extra == "server"
Requires-Dist: stripe>=9.0.0; extra == "server"
Requires-Dist: pydantic[email]>=2.0.0; extra == "server"
Requires-Dist: python-multipart>=0.0.9; extra == "server"

# VaultLayer

Run AI training jobs on managed GPU capacity with checkpointing, log
streaming, and provider failover.

```bash
pip install -U vaultlayer
vl init
vl run train.py
```

```
Job submitted
Training on vast_ai
Training output:
...
Training completed successfully.
```

---

## What It Does

VaultLayer sits between your training script and the cloud. It:

- **Checkpoints automatically** — syncs model weights + optimizer state to a zero-egress R2 Vault on every save
- **Detects interruptions** — intercepts AWS/GCP/Azure termination signals before your job dies
- **Migrates instantly** — provisions a replacement node on the cheapest available provider and resumes from last checkpoint
- **Tracks savings** — shows real-time cost vs what you would have paid on AWS On-Demand

No changes to your PyTorch or JAX code. No YAML configs. No PhD-level infra knowledge required.

### Commands

```bash
# Training
vl run train.py
vl ps
vl logs <job-id> --follow
vl stop <job-id>

# Dataset storage (no S3 required)
vl sync ./data --dataset-id my-dataset
vl run --data r2://my-dataset train.py
vl datasets
```

---

## Supported Providers

| Provider | Type | Status |
|---|---|---|
| Vast.ai | Marketplace | Production-included |
| RunPod | Neocloud | Production-included |
| Lambda Labs | Neocloud | Production-included |
| AWS Spot | Hyperscaler | Production-included for validated failover paths |
| AWS On-Demand | Hyperscaler | Internal testing |
| GCP, CoreWeave, Crusoe, Nebius, Voltage Park, Hyperstack, Azure | Mixed | Pending validation |

Current provider status lives in
[docs/provider_test_matrix.md](docs/provider_test_matrix.md) and
[docs/provider_testing_matrix.md](docs/provider_testing_matrix.md).

---

## Model Size Support

| Model Size | Method | Checkpoint Size | Status |
|---|---|---|---|
| 1B | QLoRA | small | Validated smoke path |
| 3B | QLoRA | small | Validated matrix path |
| 7B | QLoRA | medium | Validated matrix path |
| 72B | QLoRA | large | Routed to 96GB+ high-VRAM capacity |
| Full fine-tune / multi-GPU | varies | varies | Future work |

---

## Tech Stack

| Layer | Technology | Cost |
|---|---|---|
| Code + Docs | GitHub (this repo) | Free |
| CI/CD | GitHub Actions | Free (2k min/mo) |
| Vault / Storage | Cloudflare R2 | Free up to 10GB |
| Agent Runtime | Railway | Free $5/mo credit |
| Webhooks | Cloudflare Workers | Free 100k req/day |
| Agent Message Queue | Upstash Redis | Free 10k cmd/day |

---

## Repository Structure

```
vaultlayer/
├── README.md
├── docs/
│   ├── PRD.md              # Full product requirements
│   ├── ARCHITECTURE.md     # System design + agent topology
│   └── AGENTS.md           # Agent specs + build order
├── dashboard/
│   └── index.html          # Savings dashboard prototype
└── src/
    ├── cli/
    │   ├── main.py
    │   ├── run.py
    │   ├── checkpoint_template.py
    │   └── init.py
    ├── vaultlayer/
    │   └── _resume_hook.py
    ├── agents/
    │   ├── orchestration/
    │   ├── pricing/
    │   ├── watchdog/
    │   │   └── signals.py
    │   ├── vault/
    │   ├── broker/
    │   ├── finops/
    │   └── namespace/
    └── shared/
```

---

## SLA

VaultLayer tracks job completion, checkpoint persistence, and resume behavior.
Public SLA numbers are not committed during beta; see
[docs/SLA_SLI.md](docs/SLA_SLI.md) for definitions.

---

## Dataset Storage (No S3 Required)

VaultLayer's Neutral Zone (Cloudflare R2) is a first-class storage provider. Users with no AWS or
cloud storage account can upload training data directly and train from it on any provider.

```bash
# Upload from your laptop / on-prem server
vl sync ./training-data --dataset-id my-dataset

# Train — data is mounted at /mnt/vaultlayer on every provisioned node
vl run --data r2://my-dataset train.py

# See what you're storing and the monthly cost
vl datasets
```

**Pricing:**

| Action | Cost |
|--------|------|
| Upload (local → R2) | Free |
| Storage | $0.020 / GB / month ($0.0195 — 30% markup over Cloudflare R2 base rate) |
| Read (R2 → training node) | $0.00 (zero egress within Cloudflare network) |
| S3 mirror (one-time) | AWS egress charge (~$0.09/GB, first 100 GB/month free) |

**Storage quotas by plan:**

| Plan | Storage limit |
|------|--------------|
| Free | 10 GB |
| Pro | 500 GB |
| Enterprise | Unlimited |

Datasets are soft-deleted with `vl datasets --delete <id>` — billing stops immediately,
R2 objects are purged within 24 hours.

## Getting Started

```bash
pip install -U vaultlayer
vl init
vl run train.py
```

---

## License

Private — © 2026 VaultLayer
