From a single GPU VM to a fully serverless, auto-scaling inference pipeline on GCP.
Use arrow keys or swipe to navigate · Press F for fullscreen
A fine-tuned LLM pipeline that takes arbitrary text and returns a structured bias analysis: not just a binary flag, but actionable detail.
Compute Engine VM · HuggingFace Transformers · SSE
Instead of waiting 20–30s for a full response, we stream tokens as they generate, giving users a live experience that feels instant.
// token during generation data: {"t": "biased"} // final structured result data: {"result": {...BiasResult}}
HuggingFace model.generate() is a single blocking call. A second user waits in a queue behind the first.
VM runs 24/7 regardless of traffic. A demo with sporadic usage still pays for idle GPU time.
Every code change requires SSH → git pull → restart. No CI/CD, no rollback, no audit trail.
One VM = one instance. Hardware failure, OOM, or crash takes the whole service down with no auto-recovery.
Continuous batching · PagedAttention · OpenAI-compatible API
POST /v1/chat/completions: a drop-in replacement for OpenAI. Our demo app just points an openai.OpenAI client at the vLLM URL.
vLLM natively streams via SSE. Our demo /analyze/stream endpoint just proxies the token stream: same UX, zero extra code.
Split architecture · CI/CD · The bumps along the way
Every push to main or deploy triggers a smart, path-filtered build.
Changing only src/** skips the vLLM rebuild. Changing only Dockerfile.vllm skips the demo rebuild. Each service redeploys only when its own files change.
No long-lived keys in secrets. GitHub OIDC token is exchanged for a short-lived GCP access token via Workload Identity Federation. Bound to VectorInstitute/unbias-plus repo only.
The $1,100/month problem · Cold starts · Warm weights from GCS
With --min-instances=1, the GPU instance never shuts down. You pay 24/7 even with zero users.
When the service is at zero and the first request arrives, Cloud Run must start a fresh container. For vLLM, this means loading model weights into GPU memory before it can serve anything.
Cloud Run gen2 mounts GCS buckets as a filesystem via GCS FUSE. The vLLM container sees /hf-cache as a regular directory, no special download code needed.
Setting HF_HOME=/hf-cache means the HF downloader writes its cache structure to GCS. Subsequent loads find the model via the exact same cache key, fully transparent.
--startup-probe= httpGet.path=/health, failureThreshold=80, periodSeconds=15, timeoutSeconds=5 // 80 × 15s = 20 min window // covers cached load (~6min) // and first-ever HF download
| Deployment | Idle cost/mo | Peak cost/mo (3 instances) | Cold start |
|---|---|---|---|
| GCE VM · always on, L40 | ~$1,400 | ~$1,400 (no scaling) | 0s — always warm |
| Cloud Run GPU · min=1, L4 | ~$1,036 | ~$3,108 (3 × $1,036) | 0s — always warm |
| Cloud Run GPU · min=0 + GCS cache, L4 | ~$0 | ~$3,108 (3 × $1,036, only while serving) | ~7 min from GCS |
At constant 24/7 load on all 3 instances, cost would be 3 × $1,036 = ~$3,108/mo. With min=0, you only pay while requests are actively served — at a few hours of use per day, real cost is ~$50–150/mo.
vs g2-standard-8 VM ($623/mo always-on): Per single instance, Cloud Run costs more per hour (~$1.42/hr vs ~$0.85/hr) and breaks even at ~60% utilization; below that, scale-to-zero wins. But the comparison isn't equal: a VM is fixed at 1 GPU with no auto-scaling. Matching Cloud Run's 0→3 burst requires 3 VMs at $1,869/mo always-on, vs Cloud Run's $0 at idle.
Fast to set up. Added SSE streaming for live UX. Hit walls: no concurrency, $1,400/mo, manual deploys.
Continuous batching + PagedAttention solves concurrency. OpenAI-compatible API made the proxy trivial.
CPU demo app + GPU vLLM server. Each scales independently. CI/CD via GitHub Actions + WIF.
min-instances=0. GCS FUSE mount serves weights in ~7 min. ~10× cost reduction at idle.