Vector Institute · Engineering Deep Dive

UnBias Plus
Deployment Journey

From a single GPU VM to a fully serverless, auto-scaling inference pipeline on GCP.

Cloud Run GPU vLLM GCS Model Cache Scale to Zero GitHub Actions CI/CD

Use arrow keys or swipe to navigate · Press F for fullscreen

What is UnBias Plus?

A fine-tuned LLM pipeline that takes arbitrary text and returns a structured bias analysis: not just a binary flag, but actionable detail.

Detects biased segments with character-level offsets for UI highlighting
Classifies severity, bias type, and explains its reasoning
Suggests neutral replacement phrases per segment
Rewrites the full text as a neutrally-worded alternative
Model
Qwen3-8B-UnBias-Plus-SFT
Parameters
8B
Context window
8,192 tokens
Output format
Structured JSON → BiasResult
POST /analyze/stream
→ SSE token stream + final result
01

The MVP

Compute Engine VM · HuggingFace Transformers · SSE

MVP Architecture

User Browser POST /analyze/stream ← SSE token stream HTTPS COMPUTE ENGINE VM · us-central1-a L4 GPU · 24 GB VRAM FastAPI /analyze /analyze/stream /health input capped at 1,000 chars · model ctx: 8,192 tokens HuggingFace Transformers AutoModelForCausalLM + TextIteratorStreamer generate() in background thread → yield tokens Qwen3-8B-UnBias-Plus-SFT weights loaded into L4 VRAM at startup · BF16 ~16 GB
Deployment
  • Single always-on VM (L4, 24 GB VRAM)
  • Model loaded into VRAM at startup
  • FastAPI serves /analyze and /analyze/stream
  • Input capped at 1,000 chars — model context window is 8,192 tokens
  • Manual deploy via SSH
The SSE innovation

Instead of waiting 20–30s for a full response, we stream tokens as they generate, giving users a live experience that feels instant.

Streaming: {"t": "biased"}
SSE event format
// token during generation
data: {"t": "biased"}

// final structured result
data: {"result": {...BiasResult}}

MVP Drawbacks

No Concurrency

HuggingFace model.generate() is a single blocking call. A second user waits in a queue behind the first.

Request timeline:
User 1
generating...
User 2
⏳ waiting...
User 3
⏳ waiting even longer...
Always-On Cost

VM runs 24/7 regardless of traffic. A demo with sporadic usage still pays for idle GPU time.

~$623
per month
100%
idle billing
Manual Deployments

Every code change requires SSH → git pull → restart. No CI/CD, no rollback, no audit trail.

Single Point of Failure

One VM = one instance. Hardware failure, OOM, or crash takes the whole service down with no auto-recovery.

02

Enter vLLM

Continuous batching · PagedAttention · OpenAI-compatible API

Why vLLM?

The problem with HuggingFace generate()

Sequential batching: requests queue up
Req 1 done
Req 2
Req 3

Continuous batching (vLLM)

Continuous batching: GPU never idles
Req 1 done
Req 2 done
Req 3 done
GPU utilization
sequential R1 idle R2 idle R3 ~40%
vLLM R1 R2 R3 R1 R2 ~100%
→ same GPU time, 3× the requests served
PagedAttention
Naive: pre-allocated contiguous block
Seq 1 wasted
Seq 2 wasted
PagedAttention: VRAM physical layout
Seq 1 Seq 2 free
→ pages interleave freely, no wasted gaps
OpenAI-compatible API

POST /v1/chat/completions: a drop-in replacement for OpenAI. Our demo app just points an openai.OpenAI client at the vLLM URL.

Streaming

vLLM natively streams via SSE. Our demo /analyze/stream endpoint just proxies the token stream: same UX, zero extra code.

03

Cloud Run GPU

Split architecture · CI/CD · The bumps along the way

New Architecture: Two Services

GitHub push to main or deploy GitHub Actions detect changes build images push to GAR deploy services Artifact Registry us-east4 GCP · us-east4 Demo App Cloud Run CPU python:3.11-slim 2 vCPU · 2 GiB · scales 0-10 FastAPI proxy → VLLM_BASE_URL ~$5-20/mo idle vLLM Server Cloud Run GPU NVIDIA L4 · 8 vCPU · 32 GiB vllm serve Qwen3-8B continuous batching · concurrency=4 HF_HOME=/hf-cache scales 0-3 instances VLLM_BASE_URL /v1/chat/completions GCS Bucket unbias-plus-model-cache Qwen3-8B weights ~16 GB mounted at /hf-cache (GCS FUSE) Browser unbias-plus-demo
Demo App
CPU only · scales to zero
No GPU cost when idle
vLLM Server
GPU · NVIDIA L4
Independent scaling
GCS Cache
Persistent model weights
Warm starts in ~7 min

GitHub Actions CI/CD Pipeline

Every push to main or deploy triggers a smart, path-filtered build.

git push main / deploy or dispatch Detect Changes paths-filter demo: src/**, Dockerfile vllm: Dockerfile.vllm Build Demo python:3.11-slim push to GAR · GHA cache Build vLLM vllm/vllm-openai push to GAR · GHA cache Deploy vLLM Cloud Run GPU us-east4 · L4 · startup probe if success or skipped Deploy Demo Cloud Run CPU fetch vLLM URL inject VLLM_BASE_URL health check → summary ✓ Live both services Deploy Demo Cloud Run CPU health check
Smart path filtering

Changing only src/** skips the vLLM rebuild. Changing only Dockerfile.vllm skips the demo rebuild. Each service redeploys only when its own files change.

Keyless auth via WIF

No long-lived keys in secrets. GitHub OIDC token is exchanged for a short-lived GCP access token via Workload Identity Federation. Bound to VectorInstitute/unbias-plus repo only.

04

Scale to Zero
+ GCS Model Cache

The $1,100/month problem · Cold starts · Warm weights from GCS

The Cost Problem: min-instances=1

With --min-instances=1, the GPU instance never shuts down. You pay 24/7 even with zero users.

Monthly cost breakdown (1 always-on instance)
NVIDIA L4 GPU~$490
8 vCPU (--no-cpu-throttling)~$378
32 GiB memory~$168
Total (1 instance) ~$1,036/mo
Up to 3 instances under load = ~$3,400/mo
For a research demo with sporadic traffic? Way too much.

Scale to Zero: How It Works

ZERO $0.00/hr no instances COLD START ~7 min loading model from GCS WARM serving traffic fast responses SCALE OUT 2-3 instances high traffic first req ready high load load drops 15 min no traffic Billed only when instances are running

What is a "cold start"?

When the service is at zero and the first request arrives, Cloud Run must start a fresh container. For vLLM, this means loading model weights into GPU memory before it can serve anything.

Cold start timeline comparison
Without GCS cache (download from HuggingFace)
⬇ HF download ~10 min
GPU load
~12-15 min
With GCS cache (load from mounted bucket)
mount
GCS load ~4.5 min
compile
~7 min ✓
compile cache persisted via VLLM_CACHE_ROOT → saves ~40s on repeat cold starts
After the first ever cold start, all subsequent starts are fast: the model is already in GCS. The first deployment is the only slow one.

GCS Model Cache: How It Works

First cold start (one time only) vLLM Container HF_HOME=/hf-cache /hf-cache → GCS FUSE model not in cache yet fetch HuggingFace Hub API Qwen3-8B ~16 GB write via GCS FUSE GCS Bucket: unbias-plus-model-cache hub/models--vector-institute--Qwen3-8B .../ *.safetensors · config.json · tokenizer.json All subsequent cold starts vLLM Container HF_HOME=/hf-cache reads directly from GCS ↗ ~7 min load (267s weights + 40s compile) --add-volume= name=hf-cache, type=cloud-storage, bucket=unbias-plus-model-cache
GCS FUSE mount

Cloud Run gen2 mounts GCS buckets as a filesystem via GCS FUSE. The vLLM container sees /hf-cache as a regular directory, no special download code needed.

HuggingFace cache compatibility

Setting HF_HOME=/hf-cache means the HF downloader writes its cache structure to GCS. Subsequent loads find the model via the exact same cache key, fully transparent.

Startup probe config
--startup-probe=
  httpGet.path=/health,
  failureThreshold=80,
  periodSeconds=15,
  timeoutSeconds=5
// 80 × 15s = 20 min window
// covers cached load (~6min)
// and first-ever HF download

Cost Comparison: The Full Picture

Deployment Idle cost/mo Peak cost/mo (3 instances) Cold start
GCE VM · always on, L40 ~$1,400 ~$1,400 (no scaling) 0s — always warm
Cloud Run GPU · min=1, L4 ~$1,036 ~$3,108 (3 × $1,036) 0s — always warm
Cloud Run GPU · min=0 + GCS cache, L4 ~$0 ~$3,108 (3 × $1,036, only while serving) ~7 min from GCS
Where does $1,036/mo come from?
8 vCPU (always-on rate) ~$378/mo
32 GiB memory ~$168/mo
1× NVIDIA L4 GPU ~$490/mo
1 instance running 24/7 ~$1,036/mo

At constant 24/7 load on all 3 instances, cost would be 3 × $1,036 = ~$3,108/mo. With min=0, you only pay while requests are actively served — at a few hours of use per day, real cost is ~$50–150/mo.

vs g2-standard-8 VM ($623/mo always-on): Per single instance, Cloud Run costs more per hour (~$1.42/hr vs ~$0.85/hr) and breaks even at ~60% utilization; below that, scale-to-zero wins. But the comparison isn't equal: a VM is fixed at 1 GPU with no auto-scaling. Matching Cloud Run's 0→3 burst requires 3 VMs at $1,869/mo always-on, vs Cloud Run's $0 at idle.

Final Architecture

CI/CD PIPELINE GitHub push → main/deploy WIF keyless auth GH Actions path-filter build images docker buildx GHA cache Artifact Registry us-east4 demo:sha vllm:sha gcloud deploy deploy-vllm deploy-demo startup probe health check GCP · us-east4 Browser /analyze/stream SSE ← tokens Demo App Cloud Run CPU 2 vCPU · 2 GiB min=0 · max=10 FastAPI + demo UI scales to zero ✓ OpenAI client VLLM_BASE_URL vLLM Server Cloud Run GPU NVIDIA L4 · 8 vCPU · 32 GiB vllm serve Qwen3-8B continuous batching · concur=4 HF_HOME=/hf-cache --no-gpu-zonal-redundancy min=0 · max=3 ✓ GCS Model Cache unbias-plus-model-cache Qwen3-8B weights ~16 GB GCS FUSE · Compute SA objectAdmin mount /hf-cache

Summary: What We Built

MVP: GCE VM (L40)

Fast to set up. Added SSE streaming for live UX. Hit walls: no concurrency, $1,400/mo, manual deploys.

Switched to vLLM

Continuous batching + PagedAttention solves concurrency. OpenAI-compatible API made the proxy trivial.

Split into two Cloud Run services

CPU demo app + GPU vLLM server. Each scales independently. CI/CD via GitHub Actions + WIF.

Scale to zero + GCS model cache

min-instances=0. GCS FUSE mount serves weights in ~7 min. ~10× cost reduction at idle.

Key Principles

  • Separate concerns: inference and serving are independent services
  • Cost follows usage: pay for GPU only when someone is actually using it
  • Warm starts, not always-on: cache model weights on GCS, not the compute
  • UX first: SSE streaming makes a 20s inference feel interactive
  • Automate everything: every push deploys, path filtering keeps it fast
Still to do
No auth or rate limiting — GPU quota (3 instances) is the natural circuit breaker
No quantization yet — pre-quantizing to FP8/AWQ would halve GCS load time and VRAM usage
~$0
idle cost
~7 min
cold start
concurrency (vLLM)
CI/CD
on every push