v0.30 - Agents Unleashed

What's New in Sandcastle

Every version ships something you'll actually use. No roadmap fluff. No "coming soon." Shipped code or nothing.

v0.32.2 Your Sandbox, Your Silicon Latest May 19, 2026

Anthropic shipped a managed-agents update on May 19. Most teams will read the blog post and move on. We extracted everything. Self-hosted sandboxes that keep your org key inside your boundary. A Memory MCP server that fills the gap Anthropic left when they declared memory_stores incompatible with self-hosted runs. MCP tunnels with WIF token exchange so your private Jira can be reached by a managed agent without static secrets. Live work-queue telemetry in the dashboard. Three production case-study blueprints. Five sandbox cookbooks. 150 new tests, validated across six iterations. Zero regressions. You wanted control over where your model runs. Here it is.

Headline

Self-hosted sandboxes - your code, your silicon

Five providers in the runtime enum: cloudflare, daytona, modal, vercel, docker. Each gets a cookbook under deploy/cookbooks/ with a copy-paste-ready spawn script and image. Your org key never leaves your boundary - the worker refuses to start if ANTHROPIC_API_KEY is present in the sandbox env. Only sk-ant-oat01- environment-scoped keys get past the gate. A foot-gun that would have leaked your org budget now raises before the first network call.

New

Memory MCP server

Anthropic explicitly documented that memory_stores cannot be combined with self-hosted sandboxes. We filled the gap. sandcastle.engine.memory_mcp_server wraps mem0 + persistent Qdrant + Anthropic Haiku for memory decisions. Four tools, two resources, one prompt. Helm chart + docker-compose. NetworkPolicy egress to Cloudflare CIDR only.

New

MCP tunnels with WIF auth

Gated beta, header mcp-client-2025-11-20. Your internal Jira / Snowflake / Confluence is reachable from a managed agent through a cloudflared tunnel authenticated with WIF tokens. No static secrets. 60s skew-cached token exchange. Manual cert mode also supported for air-gapped setups.

New

Live work-queue dashboard

WorkQueuePanel on the run detail page, SSE-driven. Depth, sparkline (Recharts), pill (green < 5, amber 5-50, red > 50), aria-live="polite", exponential backoff up to 30 s. You see the backlog before the pager fires.

New

Webhook-driven workers

New session.status_run_started event in SUPPORTED_EVENTS. A worker can now run as a webhook handler instead of a long-poll loop - saves RAM, latency, and your AWS bill. HMAC verification round-trips sha256= prefix and bare digest forms.

Templates

Three production case studies

Not tutorials. Blueprints. Amplitude designer (multiagent + accessibility specialist + computer-use + Cloudflare). Clay GTM (project_manager + researcher / writer / qualifier + Daytona). Rogo analyst on private data (Vercel + WIF tunnel + risk_level high + mandatory approval + eval gate).

Quality

Tested six different ways

150 new tests. Sequential re-run, file isolation, randomized + pollution cluster, pytest-repeat 10x stress, module + CLI smoke, consolidated verdict. Zero flakes. Zero new regression categories. Documented in the PR.

Milestone

By the numbers

150
New v0.32.2 tests
5
Sandbox cookbooks
6
Validation iterations
0
Regressions
v0.32 Claude Agents Deep Integration May 16, 2026

You shipped the agent. The client wants the integration. The auditor wants the trail. The user wants you to ask the next question without restarting the whole workflow. v0.32 is the answer to every one of those. Sandcastle now exposes every Anthropic Managed Agents primitive shipped under the managed-agents-2026-04-01 beta umbrella, plus the things Anthropic doesn't ship: cryptographically verifiable trajectory replay, a Skills publisher that turns workflows into uploadable Claude Skills, and an Agent SDK runtime for teams that want in-process execution. Two weeks of work, 169 new tests, one release.

Headline

Every Anthropic primitive surfaced

Memory Stores, Multiagent coordinator, Outcomes API, Webhooks. All four came out from Anthropic under the same beta header in April and May 2026; Sandcastle now wires each one into the YAML and the dashboard so a single workflow can attach versioned memory, spawn 20 parallel specialist agents, define outcomes the eval pipeline reads automatically, and emit lifecycle events to a webhook endpoint.

Add three lines of YAML, get a multi-agent system that meets your eval gate.

New

Skills Publisher

sandcastle publish-skills --upload converts every workflow into a tar.gz SKILL.md package with strict frontmatter and uploads to /v1/skills. Your workflows are now callable from every Anthropic Skills-aware client. Pair with v0.31 MCP-first publishing and Sandcastle covers both major distribution channels.

New

Trajectory Replay

New type: trajectory-replay step computes SHA-256 over a recorded tool-call sequence and diffs against a candidate run. Sandcastle's audit trail is a hash chain, so the replay is cryptographically verifiable - a property neither LangSmith nor Braintrust ships. EU AI Act auditors love this.

New

Agent SDK runtime

runtime: "agent-sdk" swaps Managed Agents for in-process Claude Agent SDK. For EU sovereignty, air-gapped, regulated teams who can't send session state to Anthropic's hosted infra. Lazy-imports the SDK; graceful error when not installed.

Polish

Live Agent Reasoning panel

Run detail page now subscribes to an SSE event stream and shows agent.thinking, agent.tool_use, and agent.message events as they fire. Thread-grouped, collapsible. The dashboard finally shows what the agent is actually doing.

Fix

Five wire fixes (table stakes)

tools_enabled finally reaches the API (was parsed but ignored). temperature, max_tokens, thinking_budget on ManagedAgentConfig. stream config field actually used. Pricing table mapping every Claude 4.x model. Fallback chains of up to 5 templates.

New

Tool Search + examples convention

Mark tools with defer_loading: true and 1-5 invocation examples per tool. Anthropic's own measurements: tool-selection accuracy 49% to 74%, usable context up 85%, parameter accuracy 72% to 90%. New docs/tool-examples-convention.md.

New

Computer Use + MCP Elicitation

New type: computer-use step type with the computer_20251124 beta + an 8-item safety pre-flight (prompt-injection guard, page-load deadline, screenshot bounds). Plus the 6th MCP primitive (Elicitation, spec rev 2025-11-25) lets workflows ask the user for a missing input mid-run.

Milestone

By the numbers

15,176
Tests passing
169
New v0.32 tests
9
New modules
24
Step types
v0.31 Compliance & Connections May 14, 2026

Eighty days. That is how long you have until the EU AI Act's general-purpose obligations go live. After 2 August 2026, "we'll figure it out later" stops being an acceptable answer. v0.31 is the answer you can ship today: a dedicated EU AI Act landing your CISO can actually read, ten ready-made compliance workflow templates mapped to specific Articles and Annex IV, eval gates that block a regressing model from getting promoted, and MCP-first publishing that turns every workflow you've already built into a first-class tool inside Claude Desktop, Cursor, Windsurf, or any MCP client. The work was quiet. The result is loud.

Headline

Built for August 2, 2026

The EU AI Act deadline is on the calendar whether you like it or not. v0.31 ships a dedicated landing page mapping every Sandcastle control to a specific Article (9, 11, 12, 14, 25, 49, 50, 73) and Annex IV, plus ten production-ready compliance workflow templates: DPIA, vendor risk assessment, incident reporting, bias audit, model card generator, AI inventory, risk register, GDPR data-subject request, human oversight log, and Annex IV transparency report.

Bring the templates. Customize the prompts. Hand the auditor the audit trail. Done.

New

MCP-First Publishing

One command: sandcastle publish-mcp researcher. Paste the snippet into Claude Desktop, Cursor, or Windsurf. Your workflow is now a tool the AI client can call. Every workflow you already built becomes reachable from every MCP client in the ecosystem - the full five-primitive spec (tools, resources, prompts, sampling, roots) and a .well-known/mcp.json discovery manifest included.

New

Eval Gates That Block Bad Models

Define a golden dataset of inputs and expected scores. Set a minimum threshold. Promote with ?strict=true and Sandcastle runs the gate before flipping the production version. The new model regresses by even a hair? Promotion fails closed. Eval-driven development is no longer a future plan.

Polish

A Dashboard That Doesn't Crash

Overview page split into twenty focused components, each wrapped in its own error boundary. One slow API endpoint no longer wipes the page. Every section that loads stays loaded. The 2,261-line monolith is now a 241-line orchestrator.

Fix

Cost Estimates That Tell the Truth

The token estimator used to assume 1,000 tokens per run for every workflow on Earth. Now it reads each workflow's actual run history, blends the price per million, and gives each one its own number. Evolution makes better keep/discard calls. AutoPilot stops lying about the cost of an experiment.

Security

Codex Audit Rounds 9 + 10

Five HIGH findings closed: tenant-scoped cache and Mem0 scope, XSS-safe report config, SSRF-guarded WeasyPrint URL fetching, A2A budget enforcement, version-list 404 correctness. Plus one MEDIUM: cross-tenant prompt injection in the workflow generator. tenant_id threaded through every executor path so neighbouring tenants cannot see each other's cache hits, memory reads, or prompt context.

Polish

Dashboard accessibility pass

aria-expanded, aria-controls, aria-hidden, focus-visible rings, sr-only labels on every search and filter input, modal widths that scale up on tablet, disabled opacity bumped to WCAG-AA contrast, and localStorage wrapped in try/catch so private browsing degrades gracefully. Eight separate polish items in one release.

New

Per-workflow stats API

GET /api/workflows/stats returns run count, success rate, average cost, and last-run status for every workflow visible to your tenant. Cached in-process for 30 seconds. The dashboard grid now shows real metrics instead of placeholder data.

Milestone

By the numbers

15,014
Tests passing
10
Compliance YAMLs
5
MCP primitives
80
Days to deadline
v0.30 Agents Unleashed April 10, 2026

v0.30 is the biggest Sandcastle release yet. Claude Managed Agents are now workflow steps. Pick from 15 built-in agent templates - or describe what you need and AI designs the agent for you. Combine cloud agents, local models, OCR, and PDF reports in one YAML file. Plus: GLM-OCR with 94.6% accuracy runs locally for free, and every run now tells you exactly where your tokens go.

Headline

Claude Managed Agents as Workflow Steps

Delegate to Anthropic's cloud agent runtime. Agents get bash, files, web search, code execution. Use them alongside Mistral, oMLX, or any other provider. One YAML, best tool per step.

New

15 Built-in Agent Templates

researcher, coder, analyst, writer, reviewer, scraper, tester, devops, translator, designer, sql_expert, seo_specialist, legal_analyst, financial_analyst, project_manager. One line: agent_template: researcher

New

Describe Mode

Don't pick a template. Describe what you need: "Data analyst who creates charts from CSVs." AI designs the agent automatically - system prompt, tools, packages, network.

New

Agent Chaining

output_format (json/files/markdown), shared_files between agents, fallback_template on failure. Agents collaborate in multi-step pipelines.

New

GLM-OCR Engine

0.9B model, 94.6% accuracy, #1 on OmniDocBench. Runs locally via Ollama. Free. Handles tables, handwriting, math, code blocks. 4th OCR engine alongside pymupdf and Chandra.

Perf

Token Waste Detection

Every run generates a token efficiency report. Detects duplicate file reads, duplicate prompts, oversized context. output_max_tokens limits what passes between steps. Cache-aware execution.

Milestone

By the numbers

21
Step types
15
Agent templates
4
OCR engines
325
New tests
v0.28.1 Every Step Knows Its Story April 1, 2026

v0.28.1 is about making what you've built understandable. Steps explain why they exist. Errors tell you what went wrong. The CLI shows you what's happening in real-time. And LLM steps can now pull in fresh context from memory, web, or files before they run.

Headline

Self-Describing Workflows

Every step can declare its responsibility, source_hint, owner, and added_date. Three new CLI commands: sandcastle describe, sandcastle lint, sandcastle owners. Audit trail enriched. Dashboard shows metadata on steps and workflow cards.

New

context_query - Dynamic Context for LLM Steps

Steps fetch relevant context before execution. Four sources: Agent Memory (semantic search), web (Tavily), local files (keyword search), custom (shell command). Available as {context} in prompts.

UX

sandcastle run --stream

Live terminal output with colors: green for done, yellow for running, red for failed. Shows step timing, cost, and responsibility. Ctrl+C runs in background.

UX

Real-Time Step Progress

Dashboard shows live checkmarks via SSE as steps complete. No more refreshing to see progress.

UX

Detailed Errors + PDF Download

Full error messages in red callout boxes (not just "failed"). Download PDF button for report steps.

New

Batch Cancel + Semantic Search

Cancel running batches mid-flight. Template search with "Did you mean..." fuzzy suggestions.

Quality

540 New Tests

280 template e2e tests (top 20 templates). 260 provider integration tests (all 7 providers). Full validation coverage.

v0.28 Batch, Monitor, Report. March 29, 2026

v0.28 is where Sandcastle stops being a tool and becomes a platform. Batch-process thousands of items. Monitor your schedules. Generate beautiful PDF reports with charts. Parse scanned documents with AI. Auto-update without downtime. And now - run everything locally on Apple Silicon with oMLX.

Headline

Auto-Update System

CLI: sandcastle update. Dashboard: one-click with rollback. Stable/beta/pinned channels. Blackout windows. Enterprise approval policies. Never manually pip install again.

New

Batch Run

Upload 1000 items, Sandcastle fans out into parallel runs. Progress tracking, per-item results, aggregate cost.

POST /workflows/{name}/batch
New

PDF Report Engine

New step type: type: report. Professional-grade PDFs with cover page, table of contents, charts, KPI boxes, callouts. 3 themes. WeasyPrint under the hood.

pip install sandcastle-ai[report]

New

Schedule Monitor

Dashboard page showing all cron schedules: last run, next run countdown, success rate, pause/resume. No more guessing what failed overnight.

New

Chandra OCR

Parse scanned documents, handwriting, tables from images. 90+ languages including Czech. ocr_engine: chandra in parse step.

pip install sandcastle-ai[ocr]

New

oMLX - Local AI on Apple Silicon

7th provider. Run Llama 4, Mistral, Gemma, Qwen locally. Zero cloud costs. Zero data leaving your machine. OpenAI-compatible.

New

Workflow Version Diff

Every edit versioned. Compare any two versions with YAML diff highlighting. See exactly what changed.

UX

Activity Feed + Run Compare

Real-time activity stream on dashboard. Select 2 runs and compare side-by-side.

Milestone

By the numbers

7
AI providers
20
Step types
125
Templates
44
New oMLX tests
New

Deep Research Pipeline

OpenVLAW-inspired 6-step template: parallel research, judge, verify, typeset, audit. Built-in.

v0.2666b1 Hardened March 27, 2026

v0.2666 was the vision. v0.2666b1 is the proof. We shipped universal advisor, EU data residency, smart failover. Then we spent a week trying to break it. 1,875 new tests. 90 bug fixes. 10 security patches. Every edge case we found - fixed.

AI

AI Assistant got smarter

Generates workflow, finds errors, fixes them automatically. Suggests matching templates. Learns your style from existing workflows. Knows which tools you have configured.

UX

First 60 seconds redesigned

Auto-detects Ollama and API keys on first load. One-click demo workflow runs in 10 seconds. Onboarding wizard: 2 steps, not 5.

New

Failover Dashboard

See when and why providers failed over. Per-workflow cost recommendations with savings amounts.

New

Agent Memory Page

Search, browse, delete agent memories. Importance scores and decay visualization.

Security

Security hardened

Admin auth on evolution endpoints, template injection prevention, EU residency enforced on every LLM call, rate limiting, credential logging sanitized.

Performance

Faster everywhere

GZip compression, 2-phase dashboard loading, cache headers, database indexes, search debounce.

Milestone

By the numbers

1,875
New tests
90
Bug fixes
10
Security patches
14,361
Total tests
v0.2666 Your AI, Your Rules, Your Borders March 26, 2026

The release where Sandcastle stopped being an orchestrator and became a platform. You used to choose a provider and hope for the best. Now you describe what you want and the system figures out the rest - which model, which provider, what it costs, where the data lives, and what happens when something breaks.

Headline

Universal Advisor - one setting changes everything

Set SANDCASTLE_ADVISOR_PROVIDER=mistral and every AI feature - generation, evolution, evaluation, quality scoring - switches to Mistral. Tomorrow you want Claude? Change one line. Ollama on your laptop? Same line. Six providers, one setting, zero code changes.

Compliance

EU Data Residency

Set DATA_RESIDENCY=eu. That's it. All AI processing routes through EU providers or stays local. Try sending data to a US provider with EU mode on - Sandcastle won't let you. Not a promise. Enforcement.

Reliability

Smart Auto-Failover

Provider hits a rate limit at 3 AM? Sandcastle switches to the next one. No error. No alert. No intervention. You wake up, check the dashboard, see "Failover activated 2x overnight. $0.12 additional cost." That's the whole story.

Intelligence

SLO-Aware Routing

Critical analysis gets Claude Opus. Simple formatting gets Haiku. Not because you configured 47 rules - because Sandcastle matches model quality to task importance automatically. You set the priorities. It picks the brain.

Cost

Cost Intelligence - you'll finally know where the money goes

Per-provider cost breakdown. "Last 30 days: $120 via Claude. Same workloads via Mistral: $45." One click to see the math. And proactive recommendations that actually make sense - not just data, but advice. "Switch these 3 workflows to Mistral and save $75/month with EU residency included."

Integration

OpenClaw Integration

New step type: type: openclaw. Your workflows can now call OpenClaw agents as a step in any pipeline. One more thing you don't have to build yourself.

New

Document Parser

New step type: type: parse. PDF, DOCX, XLSX at 576 pages per second. No external service. No API key. No monthly bill. Just pip install sandcastle-ai[parse] and go.

UX

Dashboard that gets out of your way

Getting Started checklist for new users. Quick Run cards - jump straight to results. Collapsible sidebar. Step palette with categories. Backend configurator with copy-paste .env snippets. Lazy loading. The dashboard got simpler by getting smarter.

Milestone

By the numbers

6
AI providers
20
Step types
576
Pages/sec parsed
0
Lines to switch providers
v0.25.0 Evolution March 20, 2026
New

Workflow Evolution

Your workflows optimize themselves. Set an eval suite, click "Evolve", and Sandcastle autonomously mutates prompts, swaps models, and simplifies steps - keeping only changes that improve your score.

New

Composite Scoring

quality * confidence - cost_penalty - latency_penalty. Every mutation is evaluated mechanically. No subjective judgments.

New

Mutation Operators

Three strategies: prompt refinement (LLM-guided), model swapping (haiku <-> sonnet based on quality/cost), and simplification (the AI learns that less is more).

New

Evolution Dashboard

Track every iteration. See the score evolve. Compare baseline vs best. Accept the winner with one click.

v0.24.0 Ecosystem March 19, 2026
New

Workflow as API

Your workflows are now production APIs. One click to publish, one curl to call. Your customers hit the endpoint, you see every run in the dashboard.

curl -X POST "https://api.example.com/api/v1/lead-enrichment" \
  -H "Authorization: Bearer $SANDCASTLE_KEY" \
  -H "Content-Type: application/json" \
  -d '{"company": "Acme Corp", "domain": "acme.com"}'
New

Living Dashboard

The dashboard shows real data now. Real-time sparklines update on every run. Anomaly detection catches cost spikes and error streaks automatically.

Heatmap shows 6 months of activity - click any day to drill down into individual runs.

New

Agent Marketplace

Publish your workflows to the community hub. Others rate them, install them, remix them.

Think npm for AI agents. One command to publish, one click to install.

New

Multi-Agent Delegation

Agents that call other agents. Dynamic routing: the output of one step picks which workflow runs next.

Full depth tracking, real-time progress events at every level of delegation.

New

File Upload

Workflows accept real documents now. Drop a PDF, CSV, or image into the run modal.

Text files are inlined into prompts. Binary files are passed as references to your agent steps.

v0.23.0 Enterprise Trust March 18, 2026
Compliance

EU AI Act Compliance

Risk classification per EU AI Act categories. Compliance mode that blocks high-risk workflows without human approval.

Transparency reports (Article 13), Annex IV technical documentation, global emergency stop.

Security

Tamper-Evident Audit Trail

SHA-256 hash chain on every event. Each entry links to the previous - if anyone modifies a record, the chain breaks.

Verify integrity via API. 7 executor hooks + 9 admin action hooks. 3 query endpoints.

Privacy

Privacy Router

7 PII patterns: email, phone, SSN, credit card, IP address, IBAN, date of birth.

Per-workflow or per-server config. Redact sensitive data before it touches your LLM, or run in audit-only mode.

Infrastructure

Browser Modes

LightPanda (10x faster CDP, zero memory bloat) and Browserbase (cloud, zero cold-start) join E2B, Docker, and local as sandbox options.

Playwright pre-baked in the Dockerfile. Switch backends with one env var.

Observability

OpenTelemetry

Workflow and step-level OTLP spans. Cost, duration, and token counts on every trace. Compatible with Jaeger, Tempo, Honeycomb, Datadog.

pip install sandcastle-ai[otel]

Integrations

5 New Connectors

Langfuse (LLM observability), Qdrant (vector search), GCS (Google Cloud Storage), Azure Blob Storage, and Exa (semantic web search).

63
Total connectors
v0.22.0 Reliability & Cost March 17, 2026
Cost

Cost Estimation

POST /runs/estimate tells you what a workflow will cost before you run it. Per-step breakdown with model pricing baked in.

Set budget guardrails. Workflows that would exceed the budget are blocked before a single token is spent.

Security

Secret Scrubber

Catches credential URLs, PEM keys, Azure AccountKey, and JSON-quoted secrets before they reach logs or LLM context.

Two-layer defense. Idempotent - safe to run multiple times on the same payload.

Quality

Eval Framework

Detects quality regressions with IEEE 754-safe integer basis points. No more floating-point false positives that cause flaky CI.

Define expected outputs, run evals on every commit, catch regressions at 0.01 percentage point precision.

Integrations

Composio Integration

500+ business app actions through a single step type. Gmail, Slack, Notion, Salesforce, HubSpot, and more - all wired in via the Composio connector.

One auth flow, any tool, zero custom code.