Anthropic shipped a managed-agents update on May 19. Most teams will read the blog post and move on. We extracted everything. Self-hosted sandboxes that keep your org key inside your boundary. A Memory MCP server that fills the gap Anthropic left when they declared memory_stores incompatible with self-hosted runs. MCP tunnels with WIF token exchange so your private Jira can be reached by a managed agent without static secrets. Live work-queue telemetry in the dashboard. Three production case-study blueprints. Five sandbox cookbooks. 150 new tests, validated across six iterations. Zero regressions. You wanted control over where your model runs. Here it is.
Self-hosted sandboxes - your code, your silicon
Five providers in the runtime enum: cloudflare, daytona, modal, vercel, docker. Each gets a cookbook under deploy/cookbooks/ with a copy-paste-ready spawn script and image. Your org key never leaves your boundary - the worker refuses to start if ANTHROPIC_API_KEY is present in the sandbox env. Only sk-ant-oat01- environment-scoped keys get past the gate. A foot-gun that would have leaked your org budget now raises before the first network call.
Memory MCP server
Anthropic explicitly documented that memory_stores cannot be combined with self-hosted sandboxes. We filled the gap. sandcastle.engine.memory_mcp_server wraps mem0 + persistent Qdrant + Anthropic Haiku for memory decisions. Four tools, two resources, one prompt. Helm chart + docker-compose. NetworkPolicy egress to Cloudflare CIDR only.
MCP tunnels with WIF auth
Gated beta, header mcp-client-2025-11-20. Your internal Jira / Snowflake / Confluence is reachable from a managed agent through a cloudflared tunnel authenticated with WIF tokens. No static secrets. 60s skew-cached token exchange. Manual cert mode also supported for air-gapped setups.
Live work-queue dashboard
WorkQueuePanel on the run detail page, SSE-driven. Depth, sparkline (Recharts), pill (green < 5, amber 5-50, red > 50), aria-live="polite", exponential backoff up to 30 s. You see the backlog before the pager fires.
Webhook-driven workers
New session.status_run_started event in SUPPORTED_EVENTS. A worker can now run as a webhook handler instead of a long-poll loop - saves RAM, latency, and your AWS bill. HMAC verification round-trips sha256= prefix and bare digest forms.
Three production case studies
Not tutorials. Blueprints. Amplitude designer (multiagent + accessibility specialist + computer-use + Cloudflare). Clay GTM (project_manager + researcher / writer / qualifier + Daytona). Rogo analyst on private data (Vercel + WIF tunnel + risk_level high + mandatory approval + eval gate).
Tested six different ways
150 new tests. Sequential re-run, file isolation, randomized + pollution cluster, pytest-repeat 10x stress, module + CLI smoke, consolidated verdict. Zero flakes. Zero new regression categories. Documented in the PR.
By the numbers
You shipped the agent. The client wants the integration. The auditor wants the trail. The user wants you to ask the next question without restarting the whole workflow. v0.32 is the answer to every one of those. Sandcastle now exposes every Anthropic Managed Agents primitive shipped under the managed-agents-2026-04-01 beta umbrella, plus the things Anthropic doesn't ship: cryptographically verifiable trajectory replay, a Skills publisher that turns workflows into uploadable Claude Skills, and an Agent SDK runtime for teams that want in-process execution. Two weeks of work, 169 new tests, one release.
Every Anthropic primitive surfaced
Memory Stores, Multiagent coordinator, Outcomes API, Webhooks. All four came out from Anthropic under the same beta header in April and May 2026; Sandcastle now wires each one into the YAML and the dashboard so a single workflow can attach versioned memory, spawn 20 parallel specialist agents, define outcomes the eval pipeline reads automatically, and emit lifecycle events to a webhook endpoint.
Add three lines of YAML, get a multi-agent system that meets your eval gate.
Skills Publisher
sandcastle publish-skills --upload converts every workflow into a tar.gz SKILL.md package with strict frontmatter and uploads to /v1/skills. Your workflows are now callable from every Anthropic Skills-aware client. Pair with v0.31 MCP-first publishing and Sandcastle covers both major distribution channels.
Trajectory Replay
New type: trajectory-replay step computes SHA-256 over a recorded tool-call sequence and diffs against a candidate run. Sandcastle's audit trail is a hash chain, so the replay is cryptographically verifiable - a property neither LangSmith nor Braintrust ships. EU AI Act auditors love this.
Agent SDK runtime
runtime: "agent-sdk" swaps Managed Agents for in-process Claude Agent SDK. For EU sovereignty, air-gapped, regulated teams who can't send session state to Anthropic's hosted infra. Lazy-imports the SDK; graceful error when not installed.
Live Agent Reasoning panel
Run detail page now subscribes to an SSE event stream and shows agent.thinking, agent.tool_use, and agent.message events as they fire. Thread-grouped, collapsible. The dashboard finally shows what the agent is actually doing.
Five wire fixes (table stakes)
tools_enabled finally reaches the API (was parsed but ignored). temperature, max_tokens, thinking_budget on ManagedAgentConfig. stream config field actually used. Pricing table mapping every Claude 4.x model. Fallback chains of up to 5 templates.
Tool Search + examples convention
Mark tools with defer_loading: true and 1-5 invocation examples per tool. Anthropic's own measurements: tool-selection accuracy 49% to 74%, usable context up 85%, parameter accuracy 72% to 90%. New docs/tool-examples-convention.md.
Computer Use + MCP Elicitation
New type: computer-use step type with the computer_20251124 beta + an 8-item safety pre-flight (prompt-injection guard, page-load deadline, screenshot bounds). Plus the 6th MCP primitive (Elicitation, spec rev 2025-11-25) lets workflows ask the user for a missing input mid-run.
By the numbers
Eighty days. That is how long you have until the EU AI Act's general-purpose obligations go live. After 2 August 2026, "we'll figure it out later" stops being an acceptable answer. v0.31 is the answer you can ship today: a dedicated EU AI Act landing your CISO can actually read, ten ready-made compliance workflow templates mapped to specific Articles and Annex IV, eval gates that block a regressing model from getting promoted, and MCP-first publishing that turns every workflow you've already built into a first-class tool inside Claude Desktop, Cursor, Windsurf, or any MCP client. The work was quiet. The result is loud.
Built for August 2, 2026
The EU AI Act deadline is on the calendar whether you like it or not. v0.31 ships a dedicated landing page mapping every Sandcastle control to a specific Article (9, 11, 12, 14, 25, 49, 50, 73) and Annex IV, plus ten production-ready compliance workflow templates: DPIA, vendor risk assessment, incident reporting, bias audit, model card generator, AI inventory, risk register, GDPR data-subject request, human oversight log, and Annex IV transparency report.
Bring the templates. Customize the prompts. Hand the auditor the audit trail. Done.
MCP-First Publishing
One command: sandcastle publish-mcp researcher. Paste the snippet into Claude Desktop, Cursor, or Windsurf. Your workflow is now a tool the AI client can call. Every workflow you already built becomes reachable from every MCP client in the ecosystem - the full five-primitive spec (tools, resources, prompts, sampling, roots) and a .well-known/mcp.json discovery manifest included.
Eval Gates That Block Bad Models
Define a golden dataset of inputs and expected scores. Set a minimum threshold. Promote with ?strict=true and Sandcastle runs the gate before flipping the production version. The new model regresses by even a hair? Promotion fails closed. Eval-driven development is no longer a future plan.
A Dashboard That Doesn't Crash
Overview page split into twenty focused components, each wrapped in its own error boundary. One slow API endpoint no longer wipes the page. Every section that loads stays loaded. The 2,261-line monolith is now a 241-line orchestrator.
Cost Estimates That Tell the Truth
The token estimator used to assume 1,000 tokens per run for every workflow on Earth. Now it reads each workflow's actual run history, blends the price per million, and gives each one its own number. Evolution makes better keep/discard calls. AutoPilot stops lying about the cost of an experiment.
Codex Audit Rounds 9 + 10
Five HIGH findings closed: tenant-scoped cache and Mem0 scope, XSS-safe report config, SSRF-guarded WeasyPrint URL fetching, A2A budget enforcement, version-list 404 correctness. Plus one MEDIUM: cross-tenant prompt injection in the workflow generator. tenant_id threaded through every executor path so neighbouring tenants cannot see each other's cache hits, memory reads, or prompt context.
Dashboard accessibility pass
aria-expanded, aria-controls, aria-hidden, focus-visible rings, sr-only labels on every search and filter input, modal widths that scale up on tablet, disabled opacity bumped to WCAG-AA contrast, and localStorage wrapped in try/catch so private browsing degrades gracefully. Eight separate polish items in one release.
Per-workflow stats API
GET /api/workflows/stats returns run count, success rate, average cost, and last-run status for every workflow visible to your tenant. Cached in-process for 30 seconds. The dashboard grid now shows real metrics instead of placeholder data.
By the numbers
v0.30 is the biggest Sandcastle release yet. Claude Managed Agents are now workflow steps. Pick from 15 built-in agent templates - or describe what you need and AI designs the agent for you. Combine cloud agents, local models, OCR, and PDF reports in one YAML file. Plus: GLM-OCR with 94.6% accuracy runs locally for free, and every run now tells you exactly where your tokens go.
Claude Managed Agents as Workflow Steps
Delegate to Anthropic's cloud agent runtime. Agents get bash, files, web search, code execution. Use them alongside Mistral, oMLX, or any other provider. One YAML, best tool per step.
15 Built-in Agent Templates
researcher, coder, analyst, writer, reviewer, scraper, tester, devops, translator, designer, sql_expert, seo_specialist, legal_analyst, financial_analyst, project_manager. One line: agent_template: researcher
Describe Mode
Don't pick a template. Describe what you need: "Data analyst who creates charts from CSVs." AI designs the agent automatically - system prompt, tools, packages, network.
Agent Chaining
output_format (json/files/markdown), shared_files between agents, fallback_template on failure. Agents collaborate in multi-step pipelines.
GLM-OCR Engine
0.9B model, 94.6% accuracy, #1 on OmniDocBench. Runs locally via Ollama. Free. Handles tables, handwriting, math, code blocks. 4th OCR engine alongside pymupdf and Chandra.
Token Waste Detection
Every run generates a token efficiency report. Detects duplicate file reads, duplicate prompts, oversized context. output_max_tokens limits what passes between steps. Cache-aware execution.
By the numbers
v0.28.1 is about making what you've built understandable. Steps explain why they exist. Errors tell you what went wrong. The CLI shows you what's happening in real-time. And LLM steps can now pull in fresh context from memory, web, or files before they run.
Self-Describing Workflows
Every step can declare its responsibility, source_hint, owner, and added_date. Three new CLI commands: sandcastle describe, sandcastle lint, sandcastle owners. Audit trail enriched. Dashboard shows metadata on steps and workflow cards.
context_query - Dynamic Context for LLM Steps
Steps fetch relevant context before execution. Four sources: Agent Memory (semantic search), web (Tavily), local files (keyword search), custom (shell command). Available as {context} in prompts.
sandcastle run --stream
Live terminal output with colors: green for done, yellow for running, red for failed. Shows step timing, cost, and responsibility. Ctrl+C runs in background.
Real-Time Step Progress
Dashboard shows live checkmarks via SSE as steps complete. No more refreshing to see progress.
Detailed Errors + PDF Download
Full error messages in red callout boxes (not just "failed"). Download PDF button for report steps.
Batch Cancel + Semantic Search
Cancel running batches mid-flight. Template search with "Did you mean..." fuzzy suggestions.
540 New Tests
280 template e2e tests (top 20 templates). 260 provider integration tests (all 7 providers). Full validation coverage.
v0.28 is where Sandcastle stops being a tool and becomes a platform. Batch-process thousands of items. Monitor your schedules. Generate beautiful PDF reports with charts. Parse scanned documents with AI. Auto-update without downtime. And now - run everything locally on Apple Silicon with oMLX.
Auto-Update System
CLI: sandcastle update. Dashboard: one-click with rollback. Stable/beta/pinned channels. Blackout windows. Enterprise approval policies. Never manually pip install again.
Batch Run
Upload 1000 items, Sandcastle fans out into parallel runs. Progress tracking, per-item results, aggregate cost.
POST /workflows/{name}/batch
PDF Report Engine
New step type: type: report. Professional-grade PDFs with cover page, table of contents, charts, KPI boxes, callouts. 3 themes. WeasyPrint under the hood.
pip install sandcastle-ai[report]
Schedule Monitor
Dashboard page showing all cron schedules: last run, next run countdown, success rate, pause/resume. No more guessing what failed overnight.
Chandra OCR
Parse scanned documents, handwriting, tables from images. 90+ languages including Czech. ocr_engine: chandra in parse step.
pip install sandcastle-ai[ocr]
oMLX - Local AI on Apple Silicon
7th provider. Run Llama 4, Mistral, Gemma, Qwen locally. Zero cloud costs. Zero data leaving your machine. OpenAI-compatible.
Workflow Version Diff
Every edit versioned. Compare any two versions with YAML diff highlighting. See exactly what changed.
Activity Feed + Run Compare
Real-time activity stream on dashboard. Select 2 runs and compare side-by-side.
By the numbers
Deep Research Pipeline
OpenVLAW-inspired 6-step template: parallel research, judge, verify, typeset, audit. Built-in.
v0.2666 was the vision. v0.2666b1 is the proof. We shipped universal advisor, EU data residency, smart failover. Then we spent a week trying to break it. 1,875 new tests. 90 bug fixes. 10 security patches. Every edge case we found - fixed.
AI Assistant got smarter
Generates workflow, finds errors, fixes them automatically. Suggests matching templates. Learns your style from existing workflows. Knows which tools you have configured.
First 60 seconds redesigned
Auto-detects Ollama and API keys on first load. One-click demo workflow runs in 10 seconds. Onboarding wizard: 2 steps, not 5.
Failover Dashboard
See when and why providers failed over. Per-workflow cost recommendations with savings amounts.
Agent Memory Page
Search, browse, delete agent memories. Importance scores and decay visualization.
Security hardened
Admin auth on evolution endpoints, template injection prevention, EU residency enforced on every LLM call, rate limiting, credential logging sanitized.
Faster everywhere
GZip compression, 2-phase dashboard loading, cache headers, database indexes, search debounce.
By the numbers
The release where Sandcastle stopped being an orchestrator and became a platform. You used to choose a provider and hope for the best. Now you describe what you want and the system figures out the rest - which model, which provider, what it costs, where the data lives, and what happens when something breaks.
Universal Advisor - one setting changes everything
Set SANDCASTLE_ADVISOR_PROVIDER=mistral and every AI feature - generation, evolution, evaluation, quality scoring - switches to Mistral. Tomorrow you want Claude? Change one line. Ollama on your laptop? Same line. Six providers, one setting, zero code changes.
EU Data Residency
Set DATA_RESIDENCY=eu. That's it. All AI processing routes through EU providers or stays local. Try sending data to a US provider with EU mode on - Sandcastle won't let you. Not a promise. Enforcement.
Smart Auto-Failover
Provider hits a rate limit at 3 AM? Sandcastle switches to the next one. No error. No alert. No intervention. You wake up, check the dashboard, see "Failover activated 2x overnight. $0.12 additional cost." That's the whole story.
SLO-Aware Routing
Critical analysis gets Claude Opus. Simple formatting gets Haiku. Not because you configured 47 rules - because Sandcastle matches model quality to task importance automatically. You set the priorities. It picks the brain.
Cost Intelligence - you'll finally know where the money goes
Per-provider cost breakdown. "Last 30 days: $120 via Claude. Same workloads via Mistral: $45." One click to see the math. And proactive recommendations that actually make sense - not just data, but advice. "Switch these 3 workflows to Mistral and save $75/month with EU residency included."
OpenClaw Integration
New step type: type: openclaw. Your workflows can now call OpenClaw agents as a step in any pipeline. One more thing you don't have to build yourself.
Document Parser
New step type: type: parse. PDF, DOCX, XLSX at 576 pages per second. No external service. No API key. No monthly bill. Just pip install sandcastle-ai[parse] and go.
Dashboard that gets out of your way
Getting Started checklist for new users. Quick Run cards - jump straight to results. Collapsible sidebar. Step palette with categories. Backend configurator with copy-paste .env snippets. Lazy loading. The dashboard got simpler by getting smarter.
By the numbers
Workflow Evolution
Your workflows optimize themselves. Set an eval suite, click "Evolve", and Sandcastle autonomously mutates prompts, swaps models, and simplifies steps - keeping only changes that improve your score.
Composite Scoring
quality * confidence - cost_penalty - latency_penalty. Every mutation is evaluated mechanically. No subjective judgments.
Mutation Operators
Three strategies: prompt refinement (LLM-guided), model swapping (haiku <-> sonnet based on quality/cost), and simplification (the AI learns that less is more).
Evolution Dashboard
Track every iteration. See the score evolve. Compare baseline vs best. Accept the winner with one click.
Workflow as API
Your workflows are now production APIs. One click to publish, one curl to call. Your customers hit the endpoint, you see every run in the dashboard.
curl -X POST "https://api.example.com/api/v1/lead-enrichment" \
-H "Authorization: Bearer $SANDCASTLE_KEY" \
-H "Content-Type: application/json" \
-d '{"company": "Acme Corp", "domain": "acme.com"}'
Living Dashboard
The dashboard shows real data now. Real-time sparklines update on every run. Anomaly detection catches cost spikes and error streaks automatically.
Heatmap shows 6 months of activity - click any day to drill down into individual runs.
Agent Marketplace
Publish your workflows to the community hub. Others rate them, install them, remix them.
Think npm for AI agents. One command to publish, one click to install.
Multi-Agent Delegation
Agents that call other agents. Dynamic routing: the output of one step picks which workflow runs next.
Full depth tracking, real-time progress events at every level of delegation.
File Upload
Workflows accept real documents now. Drop a PDF, CSV, or image into the run modal.
Text files are inlined into prompts. Binary files are passed as references to your agent steps.
EU AI Act Compliance
Risk classification per EU AI Act categories. Compliance mode that blocks high-risk workflows without human approval.
Transparency reports (Article 13), Annex IV technical documentation, global emergency stop.
Tamper-Evident Audit Trail
SHA-256 hash chain on every event. Each entry links to the previous - if anyone modifies a record, the chain breaks.
Verify integrity via API. 7 executor hooks + 9 admin action hooks. 3 query endpoints.
Privacy Router
7 PII patterns: email, phone, SSN, credit card, IP address, IBAN, date of birth.
Per-workflow or per-server config. Redact sensitive data before it touches your LLM, or run in audit-only mode.
Browser Modes
LightPanda (10x faster CDP, zero memory bloat) and Browserbase (cloud, zero cold-start) join E2B, Docker, and local as sandbox options.
Playwright pre-baked in the Dockerfile. Switch backends with one env var.
OpenTelemetry
Workflow and step-level OTLP spans. Cost, duration, and token counts on every trace. Compatible with Jaeger, Tempo, Honeycomb, Datadog.
pip install sandcastle-ai[otel]
5 New Connectors
Langfuse (LLM observability), Qdrant (vector search), GCS (Google Cloud Storage), Azure Blob Storage, and Exa (semantic web search).
Cost Estimation
POST /runs/estimate tells you what a workflow will cost before you run it. Per-step breakdown with model pricing baked in.
Set budget guardrails. Workflows that would exceed the budget are blocked before a single token is spent.
Secret Scrubber
Catches credential URLs, PEM keys, Azure AccountKey, and JSON-quoted secrets before they reach logs or LLM context.
Two-layer defense. Idempotent - safe to run multiple times on the same payload.
Eval Framework
Detects quality regressions with IEEE 754-safe integer basis points. No more floating-point false positives that cause flaky CI.
Define expected outputs, run evals on every commit, catch regressions at 0.01 percentage point precision.
Composio Integration
500+ business app actions through a single step type. Gmail, Slack, Notion, Salesforce, HubSpot, and more - all wired in via the Composio connector.
One auth flow, any tool, zero custom code.