Auto-Heal: Coverage Spectrum

The remaining 20% of failures don't fail suddenly — they degrade first. Layer 2 catches them pre-emptively. Net: 93% of all failures resolve without human touch.
Layer 1 · Crash Recovery ✅ Shipped 75%
Reactive — catch after failure
⚡ Process crash
Detection: Pulse health check fails
✅ Auto-restart in ~5s
💀 OOM kill
Detection: Pulse timeout → guard trip
✅ Restart with memory cap
🔄 Crash-loop
Detection: Repeated failures in window
✅ Retry (configurable) → circuit breaker
🫥 Zombie / timeout
Detection: Connection refused / timeout
✅ SIGKILL + restart
Layer 2 · Proactive Detection 🆕 New 18%
Proactive — catch before failure
📊 Memory bloat
RSS growth >5%/h for 3 samples
✅ Pre-emptive graceful restart (~90% success)
🔒 Stuck / deadlocked
No output >3x P95 response time
✅ SIGABRT + core dump + restart (~80% success)
🤖 Hallucinating
Output structure drift >3σ from 7d baseline
✅ Restart with fallback model (~50% success)
🌐 Upstream failure
Connection refused in first retries
✅ Circuit breaker + buffer + backoff (~70% success)
Human-Needed · Structured Diagnosis 🔍 Honest 7%
Cannot auto-fix — but arrives with full diagnostics
⚙️ Config errors
Pre-flight validation
📋 "SOUL.md references skill 'x' which doesn't exist"
🧩 Logic bugs
After 2+ restarts, different models
📋 "Output quality dropped 40% over 3 days, persisted across model fallback. See attached baseline comparison."
💾 Disk full
Auto-cleanup fails
📋 "Auto-cleanup freed 0 bytes. 3 largest files: ..."
💡 The 1000x insight: The remaining 20% don't fail suddenly. They degrade first. Memory leaks grow +6%/h for 6 hours before the OOM. Stuck agents stop producing output for 3x their normal response time before you notice. Hallucinating agents drift from their output baseline over hours before they produce visible garbage. Every "non-crash" failure leaves a detectable signature before it becomes fatal. If we watch the trends — not the crash — we catch and auto-heal pre-emptively.

📖 Same Failure, Three Worlds

Scenario: Memory leak kills agent at 3am
❌ No auto-heal
03:00 OOM kill
03:00:05 Guard trips
03:00:10 Cooldown
~~~~~ 4 hours gap ~~~~~
07:00 🔴 Dead — cryptic logs
3 alarms in 1 minute, human wakes to mystery
✅ Layer 1 only
03:00 OOM kill
03:00:05 Auto-restart
03:00:10 OOM again
03:00:15 Circuit trips
03:00:16 📱 Push alert
1 alert: "crash-loop after OOM" — still needs human to diagnose cause
✅ Layer 1 + 2
22:00 RSS trend: +4%/h
23:00 RSS trend: +6%/h → threshold breach
23:05 Pre-emptive graceful restart
23:06 RSS drops to baseline
🔇 Silent. Pre-emptive restart fixed it. Human never knows.
If restart fails: "3x restarts in 2h, RSS returns to 500MB+ within 30min. Likely cause: cache accumulation."

🔧 Why This Works Without New Infrastructure

Every signal Layer 2 needs already exists in the agent ecosystem
🖥️
RSS (Memory)
Every `ps` call has it. Pulse files already track process state. GS-013 §Process table.
⏱️
Response time P95
Cron output timestamps. Signal delivery timestamps. Kanban cycle times. All in state/metrics/.
📋
Output structure
Signal payloads from every agent run. Compare last 7 days of output shapes to detect drift.
🚦
Connection status
Pulse already tracks health endpoint responses. Retry counts are in the circuit breaker state.
No new agents. No new daemons. Just trend tracking over existing metrics — which is exactly what GS-013 already defines. Layer 2 is the logical completion of the measurement standard: once you wire all 22 metrics, you have the trend data to drive pre-emptive auto-heal.