Wrap your existing tool-calling loop and get crash-safety, an append-only step ledger, idempotent tool dispatch, and deterministic replay — backed entirely by Postgres. No broker. No rewrite of your agent.
The agent runs lookup_order → issue_refund → email_customer.
A worker is killed after issue_refund dispatches but before its
observation commits — the decisive crash window. A fresh worker re-leases the run,
rebuilds from the Postgres ledger, and re-dispatches with the same idempotency key.
---- timeline ---- #1 [plan] #2 [tool_call] c1 # lookup_order #3 [observation] c1 #4 [plan] #5 [tool_call] c2 # issue_refund ▸ resumed by host:6226 (attempt 2) ← crashed here #6 [observation] c2 # re-dispatched, deduped ... #11 [final] ------------------- run status : succeeded dispatch attempts : 2 # physically called twice tool effects : 1 # one actual refund
Run it yourself:
python -m avatar.cli demo
A real worker process is killed with SIGKILL — this is not a mock.
Then open the dashboard to see the
▸ resumed after crash divider in the ledger.
LangChain-style frameworks give you the agent loop but no durability — a deploy or a crash mid-run loses state, or double-fires a refund. Temporal gives you durability but no agent / tool / LLM semantics. Avatar is the only thing that is both agent-native and crash-safe, and its only infrastructure dependency is Postgres.
| Crash-safe | Idempotent tools | Replay | Agent-native | Infra | |
|---|---|---|---|---|---|
| LangChain / Agents SDK | no | no | no | yes | — |
| Temporal / Restate | yes | DIY | yes | no | cluster |
| Cloud queues (SQS/Celery) | at-least-once | DIY | no | no | broker |
| Avatar | yes | yes | yes | yes | Postgres |
The engine is a state machine over two Postgres tables. The
runs table is the queue; run_steps is an append-only
ledger. All run state is a pure fold over that ledger — which is what makes crash-resume
and replay deterministic.
runs, status queued.FOR UPDATE SKIP LOCKED) and heartbeats the lease — exactly one owner.tool_call intent, dispatch the tool with its idempotency key, then commit the observation.Lease + heartbeat + ledger replay. Kill -9 any worker; another resumes.
A crash-stable key per tool call, enforced by UNIQUE(run_id, key).
Every plan / tool_call / observation step is committed before the next. Full audit, free.
Re-run a trace prefix from any step without re-calling the model or re-running tools.
allow / deny / require_approval evaluated before every tool dispatch.
Per-run budget_cap_cents; the run halts atomically before it breaches.
REST with a single static key and a live step stream. The dashboard is just a client.
Runs list, step-ledger timeline, live SSE, visible crash-resume markers, fork-here.
One infra dependency. The runs table is the queue. No Redis, no broker.
Two ways in. The whole stack — Postgres, control API, dashboard, and a worker — comes up with one command.
# Postgres + control API + dashboard + 1 worker docker compose up # scale workers horizontally docker compose up --scale worker=3
Then open the dashboard (host port 8088 in this
environment), enqueue a run, and watch its live step timeline. Host ports are
overridable via AVATAR_API_HOST_PORT / AVATAR_PG_HOST_PORT.
# from the repo root pip install -e . # the crash-resume proof (kills a real worker) python -m avatar.cli demo # run the API + dashboard, and a worker avatar serve avatar worker # scale by running more
from avatar import Avatar, tool, Plan, ToolCall app = Avatar(api_url="http://localhost:8088", api_key="dev-key") @tool(timeout=10, retries=2) def issue_refund(order_id, cents): # Your real side effect. Forward current_idempotency_key() # to the downstream for exactly-once end-to-end. return {"refunded": True} @app.agent("support-resolver") def resolve(state): if any(m["role"] == "tool" for m in state.messages): return Plan(final=True, output={"status": "done"}) return Plan(tool_calls=[ToolCall(id="c1", name="issue_refund", arguments={"order_id": "42", "cents": 500})]) run = app.runs.create(agent_ref="support-resolver", input={"ticket_id": 42}) print(app.runs.wait(run["id"]))
Point the worker at your module with AVATAR_APP=yourpkg.agents. The engine
drives the durable loop; you write only the model call and the tools.
Backend engineers whose AI agents touch real, stateful, money-or-data-changing systems — issuing refunds, provisioning accounts, updating CRM/billing, writing to ledgers — and who have already been burned by an in-memory loop losing state or double-firing a side effect on a deploy or crash. If your agent only chats, you don't need Avatar. If it acts, you do.
This is single-purpose infrastructure, not a platform. It is not a SaaS, not multi-tenant, not a marketplace, not BYOK, not voice/avatars, not multi-agent orchestration. Those are sequenced behind the wedge — a hosted Avatar Cloud, governance, and a multi-agent fabric — built in the order that actually ships.