v0.1 · pre-1.0 / fastapi + celery · zero job loss

When your worker dies tonight,
your tasks come back.

The reliability layer for FastAPI + Celery. Sync or async — zero job loss. No database. Just Redis.

vanilla_celery.py (Unsafe) worker dies = task lost
# If the worker dies mid-execution tonight,
# this task is lost silently and forever.
@celery_app.task
def process_order(order_id: str):
    charge_card(order_id)
    send_receipt(order_id)
relier_task.py (Guaranteed) phoenix pattern active
# Worker dies tonight? Phoenix auto-resurrects it!
# Deduplicates retries automatically.
@rl_task(idempotent=True)
async def process_order(order_id: str):
    await charge_card(order_id)
    await send_receipt(order_id)
rl tasks inflight — live
cluster prod-eu-west queue 147 p95 18.2s
Worker Task Duration Status Queue
rl-worker-1 process_document 12.4s running high_priority
rl-worker-2 send_invoice 3.1s running default
rl-worker-3 classify_text 28.2s running default
rl-worker-4 idle
3 tasks running · 1 worker idle · 0 lost phoenix scan · every 5s
Delivery SLO
99.97%
Recovery time
< 35s
Overhead / task
< 10ms
License
Apache 2.0
§01 · Problem

Eight ways a Celery worker silently loses your data.

Every Celery deployment has these failure modes. Most teams discover them at 2am, after a customer complains. Relier closes each one.

F01 — runtime

Worker OOM kill

Worker is killed mid-execution. All in-flight tasks vanish without a trace. You don't even know it happened.

F02 — retries

Non-idempotent retries

Task retries re-execute side effects. Double Stripe charges. Duplicate emails. Corrupt database state.

F03 — timeouts

No task timeouts

One stalled upstream API holds a worker hostage forever. No soft timeout. No cleanup hook. Just a zombie.

F04 — deploys

Ungraceful shutdown

Deploy at 3pm. SIGTERM lands. 12 in-flight tasks silently dropped. Nobody notices until a customer complains.

F05 — visibility

Zero visibility

No idea what's running right now. Which worker has which task. How long it's been. You're flying blind.

F06 — traffic

Traffic spikes

Queue floods with no backpressure. Workers cascade-fail. No admission control. No Retry-After.

F07 — poison

Poison-pill tasks

One bad payload crashes the worker, gets retried, crashes again. Infinite loop. No quarantine. No DLQ.

F08 — schema

Schema drift

Mid-deploy payload mismatch. Old workers pick up new-format tasks. Silent deserialization failures. Data lost.

§02 · Primitives

Reliability primitives, not boilerplate.

One decorator. Four guarantees. Every task tracked from enqueue to completion — and brought back if anything in between dies.

Zero-job-loss resurrection.

Every task registers a Redis heartbeat with a short TTL. When a worker crashes, the heartbeat expires. The Phoenix resurrector detects it and re-queues the task on a fresh worker — automatically.

  • Heartbeat registration on the persistent worker event loop.
  • OOM detection via Redis TTL expiry, checked every five seconds.
  • Automatic re-queue with original payload intact.
  • DLQ quarantine after max_resurrections exceeded.
  • OTEL span emitted for every resurrection event.
tasks.py python
@rl_task(queue="high_priority", max_resurrections=5)
async def process_document(doc_id: str):
    # Your existing code. Zero changes needed.
    result = await store_document(doc_id)
    return result
# Worker dies? Phoenix resurrects in <35s.
# Delivery rate: 99.97%

Safe retries by default.

Atomic Redis Lua check-and-set makes any task safely retryable. First run claims, executes, caches. Retry returns cached result instantly. No double charges. No duplicate emails.

  • Lua script: SET key value NX EX ttl — atomic claim.
  • On first run: claim → execute → store result.
  • On retry: return cached result, skip execution.
  • IN_FLIGHT race handling with automatic TTL expiry.
tasks.py python
# Option A — one flag. Done.
@rl_task(idempotent=True, idempotency_ttl=3600)
async def send_invoice(invoice_id: str):
    await stripe.charge(invoice_id)
    # Already ran? → cached result, charge skipped.
# Option B — manual key for custom logic.
async with idempotency_lock(key=event_id) as lock:
    if lock.already_executed:
        return lock.cached_result

Soft + hard with cleanup hooks.

Two-tier timeout enforcement. Soft timeout fires your cleanup hook to save progress. Hard timeout cancels the coroutine unconditionally. Both emit OTEL events you can plug into anywhere.

  • Soft timeout fires an async cleanup hook.
  • Hard timeout: unconditional task cancellation.
  • Save partial results before hard kill.
  • Both tiers emit OTEL events with rl.timeout.type.
tasks.py python
@rl_task(
    soft_timeout=25,
    hard_timeout=30,
    on_soft_timeout=save_progress
)
async def process_large_job(job_id: str):
    return await run_job(job_id)
async def save_progress(ctx: TaskContext):
    await redis.set(f"partial:{ctx.task_id}", ctx.partial_result)

Tasks finish or hand off.

Relier intercepts SIGTERM from deploys, scale-downs, and K8s evictions. Worker enters drain mode, finishes current tasks, and hands off the rest. Zero task loss on every deploy.

  • Worker enters drain mode — stops accepting new tasks.
  • Waits for current tasks to finish within grace window.
  • Unfinished tasks: checkpoint → re-queue elsewhere.
  • Clean exit with full accounting.
terminal shell
$ rl worker drain --timeout 30
⏳ Worker rl-worker-2 entering drain mode...
✓ task_a8f2c1 completed (12.4s)
✓ task_b2d4e8 completed (18.1s)
⚠ task_c9f1a3 exceeded timeout — re-queuing
✓ Clean exit. 1 task handed off. 0 lost.
§03 · Developer experience

See everything. Control everything.

The rl CLI gives you real-time visibility into every task, worker, and failure — and the muscles to act on it.

~/relier — rl
⌃C to exit
§04 · Benchmarks

Relier vs vanilla Celery.

Measured overhead. Real numbers. No asterisks. Reproduce them yourself with rl bench all.

Metric Relier v1.0 Vanilla Celery
Task delivery rate
measured over 10M tasks
99.97% ~94%
Worker OOM recovery
SIGKILL → task back on a worker
< 35s ∞ lost
Duplicate prevention
idempotent=True flag
100% 0%
Admission control p99
Lua atomic rate-limit
< 1ms n/a
Graceful shutdown
SIGTERM → drain → hand off
100% ~60%
Overhead per task
heartbeat + idempotency check
< 10ms 0ms
§05 · Quickstart

Five minutes to zero job loss.

Install. Decorate. Deploy. That's the whole onboarding.

01 · install

One package.

Zero database dependencies. Zero GPU dependencies. Python 3.11+.

$ pip install relier
02 · decorate

Wrap your tasks.

Add @rl_task. Dispatch with .apush() from FastAPI, .push() from Django.

@rl_task(idempotent=True) async def my_task(arg): return await do_work(arg) # dispatch await my_task.apush("data")
03 · run

Start the cluster.

One command brings up workers, the Phoenix resurrector, and the OTEL exporter.

$ rl cluster up
§06 · Surface area

Every reliability primitive you need, in one library.

No glue. No second service. No second database.

primitive · 01

Phoenix resurrection

Worker dies → task comes back. Automatic. Median recovery < 35 seconds with full payload integrity.

primitive · 02

Idempotency

Atomic Lua check-and-set. One flag, no double charges.

primitive · 03

Soft + hard timeouts

Two-tier timeout with cleanup hooks. No zombie workers.

primitive · 04

Graceful shutdown

SIGTERM → drain → finish or hand off. Zero loss on deploy.

primitive · 05

Inflight visibility

Every running task, worker, and queue depth in real time.

primitive · 06

Admission control

Lua atomic rate-limit. < 1ms p99. Returns 429 + Retry-After.

primitive · 07

Dead letter queue

Poison pills quarantined with payload + stack trace. Release when fixed.

primitive · 08

Schema versioning

Versioned envelope. Auto-migration on pickup. Deploy any time.

primitive · 09

OpenTelemetry native

Every event emits OTEL spans. Plug into Grafana, Jaeger, Datadog, anything OTLP.

primitive · 10

Chaos engineering CLI

rl chaos worker-kill --watch. Prove the guarantees yourself.

primitive · 11

Sync + async, one API

async def or def — both work. .apush() from FastAPI, .push() from Django or Flask. Persistent event loop under the hood, zero per-task asyncio overhead.

primitive · 12

First-class CLI

rl tasks, rl worker, rl dlq, rl chaos — full control from terminal.

— get started

Built for engineers at 2am
whose queue just died.

Open source. Apache 2.0. Free forever. Made with conviction in Abuja.

Star on GitHub
Python 3.11+runtime
Apache 2.0license
Redis-onlydependency
< 10msoverhead
99.97%delivery slo