Metadata-Version: 2.4
Name: keel-circuit-breaker
Version: 0.1.0
Summary: Open/closed/half-open circuit breaker keyed by any string. Zero dependencies, async-friendly, observable.
Project-URL: Homepage, https://github.com/keelplatform/keel
Project-URL: Source, https://github.com/keelplatform/keel/tree/main/py/packages/circuit-breaker
Project-URL: Changelog, https://github.com/keelplatform/keel/blob/main/py/packages/circuit-breaker/CHANGELOG.md
Author: Raj Yakkali
License: MIT
Keywords: circuit-breaker,keel,llm,reliability,resilience
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown

# keel-circuit-breaker

> Open/closed/half-open circuit breaker keyed by any string. Zero dependencies, async-friendly, observable. Skip a failing target during a cooldown, probe for recovery after.

Part of [Keel](https://github.com/keelplatform/keel) — a portfolio of small, vendor-neutral libraries. This is the first one, extracted from 1.5+ years of production use across 5 distinct LLM provider adapters.

## Why it exists

Every product that calls a flaky external service (LLM providers, third-party APIs, internal microservices) re-implements the same pattern: after N consecutive failures, stop hammering the failing target for a while; after a cooldown, let a request through to see if it recovered. `keel-circuit-breaker` is that pattern, battle-tested and keyed by any string — so one instance serves per-model, per-tenant, per-endpoint, or per-anything use cases.

## Install

```bash
pip install keel-circuit-breaker     # or: uv add keel-circuit-breaker
```

Zero runtime dependencies (stdlib only).

## Three worked examples

### 1. Guard an HTTP/LLM call (manual lifecycle)

```python
from keel_circuit_breaker import CircuitBreaker

breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=120.0)

async def call_model(model_key: str, prompt: str) -> str:
    if not breaker.is_available(model_key):
        raise RuntimeError(f"{model_key} circuit open — skip it")
    try:
        result = await some_async_http_call(prompt)
        breaker.record_success(model_key)
        return result
    except Exception:
        breaker.record_failure(model_key)   # YOU decide this is a failure
        raise
```

### 2. The `call()` convenience wrapper

```python
from keel_circuit_breaker import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker()

try:
    result = await breaker.call(some_async_http_call, key="model-x", prompt="hi")
except CircuitOpenError as e:
    ...  # e.key tells you which key is open; fall back to another target
```

### 3. Multi-tenant rate-isolation (keyed by tenant, not model)

```python
breaker = CircuitBreaker(failure_threshold=5, cooldown_seconds=60.0)

# The key is any string — here, a tenant id. One tenant tripping the breaker
# does not affect another tenant's availability.
if breaker.is_available(tenant_id):
    ...
```

## API

```python
CircuitBreaker(
    failure_threshold: int = 3,             # consecutive failures before opening
    cooldown_seconds: float = 120.0,        # time open before probing (half-open)
    logger: StructuredLogger | None = None, # default: logging.getLogger("keel_circuit_breaker")
)

breaker.is_available(key) -> bool           # should this target be called?
breaker.record_success(key) -> None         # fully resets the failure count
breaker.record_failure(key) -> None         # increments; opens at threshold
breaker.get_status(key) -> "closed" | "open" | "half_open"
await breaker.call(fn, key, *args, **kwargs) # convenience wrapper; raises CircuitOpenError
```

`StructuredLogger` is a small protocol — `info(event, **fields)` / `warning(event, **fields)`. `structlog`'s `BoundLogger` satisfies it directly (the original production logger). The default is a stdlib-`logging` adapter that routes fields into `extra=`.

## Design notes (read before changing behavior)

These are deliberate, load-bearing decisions carried over from production. They look like small choices; they aren't.

1. **Monotonic clock, never wall-clock.** Cooldowns use `time.monotonic()`. Wall-clock time (`time.time()`) can move backward under NTP adjustments, DST transitions, or an operator running `date` — which would re-open a circuit prematurely or stretch a cooldown into next week. Don't "simplify" to `time.time()`.

2. **Permissive half-open.** Once the cooldown elapses, *all* concurrent `is_available()` callers see `True` until one calls `record_success`. This is **not** classic single-shot half-open (which lets exactly one probe through). It's intentional: when the caller bounds concurrency by other means (e.g., per-key rate limiting), a small burst of probes gives faster recovery. If you need single-shot half-open, wrap it or open an issue — a future `probe_concurrency` option may add it as a non-breaking choice.

3. **Single-event-loop only.** State is a plain dict, not lock-guarded. Designed for one event loop per process (the common async deployment). Concurrent mutation from multiple threads or event loops in the same process will corrupt state. Multi-thread safety is out of scope for `0.x`.

4. **Success fully resets the failure count (no decay).** One success forgives all accumulated failures, even at threshold-minus-one. Intentional: pick a healthy target back up immediately. A "decay over time" model would keep skipping a recovered target after transient errors.

5. **The state dict is unbounded by key.** Not a leak when the key space is bounded and application-controlled (a fixed set of models/tenants). If you pass *user-supplied* keys, this is a memory-growth / DoS vector — bound the key space yourself. LRU eviction may arrive at `1.0` if a real consumer needs it.

6. **The caller decides what counts as a failure.** The breaker never inspects responses or status codes. You call `record_failure()` explicitly. This is deliberate: an HTTP 429 (rate-limited) usually should *not* open the circuit; a 500, a timeout, or malformed output usually should. That policy varies per provider and belongs to you, not the breaker. Don't wrap this in "smart" auto-detection.

7. **The defaults are tuned, not arbitrary.** `failure_threshold=3` skips flaky targets quickly while absorbing 1–2 transient errors. `cooldown_seconds=120.0` matches typical provider rate-limit windows. They're calibrated for free-tier LLM provider behavior — override them for your own SLAs.

8. **Log field names are a contract.** Each state transition emits a structured record with **both** legacy field names (`model_key`, `failures`, `cooldown_seconds`; events `circuit_opened` / `circuit_closed` / `circuit_half_open`) **and** canonical namespaced fields (`keel.lib.name`, `keel.primitive`, `keel.event`, `keel.key`, `keel.failure_count`, `keel.cooldown_seconds`). Dual emission is preserved throughout the entire `0.x` lifecycle; the legacy aliases are dropped at `1.0.0`. Update any log-based dashboards to the `keel.*` fields before then.

## Status

`0.1.0` — first release. Keel stays in `0.x` through its first year (breaking changes possible at minor bumps, always documented in the CHANGELOG; pin exact versions). Source lives in the [Keel monorepo](https://github.com/keelplatform/keel/tree/main/py/packages/circuit-breaker).

## License

MIT — see [LICENSE](https://github.com/keelplatform/keel/blob/main/LICENSE).
