Metadata-Version: 2.4
Name: agentfuse-runtime
Version: 0.2.2
Summary: Intelligent LLM agent cost optimization runtime — per-run budget enforcement, semantic caching, model routing across 15 providers
Author-email: Vinay Kumar Reddy Budideti <vinaybudideti@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/vinaybudideti/agentfuse
Project-URL: Repository, https://github.com/vinaybudideti/agentfuse
Project-URL: Bug Tracker, https://github.com/vinaybudideti/agentfuse/issues
Project-URL: Changelog, https://github.com/vinaybudideti/agentfuse/blob/main/README.md#changelog
Keywords: llm,ai-agents,cost-optimization,semantic-caching,budget-enforcement,model-routing,langchain,crewai,openai-agents,langgraph,mcp,pydantic-ai,openai,anthropic,gemini,deepseek,llm-gateway,token-counting,observability
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.27
Requires-Dist: numpy>=1.24
Requires-Dist: faiss-cpu>=1.7
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: cachetools>=5.3
Requires-Dist: structlog>=24.0
Requires-Dist: tenacity>=8.2
Requires-Dist: tiktoken>=0.5
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == "redis"
Provides-Extra: otel
Requires-Dist: opentelemetry-sdk>=1.20; extra == "otel"
Requires-Dist: opentelemetry-api>=1.20; extra == "otel"
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20; extra == "anthropic"
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0; extra == "gemini"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3; extra == "langchain"
Provides-Extra: all
Requires-Dist: redis>=5.0; extra == "all"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "all"
Requires-Dist: opentelemetry-api>=1.20; extra == "all"
Requires-Dist: openai>=1.0; extra == "all"
Requires-Dist: anthropic>=0.20; extra == "all"
Requires-Dist: google-genai>=1.0; extra == "all"
Requires-Dist: langchain-core>=0.3; extra == "all"
Requires-Dist: prometheus-client>=0.20; extra == "all"
Dynamic: license-file

# AgentFuse

> Intelligent LLM Agent Cost Optimization Runtime

![PyPI](https://img.shields.io/pypi/v/agentfuse-runtime)
![License](https://img.shields.io/badge/license-MIT-blue)
![Python](https://img.shields.io/badge/python-3.11+-blue)
![Tests](https://img.shields.io/badge/tests-1099%20passing-brightgreen)
![Coverage](https://img.shields.io/badge/core%20coverage-93%25-green)

AgentFuse is a production-grade Python SDK that optimizes LLM costs through intelligent model routing, semantic caching, graduated budget enforcement, and unified observability across OpenAI, Anthropic, Google Gemini, DeepSeek, Mistral, and 15+ providers. Built with insights from LiteLLM, Portkey, and Helicone architectures, and backed by research from 8 academic papers.

## The Problem

AI agents burn money without warning. A stuck loop can cost $50 in 10 minutes. Retries on a failed call can triple your bill. There is no built-in way to enforce per-run budgets across LLM providers.

## The Solution

AgentFuse intercepts every LLM call with a two-tier semantic cache (Redis L1 exact-match + FAISS L2 vector similarity) achieving 87.5% hit rate, and enforces graduated budget policies that downgrade models, compress context, and terminate gracefully instead of burning your budget.

## Key Numbers

| Metric | Value |
|---|---|
| Cache hit rate | 87.5% on repeated and paraphrased prompts |
| Cost reduction | 71.8% ($0.24 vs $0.87 for same workload) |
| Model routing savings | Up to 85% via intelligent complexity routing (RouteLLM-inspired) |
| Tokens saved | 179,445 per 100 calls |
| Integration effort | 1 line of code (`completion()` gateway) |
| Test suite | 1099 unit tests, 93% core coverage |
| Models supported | 34 with hot-reloadable pricing (GPT-5, Claude Opus 4.6, Gemini 2.5 Pro) |
| Providers supported | 15 (OpenAI, Anthropic, Gemini, DeepSeek, Mistral, Groq, Together, xAI, Fireworks, OpenRouter, Ollama, vLLM, Azure, Bedrock, SiliconFlow) |
| Framework integrations | 8 (OpenAI, Anthropic, LangChain, CrewAI, OpenAI Agents SDK, LangGraph, MCP, Pydantic AI) |
| Production subsystems | 45 core modules (cache, budget, routing, retry, dedup, alerting, anomaly detection, predictive routing, prompt compression, tool cost tracking, conversation estimation, hierarchical budgets) |

## Quickstart

```bash
pip install agentfuse-runtime
```

```python
from agentfuse import completion

# One function for ANY provider — budget-enforced, cached, cost-tracked
response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    budget_id="my_agent",
    budget_usd=5.00,
)

# Works with ANY provider — just change the model name:
response = completion(model="claude-sonnet-4-6", messages=[...], budget_id="run_2", budget_usd=3.00)
response = completion(model="gemini-2.5-pro", messages=[...], budget_id="run_3", budget_usd=1.00)
response = completion(model="deepseek/deepseek-chat", messages=[...])

# Enable intelligent routing — sends simple queries to cheap models automatically:
response = completion(model="gpt-4o", messages=[...], auto_route=True)
```

**Session API — all-in-one context manager (recommended):**
```python
from agentfuse import AgentSession

with AgentSession("my_agent", budget_usd=5.00) as session:
    response = session.completion(messages=[{"role": "user", "content": "What is Python?"}])
    session.record_tool_call("web_search", cost=0.01)  # track tool costs too

receipt = session.get_receipt()
# {'total_cost_usd': 0.52, 'llm_cost_usd': 0.51, 'tool_cost_usd': 0.01, 'cache_hit_rate': 0.3, ...}
```

**Check your spend (persists across restarts):**
```python
from agentfuse import get_spend_report
report = get_spend_report()
# {'total_usd': 4.52, 'by_model': {'gpt-4o': 3.1, 'claude-sonnet-4-6': 1.42}, ...}
```

**Legacy monkey-patch API** (still supported):
```python
from agentfuse import wrap_openai
import openai

wrap_openai(budget_usd=5.00, run_id="my_agent")
# All subsequent openai.chat.completions.create() calls are now
# budget-enforced and semantically cached
```

## Architecture

```
                         ┌──────────────────────────────────┐
                         │        AgentFuse Gateway          │
                         │    completion() / acompletion()   │
                         └───────────────┬──────────────────┘
                                         │
                         ┌───────────────▼──────────────────┐
                         │  1. Kill Switch Check             │
                         │  2. Input Validation (fail-fast)  │
                         │  3. Deprecation Warning           │
                         │  4. Rate Limiting (GCRA)          │
                         └───────────────┬──────────────────┘
                                         │
                         ┌───────────────▼──────────────────┐
                         │  5. Request Optimization          │
                         │  6. Intelligent Model Routing     │
                         │     (RouteLLM complexity-based)   │
                         │  7. Context Window Guard          │
                         └───────────────┬──────────────────┘
                                         │
            ┌────────────────────────────▼────────────────────────────┐
            │              8. Budget Check (Graduated)                │
            │                                                         │
            │  ┌────────┐ ┌───────────┐ ┌──────────┐ ┌────────────┐  │
            │  │  60%   │ │   80%     │ │   90%    │ │   100%     │  │
            │  │ Alert  │ │ Downgrade │ │ Compress │ │ Terminate  │  │
            │  │        │ │ GPT-4o → │ │+ degrade │ │ graceful   │  │
            │  │        │ │ 4o-mini  │ │          │ │ shutdown   │  │
            │  └────────┘ └───────────┘ └──────────┘ └────────────┘  │
            └────────────────────────────┬────────────────────────────┘
                                         │
            ┌────────────────────────────▼────────────────────────────┐
            │              9. Cache Lookup (Two-Tier)                  │
            │                                                         │
            │  L1: Redis exact-match (SHA-256)   → HIT? Return cached │
            │  L2: FAISS semantic (768d, 0.90)   → HIT? Return cached │
            └────────────────────────────┬────────────────────────────┘
                                         │ MISS
                         ┌───────────────▼──────────────────┐
                         │ 10. Key Pool Selection            │
                         │ 11. Request Deduplication          │
                         └───────────────┬──────────────────┘
                                         │
            ┌────────────────────────────▼────────────────────────────┐
            │             12. Provider API Call                        │
            │                                                         │
            │  OpenAI · Anthropic · Gemini · DeepSeek · Mistral       │
            │  Groq · Together · xAI · Fireworks · OpenRouter         │
            │  Ollama · vLLM · Azure · Bedrock · SiliconFlow          │
            │                                                         │
            │             13. Automatic Fallback on errors             │
            └────────────────────────────┬────────────────────────────┘
                                         │
                         ┌───────────────▼──────────────────┐
                         │ 14. Cost Recording (normalized)   │
                         │ 15. Cost Alerts (webhook)         │
                         │ 16. Anomaly Detection (EMA)       │
                         │ 17. Prometheus + OTel spans       │
                         │ 18. SpendLedger (JSONL)           │
                         └───────────────┬──────────────────┘
                                         │
                         ┌───────────────▼──────────────────┐
                         │ 19. Response Validation           │
                         │     (XSS, PII, security)          │
                         │ 20. Cache Store                    │
                         │     (L1+L2, CacheAttack defense)  │
                         └───────────────┬──────────────────┘
                                         │
                         ┌───────────────▼──────────────────┐
                         │           Response                │
                         └──────────────────────────────────┘
```

**Graduated Budget Policies:**
- **60% spent** — alert callback triggered
- **80% spent** — automatic model downgrade (e.g., GPT-4o → GPT-4o-mini, Claude Opus → Sonnet)
- **90% spent** — context compression (system prompt + last 6 messages) and model downgrade
- **100% spent** — graceful termination with partial results preserved

## Features

### Budget Enforcement
- Per-run budget limits with graduated degradation policies
- Atomic Redis Lua budget enforcement with microdollar precision (integer arithmetic, no floating-point drift)
- In-memory budget store with TTL and LRU eviction for zero-dependency deployments
- Async-safe budget operations via `asyncio.Lock`
- Anthropic overflow pricing: automatic 2x input / 1.5x output when requests exceed 200K input tokens
- Gemini Pro overflow pricing: automatic 2x when requests exceed 200K input tokens

### Two-Tier Semantic Cache
- **L1 (Redis):** SHA-256 exact-match lookup, sub-millisecond latency. Falls back to local `TTLCache` when Redis is unavailable
- **L2 (FAISS):** `IndexFlatIP` vector similarity search with `redis/langcache-embed-v2` embeddings (768-dim, purpose-built for semantic caching), 2-5ms latency
- Cross-model contamination prevention: model name in L1 hash key, model prefix post-filter on L2 results
- Tool-use queries never enter L2 semantic search (tool arguments are context-dependent)
- Requests with `temperature > 0.5` or side-effect tools (`send_email`, `execute_trade`, etc.) skip cache entirely
- 24-hour TTL with +/-10% jitter to prevent thundering herd
- FAISS index persistence to disk (`save_l2_index` / `load_l2_index`)
- LRU eviction at configurable max entries with full FAISS index rebuild

### Token Counting
- Provider-aware 4-tier fallback chain:
  - **Tier 1:** Exact local tokenizer — GPT-4o/GPT-4.1/o1/o3/o4 use `o200k_base`, GPT-4/GPT-3.5 use `cl100k_base`
  - **Tier 2:** Provider API counting (Anthropic/Gemini — planned)
  - **Tier 3:** `tiktoken cl100k_base` with safety margins — Anthropic 1.20x, Gemini 1.25x, Mistral 1.15x, DeepSeek/Llama 1.10x, Grok 1.15x
  - **Tier 4:** Character-based estimate (`len(text) / 3.5`) for unknown models
- Multimodal content block handling (Anthropic vision format)

### Model Registry
- Hot-reloadable pricing for 34 models at March 2026 rates (GPT-5, GPT-4.1, o3, o4-mini, Claude Sonnet/Haiku/Opus 4.x, Gemini 2.x, DeepSeek V3/R1, Mistral, Grok, Llama)
- Per-model-family cache discounts: GPT-5 (90%), GPT-4.1 (75%), GPT-4o (50%), Gemini (90%)
- 4-tier pricing lookup: user overrides, exact match, fine-tuned model (2x base price), unknown (zero + warning)
- LiteLLM remote refresh for automatic pricing updates
- `cached_input_cost()` for provider cache discounts (Anthropic 90% off, DeepSeek 90% off)
- Configurable refresh interval via `AGENTFUSE_REGISTRY_REFRESH_HOURS` environment variable

### Provider Router
- Automatic provider detection from model name with base URL routing for OpenAI-compatible providers
- 15 providers: OpenAI, Anthropic (native SDK), Gemini, DeepSeek, Mistral, Groq, Together AI, xAI, Fireworks AI, OpenRouter, Ollama (local), vLLM (self-hosted), Azure OpenAI, AWS Bedrock, SiliconFlow
- Fine-tuned model routing: `ft:gpt-4o:org:name` resolves to OpenAI
- `list_providers()` to enumerate all configured providers

### Unified Error Handling
- `classify_error()` across OpenAI, Anthropic, Google GenAI, and httpx exceptions
- Provider-specific handling:
  - OpenAI `insufficient_quota` (429) is **not** retryable (billing issue, not rate limit)
  - Anthropic `OverloadedError` (529) is **always** retryable
  - Context window exceeded (400) is **not** retryable
- `Retry-After` header extraction from provider responses
- `agentfuse_retry()` decorator with tenacity exponential backoff
- Cost-aware retry with automatic model downgrade on each retry attempt

### Security
- **API key protection:** `mask_api_key()` for safe logging, `validate_api_key_format()` per provider
- **Prompt injection detection:** `check_prompt_injection()` flags known injection patterns
- **Invisible character stripping:** Removes zero-width Unicode chars used in steganographic attacks
- **Response safety validation:** Prevents caching XSS, javascript: URIs, and data: URI payloads
- **Per-tenant cache isolation:** CacheAttack defense (arXiv 2601.23088) — no cross-tenant cache access
- **Dual-threshold verification:** Higher similarity required for cache writes (0.95) than reads (0.90)
- **Input validation:** Fail-fast on malformed inputs at gateway boundary
- **0 known vulnerabilities:** Clean `pip-audit` on all dependencies
- **No dangerous patterns:** No `eval()`, `exec()`, `subprocess`, `pickle` in codebase

### Reliability
- Semantic loop detection with FAISS sliding window and cost-aware thresholds
- Streaming cost middleware with real-time per-token cost tracking and abort capability
- Anthropic prompt caching middleware — auto-injects `cache_control` markers on static system messages above model-specific thresholds
- Structured JSON cost receipts with per-step logging (model, tokens, cost, cache tier, latency)
- Automatic model fallback on retryable errors (tries cheaper models from DEFAULT_CHAINS)
- Request deduplication — coalesces identical in-flight requests to avoid duplicate API calls
- GCRA rate limiting — smooth traffic shaping per tenant

### Framework Integrations

| Framework | Integration | Cache Support |
|---|---|---|
| OpenAI | `wrap_openai()` monkey-patch | Full (intercept + return cached) |
| Anthropic | `wrap_anthropic()` monkey-patch | Full (intercept + return cached) |
| LangChain | `AgentFuseChatModel` (BaseChatModel wrapper) | Full (wrapper checks cache before delegating) |
| CrewAI | `create_agentfuse_hooks()` (before/after hooks) | Full (before hook blocks call on cache hit) |
| OpenAI Agents SDK | `AgentFuseModel` / `AgentFuseModelProvider` | Full (async cache check in `get_response`) |
| LangGraph | `AgentFuseNode` + `budget_guard` conditional edge | Full (node-level cost tracking + budget guard) |
| MCP | MCP tool server (budget_check, estimate_cost, spend_report, kill) | N/A (exposes budget tools to MCP clients) |
| Pydantic AI | `wrap_pydantic_agent()` wrapper | Full (cost tracking + budget enforcement) |

### Observability
- **OpenTelemetry:** GenAI semantic convention v1.40 spans with `gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model` attributes
- **Structured Logging:** `structlog` JSON output with automatic OTel trace/span ID injection and Datadog `dd.trace_id` / `dd.span_id` correlation
- **Prometheus Metrics:**
  - `gen_ai_client_token_usage` — histogram, tokens per operation
  - `gen_ai_client_operation_duration_seconds` — histogram, latency
  - `agentfuse_cache_hits_total` / `agentfuse_cache_lookups_total` — counters by model and tier
  - `agentfuse_cost_usd_total` — counter by model and provider
  - `agentfuse_cost_per_request_usd` — histogram with buckets [0.0001 ... 10.0]
  - `agentfuse_budget_remaining_usd` — gauge per budget
  - `agentfuse_errors_total` — counter by error type and provider
  - `agentfuse_model_fallbacks_total` — counter by original and fallback model
- All observability calls wrapped in try/except — failures never propagate to user code

### Usage Normalization
- Unified `NormalizedUsage` dataclass across all providers
- Anthropic fix: `input_tokens` excludes cached tokens — AgentFuse adds `cache_read_input_tokens` + `cache_creation_input_tokens` for correct total
- OpenAI: `completion_tokens` already includes `reasoning_tokens` — no double-counting
- Gemini: `thoughts_token_count` billed as output — added to `candidates_token_count`

## Integration Examples

### OpenAI

```python
from agentfuse import wrap_openai
wrap_openai(budget_usd=5.00, run_id="my_agent")
```

### Anthropic

```python
from agentfuse import wrap_anthropic
wrap_anthropic(budget_usd=5.00, run_id="my_agent")
```

### LangChain

```python
from agentfuse.integrations.langchain import AgentFuseChatModel
model = AgentFuseChatModel(inner=ChatOpenAI(), budget=5.00)
response = model.invoke([HumanMessage(content="Hello")])
```

### CrewAI

```python
from agentfuse.integrations.crewai import create_agentfuse_hooks
before_hook, after_hook = create_agentfuse_hooks(budget=5.00)
```

### OpenAI Agents SDK

```python
from agentfuse.integrations.openai_agents import AgentFuseModelProvider
provider = AgentFuseModelProvider(inner=your_provider, budget=5.00)
result = await Runner.run(agent, run_config=RunConfig(model_provider=provider))
```

### LangGraph

```python
from agentfuse.integrations.langgraph import AgentFuseNode, budget_guard

node = AgentFuseNode(my_llm_node, budget_id="graph_run", budget_usd=5.00)
graph.add_conditional_edges("llm_node", budget_guard("graph_run"))
```

### MCP (Model Context Protocol)

```python
from agentfuse.integrations.mcp import MCP_TOOLS, handle_mcp_tool_call

# Register AgentFuse tools with any MCP-compatible agent
# Tools: agentfuse_budget_check, agentfuse_estimate_cost,
#         agentfuse_spend_report, agentfuse_kill
result = handle_mcp_tool_call("agentfuse_budget_check", {"run_id": "my_agent"})
```

### Pydantic AI

```python
from agentfuse.integrations.pydantic_ai import wrap_pydantic_agent

agent = wrap_pydantic_agent(my_agent, budget_usd=5.00)
```

### Programmatic Usage

```python
from agentfuse import BudgetEngine, ModelPricingEngine, TokenCounterAdapter

engine = BudgetEngine("run_123", budget_usd=5.00, model="gpt-4o")
pricing = ModelPricingEngine()
tokenizer = TokenCounterAdapter()

messages = [{"role": "user", "content": "What is the capital of France?"}]
tokens = tokenizer.count_messages(messages, "gpt-4o")
estimated_cost = pricing.input_cost("gpt-4o", tokens)

result_messages, active_model = engine.check_and_act(estimated_cost, messages)
# active_model may be downgraded if budget is running low
```

## Comparison

| Feature | AgentFuse | AgentBudget | LiteLLM | Portkey |
|---|---|---|---|---|
| Per-run budget enforcement | Yes | Yes | No | No |
| Semantic caching (87.5% hit rate) | Yes | No | No | Basic |
| Two-tier cache (Redis + FAISS) | Yes | No | No | No |
| Cross-model contamination prevention | Yes | No | No | No |
| Mid-run model switching (at 80% budget) | Yes | No | No | No |
| Context compression (at 90% budget) | Yes | No | No | No |
| Graceful termination + partial results | Yes | No | No | No |
| Semantic loop detection | Yes | No | No | No |
| Retry storm prevention with model downgrade | Yes | No | Yes | Yes |
| Streaming cost abort | Yes | No | No | No |
| Auto Anthropic prompt caching | Yes | No | No | No |
| Unified error classifier (3+ providers) | Yes | No | Partial | No |
| Anthropic/Gemini overflow pricing | Yes | No | No | No |
| OTel GenAI spans + Prometheus metrics | Yes | No | Yes | Yes |
| Structured cost receipts (JSON) | Yes | No | Basic | Basic |
| Framework integrations (LangChain, CrewAI, OpenAI Agents, LangGraph, MCP, Pydantic AI) | Yes | No | Yes | Yes |
| Hot-reloadable model pricing (34 models) | Yes | No | Yes | No |
| Atomic Redis Lua budget enforcement | Yes | No | No | No |
| Provider-aware token counting (6 providers) | Yes | No | Partial | No |
| FAISS index persistence | Yes | No | No | No |
| Framework integrations (8 frameworks) | Yes | No | Yes | Yes |
| Provider support (15 providers) | Yes | No | Yes | Partial |

## Install

```bash
pip install agentfuse-runtime              # Core (in-memory cache + budget)
pip install agentfuse-runtime[redis]       # + Redis cache and budget store
pip install agentfuse-runtime[otel]        # + OpenTelemetry tracing
pip install agentfuse-runtime[openai]      # + OpenAI SDK
pip install agentfuse-runtime[anthropic]   # + Anthropic SDK
pip install agentfuse-runtime[gemini]      # + Google Gemini SDK
pip install agentfuse-runtime[langchain]   # + LangChain Core
pip install agentfuse-runtime[all]         # Everything
```

**Requirements:** Python 3.11+

## Changelog

### v0.2.2 — PyPI Release (March 2026)

- Published to PyPI as `agentfuse-runtime`
- Added 3 new framework integrations: LangGraph, MCP (Model Context Protocol), Pydantic AI
- Added 3 new providers: Azure OpenAI, AWS Bedrock, SiliconFlow (15 total)
- Updated architecture diagram with full 20-step gateway flow
- 85 public exports, 1099 tests passing, 93% core coverage

### v0.2.1 — Production Fixes (March 2026)

- Fixed: environment variable rate limiting (`AGENTFUSE_RATE_LIMIT_RPS`) now works correctly
- Fixed: `configure(output_guardrails=...)` now properly sets module-level guardrails
- Fixed: output guardrails are checked before caching responses
- Fixed: streaming responses validated before cache storage (both OpenAI and Anthropic)
- Fixed: `acompletion()` now has automatic fallback chain matching sync `completion()`
- Fixed: async provider uses `asyncio.get_running_loop()` (deprecated `get_event_loop` removed)
- Fixed: async streaming injects `stream_options` for OpenAI usage tracking
- Added: SDK client caching across requests (reuses connection pools)
- Added: `AgentSession` async context manager (`async with`)
- Added: real-world end-to-end test script (`examples/e2e_real_test.py`)
- Added: CI coverage threshold enforcement
- Filled: all example files with working code
- 85 public exports, 1099 tests passing, 93% core coverage

### v0.2.0 — Production Rebuild (March 2026)

**New Modules:**
- `TwoTierCacheMiddleware` — Redis L1 exact-match + FAISS L2 semantic search with `redis/langcache-embed-v2`
- `ModelRegistry` — hot-reloadable pricing for 22+ models with LiteLLM remote refresh
- `ProviderRouter` — automatic provider detection and base URL routing for 12 providers
- `RedisBudgetStore` — atomic budget enforcement via Redis Lua scripts with microdollar precision
- `InMemoryBudgetStore` / `AsyncInMemoryBudgetStore` — thread-safe budget stores with TTL and LRU
- `NormalizedUsage` / `extract_usage()` — unified token usage extraction across all providers
- `classify_error()` / `ClassifiedError` — unified error classification with Retry-After extraction
- `agentfuse_retry()` — tenacity-based retry decorator with provider-aware predicate
- OTel GenAI spans, structlog JSON logging, Prometheus metrics (11 metric types)

**Framework Integrations (rebuilt):**
- LangChain: `AgentFuseChatModel` — `BaseChatModel` wrapper (callbacks are observe-only and cannot return cached responses)
- CrewAI: `create_agentfuse_hooks()` — before/after hooks with side-channel cached response injection
- OpenAI Agents SDK: `AgentFuseModel` / `AgentFuseModelProvider` — async Model interface with non-blocking cache store

**Bug Fixes:**
- Cache key design: SHA-256 with model always first component — cross-model contamination eliminated
- Token counting: GPT-4o/GPT-4.1/o-series use `o200k_base` (was incorrectly using `cl100k_base`)
- Safety margins: Anthropic 1.20x (was 1.15x), Gemini 1.25x (was 1.05x) — prevents budget underrun
- Thread safety: instance-level locks + `ContextVar` for per-run isolation (was using class-level locks causing serialization)
- L2 cache eviction: vectors stored alongside metadata for correct FAISS index rebuild (was losing all vectors)
- 90% budget policy: now correctly applies both model downgrade and context compression
- Retry module: uses `classify_error()` instead of string matching on exception names
- L2 metadata filter: correct prefix detection for Mistral, Grok, Llama models
- Double-wrapping prevention in `wrap_openai()` / `wrap_anthropic()`

**Pricing:**
- Anthropic overflow pricing (>200K input tokens: 2x input, 1.5x output)
- Gemini Pro overflow pricing (>200K input tokens: 2x)
- Cached input cost method for provider cache discounts (Anthropic 90%, DeepSeek 90%)

**Quality:**
- 260 unit tests, 0 construction-only tests, 86% core module coverage
- PEP 561 `py.typed` marker for type checking support

### v0.1.0 — Initial Prototype (February 2026)

- BudgetEngine with graduated policies, CacheMiddleware with Intent Atoms FAISS 3-tier cache
- `wrap_openai()`, `wrap_anthropic()` monkey-patches
- LoopDetectionMiddleware, CostAwareRetry, StreamingCostMiddleware, PromptCachingMiddleware, CostReceiptEmitter
- LangChain, CrewAI, OpenAI Agents SDK integrations (callback-based)
- 34 tests passing

## Roadmap

- [x] Python SDK v0.2.0 — production-ready
- [ ] TypeScript SDK
- [ ] Cloud dashboard with real-time cost monitoring
- [ ] Anomaly detection (per-agent-type baseline profiling)
- [ ] Batch pricing support (50% discount with 24h SLA)
- [ ] Multi-agent budget sharing

## Contributing

```bash
git clone https://github.com/vinaybudideti/agentfuse.git
cd agentfuse
pip install -e ".[all]"
pytest tests/unit/ -v
```

## License

MIT

---

Built by [@vinaybudideti](https://github.com/vinaybudideti)
