Response Caching¶
CachedProvider wraps any LLMProvider with a transparent cache — repeated or near-duplicate prompts return instantly at zero cost.
from meshflow import Agent
from meshflow.cache import SQLiteCache
agent = Agent(
name="analyst",
role="researcher",
cache=SQLiteCache("meshflow_cache.db"), # persists across restarts
)
# In-memory cache (per-process only):
agent = Agent(name="a", role="executor", cache=True) # uses InMemoryCache
Cache Backends¶
InMemoryCache¶
Thread-safe LRU cache. Data is lost when the process exits.
from meshflow.cache import InMemoryCache
cache = InMemoryCache(
max_size=1000, # LRU eviction when full
similarity_threshold=0.95, # cosine threshold for semantic fuzzy matching
semantic=True, # enable embedding-based near-duplicate lookup
)
cache.stats()
# {"backend": "in_memory", "size": 42, "hits": 100, "misses": 5, "hit_rate": 0.95}
SQLiteCache¶
Persistent cache backed by SQLite. Survives restarts and is shared across worker processes.
from meshflow.cache import SQLiteCache
cache = SQLiteCache(
path="meshflow_cache.db",
max_size=10_000, # oldest rows evicted beyond this limit
similarity_threshold=0.95,
ttl_s=3600.0, # expire entries after 1 hour (None = no expiry)
semantic=True,
)
cache.stats()
# {"backend": "sqlite", "path": "...", "size": 500, "hit_rate": 0.87}
Use path=":memory:" in tests to get a persistent-connection in-process SQLite store without writing to disk.
LLMCache Abstract Interface¶
Both backends implement LLMCache:
cache.get(key) # exact-key lookup → CacheEntry | None
cache.put(entry) # store a CacheEntry
cache.get_semantic(model, system, messages) # fuzzy lookup → CacheEntry | None
cache.invalidate(key) # remove one entry
cache.clear() # remove all entries
cache.stats() # hit/miss statistics dict
CachedProvider¶
Agent(cache=...) wraps the underlying provider automatically. To wrap a provider manually:
from meshflow.agents.base import AnthropicProvider
from meshflow.cache import InMemoryCache
from meshflow.cache.provider import CachedProvider
provider = CachedProvider(
provider=AnthropicProvider(),
cache=InMemoryCache(),
)
On each complete() call CachedProvider:
- Computes a deterministic SHA-256 key from
(model, system, messages). - Returns the cached response on an exact-key hit (cost reported as
$0.00). - Falls back to a semantic (embedding cosine) lookup if exact miss.
- Calls the underlying provider on cache miss, stores the result.
Streaming (stream_complete) bypasses the cache — chunks cannot be cheaply reassembled.
Semantic Fuzzy Matching¶
When semantic=True, the cache embeds the last user message and finds the stored entry with the highest cosine similarity. If similarity >= similarity_threshold the stored response is returned instead of making an LLM call.
The embedding chain used is the same as VectorStore: sentence-transformers → numpy BoW → char n-gram (no extra dependencies required).
# Lower threshold = more aggressive cache hits (but risk of wrong answers)
cache = SQLiteCache("c.db", similarity_threshold=0.90)
Anthropic Prompt Caching (cache_control)¶
AnthropicProvider automatically applies Anthropic's server-side prompt caching when a TokenBudgetTracker is active. This is different from MeshFlow's response cache — it caches the KV computation of large system prompts and tool definitions on Anthropic's side.
from meshflow.optimization.tracker import TokenBudgetTracker, active_tracker
with TokenBudgetTracker(max_tokens=100_000) as tracker:
active_tracker.set(tracker)
result = await agent.step("Summarise the contract")
# System prompt and tool schemas are sent with cache_control: {"type": "ephemeral"}
# Anthropic caches the KV state; subsequent calls pay 10% of the input token price
Cost impact:
- Cached input tokens cost approximately 10% of the normal input rate
- Typical savings: 70–90% on system-prompt tokens for agents that share a large system prompt across many calls
- Applies automatically to system prompts and the last tool schema in
complete_with_tools() - Requires
anthropic-beta: prompt-caching-2024-07-31header (added automatically)