Metadata-Version: 2.4
Name: recallm
Version: 0.2.0
Summary: Semantic caching for LLMs. Ask once, recall forever.
Author-email: Munim Ahmad <munimahmad2@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://recallm.dev
Project-URL: Repository, https://github.com/munimx/llm-semantic-cache
Project-URL: Documentation, https://recallm.dev/docs
Project-URL: Issues, https://github.com/munimx/llm-semantic-cache/issues
Project-URL: Changelog, https://github.com/munimx/llm-semantic-cache/blob/main/CHANGELOG.md
Keywords: llm,cache,semantic,embeddings,nlp,ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: numpy>=1.26
Requires-Dist: fastembed>=0.3
Requires-Dist: structlog>=24.0
Requires-Dist: prometheus-client>=0.20
Provides-Extra: torch
Requires-Dist: sentence-transformers>=3.0; extra == "torch"
Provides-Extra: redis
Requires-Dist: redis[asyncio]>=5.0; extra == "redis"
Requires-Dist: fakeredis[aioredis]>=2.23; extra == "redis"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: fakeredis[aioredis]>=2.23; extra == "dev"
Requires-Dist: mypy>=1.9; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: coverage>=7.0; extra == "dev"
Dynamic: license-file

# Recallm
Semantic caching for LLMs. Ask once, recall forever.

![PyPI](https://img.shields.io/pypi/v/recallm) ![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue) ![MIT License](https://img.shields.io/badge/license-MIT-green) ![CI](https://img.shields.io/github/actions/workflow/status/munimx/recallm/ci.yml?label=CI)

Exact-match caching is useless for LLMs — two users asking the same question in slightly different words both pay the full API cost. Recallm uses sentence embeddings to find near-matches and return cached responses instantly. The result: lower API costs, reduced latency, and no changes to your existing LLM client code.

## Install

```bash
pip install recallm
pip install "recallm[redis]"   # persistent cache, shared across workers
pip install "recallm[torch]"   # sentence-transformers embedder (700MB, PyTorch)
```

Once installed, import directly from the `recallm` package:

```python
from recallm import SemanticCache, CacheConfig, InMemoryStorage
```

## Quickstart

```bash
pip install recallm
```

```python
from recallm import CacheConfig, InMemoryStorage, SemanticCache

storage = InMemoryStorage()
cache = SemanticCache(storage=storage, config=CacheConfig(threshold="balanced"))

def fake_llm(**kwargs):
    return {"id": "resp-1", "choices": [{"message": {"content": "Paris"}}]}

cached = cache.wrap(fake_llm, mode="sync")

request = {
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "cache_context": {"user_id": "u-1", "document_id": "geo-v1"},
}

first = cached(**request)   # miss: calls fake_llm and stores response
second = cached(**request)  # hit: returns cached response
print(first["choices"][0]["message"]["content"], second["choices"][0]["message"]["content"])
```

## Debugging

Inspect cache behaviour during development with `cache.stats()`:

```python
stats = cache.stats()
print(stats.hit_rate)         # fraction of requests served from cache
print(stats.hits, stats.misses)
print(stats.avg_similarity)   # mean cosine similarity of cache hits
print(stats.namespace_sizes)  # entry counts per namespace
```

`stats()` returns a `CacheStats` dataclass and is intended for development and debugging. Use the Prometheus metrics for production observability.

> **Deployment note:** `SemanticCache(...)` loads the embedding model synchronously.
> In async frameworks (FastAPI, etc.), use `await cache.async_warmup()` during startup
> instead of relying on the constructor — see [getting started](docs/getting-started.md).

| Use case | Expected hit rate | Why |
|---|---|---|
| FAQ / support bot | 40–70% | High repetition, forgiving similarity |
| Document summarization | 20–50% | Same docs re-processed, template prompts |
| General chat assistant | 5–15% | High diversity, dynamic context |
| Code generation | 3–10% | Exact problem statements vary, strict threshold |

## Known limitations

- `stream=True` bypasses the cache entirely
- Redis backend is not suitable for namespaces > 5,000 entries without partitioning
- Sync callers using `RedisStorage` have no timeout protection (v0.1.0)

[Full docs](https://recallm.dev) · [Contributing](CONTRIBUTING.md) · [MIT License](LICENSE)
