Metadata-Version: 2.4
Name: memollm
Version: 0.1.0
Summary: 3-level LLM cache (exact, semantic, prefix) with real-time token cost tracking
Author-email: Kumar Abhishek <kr0abhishek@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/abhi-singh-123/memollm
Project-URL: Repository, https://github.com/abhi-singh-123/memollm
Project-URL: Bug Tracker, https://github.com/abhi-singh-123/memollm/issues
Project-URL: Paper, https://arxiv.org/abs/XXXX.XXXXX
Keywords: llm,cache,openai,anthropic,rag,tokens,cost,semantic-cache
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: sentence-transformers>=2.7
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: pydantic>=2.0
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == "redis"
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9; extra == "postgres"
Provides-Extra: openai-embed
Requires-Dist: openai>=1.0; extra == "openai-embed"
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.7; extra == "faiss"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: openai>=1.0; extra == "dev"
Requires-Dist: anthropic>=0.25; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Provides-Extra: all
Requires-Dist: redis>=5.0; extra == "all"
Requires-Dist: psycopg2-binary>=2.9; extra == "all"
Requires-Dist: openai>=1.0; extra == "all"
Requires-Dist: faiss-cpu>=1.7; extra == "all"

# MemoLLM

**Cut your LLM API bill by 40–70% with one line of code.**

[![PyPI](https://img.shields.io/pypi/v/memollm)](https://pypi.org/project/memollm/)
[![Downloads](https://img.shields.io/pypi/dm/memollm)](https://pypi.org/project/memollm/)
[![Tests](https://github.com/abhi-singh-123/memollm/actions/workflows/ci.yml/badge.svg)](https://github.com/abhi-singh-123/memollm/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-coming%20soon-red)](#citation)

MemoLLM wraps your existing OpenAI or Anthropic client with a 3-level cache. One line to add, zero infrastructure required, works with every model.

```python
# Before
from openai import OpenAI
client = OpenAI()

# After — literally one line change
from memollm import wrap
client = wrap(OpenAI())

# Your code stays exactly the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document..."}]
)
print(response.choices[0].message.content)  # works identically
```

---

## Why MemoLLM?

Most LLM apps repeat the same or very similar prompts constantly — same questions, same system prompts, same RAG context. You pay for every token every time.

MemoLLM catches three types of repetition:

| Level | What it catches | Saving | API call? |
|-------|----------------|--------|-----------|
| **L1 Exact** | Identical prompts (same characters) | 100% | No |
| **L2 Semantic** | Same meaning, different wording | 100% | No |
| **L3 Prefix** | Same system prompt / context prefix | 25–100% | Yes, but cheaper |

**vs GPTCache:** GPTCache only does L1+L2. MemoLLM adds L3 — automatic `cache_control` injection for Anthropic and prefix cache detection for OpenAI — and adds acronym expansion before embedding so `"RAG"` and `"retrieval augmented generation"` hit the same cache entry.

---

## Installation

```bash
pip install memollm
```

No database, no Docker, no configuration. Uses SQLite by default — your cache lives at `~/.memollm/cache.db`.

```bash
pip install memollm[redis]     # Redis backend (multi-process)
pip install memollm[postgres]  # PostgreSQL backend (enterprise)
pip install memollm[all]       # Everything
```

---

## Usage

### OpenAI / GPT-4

```python
from openai import OpenAI
from memollm import wrap

client = wrap(OpenAI())  # OPENAI_API_KEY from environment

# First call — hits the API
response = client.chat.completions.create(
    model="gpt-4o",          # works with gpt-4o, gpt-4, gpt-4-turbo, o1, o3, etc.
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
print(response.choices[0].message.content)

# Second call — same question, instant cache hit, $0 cost
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)

# Third call — different wording, same meaning → L2 semantic hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain how RAG works in AI"}]
    # "RAG" is expanded to "retrieval augmented generation" before embedding
)
```

### Anthropic / Claude

MemoLLM automatically injects Anthropic's `cache_control` on long system prompts, so the second call pays ~0% for the system prompt tokens.

```python
import anthropic
from memollm import wrap

client = wrap(anthropic.Anthropic())  # ANTHROPIC_API_KEY from environment

SYSTEM_PROMPT = """
You are an expert data analyst with deep knowledge of SQL, Python, and statistics.
You help users understand complex datasets and write production-quality analysis code.
Always explain your reasoning step by step before providing code.
[... your long system prompt ...]
"""

# MemoLLM automatically adds cache_control to the system prompt.
# First call: Anthropic caches the system prompt KV states (charged at 25%).
# All subsequent calls: system prompt tokens are FREE.
response = client.messages.create(
    model="claude-opus-4-7",        # works with all Claude models
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Analyse this CSV and find anomalies"}
    ],
    system=SYSTEM_PROMPT,
)
print(response.content[0].text)

# L2 semantic cache — different question, same meaning
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Find outliers in my dataset"}
    ],
    system=SYSTEM_PROMPT,
)
```

### Local Ollama (no API key needed)

```python
from openai import OpenAI
from memollm import wrap

# Ollama uses OpenAI-compatible API
client = wrap(OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
))

response = client.chat.completions.create(
    model="llama3.1",    # or llama3.2, mistral, codellama, qwen2.5-coder, etc.
    messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content)
```

### Configuration

```python
from memollm import wrap, CacheConfig

config = CacheConfig(
    # L2 semantic cache settings
    l2_threshold=0.65,            # 0.0–1.0: lower = more hits, less accuracy
                                  # 0.65 recommended for RAG/QA workloads
                                  # 0.90+ for strict accuracy requirements

    # Storage backend
    backend="sqlite",             # "memory" | "sqlite" | "redis" | "postgres"
    backend_url=None,             # required for redis/postgres
    db_path="~/.memollm/cache.db",

    # Cache expiry
    default_ttl_seconds=None,     # None = cache forever, 3600 = expire after 1 hour

    # Embedding model for L2
    l2_embedder="local",          # "local": free, no API key, ~15ms latency
                                  # "openai": better quality, costs $0.02/1M tokens

    # L3 prefix caching
    l3_enabled=True,              # auto-inject cache_control for Anthropic
    l3_min_prefix_tokens=500,     # minimum system prompt length to activate L3
)

client = wrap(OpenAI(), config=config)
```

---

## How it works

```
Your LLM call
      │
      ▼
┌──────────────────────────────────────────┐
│  L1: Exact Cache                         │
│  Normalize prompt → SHA-256 hash         │  → instant hit (<1ms), no API call
│  "What is RAG?" == "What is  RAG ?"      │    whitespace/order differences ignored
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L2: Semantic Cache                      │
│  Expand acronyms → embed → cosine sim    │  → ~15ms hit, no API call
│  "RAG" → "retrieval augmented generation"│    <0.3% quality loss at threshold 0.65
│  cosine similarity > threshold → hit     │
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L3: Provider Prefix Cache               │
│  Anthropic: inject cache_control blocks  │  → API call made, but system prompt
│  OpenAI: detect prefix cache eligibility │    tokens cost 0% on cache read
└──────────────────────────────────────────┘
      │
      ▼
  LLM API call → store result → log savings
```

---

## Cost report

```bash
$ llmcache report --last 7d

╔══════════════════════════════════════════════════╗
║         MemoLLM Report (last 7 days)        ║
╠══════════════════════════════════════════════════╣
║  Total requests         1,247                    ║
║  ├─ L1 exact hits        312  (25.0%)  <1ms      ║
║  ├─ L2 semantic hits     489  (39.2%)  ~15ms     ║
║  ├─ L3 prefix hints      201  (16.1%)  cheaper $ ║
║  └─ Full misses          245  (19.6%)             ║
╠══════════════════════════════════════════════════╣
║  Tokens saved          842,301                   ║
║  Cost saved              $16.84                  ║
╚══════════════════════════════════════════════════╝
```

Check savings in code:

```python
summary = client.tracker.summary()
print(f"Hit rate:     {summary.hit_rate * 100:.1f}%")
print(f"Tokens saved: {summary.tokens_saved:,}")
print(f"Cost saved:   ${summary.cost_saved_usd:.4f}")
```

---

## CLI reference

```bash
llmcache report              # full report (all time)
llmcache report --last 24h  # last 24 hours
llmcache report --last 7d   # last 7 days
llmcache report --format json > report.json

llmcache stats               # one-line summary
llmcache patterns            # top wasteful patterns + fix suggestions
llmcache clear               # wipe cache
llmcache config              # show current settings
```

---

## Backends

| Backend | Best for | Setup required |
|---------|----------|----------------|
| `memory` | Testing, Jupyter notebooks | None — lost on restart |
| `sqlite` | Single-process apps **(default)** | None — file at `~/.memollm/cache.db` |
| `redis` | Multiple workers, shared cache | `pip install memollm[redis]` |
| `postgres` | Production, analytics, auditing | `pip install memollm[postgres]` |

---

## FAQ

**Does it work with streaming (`stream=True`)?**
Yes — streaming calls pass through directly without caching. No crash, no configuration needed.

**Does it change the response object?**
No. Cache hits return a proper `ChatCompletion` (OpenAI) or `Message` (Anthropic) object — identical to what the API returns. Your existing code needs zero changes.

**Does caching affect response quality?**
L1 is lossless — same prompt, same response. L2 returns a cached response to a semantically equivalent question. At the default threshold of 0.65, BERTScore degradation is <0.3%. Raise the threshold toward 0.90+ if you need stricter accuracy.

**Is it thread-safe?**
SQLite uses WAL mode and handles concurrent reads safely. For high-concurrency multi-process deployments, use Redis or Postgres.

**Does it work with LangChain or LlamaIndex?**
Yes — wrap the underlying client before passing it in:
```python
from langchain_openai import ChatOpenAI
from memollm import wrap
from openai import OpenAI

cached_openai = wrap(OpenAI())
llm = ChatOpenAI(client=cached_openai.chat.completions)
```

**How is the cache keyed?**
L1 uses SHA-256 of the normalized prompt (whitespace stripped, keys sorted). L2 uses a 768-dim embedding vector stored alongside the response. Each entry is keyed by its SHA-256 hash in the backend.

**What happens if the cache backend is down?**
Cache failures are silent — the request falls through to the real API. Your app never crashes due to a cache error.

---

## Benchmarks

*Full benchmark results are being prepared for the accompanying paper. Preliminary results on HotpotQA and ShareGPT datasets:*

| Method | Token reduction | Cost reduction | Quality (BERTScore) |
|--------|----------------|----------------|---------------------|
| No cache (baseline) | 0% | 0% | 1.000 |
| LangChain CacheBackedEmbeddings | 18% | 18% | 1.000 |
| GPTCache | 31% | 31% | 0.994 |
| **MemoLLM (L1 only)** | 25% | 25% | 1.000 |
| **MemoLLM (L1+L2)** | **58%** | **58%** | 0.997 |

---

## Architecture

```
memollm/
├── __init__.py          # public API: wrap(), CacheConfig
├── interceptor.py       # transparent proxy — wraps OpenAI/Anthropic clients
├── cache.py             # L1 → L2 cache hierarchy
├── normalizer.py        # prompt normalization, SHA-256 hashing, acronym expansion
├── embedder.py          # sentence-transformers (local) or OpenAI embeddings
├── differ.py            # L3 prefix detection, Anthropic cache_control injection
├── config.py            # CacheConfig dataclass with all settings
├── backends/
│   ├── memory.py        # in-memory dict (testing/notebooks)
│   ├── sqlite.py        # SQLite WAL (default)
│   ├── redis.py         # Redis (multi-process)
│   └── postgres.py      # PostgreSQL (enterprise)
├── cost/
│   ├── tracker.py       # per-call token and cost accounting
│   ├── reporter.py      # rich terminal report
│   └── providers.json   # pricing table: OpenAI, Anthropic, Ollama models
└── cli/
    └── main.py          # llmcache CLI
```

---

## Citation

If you use MemoLLM in your research, please cite:

```bibtex
@article{kumar2026memollm,
  title={MemoLLM: A Three-Level Hierarchical Cache for Cost-Efficient LLM Applications},
  author={Kumar, Abhishek},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
```

*Paper in preparation. arXiv link coming soon.*

---

## Contributing

PRs welcome. Please open an issue first for significant changes.

## License

MIT
