Metadata-Version: 2.4
Name: cachesaver
Version: 0.0.6
Summary: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference
Author-email: Lars Klein <lars.klein@epfl.ch>
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: deepdiff
Requires-Dist: diskcache
Requires-Dist: nest_asyncio
Requires-Dist: openai
Requires-Dist: together
Requires-Dist: anthropic
Requires-Dist: google-genai
Requires-Dist: huggingface_hub
Requires-Dist: groq
Provides-Extra: transformers
Requires-Dist: transformers; extra == "transformers"
Requires-Dist: torch; extra == "transformers"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Requires-Dist: pytest-cov>=4.1.0; extra == "test"
Requires-Dist: deepdiff; extra == "test"
Requires-Dist: diskcache; extra == "test"

# Cache Saver

**A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference**

*Accepted at EMNLP 2025 (Findings)*

[![Paper](https://img.shields.io/badge/Arxiv-Paper-B31B1B?logo=arxiv)](https://aclanthology.org/2025.findings-emnlp.1402.pdf)
[![Project Page](https://img.shields.io/badge/Project-Page-blue?logo=github)](https://cachesaver.github.io/)
[![PyPI - Version](https://img.shields.io/pypi/v/cachesaver?logo=pypi&logoColor=white)](https://pypi.org/project/cachesaver/)

Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. At its heart is a *namespace-aware list-valued cache* that ensures **statistical integrity** of LLM responses by generating *i.i.d.* responses within a namespace while enabling response **reuse** across namespaces, all while guaranteeing full **reproducibility**.

On average across five reasoning strategies, five benchmark tasks, and three LLMs, Cache Saver **reduces USD cost by ~25% and CO2 emissions by ~35%**. In practical scenarios such as benchmarking and ablation analysis, savings reach **up to 60%**.

```python
# Just change the import — everything else stays the same
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
```

## Installation

```bash
pip install cachesaver
```

For local HuggingFace Transformers inference:
```bash
pip install cachesaver[transformers]
```

## Quick Start

Replace your LLM client import with Cache Saver's — the rest of your code is unchanged:

```python
# Before
from openai import AsyncOpenAI

# After
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Run again → A new sample is generated
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)

# Re-initialize the client and run again → The responses are retrieved from the cache
client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
```

A synchronous client is also available:

```python
from cachesaver.models.openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "What's the meaning of life?"}],
)
```

## Supported Providers

| Provider | Import |
|---|---|
| **OpenAI** | `from cachesaver.models.openai import AsyncOpenAI, OpenAI` |
| **Anthropic** | `from cachesaver.models.anthropic import AsyncAnthropic, Anthropic` |
| **Google Gemini** | `from cachesaver.models.gemini import AsyncGemini, Gemini` |
| **Together AI** | `from cachesaver.models.together import AsyncTogether` |
| **Groq** | `from cachesaver.models.groq import AsyncGroq, Groq` |
| **OpenRouter** | `from cachesaver.models.openrouter import AsyncOpenRouter, OpenRouter` |
| **HuggingFace (Inference Providers)** | `from cachesaver.models.huggingface import AsyncHuggingFace, HuggingFace` |
| **vLLM** | `from cachesaver.models.vllm import AsyncVLLM, VLLM` |
| **HuggingFace Transformers** | `from cachesaver.models.transformers import AsyncHFTransformers, HFTransformers` |


All cloud providers use the same interface as their original SDK. Just change the import.

## Key Features

### Statistical Integrity via Namespaced Caching

Unlike naive key-value caches, Cache Saver uses a **list-valued cache** managed through **namespaces**. Within a namespace, all responses to a given prompt are guaranteed to be *i.i.d.* — a response is never reused within the same namespace. Across namespaces, responses *are* reused via **stochastic coupling**, which is what drives the cost savings. This is critical for scenarios like stochastic sampling, uncertainty estimation, and policy diversity, where multiple independent responses to the same prompt are required.

### Reproducibility

Namespaces track which cached responses have been consumed, so re-running an experiment from scratch replays the exact same results in the exact same order — even for duplicate prompts.

```python
# Run 1 — calls the API
results_run1 = await classify(sentences, namespace="experiment_v1")

# Run 2 — new namespace, identical results from cache
results_run2 = await classify(sentences, namespace="experiment_v2")
assert results_run1 == results_run2  # Always true
```

### Error Recovery

Crash on item 7 of 10? Re-run and items 1–6 are served from cache instantly. Only items 7–10 hit the API.

```python
# Attempt 1 — crashes at item 7
try:
    results = await process(items, namespace="my_exp")
except RuntimeError:
    pass  # Items 1-6 are cached

# Attempt 2 — items 1-6 from cache, only 7-10 call API
results = await process(items, namespace="my_exp")
```

### Async Parallelism

Fully async-native. Use `asyncio.gather` for concurrent requests:

```python
results = await asyncio.gather(*[
    client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[{"role": "user", "content": prompt}],
    )
    for prompt in prompts
])
```

### Deterministic Async Ordering

When multiple async agents process the same prompt concurrently, Cache Saver caches by request id — not request or completion order. A built-in reordering module ensures replays are deterministic regardless of which task finishes first.

## Why It Works: Reuse Potential in LLM Reasoning

Multi-step reasoning strategies (Tree-of-Thought, ReAct, RAP, FoA, ReST-MCTS\*, etc.) are highly repetitive — **~50% of prompts are duplicates** both within a single method execution and across methods on the same task. Cache Saver exploits this redundancy across three practical scenarios:

<p align="center">
  <img src="docs/fig5.png" alt="Practical application results across cost, tokens, latency, and throughput" width="100%">
  <br>
  <em>Three practical scenarios using GPT-4.1-Nano across the benchmarks of Game of 24, HumanEval, and SciBench.</em>
</p>

The figure shows Cache Saver's impact across three practical ML scenarios. **A1-Hyperparameter tuning:** grid search over Tree-of-Thought configurations (tree width, depth, number of evaluations). **A2-Ablation analysis:** testing three variations of the FoA algorithm (removing the selection phase, backtracking, or resampling). **A3-Benchmarking:** comparing entirely different reasoning strategies (ToT, GoT, FoA).

The **blue bars** show the cost *without* Cache Saver. The **orange bars** show the *average* cost with Cache Saver. Because experiments share prompts, cached responses are reused and average cost drops significantly. The **green bars** show the *marginal* cost, that is the added cost of incorporating one more configuration, variation, or method into the experiment.

The reuse potential depends on how similar the experiments are: hyperparameter tuning (A1) achieves the highest savings (**6x** lower cost, tokens, and latency) since different configurations of the same method share most prompts. Ablation analysis (A2) achieves **2.5x** savings. Finally, benchmarking across different methods (A3) still achieves **2x** savings, a notable finding since even structurally different reasoning strategies share significant prompt overlap. These savings are **on top of** existing platform-level optimizations (paged attention, KV caching, prefix sharing, etc.).

## Architecture

Cache Saver composes four async pipeline components around your model:

| Component | Role |
|---|---|
| **Cacher** | Namespace-aware list-valued cache with per-key async mutexes. Tracks per-namespace usage counts for i.i.d. sampling. |
| **Deduplicator** | Merges duplicate prompts within a batch by (hash, namespace), combines `n` values, redistributes responses. |
| **Reorderer** | Sorts by stable identifier before processing, restores original order after. Ensures deterministic results. |
| **Batcher** | Async producer-consumer queue. Groups requests by `batch_size` with timeout. |

## Local Model Inference

For HuggingFace Transformers models running on local GPUs:

```python
from cachesaver.models.transformers import AsyncHFTransformers

client = AsyncHFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="local_exp",
    cachedir="./cache",
    batch_size=8,
)

response = await client.chat.completions.create(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_new_tokens=20,
)
```

## Examples

See the [`examples/`](examples/) directory:

- **[`tutorial.ipynb`](examples/tutorial.ipynb)** — Full walkthrough: quickstart, reproducibility, error recovery, parallelism, ReAct agents, Tree-of-Thought, and RAG pipelines.
- **[`providers_example.ipynb`](examples/providers_example.ipynb)** — Usage examples for all supported providers.

## Requirements

- Python >= 3.10

## Citation

```bibtex
@inproceedings{
potamitis2025cache,
title={Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible {LLM} Inference},
author={Nearchos Potamitis and Lars Henning Klein and Bardia Mohammadi and Chongyang Xu and Attreyee Mukherjee and Niket Tandon and Laurent Bindschaedler and Akhil Arora},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=2Nxih3ySSi}
}
```
