Metadata-Version: 2.4
Name: cogcache
Version: 0.2.0
Summary: Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.
Project-URL: Homepage, https://github.com/AaronharveyHan/AI_Cost_Optimization
Project-URL: Repository, https://github.com/AaronharveyHan/AI_Cost_Optimization
Project-URL: Bug Tracker, https://github.com/AaronharveyHan/AI_Cost_Optimization/issues
Project-URL: Demo Playground, https://github.com/AaronharveyHan/cogcache-playground
Author-email: Aaron Harvey Han <aaronharveyhansen@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,cache,cost-optimization,embedding,langchain,llm,openai,redis,semantic
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Provides-Extra: all
Requires-Dist: langchain-core>=0.1.0; extra == 'all'
Requires-Dist: openai>=1.30.0; extra == 'all'
Requires-Dist: prometheus-client>=0.19.0; extra == 'all'
Requires-Dist: redis>=5.0.0; extra == 'all'
Provides-Extra: bench
Requires-Dist: openai>=1.30.0; extra == 'bench'
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: hatchling>=1.21; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.27; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.0; extra == 'docs'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1.0; extra == 'langchain'
Provides-Extra: openai-judge
Requires-Dist: openai>=1.30.0; extra == 'openai-judge'
Provides-Extra: prometheus
Requires-Dist: prometheus-client>=0.19.0; extra == 'prometheus'
Provides-Extra: redis
Requires-Dist: redis>=5.0.0; extra == 'redis'
Description-Content-Type: text/markdown

# cogcache

> Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.

[![PyPI](https://img.shields.io/pypi/v/cogcache.svg)](https://pypi.org/project/cogcache/)
[![Python](https://img.shields.io/pypi/pyversions/cogcache.svg)](https://pypi.org/project/cogcache/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**cogcache** caches LLM responses by **semantic similarity** instead of exact key match.
When a paraphrased query arrives, it returns the previous answer in milliseconds —
zero LLM tokens spent.

```
"What is semantic caching?"      → LLM call (4.2s, 320 tokens)
"What does semantic caching mean?" → Cache HIT (0.5ms, 0 tokens)   ← 99% savings
```

## Install

```bash
pip install cogcache                   # core library
pip install cogcache[redis]            # + Redis Stack backend (HNSW vector search)
pip install cogcache[prometheus]       # + Prometheus metrics sink
pip install cogcache[openai-judge]     # + LLM-as-Judge quality scoring
pip install cogcache[langchain]        # + LangChain BaseCache adapter
pip install cogcache[all]              # everything
```

## Quick start

```python
from cogcache import CogniCache

cache = CogniCache(similarity_threshold=0.92)

def my_llm(query: str) -> str:
    # Your real LLM call here (OpenAI, Anthropic, DashScope, ...)
    return openai_client.chat.completions.create(...).choices[0].message.content

# First call → LLM
answer = cache.query("What is gradient descent?", llm_fn=my_llm)

# Second call → cache hit, zero LLM cost
answer = cache.query("Explain gradient descent.", llm_fn=my_llm)
```

### As a decorator

```python
@cache.cached(threshold=0.90)
def ask_llm(query: str) -> str:
    return my_llm(query)

ask_llm("What is X?")   # LLM call
ask_llm("Tell me X.")    # cache hit
```

### With LangChain

```python
from cogcache.integrations.langchain import CogniCacheLangChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(cache=CogniCacheLangChain(cache))
```

## Features

| Feature | Default |
|---|---|
| Cosine-similarity semantic matching | ✅ |
| Pluggable stores: `MemoryStore` / `RedisStore` (Redis Stack HNSW) | ✅ |
| TTL eviction on read & write paths | ✅ |
| LLM-as-Judge with "write strict, hit lenient" policy | optional |
| Prometheus + JSON metrics sink | optional |
| Route / intent isolation (multi-tenant safe) | ✅ |
| Fail-open on backend failures | ✅ |

## Configuration

```python
CogniCache(
    redis_url="redis://localhost:6379/0",   # None = in-memory
    similarity_threshold=0.92,               # 0.85–0.95 typical
    max_cache_size=10_000,
    ttl=3600,                                # -1 for no expiry
    vector_dim=512,                          # match your embedder
    enable_judge=True,                       # LLM Judge quality gate
    write_min_quality=0.8,
    judge_on_hit=False,                      # async hit-time warning
    embed_fn=my_custom_embedder,             # or use the default
    metrics=MetricsCollector(),              # observability hook
)
```

See [tuning guide](https://github.com/AaronharveyHan/AI_Cost_Optimization/blob/main/docs/tuning.md)
for threshold selection, embedding model comparison, and Prometheus alert thresholds.

## When to use cogcache

✅ **High-QPS chatbots** where users phrase the same question different ways
✅ **RAG systems** with repetitive paraphrased queries  
✅ **Multi-tenant LLM APIs** where you bill per token  
✅ **Demo / dev environments** where you want to skip LLM calls on repeat

❌ Personalized answers (use `route=user_id` isolation if you must)  
❌ Real-time data (weather, prices) — set short TTL or skip caching

## Production readiness

- ✅ Thread-safe `MemoryStore` and `MetricsCollector`
- ✅ Fail-open: Redis disconnect / Judge crash never breaks your request path
- ✅ 49 unit tests, run with `pytest -q`
- ✅ Used in production at [AI_Cost_Optimization](https://github.com/AaronharveyHan/AI_Cost_Optimization)
  reference deployment

## Try it live

For a complete demo with FastAPI backend, admin dashboard, Prometheus exporter,
and Docker Compose setup, see the **cogcache-playground** repo:

```bash
git clone https://github.com/AaronharveyHan/cogcache-playground.git
cd cogcache-playground && docker compose up
# Open http://localhost:8000/admin
```

## License

MIT — see [LICENSE](https://github.com/AaronharveyHan/AI_Cost_Optimization/blob/main/LICENSE).
