Metadata-Version: 2.4
Name: atelya
Version: 0.1.19
Summary: AI-first memory OS: semantic recall that returns reusable KV. One lifecycle memory, three front doors (self-host serve, closed-LLM proxy, MCP server).
Author: amem
License: Apache-2.0
Keywords: memory,llm,kv-cache,rag,mcp,agents,vllm
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: fastembed>=0.3
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn>=0.27
Requires-Dist: pydantic>=2
Requires-Dist: mcp>=1.0
Provides-Extra: selfhost
Requires-Dist: vllm>=0.6; extra == "selfhost"
Requires-Dist: transformers>=4.40; extra == "selfhost"
Requires-Dist: torch; extra == "selfhost"
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.2; extra == "langgraph"
Dynamic: license-file

# Atelya OS — KV-native memory for self-hosted agents

> Agent memory that lives in the model's **KV cache**, not just in text.
> Built for teams self-hosting open-model inference (vLLM / SGLang) with heavy, long-lived memory.

```bash
pip install atelya
```

Most memory systems store **text** and re-prefill it into the prompt **every turn** — so the cost of
serving memory grows linearly with turns. Atelya OS (`amem`) keeps the relevant working set as
**reused KV**: compute it once, reuse it, don't recompute memory each turn.

---

## What's measured (not estimated)

All numbers below were measured on a single RTX 4070 with `bench_real.py` / `amem_headtohead.py` in
this repo. Full methodology, per-query data, and honest limits: **[BENCHMARK.md](BENCHMARK.md)**.

- **Fidelity — 97.5% answer-for-answer agreement** with a cold full re-prefill of the same chunks
  (n = 200, LLM-judged). Reusing KV instead of recomputing it **does not change the answer**.
- **Cost — ~6x to ~54x less prefill.** Reusing KV vs re-prefilling the same retrieved set is **~6x
  cheaper per query** (n = 200, the conservative, default behavior); for a **stable session** the
  one-time working set amortizes to **~54x** (measured, 30 queries). Your real number lands in that
  range depending on how much the relevant memory changes per query. Break-even ~1 query.
- **Quality — at parity, not a win.** Head-to-head vs Mem0 at a **matched answerer + injection
  budget**, answer correctness was **60% (amem) vs 55% (Mem0)** — within noise at n = 20. We do
  **not** claim a recall-accuracy win: dedicated recall systems (Mem0, Zep, EverOS) lead that, and
  this comparison deliberately matches retrieval to isolate **cost**, so it does not reflect Mem0's
  stronger production recall pipeline.

> Honest framing for the cost number: lead with **~6x** (rigorous, n = 200). The **~54x** is the
> best case (stable working set + Mem0 storing raw turns); Mem0's real extraction injects fewer
> tokens and narrows the gap — but amem still never re-prefills its resident KV. See BENCHMARK.md.

---

## Who this is for

- **Full fit — you self-host vLLM / SGLang on a CUDA GPU**, with memory-heavy or long-horizon
  agents. You own inference, so you can inject KV -> you get the flat cost curve **and** the KV moat.
- **Partial fit — closed APIs (OpenAI) or Ollama.** You can't inject KV into a model you don't
  control, so amem **reduces** the per-turn memory bill but can't flatten it. The memory SDK still
  helps; the KV-reuse cost curve does not apply.

---

## Install

```bash
pip install atelya                 # memory SDK + CLI (no GPU needed)
pip install 'atelya[selfhost]'     # + vLLM / LMCache CacheBlend engine (CUDA GPU) — unlocks KV reuse
```

The ~6x–54x cost win needs the self-host engine (`[selfhost]`). **`pip install atelya` alone is the
memory-layer SDK; the KV moat requires inference you control.** The Python package runs without the
optional Rust engine (that engine is a commercial performance deepener, not required).

---

## Quickstart

> Verified against **amem 0.1.18**. `kv-serve` (the KV-reuse moat) needs `amem[selfhost]` + a CUDA GPU.
> No GPU? Swap in `amem proxy` (drop-in for Claude / GPT) — same SDK; the cost curve just doesn't apply.

Start the KV-reuse memory server (loads the vLLM + CacheBlend engine):

```bash
amem kv-serve            # serves on http://localhost:8000  (needs amem[selfhost] + a CUDA GPU)
```

Use it from the tiny SDK — the second question reuses the first one's KV instead of re-prefilling
the memory:

```python
from amem import Amem

mem = Amem("http://localhost:8000")
mem.remember("alice", ["Alice lives in Seattle and works at Boeing.",
                       "Alice's sister Maria lives in Denver."])
sid = mem.start_session("alice")                       # one resident working set for the chat
print(mem.ask("alice", "Where does Alice's sister live?", session=sid))
print(mem.ask("alice", "What company does Alice work for?", session=sid))   # different q -> reuses KV
```

Or drop it in behind the **standard OpenAI SDK** — point `base_url` at amem and pass `user` as the
memory namespace (per-user residency is on by default):

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="amem", user="alice",
    messages=[{"role": "user", "content": "Where does Alice's sister live?"}])
print(r.choices[0].message.content)
```

---

## CLI

```
amem proxy        # closed-LLM drop-in (Claude / GPT) + cache orchestration   (light, no GPU)
amem mcp          # MCP server for Claude Desktop / Cursor / Claude Code       (stdio)
amem serve        # self-host serve (local open model + KV residency)          [amem[selfhost]]
amem kv-serve     # KV-native serve: CacheBlend reuse + KvPolicy eviction (moat) [amem[selfhost]]
amem bench        # reproducible cost-curve benchmark (KV-residency vs re-prefill) [amem[selfhost]]
amem version      # print version
```

Run `amem --help` for the authoritative list and flags.

---

## How it works — the bridge

```
text memory                amem (the bridge)                 KV memory
re-prefilled every turn  -> recall the relevant set, then  -> its precomputed KV is REUSED
(linear cost)               reuse its KV (not the text)       (one-time prefill, then ~flat)
```

- **CacheBlend** (via LMCache) reuses each chunk's attention KV **position-independently**, with
  ~15% selective recompute — so an arbitrary set of recalled chunks can be served from cached KV
  instead of a full re-prefill.
- A **value-model residency policy** (relevance x recency x reuse - size) keeps the hottest memory
  resident and tiers the rest to CPU/disk.

---

## What this is **not**

- **Not a recall-accuracy leaderboard claim.** Mem0 / Zep / EverOS lead LoCoMo / LongMemEval recall;
  amem composes with a recall layer. Its edge is the **cost of serving memory at parity fidelity**.
- **Transformer-only.** CacheBlend reuses per-token attention KV. SSM / Mamba-hybrid models keep a
  compressed recurrent state — there is no per-token KV to blend — so the KV-reuse moat does not
  apply to them.
- **A storage trade.** KV is ~1000x the size of the text it represents; amem tiers it to CPU/disk.
  You buy lower compute with more storage.

---

## Status & license

Experimental, in active development — expect rough edges. Apache-2.0.
