State-of-the-art retrieval over past conversations. Your words, stored exactly as you said them — retrieved with zero LLM calls at query time.
Tested on the two widely-used conversational memory benchmarks. No LLM in the loop — just embeddings, sparse retrieval, and a free cross-encoder reranker.
Disclaimer: results are compiled from multiple papers and evaluation reports. They are not directly comparable due to differences in backbone LLMs, prompting strategies, and evaluation setups.
| System | LoCoMo Accuracy | LLM required | Open source | Source |
|---|---|---|---|---|
| Engram | 93.9% (R@5) | No | Yes (MIT) | This repo (reproducible) |
| EverMemOS | 86.76% – 93.05% | Yes | No | arXiv:2601.02163 |
| Zep | 85.22% | Yes | Partial | EverMemOS evaluation |
| MemOS | 80.76% | Yes | Partial | EverMemOS evaluation |
| Mem0 | 64.20% | Yes | Partial | EverMemOS evaluation |
| MemU | 61.15% | Yes | Partial | arXiv:2601.02163 |
| Other LLM-based (Hindsight, MemGPT, Letta) | ~83 – 92% | Yes | Varies | Secondary reports |
| Non-LLM (SLM variants) | ~74 – 75% | No | Yes | Secondary reports |
Dense semantic search catches meaning. Sparse BM25 catches exact words. A cross-encoder reranker scores the finalists. Nothing is summarized.
bge-large bi-encoder (1024d) finds semantically similar past turns.
BM25 catches exact names, dates, and rare terms embeddings miss.
Reciprocal Rank Fusion combines both signals without per-query tuning.
Cross-encoder scores top candidates jointly for the final ranking.
Long sessions dilute embeddings. Chunking at ~6 turns with 1-turn overlap keeps individual facts retrievable.
Prepending [2024-01-15] to each document lets both dense and BM25 match temporal queries.
First-person turns don't contain the speaker's name, so entity-attribute queries fail. Prepending it bridges the gap and lifts LoCoMo R@5 by ~3pts.
One pip install. Works locally with FAISS + SQLite, or plugs into Qdrant for cloud deployment.
# Install $ pip install engram-search # Initialize a memory store $ engram init ./my_memories # Ingest past conversations $ engram ingest conversations.json --store ./my_memories # Search $ engram search "why did we switch to GraphQL" --store ./my_memories
from engram.backends.faiss_backend import FaissBackend from engram.backends.base import Document from engram.ingestion.parser import session_to_documents from engram.retrieval.embedder import Embedder from engram.retrieval.pipeline import RetrievalPipeline embedder = Embedder("bge-large") backend = FaissBackend(path="./my_memories", dimension=1024) pipeline = RetrievalPipeline(embedder=embedder) turns = [ {"role": "user", "content": "I'm switching our API from REST to GraphQL."}, {"role": "assistant", "content": "What's driving the switch?"}, {"role": "user", "content": "Too many round trips — 12 calls per screen."}, ] docs = session_to_documents(turns, session_id="s1", timestamp="2025-01-15") results = pipeline.search("why did we switch to GraphQL", documents=docs, top_k=3) for r in results: print(r.text)
# Point Engram at a managed Qdrant cluster $ export ENGRAM_BACKEND=qdrant $ export ENGRAM_QDRANT_URL=https://your-cluster.qdrant.io:6333 $ export ENGRAM_QDRANT_API_KEY=your-api-key # Start the API server $ pip install fastapi uvicorn $ uvicorn engram.server:app --host 0.0.0.0 --port 8000 # Endpoints available # POST /ingest — add conversations # POST /search — retrieve memories # GET /health — health check # GET /stats — store statistics
# Install with MCP extras $ pip install "engram-search[mcp]" $ engram init ./engram_store # Add to claude_desktop_config.json (Claude Desktop, Cursor, Windsurf…) { "mcpServers": { "engram": { "command": "engram-mcp", "env": { "ENGRAM_STORE_PATH": "/absolute/path/to/engram_store" } } } } # Restart the client. Engram exposes three tools: # search_memory(query, top_k, min_score) — retrieve relevant memories # add_memory(text, metadata) — store a new memory fact # memory_stats() — count documents in the store
Even the best agents forget past interactions, lose long-term context, and rely on expensive reprocessing.
Engram fixes this at the infrastructure layer.
Retrieval only. Deterministic, reproducible, no per-query spend, no prompt drift, no rate limits.
Nothing is summarized or paraphrased on the way in. What you said is what gets returned.
FAISS + SQLite out of the box. Runs entirely on your machine. No API keys needed to get started.
Plug into Qdrant for multi-tenant, horizontally-scalable memory. Same API, same accuracy.
Drop-in memory for any system that needs to remember conversations across time.
Recall user preferences, past decisions, and prior context across sessions — without re-feeding the full history into every prompt.
Pull a customer's full history on every interaction. No transcript dumps, no context bloat, no $ per token.
Give autonomous agents persistent memory across runs without blowing up the context window.
Resolve references to prior conversations ("like we discussed last week") without re-embedding history on every turn.
Index dialogues, meeting transcripts, or support tickets with higher recall than vanilla semantic search.
Build personal assistants that actually remember what you told them — exact words, not summaries.
Shipping more integrations, more backends, and deeper temporal reasoning.
Drop-in memory modules for existing agent stacks.
Plug Engram into Claude Desktop, Cursor, Windsurf, or any MCP client with three lines of config. See the MCP tab in Quickstart.
Append turns to a live session without re-indexing.
Per-user namespaces for hosted deployments.
Non-blocking ingest/search for high-throughput workloads.
pgvector, Weaviate, and Pinecone adapters.
Improved date-grounding for "when did we…" queries.
Add MSC, DialogSum, and custom domain benchmarks.
Have a use case we're missing? Open an issue.
MIT licensed. Reproducible benchmarks. Drop it into your RAG pipeline today.