State-of-the-art retrieval over past conversations — 93.9% R@5 on LoCoMo, 98.4% on LongMemEval. No LLM calls. $0 per query. Your words, stored exactly as you said them.
Tested on the two widely-used conversational memory benchmarks. No LLM in the loop — just embeddings, sparse retrieval, and a free cross-encoder reranker.
Disclaimer: results are compiled from multiple papers and evaluation reports. They are not directly comparable due to differences in backbone LLMs, prompting strategies, and evaluation setups.
| System | LoCoMo Accuracy | LLM required | Open source | Source |
|---|---|---|---|---|
| Engram | 93.9% (R@5) | No | Yes (MIT) | This repo (reproducible) |
| EverMemOS | 86.76% – 93.05% | Yes | No | arXiv:2601.02163 |
| Zep | 85.22% | Yes | Partial | EverMemOS evaluation |
| MemOS | 80.76% | Yes | Partial | EverMemOS evaluation |
| Mem0 | 64.20% | Yes | Partial | EverMemOS evaluation |
| MemU | 61.15% | Yes | Partial | arXiv:2601.02163 |
| Other LLM-based (Hindsight, MemGPT, Letta) | ~83 – 92% | Yes | Varies | Secondary reports |
| Non-LLM (SLM variants) | ~74 – 75% | No | Yes | Secondary reports |
Dense semantic search catches meaning. Sparse BM25 catches exact words. A cross-encoder reranker scores the finalists. Nothing is summarized.
bge-large bi-encoder (1024d) finds semantically similar past turns.
BM25 catches exact names, dates, and rare terms embeddings miss.
Reciprocal Rank Fusion combines both signals without per-query tuning.
Cross-encoder scores top candidates jointly for the final ranking.
Long sessions dilute embeddings. Chunking at ~6 turns with 1-turn overlap keeps individual facts retrievable.
Prepending [2024-01-15] to each document lets both dense and BM25 match temporal queries.
First-person turns don't contain the speaker's name, so entity-attribute queries fail. Prepending it bridges the gap and lifts LoCoMo R@5 by ~3pts.
One pip install. Works locally with FAISS + SQLite, or plugs into Qdrant for cloud deployment.
# Install $ pip install engram-search # Initialize a memory store $ engram init ./my_memories # Ingest past conversations $ engram ingest conversations.json --store ./my_memories # Search $ engram search "why did we switch to GraphQL" --store ./my_memories
from engram.backends.faiss_backend import FaissBackend from engram.backends.base import Document from engram.ingestion.parser import session_to_documents from engram.retrieval.embedder import Embedder from engram.retrieval.pipeline import RetrievalPipeline embedder = Embedder("bge-large") backend = FaissBackend(path="./my_memories", dimension=1024) pipeline = RetrievalPipeline(embedder=embedder) turns = [ {"role": "user", "content": "I'm switching our API from REST to GraphQL."}, {"role": "assistant", "content": "What's driving the switch?"}, {"role": "user", "content": "Too many round trips — 12 calls per screen."}, ] docs = session_to_documents(turns, session_id="s1", timestamp="2025-01-15") results = pipeline.search("why did we switch to GraphQL", documents=docs, top_k=3) for r in results: print(r.text)
# Point Engram at a managed Qdrant cluster $ export ENGRAM_BACKEND=qdrant $ export ENGRAM_QDRANT_URL=https://your-cluster.qdrant.io:6333 $ export ENGRAM_QDRANT_API_KEY=your-api-key # Start the API server $ pip install fastapi uvicorn $ uvicorn engram.server:app --host 0.0.0.0 --port 8000 # Endpoints available # POST /ingest — add conversations # POST /search — retrieve memories # GET /health — health check # GET /stats — store statistics
Retrieval only. Deterministic, reproducible, no per-query spend, no prompt drift, no rate limits.
Nothing is summarized or paraphrased on the way in. What you said is what gets returned.
FAISS + SQLite out of the box. Runs entirely on your machine. No API keys needed to get started.
Plug into Qdrant for multi-tenant, horizontally-scalable memory. Same API, same accuracy.
MIT licensed. Reproducible benchmarks. Drop it into your RAG pipeline today.