#1 on LoCoMo benchmark — zero LLM required

Conversational memory
that actually remembers.

93.9% LoCoMo R@5
0 LLM calls
$0 per query

State-of-the-art retrieval over past conversations. Your words, stored exactly as you said them — retrieved with zero LLM calls at query time.

$ pip install engram-search
View on GitHub
MIT licensed
Local-first, cloud-ready
Python 3.9+
The shift

Memory that breaks → memory that works.

Without memory infrastructure

Forgets past turns
Re-embeds or paraphrases on every call
$ per query, rate limits, prompt drift

With Engram

93.9% recall across sessions
Exact words, retrieved verbatim
$0 per query, deterministic, reproducible
Benchmarks

Independently verified on two benchmarks.

Tested on the two widely-used conversational memory benchmarks. No LLM in the loop — just embeddings, sparse retrieval, and a free cross-encoder reranker.

LoCoMo

1,982 questions · 10 conversations
93.9%
R@5 — top result on the benchmark
R@1095.0%
NDCG@50.894
Single-hop90.4%
Temporal93.1%
Contextual97.1%
Adversarial94.6%

LongMemEval

500 questions
98.4%
R@5 — 492 of 500 questions retrieved
R@1099.4%
NDCG@50.934
Multi-session99.2%
Single-session-user100.0%
Knowledge-update98.7%
Temporal-reasoning97.0%

LoCoMo benchmark comparison

Disclaimer: results are compiled from multiple papers and evaluation reports. They are not directly comparable due to differences in backbone LLMs, prompting strategies, and evaluation setups.

System LoCoMo Accuracy LLM required Open source Source
Engram 93.9% (R@5) No Yes (MIT) This repo (reproducible)
EverMemOS 86.76% – 93.05% Yes No arXiv:2601.02163
Zep 85.22% Yes Partial EverMemOS evaluation
MemOS 80.76% Yes Partial EverMemOS evaluation
Mem0 64.20% Yes Partial EverMemOS evaluation
MemU 61.15% Yes Partial arXiv:2601.02163
Other LLM-based (Hindsight, MemGPT, Letta) ~83 – 92% Yes Varies Secondary reports
Non-LLM (SLM variants) ~74 – 75% No Yes Secondary reports
Architecture

Three-stage hybrid retrieval.

Dense semantic search catches meaning. Sparse BM25 catches exact words. A cross-encoder reranker scores the finalists. Nothing is summarized.

1

Dense

bge-large bi-encoder (1024d) finds semantically similar past turns.

2

Sparse

BM25 catches exact names, dates, and rare terms embeddings miss.

3

RRF fusion

Reciprocal Rank Fusion combines both signals without per-query tuning.

4

Rerank

Cross-encoder scores top candidates jointly for the final ranking.

Session chunking

Long sessions dilute embeddings. Chunking at ~6 turns with 1-turn overlap keeps individual facts retrievable.

Timestamp prefix

Prepending [2024-01-15] to each document lets both dense and BM25 match temporal queries.

Speaker-name injection

First-person turns don't contain the speaker's name, so entity-attribute queries fail. Prepending it bridges the gap and lifts LoCoMo R@5 by ~3pts.

Quickstart

Running in two minutes.

One pip install. Works locally with FAISS + SQLite, or plugs into Qdrant for cloud deployment.

# Install
$ pip install engram-search

# Initialize a memory store
$ engram init ./my_memories

# Ingest past conversations
$ engram ingest conversations.json --store ./my_memories

# Search
$ engram search "why did we switch to GraphQL" --store ./my_memories
from engram.backends.faiss_backend import FaissBackend
from engram.backends.base import Document
from engram.ingestion.parser import session_to_documents
from engram.retrieval.embedder import Embedder
from engram.retrieval.pipeline import RetrievalPipeline

embedder = Embedder("bge-large")
backend = FaissBackend(path="./my_memories", dimension=1024)
pipeline = RetrievalPipeline(embedder=embedder)

turns = [
    {"role": "user", "content": "I'm switching our API from REST to GraphQL."},
    {"role": "assistant", "content": "What's driving the switch?"},
    {"role": "user", "content": "Too many round trips — 12 calls per screen."},
]
docs = session_to_documents(turns, session_id="s1", timestamp="2025-01-15")

results = pipeline.search("why did we switch to GraphQL", documents=docs, top_k=3)
for r in results:
    print(r.text)
# Point Engram at a managed Qdrant cluster
$ export ENGRAM_BACKEND=qdrant
$ export ENGRAM_QDRANT_URL=https://your-cluster.qdrant.io:6333
$ export ENGRAM_QDRANT_API_KEY=your-api-key

# Start the API server
$ pip install fastapi uvicorn
$ uvicorn engram.server:app --host 0.0.0.0 --port 8000

# Endpoints available
# POST /ingest   — add conversations
# POST /search   — retrieve memories
# GET  /health   — health check
# GET  /stats    — store statistics
# Install with MCP extras
$ pip install "engram-search[mcp]"
$ engram init ./engram_store

# Add to claude_desktop_config.json (Claude Desktop, Cursor, Windsurf…)
{
  "mcpServers": {
    "engram": {
      "command": "engram-mcp",
      "env": {
        "ENGRAM_STORE_PATH": "/absolute/path/to/engram_store"
      }
    }
  }
}

# Restart the client. Engram exposes three tools:
#   search_memory(query, top_k, min_score)  — retrieve relevant memories
#   add_memory(text, metadata)              — store a new memory fact
#   memory_stats()                          — count documents in the store
Why Engram

LLMs are getting better — but memory is still broken.

Even the best agents forget past interactions, lose long-term context, and rely on expensive reprocessing.

Engram fixes this at the infrastructure layer.

Zero LLM calls

Retrieval only. Deterministic, reproducible, no per-query spend, no prompt drift, no rate limits.

Exact words preserved

Nothing is summarized or paraphrased on the way in. What you said is what gets returned.

Local-first

FAISS + SQLite out of the box. Runs entirely on your machine. No API keys needed to get started.

Cloud-ready

Plug into Qdrant for multi-tenant, horizontally-scalable memory. Same API, same accuracy.

Use Cases

Where Engram fits.

Drop-in memory for any system that needs to remember conversations across time.

AI assistants with long-term memory

Recall user preferences, past decisions, and prior context across sessions — without re-feeding the full history into every prompt.

Customer support agents

Pull a customer's full history on every interaction. No transcript dumps, no context bloat, no $ per token.

Agent memory layer

Give autonomous agents persistent memory across runs without blowing up the context window.

Multi-session chatbots

Resolve references to prior conversations ("like we discussed last week") without re-embedding history on every turn.

RAG over conversations

Index dialogues, meeting transcripts, or support tickets with higher recall than vanilla semantic search.

Personal AI with real history

Build personal assistants that actually remember what you told them — exact words, not summaries.

Roadmap

What's next.

Shipping more integrations, more backends, and deeper temporal reasoning.

LangChain + LlamaIndex integrations

Drop-in memory modules for existing agent stacks.

MCP server Shipped

Plug Engram into Claude Desktop, Cursor, Windsurf, or any MCP client with three lines of config. See the MCP tab in Quickstart.

Streaming ingestion

Append turns to a live session without re-indexing.

Multi-tenant isolation

Per-user namespaces for hosted deployments.

Async API

Non-blocking ingest/search for high-throughput workloads.

More backends

pgvector, Weaviate, and Pinecone adapters.

Temporal reasoning boost

Improved date-grounding for "when did we…" queries.

Benchmark expansion

Add MSC, DialogSum, and custom domain benchmarks.

Have a use case we're missing? Open an issue.

Ready to give your agent a memory?

MIT licensed. Reproducible benchmarks. Drop it into your RAG pipeline today.

$ pip install engram-search
Star on GitHub