ContextFit is open source — MIT licensed MIT License ★ Star on GitHub
research preview · MIT license

Agent memory that thinks in
tokens, not vectors

ContextFit retrieves the right prior conversation without embedding APIs, vector databases, or GPU hardware — at 0.4ms query latency and $0 per query.

$ pip install git+https://github.com/ContextFit/cf.git
View on GitHub Read the Whitepaper How it works →
61.5% Recall@1 · 499-case eval
0.4ms Query latency
375× Faster than embed retrieval
$0 API cost per query
41MB Total dependency footprint

Embedding-based memory has
three structural flaws

Every major agent memory system in 2026 converts conversations to dense vectors before retrieval. This works for factual lookups — and breaks for the queries agents most frequently need to answer.

🌀

Semantic averaging

An entire session compressed into one vector loses episode structure. "What should I cook tonight?" shares almost no vector proximity with "I just harvested zucchini from my garden" — even though that's exactly the session the agent needs.

💸

Compounding API cost

Every memory access requires an embedding call. At real agent throughput — thousands of turns per day — this compounds into a meaningful infrastructure cost line, plus a vector database on top.

🔒

Data leaves the machine

Sending memory retrieval queries to an embedding API means your user's personal context crosses a network boundary on every turn. For privacy-sensitive use cases this is a non-starter.

🔧

Operational complexity

Embedding API + vector database + embedding model versioning + index management = three separate systems to provision, monitor, and keep synchronized.

🫥

Zero interpretability

When retrieval fails, the answer is "cosine distance was 0.62." There's no auditable explanation for why a session was or wasn't surfaced.

GPU dependency

Running embedding models locally requires PyTorch and benefits significantly from GPU acceleration — 500MB–2GB of dependencies before a single session is indexed.


Stay in token space,
end to end

The most valuable signals in conversational memory are structural, not semantic: what kind of memory did the user express? and does this episode's memory type match what this query needs? These questions are answerable with token-level pattern matching — no embedding model required.

PRIMITIVE 01

Memory Atoms

Deterministic, domain-agnostic fact extraction from user-authored turns. Eight typed primitives — preference, goal, constraint, decision, temporal update, open loop, interest, entity — extracted with zero API calls.

PRIMITIVE 02

Episode Relevance Scorer

Numeric session ranking by structural memory-signal type, not vector proximity. Answers "does this session have what this query type needs?" — the right question for vague advice queries.

PRIMITIVE 03

Query Router

Deterministic zero-cost dispatch to the right retrieval mode per query. Vague advice → episode scorer. Specific facts → BM25. Temporal state → fusion. No LLM routing, no latency.

PRIMITIVE 04

Structural Reranker

Ten token-native features re-score BM25 candidates: lexical overlap, behavior-marker alignment, named entity overlap, question-type slot matching. Surpasses Cohere embeddings at zero cost.

raw session text
BPE tokenize
atom extraction
BM25 index
query router
structural rerank
ranked sessions

Beats Cohere. Trails OpenAI
by 1.6 pts. Costs $0.

Evaluated on a 499-case domain-agnostic agent memory benchmark across 8 behavioral categories and 26 domains — hand-crafted cases plus generated coverage.

System R@1 R@3 MRR API Cost GPU
Mem0 (GPT-4o-mini + embed) 54.4%81.0%0.716 LLM + embed
Cohere embed-english-v3 58.7%91.4%0.751 embed API
ContextFit + slot matching 61.5% 93.2% 0.776 $0 ✓ CPU
OpenAI text-embedding-3-small 63.1%96.6%0.792 embed API
Recall@1 — 499-case agent memory eval
Mem0
54.4%
Cohere embed-v3
58.7%
ContextFit
61.5%
OpenAI embed-3-small
63.1%

open_loop_retrieval

+24.6 pts
88.5% vs 63.9% — structural markers dominate

temporal_supersession

+4.8 pts
52.4% vs 47.6% — state-change atoms win

multi_session_synthesis

−8.9 pts
78.6% vs 87.5% — active area of improvement

preference_recommendation

−25.8 pts
51.6% vs 77.4% — known semantic gap; being narrowed

LongMemEval-S Any@5

96.0%
Within 1.6 pts of best published result (gbrain 97.6%)

No database. No GPU.
No vendor lock-in.

The token-native architecture isn't just a performance choice — it's a deployment choice. ContextFit runs anywhere a Python process runs.

✓ eliminated

No vector database

The index is a directory of flat files — zstd-compressed token arrays, BM25 postings, LSH signatures. No Qdrant, no Pinecone, no pgvector. No service to start, no schema to migrate.

✓ eliminated

No GPU required

Zero PyTorch, zero CUDA, zero model weights. Every operation runs on CPU: BM25 scoring, roaring bitmap intersection, MinHash LSH, episode feature computation, structural reranking.

✓ 41MB total

Minimal footprint

Total dependency footprint is ~41MB (tiktoken + numpy + pyroaring + datasketch + zstandard). PyTorch alone is 500MB–2GB. Fits in a Lambda function, a slim container, or a mobile app bundle.

✓ offline-capable

Filesystem-native storage

Standard POSIX permissions. The index lives alongside your data — inside an encrypted vault, a git repo, a synced drive. Back it up with rsync. No DB dump procedures, no export formats.

Property ContextFit Embedding + vector DB
Database None Qdrant / Chroma / pgvector
GPU Not required Recommended for local models
Dependency size ~41MB 500MB–2GB+ (PyTorch alone)
Storage format Plain files DB-managed blobs
Permissions POSIX filesystem DB users / ACLs
Offline capable Yes (default path) No (API) / Partial (local model)
Backup Any file backup tool DB dump + vector store export
API cost (default) $0 Per-embedding call
Query latency 0.4–9ms in-process 50–500ms+ (embed + vector search)

Simple, composable,
fully interpretable

Drop in as a memory layer. Query with a natural-language string. Get back ranked session IDs with source-linked evidence.

memory_retrieval.py
# Ingest sessions — no API calls, no GPU, ~2.7ms per session
from contextfit import RetrievalEngine

engine = RetrievalEngine()
engine.ingest_sessions(sessions)
engine.save("./memory_index")

# Query — auto-routes to the right retrieval mode
result = engine.query_auto(
    "what should I cook for dinner tonight?",
    top_k=5
)

# Returns ranked session IDs + route metadata
print(result["route"])        # → episode_score
print(result["session_ids"])  # → ["s_garden_harvest", ...]

# Or use individual modes directly
engine.rank_sessions_by_episode_score(query, top_k=10)
engine.query(query, method="hybrid", top_k=50)
engine.rerank_sessions_by_structure(query, bm25_order, session_texts)

Built in the open.
Free to use and extend.

MIT licensed. No cloud dependency. No vendor lock-in. Fork it, embed it, ship it.

⭐ GitHub Stars
🍴 Forks
🔓 Open Issues
MIT
📄 License
INSTALL

pip install git+https://github.com/ContextFit/cf.git

REQUIRES

Python 3.10+ · CPU only · ~41MB deps · No DB

LICENSE

MIT — use freely in commercial and open-source projects

CONTRIBUTE

Issues, PRs, and benchmark contributions welcome


Research

Token-Native Agent Memory

The full technical whitepaper: architecture, four primitives, 499-case benchmark methodology, feature ablation, per-behavior analysis, and deployment architecture.

Read the Whitepaper View on GitHub