ContextFit retrieves the right prior conversation without embedding APIs, vector databases, or GPU hardware — at 0.4ms query latency and $0 per query.
Every major agent memory system in 2026 converts conversations to dense vectors before retrieval. This works for factual lookups — and breaks for the queries agents most frequently need to answer.
An entire session compressed into one vector loses episode structure. "What should I cook tonight?" shares almost no vector proximity with "I just harvested zucchini from my garden" — even though that's exactly the session the agent needs.
Every memory access requires an embedding call. At real agent throughput — thousands of turns per day — this compounds into a meaningful infrastructure cost line, plus a vector database on top.
Sending memory retrieval queries to an embedding API means your user's personal context crosses a network boundary on every turn. For privacy-sensitive use cases this is a non-starter.
Embedding API + vector database + embedding model versioning + index management = three separate systems to provision, monitor, and keep synchronized.
When retrieval fails, the answer is "cosine distance was 0.62." There's no auditable explanation for why a session was or wasn't surfaced.
Running embedding models locally requires PyTorch and benefits significantly from GPU acceleration — 500MB–2GB of dependencies before a single session is indexed.
The most valuable signals in conversational memory are structural, not semantic: what kind of memory did the user express? and does this episode's memory type match what this query needs? These questions are answerable with token-level pattern matching — no embedding model required.
Deterministic, domain-agnostic fact extraction from user-authored turns. Eight typed primitives — preference, goal, constraint, decision, temporal update, open loop, interest, entity — extracted with zero API calls.
Numeric session ranking by structural memory-signal type, not vector proximity. Answers "does this session have what this query type needs?" — the right question for vague advice queries.
Deterministic zero-cost dispatch to the right retrieval mode per query. Vague advice → episode scorer. Specific facts → BM25. Temporal state → fusion. No LLM routing, no latency.
Ten token-native features re-score BM25 candidates: lexical overlap, behavior-marker alignment, named entity overlap, question-type slot matching. Surpasses Cohere embeddings at zero cost.
Evaluated on a 499-case domain-agnostic agent memory benchmark across 8 behavioral categories and 26 domains — hand-crafted cases plus generated coverage.
| System | R@1 | R@3 | MRR | API Cost | GPU |
|---|---|---|---|---|---|
| Mem0 (GPT-4o-mini + embed) | 54.4% | 81.0% | 0.716 | LLM + embed | — |
| Cohere embed-english-v3 | 58.7% | 91.4% | 0.751 | embed API | — |
| ContextFit + slot matching | 61.5% | 93.2% | 0.776 | $0 | ✓ CPU |
| OpenAI text-embedding-3-small | 63.1% | 96.6% | 0.792 | embed API | — |
The token-native architecture isn't just a performance choice — it's a deployment choice. ContextFit runs anywhere a Python process runs.
The index is a directory of flat files — zstd-compressed token arrays, BM25 postings, LSH signatures. No Qdrant, no Pinecone, no pgvector. No service to start, no schema to migrate.
Zero PyTorch, zero CUDA, zero model weights. Every operation runs on CPU: BM25 scoring, roaring bitmap intersection, MinHash LSH, episode feature computation, structural reranking.
Total dependency footprint is ~41MB (tiktoken + numpy + pyroaring + datasketch + zstandard). PyTorch alone is 500MB–2GB. Fits in a Lambda function, a slim container, or a mobile app bundle.
Standard POSIX permissions. The index lives alongside your data — inside an encrypted vault, a git repo, a synced drive. Back it up with rsync. No DB dump procedures, no export formats.
| Property | ContextFit | Embedding + vector DB |
|---|---|---|
| Database | None | Qdrant / Chroma / pgvector |
| GPU | Not required | Recommended for local models |
| Dependency size | ~41MB | 500MB–2GB+ (PyTorch alone) |
| Storage format | Plain files | DB-managed blobs |
| Permissions | POSIX filesystem | DB users / ACLs |
| Offline capable | Yes (default path) | No (API) / Partial (local model) |
| Backup | Any file backup tool | DB dump + vector store export |
| API cost (default) | $0 | Per-embedding call |
| Query latency | 0.4–9ms in-process | 50–500ms+ (embed + vector search) |
Drop in as a memory layer. Query with a natural-language string. Get back ranked session IDs with source-linked evidence.
# Ingest sessions — no API calls, no GPU, ~2.7ms per session from contextfit import RetrievalEngine engine = RetrievalEngine() engine.ingest_sessions(sessions) engine.save("./memory_index") # Query — auto-routes to the right retrieval mode result = engine.query_auto( "what should I cook for dinner tonight?", top_k=5 ) # Returns ranked session IDs + route metadata print(result["route"]) # → episode_score print(result["session_ids"]) # → ["s_garden_harvest", ...] # Or use individual modes directly engine.rank_sessions_by_episode_score(query, top_k=10) engine.query(query, method="hybrid", top_k=50) engine.rerank_sessions_by_structure(query, bm25_order, session_texts)
MIT licensed. No cloud dependency. No vendor lock-in. Fork it, embed it, ship it.
pip install git+https://github.com/ContextFit/cf.git
Python 3.10+ · CPU only · ~41MB deps · No DB
MIT — use freely in commercial and open-source projects
Issues, PRs, and benchmark contributions welcome
The full technical whitepaper: architecture, four primitives, 499-case benchmark methodology, feature ablation, per-behavior analysis, and deployment architecture.