Looking for the managed version? Per-tenant Postgres, hallucination detection bundled, dashboard included. Extremis Cloud →
🧠 Extremis · MIT-licensed memory library

Memory that gets smarter the more your agent uses it.

Layered, learning memory for AI agents in two lines of config. Self-host the OSS library, or skip the infrastructure and use the managed version — same engine, your call.

✓ RL-scored retrieval ✓ Hybrid (semantic + BM25) ✓ Knowledge graph ✓ MCP-native (9 tools) ✓ Hallucination detection
agent.py
# pip install extremis
from extremis import Extremis

mem = Extremis()

mem.remember("User is building a WhatsApp AI")
hits = mem.recall("what is the user building?")

for r in hits:
    print(r.memory.content)
    print(r.reason)  # similarity 0.87 · score +2.0 · used 5× · 3d

Two ways to run it

Same engine, your choice.

How it works

Three primitives. Everything else is consolidation that runs in the background.

1 mem.remember()

append to fsync'd log + episodic store

mem.remember(
  "user wants the SLA in writing",
  conversation_id="c1",
)
2 mem.recall()

ranked by cosine × RL score × recency

hits = mem.recall("SLA")
# returns ranked results,
# each with .reason and .verification
3 mem.reinforce()

asymmetric 1.5× weight on negative signals

mem.report_outcome(
  [h.memory.id for h in hits[:2]],
  success=True,
)
Plus a nightly dream pass. A consolidation worker reads the day's episodes and distils durable facts into the semantic layer. On self-host you run it; on Cloud it runs for you.

Every recall explains itself

Debuggable by default.

No black box. Every result carries a one-line reason — the same string the library returns whether you self-host or use Cloud. You see exactly why a memory surfaced, in plain English.

example reason strings

  • similarity 0.87 · score +2.0 · used 5× · 3d old
  • identity layer (×2 weight) · matched user's prior preference
  • downranked: judge flagged unverified at write time

Vs. the alternatives

What sets Extremis apart.

Feature Extremis Mem0 Letta Raw RAG
Layered memory (identity/semantic/episodic/procedural)
RL-scored retrieval (1.5× asymmetric on negatives)
Hybrid retrieval (semantic + BM25)
Per-recall reason strings
Knowledge graph built in partial
Hallucination detection bundled
Self-hostable (MIT)
Managed option available
MCP server (9 tools)

Benchmarks

LongMemEval-S · 500 QA instances · ~53 sessions each.

Hosted Extremis is the identical engine — same numbers, fully managed. Methodology: see the reproducible benchmark run. QA accuracy depends on the answerer model.

94.4%
Retrieval R@5
top-5 includes the answer session
38.8%
QA Accuracy
claude-haiku-4-5 as answerer
~35ms
p50 recall latency
local model · MPS · varies in prod

Hallucination detection

Wrong memories are flagged, not stored quietly.

A two-tier verifier runs at write time: a fast NLI model first, then an LLM judge for grey-zone scores. Failing memories aren't silently dropped — they're tagged unverified and downranked at recall time. Every recall returns a verification trace you can inspect.

  • On self-host: configure your own thresholds, pick the NLI model, point at any judge LLM.
  • On Cloud: dashboard surfaces flagged memories as a triage queue and renders the trace tree with red rows on failures.

example: contradicted recall

extremis.recall               124ms ⌐
  embedder.embed             10ms ✓
  retrieve.hybrid            11ms ✓ (semantic + BM25)
  verifier.nli               14ms ⌐ grounded 0.42
  verifier.judge             47ms ⌐ grounded 0.18

why it failed:
  sources self-correct from 99.95% to 99.9%;
  extracted memory captured the pre-correction value.

what to try:
  mem.remember_now(layer="semantic", confidence=0.95)
  to pin the corrected fact.

Privacy

MIT-licensed. No lock-in. Cloud is optional.

The library is open-source under MIT. Self-host on SQLite locally or any of five production backends. If Cloud isn't for you (or shuts down tomorrow), point HostedClient.base_url at your own deployment and nothing else changes. Cloud is a convenience, not a dependency.

Pick the version that fits your stack.

Both run the same engine. Self-host gives you full control; Cloud gives you back the afternoon.