April 1, 2026
We ran four experiments across three fictional domains to measure how well structured knowledge objects preserve information compared to compressed summaries. The results show clear strengths and honest weaknesses.
AI coding assistants maintain session memory through context summaries and markdown files. When conversations grow long, these summaries get compressed: specific numbers are dropped, rationale is flattened, constraints are generalized. The assistant retains a rough sense of what happened but loses the details.
This matters in practice. If an assistant forgets that a pipeline's batch size was reduced from 512 to 64 after an OOM crash, it may suggest 512 again. If it forgets that all database writes must go through the validation layer, it won't push back when someone proposes a direct insert.
We tested whether storing knowledge as structured objects (subject-predicate-value triples) preserves information that compressed summaries lose. All experiments use entirely fictional domains with zero prior knowledge in the model's training data.
Every domain in these experiments is invented. The facts, system names, team members, and technical decisions are all fictional. This ensures the model cannot answer from training data. Baselines confirm this: the no-memory condition scores 10-21%, reflecting pure guessing.
Scoring uses keyword matching. Each question has a set of expected keywords and a minimum threshold. A question passes if the response contains enough expected terms. This is deliberately conservative: partial credit is not awarded.
| Condition | What the model sees | Simulates |
|---|---|---|
| None | Empty context | Fresh session after context loss |
| .md | Compressed summary (~30 lines) | Markdown memory file after compaction |
| KO | All structured knowledge objects | Structured memory loaded at session start |
Domain: Nexus, a fictional bioinformatics pipeline for protein folding validation. 50 invented facts across 5 simulated sessions, 25 questions in 3 categories. All facts are fictional (enzyme names, threshold values, team decisions). 3 seeds, zero prior knowledge.
| Condition | Mean | Std Dev | |
|---|---|---|---|
| None | 21.3% | 5.0% | |
| .md | 22.7% | 1.9% | |
| KO | 100.0% | 0.0% |
| Category | None | .md | KO |
|---|---|---|---|
| Fact retention | 0% | 0% | 100% |
| Decision recall | 12% | 0% | 100% |
| Constraint | 43% | 71% | 100% |
The .md condition performs almost identically to no memory on fact retention and decision recall. Both score near zero. Only constraint enforcement shows a gap: 71% for .md vs 43% for none, because even vague rules ("always validate inputs") are enough for the model to push back on some violations. KO memory retains everything.
Domain: Helios, a fictional quantum calibration platform for satellite constellations. 40 invented KOs across 4 categories (architecture, calibration science, team processes, operational constraints). 30 questions. Since every fact is fictional, the 0% sharing baseline is a true zero.
| Share % | KOs Shared | Accuracy | |
|---|---|---|---|
| 0% | 0 | 13% | |
| 10% | 4 | 27% | |
| 25% | 10 | 50% | |
| 50% | 20 | 67% | |
| 75% | 30 | 90% | |
| 100% | 40 | 97% |
Constraint enforcement benefits the most from sharing. At 0% sharing, constraint questions score 30%. At 100% sharing, they reach 100%. This is the steepest improvement of any category, suggesting that operational rules are particularly hard to guess and particularly well-served by structured storage.
The single failure at 100% sharing (97%, not 100%) confirms the benchmark is not trivially easy.
Domain: Aether, a fictional IoT fleet management platform. 30 KOs across 3 sessions where decisions evolve and override each other (e.g., session 1 sets a polling interval to 30s, session 3 changes it to 5s after a latency incident). 20 questions in 3 categories.
| Category (questions) | None | .md | KO |
|---|---|---|---|
| Cross-session fact (8) | 0% | 25% | 100% |
| Override detection (7) | 0% | 71% | 100% |
| Synthesis (5) | 40% | 80% | 100% |
| Overall (20) | 10% | 55% | 100% |
On cross-session facts, KO scores 100% vs .md's 25%. These are questions like "What threshold was set in session 1 and then referenced in session 3?" The .md summary compresses away the session-specific details. KO preserves them. The .md condition drops to 55% overall (from 65% in the previous run) due to tighter scoring on synthesis questions.
Setup: 50 real Nexus KOs plus increasing numbers of distractor KOs drawn from 5 unrelated fictional domains. All KOs are dumped into context with no retrieval filtering. 25 questions about the Nexus facts only.
| Total KOs | Distractors | Accuracy | |
|---|---|---|---|
| 50 | 0 | 100% | |
| 100 | 50 | 92% | |
| 250 | 200 | 12% | |
| 500 | 450 | 8% |
Structured objects outperform compressed summaries for fact retention and decision recall. The gap is large: 100% vs 0-23% on specific facts. When you need exact values, thresholds, or rationale, compressed text loses them and structured storage does not.
Subject-level deduplication eliminates the override problem. An earlier version of this benchmark showed .md outperforming KO on override detection (86% vs 43%), because KO stored both old and new values. The fix is straightforward: deduplicate by subject+predicate so only the latest version enters context. After dedup, KO scores 100% on all continuity categories.
Sharing even a small fraction of knowledge has outsized impact. Going from 0% to 25% sharing triples accuracy (13% to 50%). The value curve is steepest at the low end.
Raw context dumping does not scale. At ~250 KOs, performance collapses. Any system that stores more than ~100 knowledge objects needs selective retrieval, not wholesale injection.
Benchmark run on Claude Sonnet (claude-sonnet-4-20250514), April 1, 2026. 4 experiments across 3 fictional domains (Nexus, Helios, Aether). All facts invented, zero prior knowledge confirmed by baselines. Full results and reproduction scripts at github.com/coretxinc/coretx-connect.