Metadata-Version: 2.4
Name: agent-episodic-memory
Version: 0.1.0
Summary: Agentic Context Engineering (ACE) middleware for LangChain — self-improving playbooks from the ICLR 2026 paper by Zhang et al.
Project-URL: Homepage, https://github.com/johanity/agent-episodic-memory
Project-URL: Repository, https://github.com/johanity/agent-episodic-memory
Project-URL: Issues, https://github.com/johanity/agent-episodic-memory/issues
Project-URL: Paper, https://arxiv.org/abs/2510.04618
Project-URL: Official ACE Implementation, https://github.com/ace-agent/ace
Author: Johan Bonilla
License-Expression: MIT
License-File: LICENSE
Keywords: ace,agentic-context-engineering,ai-agents,context-engineering,deepagents,delta-updates,evolving-playbook,iclr-2026,langchain,langgraph,llm,middleware,prompt-engineering,self-improving-agents
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: langchain-core>=1.0.0
Requires-Dist: langchain>=1.0.0
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: ruff>=0.5.0; extra == 'dev'
Provides-Extra: test
Requires-Dist: hypothesis>=6.100.0; extra == 'test'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'test'
Requires-Dist: pytest>=8.0.0; extra == 'test'
Description-Content-Type: text/markdown

# agent_episodic_memory

[![PyPI](https://img.shields.io/pypi/v/agent_episodic_memory?label=%20)](https://pypi.org/project/agent_episodic_memory/)
[![License](https://img.shields.io/pypi/l/agent_episodic_memory)](https://opensource.org/licenses/MIT)
[![Python](https://img.shields.io/pypi/pyversions/agent_episodic_memory)](https://pypi.org/project/agent_episodic_memory/)

**Agentic Context Engineering (ACE) as a LangChain middleware.** Your agent learns from every run and stores strategies as an evolving playbook. Based on the [ICLR 2026 paper](https://arxiv.org/abs/2510.04618) by Zhang, Hu et al. (Stanford University, SambaNova Systems, and UC Berkeley).

One import. Self-improving agents. Delta updates instead of full context rewrites. Drop-in `AgentMiddleware` subclass for LangChain v1 `create_agent`.

## What it does

`agent_episodic_memory` treats agent context as an evolving playbook that accumulates strategies across runs. After each run, the middleware reflects on what worked and appends delta entries to the playbook. On the next run, the curated playbook is injected into the system prompt. Over time, the agent's context becomes more useful without full rewrites — the paper's key insight is that structured, incremental updates preserve detail that single-pass summarization erodes.

Three paper components map 1:1 to LangChain v1 middleware hooks:

- **Generator** → `wrap_model_call` — injects the current playbook into the system prompt
- **Reflector** → `after_model` — produces delta entries describing what the model did
- **Curator** → `after_agent` — deduplicates entries by content fingerprint

> **v0.1 scope.** This is an architecture port of the paper's three-component
> structure, not a full reproduction. The default Reflector is a zero-LLM
> heuristic that labels every entry `neutral` — it never infers
> `success`/`failure`, since a heuristic that guessed would poison the
> playbook with confident-wrong answers. The Curator deduplicates by content
> fingerprint only; the paper's grow-and-refine semantic merge is not
> implemented in v0.1. Plug in a real LLM Reflector by subclassing and
> overriding `reflect()` — see [How the three hooks work](#how-the-three-hooks-work).

The middleware is stateless across instances — the playbook lives on agent state under `state["ace_playbook"]`, so it persists across tool calls within a run and can be checkpointed by LangGraph across runs.

## Quick install

```bash
pip install agent_episodic_memory
```

```python
from langchain.agents import create_agent
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[ACEMiddleware()],
)
```

That's the whole integration. The playbook is empty on the first run and grows from there.

## vs. existing LangChain context middleware

|  | `agent_episodic_memory` | `SummarizationMiddleware` | `ContextEditingMiddleware` | `compact-middleware` |
|---|:---:|:---:|:---:|:---:|
| Learns across runs | ✅ | ❌ | ❌ | ❌ |
| Delta updates (not full rewrites) | ✅ | ❌ | ❌ | ❌ |
| Structured playbook state | ✅ | ❌ | ❌ | ❌ |
| Architecture from ICLR 2026 paper | ✅ | ❌ | ❌ | ❌ |
| Preserves detail across iterations | ✅ | partial | ❌ | partial |
| Zero LLM calls in the generator hook | ✅ | ❌ | ✅ | ❌ |
| Composes with `SummarizationMiddleware` | ✅ | — | ✅ | ✅ |

## Paper results (reference only — not reproduced by this package)

> **`agent_episodic_memory` is an architecture port, not a paper reproduction.** The
> numbers below are from Zhang et al.'s official implementation against the
> AppWorld / FiNER / Formula harnesses using DeepSeek-V3.1 as the backbone
> and an LLM-based Reflector + grow-and-refine Curator. This package ships
> the Generator/Reflector/Curator hook structure and a zero-LLM default
> Reflector — it does **not** ship the adaptation harness, the benchmark
> suites, or a trained Reflector, and therefore does not produce these
> metrics out of the box. For the canonical implementation, see
> [ace-agent/ace](https://github.com/ace-agent/ace).

From [Zhang et al., ICLR 2026](https://arxiv.org/abs/2510.04618):

- **+10.6%** on the AppWorld agent benchmark
- **+8.6%** on financial reasoning (FiNER + Formula)
- **−86.9%** average adaptation latency
- **−82.3%** latency and **−75.1%** rollouts vs. GEPA (offline AppWorld)
- **−91.5%** latency and **−83.6%** token dollar cost vs. Dynamic Cheatsheet (online FiNER)
- **91.8%** KV cache reuse during evaluation
- **Matches the top-1 ranked IBM CUGA** (60.3%) on AppWorld overall average, despite using the much smaller open-source DeepSeek-V3.1 instead of CUGA's GPT-4.1. With online adaptation, ACE also surpasses IBM CUGA by **8.4% in TGC and 0.7% in SGC** on the test-challenge split.

The IBM CUGA reference is used by the paper as a rough contextual benchmark — not a direct methodological comparison — to show ACE operates in a similar performance range using a much smaller open backbone.

## How the three hooks work

### Generator (`wrap_model_call`)

Before each model call, the middleware reads the current playbook from `state["ace_playbook"]` and injects it into the system message. If an existing system message is present, the playbook is appended; otherwise a new system message is constructed.

The rendered playbook groups entries by category (`tool_use`, `final_answer`, `observation`, …) and marks each with its outcome:

```
<playbook>
  <tool_use>
    [+] read_file on config.json succeeded
    [-] grep without file-type flag returned too many hits
  </tool_use>
  <final_answer>
    [+] summarize after three or fewer file reads
  </final_answer>
</playbook>
```

Outcomes are `[+]` success, `[-]` failure, `[o]` neutral. The system prompt instructs the model to prefer `[+]` patterns and avoid `[-]` patterns when they apply.

### Reflector (`after_model`)

After each model call, the middleware inspects the latest `AIMessage`. The default reflector is deterministic and zero-LLM: it categorizes the output as `tool_use` (has tool calls) or `final_answer` (pure text) and appends a `DeltaEntry` with outcome `neutral` — **it deliberately never claims `success` or `failure`**. A heuristic cannot know whether a final answer was correct, and labeling every final answer `success` would poison the playbook with confident-wrong entries that then feed forward into every future run. The paper explicitly warns against this failure mode.

Override the public `reflect()` method in a subclass to plug in an LLM-based Reflector that produces real `success`/`failure` labels. For example, a cheaper model that rates whether the step made progress toward the task goal, or a judge model that compares the final answer against a ground-truth signal. See the paper §4 for the full Reflector design the authors ran against AppWorld and FiNER.

### Curator (`after_agent`)

At the end of the run, the Curator merges the accumulated `ace_pending_deltas` into the playbook. Entries are deduplicated by a content-hash fingerprint over `(category, content, outcome)`. **This is exact-match dedup only — the paper's grow-and-refine semantic merge is not implemented in v0.1.** Two near-identical strategies that differ by one token will coexist in the playbook as separate entries. The updated playbook is written back to state and is ready for the next run.

## Composition with other middleware

`ACEMiddleware` composes with every LangChain built-in middleware:

```python
from langchain.agents import create_agent
from langchain.agents.middleware import (
    HumanInTheLoopMiddleware,
    ModelFallbackMiddleware,
    SummarizationMiddleware,
    ToolRetryMiddleware,
)
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[
        ACEMiddleware(),              # evolves the playbook
        SummarizationMiddleware(...), # compacts history when long
        ToolRetryMiddleware(),        # retries flaky tools
        ModelFallbackMiddleware(...), # falls back on model errors
        HumanInTheLoopMiddleware(...),# gates sensitive tool calls
    ],
)
```

`ACEMiddleware` is designed to run first in the chain so the playbook is injected into the system message before any compaction or retry logic modifies the request.

## Multi-tenant deployments

> ⚠️ **The playbook is scoped to the LangGraph `thread_id`, not to any tenant or user concept.**

ACE stores the evolving playbook under `state["ace_playbook"]`, which is checkpointed per-thread by LangGraph. If two different users share a `thread_id` (even by accident — e.g. a deployment that reuses thread ids across anonymous sessions, or a supervisor that forks state across subagents), they will share the same playbook, and strategies learned from one user's runs will be injected into the other user's system prompts.

For any multi-tenant deployment:

- **Scope `thread_id` per tenant.** Use `thread_id = f"{tenant_id}:{conversation_id}"` or similar so the checkpointer can never accidentally cross-pollinate.
- **Do not share the same checkpointer thread across users.** Even for anonymous traffic, mint a fresh `thread_id` per session.
- **Consider wiping the playbook between logical contexts** if you have any scenario where cross-run learning is undesirable (e.g. one-shot question answering with no continuity).

v0.1 does not ship a built-in namespace primitive for the playbook. If you need strict per-user memory isolation with shared infrastructure, use [langmem](https://github.com/langchain-ai/langmem) alongside ACEMiddleware — langmem provides namespace-scoped memory out of the box, and ACE composes with it cleanly.

## Limitations

Direct from the paper's *Limitations and Challenges* section:

> ACE's reliance on a reasonably strong Reflector: if the Reflector fails to extract meaningful insights from generated traces or outcomes, the constructed context may become noisy or even harmful.

In practice this means the default deterministic reflector (which categorizes based on structural signals, not semantic understanding) works well for tasks where success correlates with structural patterns — tool-use behavior, error detection, retry patterns — and degrades on tasks that require nuanced interpretation of intent. For those, pass a `reflector_model` override (see above).

The paper also notes that ACE is most beneficial in settings that demand detailed domain knowledge, complex tool use, or environment-specific strategies — not tasks already covered by base model weights or simple system prompts.

## Relationship to the official ACE implementation

The official paper repository is at [ace-agent/ace](https://github.com/ace-agent/ace) and contains the full offline/online adaptation framework the authors used for evaluation, including the AppWorld, FiNER, Formula, medical, and Text-to-SQL experiments.

`agent_episodic_memory` is a LangChain v1 middleware port of the paper's Generator/Reflector/Curator architecture. It does not reimplement the adaptation harness or the benchmark suites — the goal is to make the paper's context-engineering pattern available as a drop-in middleware in existing LangChain agents.

There is also [kayba-ai/agentic-context-engine](https://github.com/kayba-ai/agentic-context-engine), which provides a separate `ACERunner` framework that wraps LangChain Runnables from the outside. That implementation is complementary: it owns the run loop; `agent_episodic_memory` plugs into LangChain's existing run loop as a middleware subclass.

## Citation

If you use `agent_episodic_memory` in research, please cite the original paper:

```bibtex
@inproceedings{zhang2026ace,
  title={Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
  author={Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.04618}
}
```

## Links

- **Paper (arXiv):** https://arxiv.org/abs/2510.04618
- **Official ACE implementation:** https://github.com/ace-agent/ace
- **ICLR 2026 poster page:** https://iclr.cc/virtual/2026/poster/10008343
- **LangChain middleware docs:** https://docs.langchain.com/oss/python/langchain/middleware
- **Issues:** https://github.com/johanity/agent-episodic-memory/issues

## License

MIT. See [LICENSE](LICENSE).
