jeevesagent.architecture.reflexion¶
Reflexion: verbal reinforcement learning via memory.
Shinn et al. 2023 — Reflexion: Language Agents with Verbal Reinforcement Learning. After each attempt, an evaluator scores the output. Below threshold, a reflector produces a single-sentence “lesson” — written advice the agent can read on its next attempt.
Lesson storage modes¶
Two storage modes for the persisted lessons:
Monotonic block (legacy default). Every lesson is appended to
memory.<lessons_block_name>and shown to the agent on every subsequent attempt. Simple but bloats context as lessons accumulate.Selective recall (recommended). Pass
lesson_store=aVectorStore. Lessons are stored as embedded chunks; before each attempt, only the top-k most relevant lessons for the current task are retrieved and surfaced. Avoids context bloat and keeps tutorial advice scoped to where it applies. Pair withInMemoryVectorStorefor in-process, orPostgresVectorStorefor cross-session learning.
Pattern¶
For each attempt up to max_attempts:
Recall (selective-recall mode only): query
lesson_storewith the current prompt; write the top-k results into the working memory block for this attempt.Run base architecture (default
ReAct).Evaluate. A text-only model call scores the output (0-1).
Threshold check. If
score >= threshold, terminate.Max-attempts check. If we’ve hit the cap, terminate.
Reflect. A text-only model call produces a single sentence identifying what went wrong.
Persist. Append (legacy) or add to
lesson_store(selective-recall) — keyed by the failing prompt so future recall can find it.Reset. Clear
session.messagesso the base re-seeds its context. Cumulative usage and turn count carry across attempts.
Strengths¶
Cross-session learning when paired with a persistent memory backend (legacy) or a persistent vector store (selective recall).
Wraps any base that reads
memory.working().Cheap: 1 evaluator + 1 reflector call per failed attempt.
Weaknesses¶
Same-model evaluation. Self-grading is biased; the score may not match human judgment.
Score parsing is best-effort. Falls back to 0.0 on parse failure (treated as a failed attempt).
Attributes¶
Classes¶
Wrap a base architecture with evaluator + reflector + lesson |
Module Contents¶
- class jeevesagent.architecture.reflexion.Reflexion(*, base: jeevesagent.architecture.base.Architecture | None = None, max_attempts: int = 3, threshold: float = 0.8, evaluator_prompt: str | None = None, reflector_prompt: str | None = None, lessons_block_name: str = 'reflexion_lessons', lesson_store: jeevesagent.vectorstore.base.VectorStore | None = None, top_k_lessons: int = 5)[source]¶
Wrap a base architecture with evaluator + reflector + lesson memory.
See module docstring for the full mechanism. Constructor parameters:
base— architecture to retry. DefaultReAct.max_attempts— cap on retries within a single run. Default 3.threshold— minimum evaluator score to terminate as success. Default 0.8.evaluator_prompt/reflector_prompt— override the default system prompts.lessons_block_name— memory working-block name for persisted lessons. Default"reflexion_lessons". Multiple Reflexion-wrapped agents in the same memory should pick distinct names.lesson_store— optionalVectorStoreenabling selective recall. When set, lessons are stored as embedded chunks and only the top-top_k_lessonsmost relevant lessons are surfaced on each attempt (instead of all past lessons). Avoids context bloat as lessons accumulate.top_k_lessons— how many lessons to recall per attempt (selective-recall mode only). Default 5.
- declared_workers() dict[str, jeevesagent.agent.api.Agent][source]¶
- async run(session: jeevesagent.architecture.base.AgentSession, deps: jeevesagent.architecture.base.Dependencies, prompt: str) collections.abc.AsyncIterator[jeevesagent.core.types.Event][source]¶
- name = 'reflexion'¶
- jeevesagent.architecture.reflexion.DEFAULT_EVALUATOR_PROMPT = Multiline-String¶
Show Value
"""You are an evaluator scoring an agent's output against a task. Score the output from 0.0 (completely failed) to 1.0 (fully successful). Be calibrated: - 1.0 = task is fully solved with no issues - 0.7-0.9 = mostly correct, minor gaps - 0.4-0.6 = partially correct, significant gaps - 0.0-0.3 = wrong or missing key components Output exactly one line in this format: score: <number between 0 and 1> Then on subsequent lines, briefly justify the score. The first line must match the score format exactly so it can be parsed."""
- jeevesagent.architecture.reflexion.DEFAULT_REFLECTOR_PROMPT = Multiline-String¶
Show Value
"""You are a reflector that produces lessons for an agent that just fell short on a task. Read the original task and the agent's failed attempt. Produce ONE sentence describing the most important thing the agent should do differently next time. Be specific and concrete: - Bad: "Be more careful." - Good: "When asked to extract dates, always normalize to ISO 8601 format before returning." Output ONLY the single sentence — no preamble, no list."""