Phase 1 built the generator. Phase 2 added episodic memory as a tool. Phase 3 closes the quality feedback loop: every generated answer is graded by a separate LLM judge (Prometheus), the score is persisted back to the engram that MemoryAgent already wrote, and the score is visible inline in the UI.
The key question from the Phase 2 retrospective: who receives the critique and updates memory? The answer is MemoryAgent, which already owns the episodic store. Phase 3 adds a second subscription — to critique.result — so it can annotate the engram it already wrote with the critic's score.
response.generation. Pairwise comparison (RespondentA vs. RespondentB) requires a second generator instance and is deferred to Phase 4+. Using Prometheus for absolute scoring is pragmatic while there is only one respondent.critic_score). Over time this builds a quality-annotated episodic store. Phase 4 will use scores to weight retrieval and route reward signals to producing agents.Phase 3 is complete when Story S6 passes against the live stack: a generated response triggers critique.result containing an integer score 1–5, the score is visible in the UI below the response, and MemoryAgent has updated the engram with that score.
src/local/agents/critic_agent.py newsrc/local/agents/critic_states.py newsrc/local/agents/critic_actions.py newsrc/local/agents/critic_transitions.py newconfig/critic.yaml newtests/test_critic_agent.py newtests/stories/s6_critic_absolute.yaml newrun_local.py — add _start_critic() thread modsrc/local/agents/memory_agent.py — subscribe to critique.result, update engram score modsrc/local/agents/memory_agent_states.py — add UPDATING_SCORE state modsrc/local/agents/memory_agent_actions.py — add UPDATE_SCORE action modsrc/local/agents/memory_agent_transitions.py — add score update transitions modsrc/local/ui/main_window.py — subscribe to critique.result, show score badge modsubjects.py: CRITIQUE = "critique.result" already defined — no change needed.
response.generation (payload: query, answer, session_id, query_id)response.generation → writes engram to ChromaDB with {"query_id": ..., "intent": ..., "entities": ...} metadata. New in Phase 3: the written document's ID equals query_id so it can be updated later.response.generation → calls Prometheus via ollama.chat() with the absolute-grading prompt → parses score (1–5) and feedback text from output.critique.result: {"score": 4, "feedback": "...", "session_id": ..., "query_id": ..., "query": ..., "answer": ...}critique.result → calls memory_service.update_engram_score(query_id, score) → updates the critic_score metadata field on the existing ChromaDB document.critique.result → finds the response widget by query_id → renders a small score badge (e.g., "● 4/5") below the answer text. The badge appears asynchronously after the answer is already shown.| State | Meaning |
|---|---|
| IDLE | Waiting for response.generation |
| RECEIVING | Envelope consumed, building prompt |
| GRADING | Prometheus ollama.chat() in flight |
| PUBLISHING | Publishing critique.result |
| ERROR | Prometheus call failed or score parse failed |
Actions: RECEIVE → START_GRADE → PUBLISH → RESET / FAIL → RESET
Prometheus expects a specific format. The rubric is defined in config/critic.yaml and injected at runtime. CriticAgent never hardcodes the rubric in Python.
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a
score rubric representing evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly
based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5.
You should refer to the score rubric.
3. Output format: "Feedback: (write a feedback) [RESULT] (integer 1-5)"
4. Do not generate any other opening, closing, or explanations.
###The instruction to evaluate:
{query}
###Response to evaluate:
{answer}
###Score Rubrics:
{rubric}
###Feedback:
import re m = re.search(r'\[RESULT\]\s*([1-5])', text) score = int(m.group(1)) if m else None
If score is None (parse failure or timeout), CriticAgent publishes critique.result with "score": null. MemoryAgent skips the update if score is null. UI shows no badge.
model: "prometheus:7b" # must be available in Ollama: ollama pull prometheus:7b temperature: 0.0 num_ctx: 4096 grade_timeout: 30 # seconds; Prometheus can be slow rubric: | [Is the response accurate, helpful, and well-reasoned?] Score 1: The response is incorrect, harmful, or completely unhelpful. Score 2: The response is mostly wrong or missing important information. Score 3: The response is partially correct but incomplete or unclear. Score 4: The response is mostly correct with minor gaps. Score 5: The response is accurate, complete, and clearly explained.
ollama pull prometheus:7b (or the correct GGUF tag for Prometheus-7B-v2.0-Q4_K_M). The model name in config/critic.yaml must match exactly what ollama list shows. If the model is missing, CriticAgent will log an error on every request and publish "score": null — the rest of the system continues normally.MemoryAgent currently subscribes to response.generation only. Phase 3 adds a second subscription to critique.result. When a critique arrives, MemoryAgent looks up the engram by query_id and adds critic_score to its metadata.
MemoryService must store each engram with id=query_id. This allows update_engram_score(query_id, score) to do a direct document update rather than a search. This is a change to MemoryService.write_episodic() — it must accept and use the query_id as the ChromaDB document ID.
ChromaDB update() replaces the entire metadata dict — it does not merge. The existing engram metadata includes type, query, timestamp, intent, and entities. A bare update with only {"critic_score": score} would wipe them, breaking the where={"type": "episodic"} filter in search_episodic(). The implementation must read the existing metadata first, merge critic_score in, then update.
def update_engram_score(self, query_id: str, score: int) -> None:
"""Merge critic_score into an existing engram's metadata."""
result = self._collection.get(ids=[query_id])
if not result["ids"]:
logger.warning("MemoryService: engram %s not found — skipping score update", query_id)
return
existing_meta = (result["metadatas"] or [{}])[0]
merged = {**existing_meta, "critic_score": score}
self._collection.update(ids=[query_id], metadatas=[merged])
Add state UPDATING_SCORE and action UPDATE_SCORE. The transition path: IDLE + UPDATE_SCORE → UPDATING_SCORE → IDLE. This is a fast path (no LLM call) — just a ChromaDB metadata update.
The score arrives after the answer is already displayed. The UI must update the existing response widget, not append a new message. Each response widget is indexed by query_id in a dict on the conversation panel.
| Score | Color | Label |
|---|---|---|
| 5 | green (#22c55e) | ● 5/5 |
| 4 | green (#22c55e) | ● 4/5 |
| 3 | amber (#f59e0b) | ● 3/5 |
| 2 | red (#ef4444) | ● 2/5 |
| 1 | red (#ef4444) | ● 1/5 |
| null | gray | (hidden — no badge) |
Badge appears as a small right-aligned indicator at the bottom of the response bubble. Hovering shows the full feedback text in a tooltip.
The existing UI routes all bus events through BusLogger signals (response = Signal(dict), thinking_chunk = Signal(dict), etc.). BusLogger already has a CRITIQUE branch in log_envelope() that formats a plain text message — it must be upgraded to emit a typed signal instead, following the same pattern as response. Do not add a new signal on MainWindow directly.
# In BusLogger — add alongside existing response = Signal(dict):
critique = Signal(dict)
# In log_envelope(), replace the CRITIQUE text branch:
elif subject == CRITIQUE:
self.critique.emit({
"score": raw.get("score"),
"feedback": raw.get("feedback", ""),
"query_id": raw.get("query_id", ""),
})
return # do not also emit message()
# In MainWindow._start_bus_monitor() — add alongside response.connect:
self._bus_logger.critique.connect(self._on_critique)
# _on_critique: looks up widget in self._pending by query_id, calls widget.set_score(score, feedback)
Note: the existing CRITIQUE log branch reads raw.get("verdict") — this must change to raw.get("feedback") when upgrading the branch. The payload field is feedback everywhere in this plan.
Write critic_states.py, critic_actions.py, critic_transitions.py, config/critic.yaml, and the core grading logic in critic_agent.py. Unit test _grade(query, answer) with a mocked OllamaBackend: assert that a response containing [RESULT] 4 returns score 4, and that a malformed response returns None. Do not make a real Prometheus call in unit tests — that belongs in the integration story (step 6).
Do this before wiring CriticAgent to the bus — the full end-to-end loop requires stable engram IDs. Update write_episodic() to accept an optional query_id parameter and use it as the ChromaDB document ID (falling back to uuid4() if absent). Add update_engram_score(query_id, score) with the read-merge-update pattern (see MemoryAgent section above). Unit test: write an engram with a known query_id, call update_engram_score, then get() the document and assert critic_score is set and type="episodic" is preserved. Test the missing-ID case: logs a warning and returns cleanly. Delete .chroma/ before the first run after this change.
Add _start_critic() to run_local.py. CriticAgent subscribes to response.generation and publishes critique.result (payload fields: score, feedback, query_id, session_id). Start it after memory_agent. Verify manually: start the stack, ask a question, confirm critique.result appears in the bus monitor with a numeric score. With MemoryService already updated (step 2), the full path from generation → engram write → score annotation can now complete.
Pass query_id from response.generation payload to write_episodic() (step 2 already supports it). Add CRITIQUE to MemoryAgent's subscription list. Add _handle_critique(envelope) — reads query_id and score from payload, calls memory_service.update_engram_score(). Update state machine with UPDATING_SCORE state and UPDATE_SCORE action. Unit test: call _handle_critique directly with a fake envelope, assert update_engram_score was called with the right args (mock MemoryService).
Add critique = Signal(dict) to BusLogger. In log_envelope(), replace the existing CRITIQUE text branch (currently reads verdict) with an emit of the typed signal (fields: score, feedback, query_id) and return early — no plain-text fallback. In _start_bus_monitor(), connect self._bus_logger.critique to self._on_critique. In _on_critique: look up the widget in self._pending by query_id, call widget.set_score(score, feedback). Add set_score() to StreamingResponseWidget: appends a colored badge label and sets a tooltip with the feedback text.
Write tests/stories/s6_critic_absolute.yaml. This is an integration test (requires live Prometheus in Ollama), not a unit test. One turn: factual question with a clear correct answer. Assert: response.generation published (answer non-empty), critique.result published (score is int 1–5). Do not assert a specific score — Prometheus output is non-deterministic. After both events fire, verify the engram in ChromaDB has critic_score set and type="episodic" still present.
response.generation — a separate thread from GeneratorAgent. The Prometheus grading call (5–30s) runs in CriticAgent's thread. GeneratorAgent is idle between queries; there is no shared resource contention. Do not introduce any synchronization between them.critique.result with "score": null. The conversation continues normally. The MemoryAgent score update and UI badge are both no-ops on null score. Never raise an exception that kills the thread..chroma/ before the first run after this change. In tests: always pass a known query_id and verify by direct get().Signal.emit().critic_agent_tool.py)user.feedback → reward.event)