← Back to main plan

LoCAL2 — Phase 3

CriticAgent: Absolute Grading · Memory Score Annotation · UI Score Badge

Context & Motivation

Phase 1 built the generator. Phase 2 added episodic memory as a tool. Phase 3 closes the quality feedback loop: every generated answer is graded by a separate LLM judge (Prometheus), the score is persisted back to the engram that MemoryAgent already wrote, and the score is visible inline in the UI.

The key question from the Phase 2 retrospective: who receives the critique and updates memory? The answer is MemoryAgent, which already owns the episodic store. Phase 3 adds a second subscription — to critique.result — so it can annotate the engram it already wrote with the critic's score.

Absolute vs. pairwise?
Phase 3 uses absolute grading (1–5). A single CriticAgent fires on every response.generation. Pairwise comparison (RespondentA vs. RespondentB) requires a second generator instance and is deferred to Phase 4+. Using Prometheus for absolute scoring is pragmatic while there is only one respondent.
What does "feedback loop" mean in Phase 3?
Score → persisted on the engram in ChromaDB as metadata (critic_score). Over time this builds a quality-annotated episodic store. Phase 4 will use scores to weight retrieval and route reward signals to producing agents.

Scope & Deliverables

Phase 3 is complete when Story S6 passes against the live stack: a generated response triggers critique.result containing an integer score 1–5, the score is visible in the UI below the response, and MemoryAgent has updated the engram with that score.

New files

  • src/local/agents/critic_agent.py new
  • src/local/agents/critic_states.py new
  • src/local/agents/critic_actions.py new
  • src/local/agents/critic_transitions.py new
  • config/critic.yaml new
  • tests/test_critic_agent.py new
  • tests/stories/s6_critic_absolute.yaml new

Modified files

  • run_local.py — add _start_critic() thread mod
  • src/local/agents/memory_agent.py — subscribe to critique.result, update engram score mod
  • src/local/agents/memory_agent_states.py — add UPDATING_SCORE state mod
  • src/local/agents/memory_agent_actions.py — add UPDATE_SCORE action mod
  • src/local/agents/memory_agent_transitions.py — add score update transitions mod
  • src/local/ui/main_window.py — subscribe to critique.result, show score badge mod

subjects.py: CRITIQUE = "critique.result" already defined — no change needed.

Data Flow

1
User submits query → GeneratorAgent produces answer → publishes response.generation (payload: query, answer, session_id, query_id)
2
MemoryAgent (existing) receives response.generation → writes engram to ChromaDB with {"query_id": ..., "intent": ..., "entities": ...} metadata. New in Phase 3: the written document's ID equals query_id so it can be updated later.
3
CriticAgent (new) receives response.generation → calls Prometheus via ollama.chat() with the absolute-grading prompt → parses score (1–5) and feedback text from output.
4
CriticAgent publishes critique.result: {"score": 4, "feedback": "...", "session_id": ..., "query_id": ..., "query": ..., "answer": ...}
5
MemoryAgent (updated) receives critique.result → calls memory_service.update_engram_score(query_id, score) → updates the critic_score metadata field on the existing ChromaDB document.
6
UI receives critique.result → finds the response widget by query_id → renders a small score badge (e.g., "● 4/5") below the answer text. The badge appears asynchronously after the answer is already shown.

CriticAgent Design

State Machine

StateMeaning
IDLEWaiting for response.generation
RECEIVINGEnvelope consumed, building prompt
GRADINGPrometheus ollama.chat() in flight
PUBLISHINGPublishing critique.result
ERRORPrometheus call failed or score parse failed

Actions: RECEIVE → START_GRADE → PUBLISH → RESET / FAIL → RESET

Prometheus Absolute Grading Prompt

Prometheus expects a specific format. The rubric is defined in config/critic.yaml and injected at runtime. CriticAgent never hardcodes the rubric in Python.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a
score rubric representing evaluation criteria are given.
1. Write a detailed feedback that assesses the quality of the response strictly
   based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5.
   You should refer to the score rubric.
3. Output format: "Feedback: (write a feedback) [RESULT] (integer 1-5)"
4. Do not generate any other opening, closing, or explanations.

###The instruction to evaluate:
{query}

###Response to evaluate:
{answer}

###Score Rubrics:
{rubric}

###Feedback:

Score Extraction

import re
m = re.search(r'\[RESULT\]\s*([1-5])', text)
score = int(m.group(1)) if m else None

If score is None (parse failure or timeout), CriticAgent publishes critique.result with "score": null. MemoryAgent skips the update if score is null. UI shows no badge.

config/critic.yaml

model: "prometheus:7b"       # must be available in Ollama: ollama pull prometheus:7b
temperature: 0.0
num_ctx: 4096
grade_timeout: 30            # seconds; Prometheus can be slow

rubric: |
  [Is the response accurate, helpful, and well-reasoned?]
  Score 1: The response is incorrect, harmful, or completely unhelpful.
  Score 2: The response is mostly wrong or missing important information.
  Score 3: The response is partially correct but incomplete or unclear.
  Score 4: The response is mostly correct with minor gaps.
  Score 5: The response is accurate, complete, and clearly explained.
Prometheus model must be pulled in Ollama before starting
Run ollama pull prometheus:7b (or the correct GGUF tag for Prometheus-7B-v2.0-Q4_K_M). The model name in config/critic.yaml must match exactly what ollama list shows. If the model is missing, CriticAgent will log an error on every request and publish "score": null — the rest of the system continues normally.

MemoryAgent Enhancement — Score Annotation

MemoryAgent currently subscribes to response.generation only. Phase 3 adds a second subscription to critique.result. When a critique arrives, MemoryAgent looks up the engram by query_id and adds critic_score to its metadata.

Engram ID convention (new constraint)

MemoryService must store each engram with id=query_id. This allows update_engram_score(query_id, score) to do a direct document update rather than a search. This is a change to MemoryService.write_episodic() — it must accept and use the query_id as the ChromaDB document ID.

New MemoryService method

ChromaDB update() replaces the entire metadata dict — it does not merge. The existing engram metadata includes type, query, timestamp, intent, and entities. A bare update with only {"critic_score": score} would wipe them, breaking the where={"type": "episodic"} filter in search_episodic(). The implementation must read the existing metadata first, merge critic_score in, then update.

def update_engram_score(self, query_id: str, score: int) -> None:
    """Merge critic_score into an existing engram's metadata."""
    result = self._collection.get(ids=[query_id])
    if not result["ids"]:
        logger.warning("MemoryService: engram %s not found — skipping score update", query_id)
        return
    existing_meta = (result["metadatas"] or [{}])[0]
    merged = {**existing_meta, "critic_score": score}
    self._collection.update(ids=[query_id], metadatas=[merged])

State machine additions

Add state UPDATING_SCORE and action UPDATE_SCORE. The transition path: IDLE + UPDATE_SCORE → UPDATING_SCORE → IDLE. This is a fast path (no LLM call) — just a ChromaDB metadata update.

Race condition: CriticAgent may publish before MemoryAgent finishes write_episodic()
CriticAgent's Prometheus call typically takes 5–15 seconds — much longer than MemoryAgent's write_episodic(). In practice this race does not occur, but MemoryService.update_engram_score() must tolerate a missing ID (engram not yet written) by logging a warning and returning without error. Retry logic is out of scope for Phase 3.

UI — Score Badge

The score arrives after the answer is already displayed. The UI must update the existing response widget, not append a new message. Each response widget is indexed by query_id in a dict on the conversation panel.

ScoreColorLabel
5green (#22c55e)● 5/5
4green (#22c55e)● 4/5
3amber (#f59e0b)● 3/5
2red (#ef4444)● 2/5
1red (#ef4444)● 1/5
nullgray(hidden — no badge)

Badge appears as a small right-aligned indicator at the bottom of the response bubble. Hovering shows the full feedback text in a tooltip.

Signal wiring — follow existing BusLogger architecture

The existing UI routes all bus events through BusLogger signals (response = Signal(dict), thinking_chunk = Signal(dict), etc.). BusLogger already has a CRITIQUE branch in log_envelope() that formats a plain text message — it must be upgraded to emit a typed signal instead, following the same pattern as response. Do not add a new signal on MainWindow directly.

# In BusLogger — add alongside existing response = Signal(dict):
critique = Signal(dict)

# In log_envelope(), replace the CRITIQUE text branch:
elif subject == CRITIQUE:
    self.critique.emit({
        "score": raw.get("score"),
        "feedback": raw.get("feedback", ""),
        "query_id": raw.get("query_id", ""),
    })
    return    # do not also emit message()

# In MainWindow._start_bus_monitor() — add alongside response.connect:
self._bus_logger.critique.connect(self._on_critique)

# _on_critique: looks up widget in self._pending by query_id, calls widget.set_score(score, feedback)

Note: the existing CRITIQUE log branch reads raw.get("verdict") — this must change to raw.get("feedback") when upgrading the branch. The payload field is feedback everywhere in this plan.

Build Order

1

CriticAgent core — no bus, no UI

Write critic_states.py, critic_actions.py, critic_transitions.py, config/critic.yaml, and the core grading logic in critic_agent.py. Unit test _grade(query, answer) with a mocked OllamaBackend: assert that a response containing [RESULT] 4 returns score 4, and that a malformed response returns None. Do not make a real Prometheus call in unit tests — that belongs in the integration story (step 6).

2

MemoryService: engram ID convention + update method

Do this before wiring CriticAgent to the bus — the full end-to-end loop requires stable engram IDs. Update write_episodic() to accept an optional query_id parameter and use it as the ChromaDB document ID (falling back to uuid4() if absent). Add update_engram_score(query_id, score) with the read-merge-update pattern (see MemoryAgent section above). Unit test: write an engram with a known query_id, call update_engram_score, then get() the document and assert critic_score is set and type="episodic" is preserved. Test the missing-ID case: logs a warning and returns cleanly. Delete .chroma/ before the first run after this change.

3

Wire CriticAgent to bus + run_local.py

Add _start_critic() to run_local.py. CriticAgent subscribes to response.generation and publishes critique.result (payload fields: score, feedback, query_id, session_id). Start it after memory_agent. Verify manually: start the stack, ask a question, confirm critique.result appears in the bus monitor with a numeric score. With MemoryService already updated (step 2), the full path from generation → engram write → score annotation can now complete.

4

MemoryAgent: subscribe to critique.result

Pass query_id from response.generation payload to write_episodic() (step 2 already supports it). Add CRITIQUE to MemoryAgent's subscription list. Add _handle_critique(envelope) — reads query_id and score from payload, calls memory_service.update_engram_score(). Update state machine with UPDATING_SCORE state and UPDATE_SCORE action. Unit test: call _handle_critique directly with a fake envelope, assert update_engram_score was called with the right args (mock MemoryService).

5

UI score badge — extend BusLogger

Add critique = Signal(dict) to BusLogger. In log_envelope(), replace the existing CRITIQUE text branch (currently reads verdict) with an emit of the typed signal (fields: score, feedback, query_id) and return early — no plain-text fallback. In _start_bus_monitor(), connect self._bus_logger.critique to self._on_critique. In _on_critique: look up the widget in self._pending by query_id, call widget.set_score(score, feedback). Add set_score() to StreamingResponseWidget: appends a colored badge label and sets a tooltip with the feedback text.

6

Story S6 — end-to-end integration verification

Write tests/stories/s6_critic_absolute.yaml. This is an integration test (requires live Prometheus in Ollama), not a unit test. One turn: factual question with a clear correct answer. Assert: response.generation published (answer non-empty), critique.result published (score is int 1–5). Do not assert a specific score — Prometheus output is non-deterministic. After both events fire, verify the engram in ChromaDB has critic_score set and type="episodic" still present.

Constraints & Invariants

CriticAgent must not block GeneratorAgent
CriticAgent subscribes to response.generation — a separate thread from GeneratorAgent. The Prometheus grading call (5–30s) runs in CriticAgent's thread. GeneratorAgent is idle between queries; there is no shared resource contention. Do not introduce any synchronization between them.
CriticAgent failure is non-fatal
If Prometheus times out, the model is missing, or score parsing fails — CriticAgent logs an error and publishes critique.result with "score": null. The conversation continues normally. The MemoryAgent score update and UI badge are both no-ops on null score. Never raise an exception that kills the thread.
MemoryAgent query_id constraint is a breaking change to write_episodic()
Currently write_episodic() generates its own document ID. Changing it to use query_id as the ID breaks any existing ChromaDB data (wrong ID format). Delete .chroma/ before the first run after this change. In tests: always pass a known query_id and verify by direct get().
Qt thread safety: critique.result arrives on background thread
The ZMQ subscriber loop runs in a non-Qt thread. All UI updates must go through a Qt Signal. Do not call any QWidget method directly from the subscriber callback — use Signal.emit().
Startup order: CriticAgent after tools
CriticAgent doesn't need to start before GeneratorAgent (it doesn't publish tool schemas). Add it after memory_agent in run_local.py. A short sleep (0.1s) after starting it is sufficient to let its socket connect.

Out of Scope (Phase 4+)