← Back to main plan

LoCAL2 — Phase 5

Dual Respondents · Model & Sampling Variation · Pairwise Comparison

Context & Motivation

Phase 5 introduces a second generator (RespondentB) that runs the same query with different model or sampling parameters. Over time, pairwise comparisons reveal which configuration produces better answers for which query types — building a preference signal that accumulates on engrams.

Why not a behavioral subclass (self-critique, contrapuntal prompting)?
Gemma 4 already reasons from multiple angles natively. Explicit contrapuntal prompting produces answers that are more verbose and abstract without being more accurate — tests showed users prefer Gemma's natural output. Behaviorally different prompting also makes pairwise results harder to interpret: did B win because of its prompting style or its substance? Model and sampling variation produces genuinely different answers while keeping the generation strategy the same.
Why config flags rather than a separate class?
RespondentB is just GeneratorAgent with different parameters — no new behaviour, no extra code paths. A separate class would add complexity without adding capability. The variation is in the config, not the code. --respondent-b enables it; --model-b and --temperature-b configure it. All overridable at the command line for easy experimentation.
Why route pairwise results through memory rather than the current turn?
RespondentB runs independently and may finish after the UI has already shown RespondentA's answer. Waiting for both before responding would add latency on every query. The pairwise preference signal accumulates on engrams and shapes future retrieval — it feeds the next cycle, not the current one. No UI change needed.

Scope & Deliverables

Phase 5 is complete when: (1) --respondent-b starts a second generator with configurable model and temperature, (2) both publish response.generation with their respondent_id, (3) CriticAgent grades both and publishes pairwise.result, and (4) both engrams in ChromaDB carry pairwise_winner. Story S7 passes.

In scope

  • respondent_id field added to response.generation payload
  • --respondent-b, --model-b, --temperature-b CLI flags
  • CriticAgent pairwise buffer + pairwise Prometheus prompt + pairwise.result
  • New PAIRWISE_GRADING critic state
  • MemoryService.annotate_pairwise()
  • MemoryAgent subscribes to pairwise.result, new ANNOTATING_PAIRWISE state
  • PAIRWISE_RESULT = "pairwise.result" added to subjects.py
  • UI filters respondent_id="B" from conversation log
  • Tests + Story S7

Out of scope

  • Subclassing GeneratorAgent or adding behavioral variation
  • Pairwise_winner weighting in retrieval (Phase 6)
  • Gemma-callable pairwise tool (critic_agent_tool.py)
  • Showing both responses in UI
  • Routing reward.event back to respondents
  • Gemma-initiated retry on low score

Command-Line Interface

# RespondentA only (default, unchanged)
python run_local.py

# RespondentB with same model, higher temperature
python run_local.py --respondent-b --temperature-b 0.7

# RespondentB with a larger model
python run_local.py --respondent-b --model-b gemma4:26b

# RespondentB with larger model and different temperature
python run_local.py --respondent-b --model-b gemma4:26b --temperature-b 0.5
  

When --respondent-b is omitted, the stack runs exactly as before. --model-b defaults to the value of --model (same model, only temperature differs). --temperature-b defaults to 0.5 when not specified.

Data Flow

1
UI publishes query.received with {query, query_id, session_id}
2
RespondentA (existing GeneratorAgent) receives it → generates → publishes response.generation with {query_id, respondent_id: "A", answer, ...}
3
RespondentB (same GeneratorAgent class, different model/temp) receives same query.received → generates → publishes response.generation with {query_id: b_own_id, correlation_id: original_query_id, respondent_id: "B", answer, ...}
4
MemoryAgent writes a separate engram for each — A's engram ID = query_id, B's engram ID = b_own_id
5
CriticAgent receives A's answer (no tool_calls) → grades absolute → stores in buffer keyed by correlation_id (= original query_id)
6
CriticAgent receives B's answer → grades absolute → finds A's result in buffer → runs Prometheus pairwise → publishes pairwise.result {original_query_id, query_id_a, query_id_b, winner: "A"|"B", score_a, score_b}
7
MemoryAgent receives pairwise.result → calls annotate_pairwise(query_id_a, query_id_b, winner) → writes pairwise_winner: true/false on both engrams
8
UI shows only RespondentA's answer. BusLogger filters response.generation with respondent_id="B" — it reaches the bus monitor text log but not the conversation panel.

RespondentB Identity & ID Scheme

RespondentB is GeneratorAgent instantiated with agent_id="respondent_b", respondent_id="B", and overridden model/temperature from CLI. No new class.

ID scheme: RespondentB generates its own query_id (new UUID) so its engram doesn't collide with A's in ChromaDB. It sets correlation_id = original_query_id from the incoming query.received envelope. CriticAgent buffers on correlation_id to match A and B from the same query. pairwise.result carries both IDs so MemoryAgent can annotate both engrams.

respondent_id="A" is additive and backward-compatible
GeneratorAgent gains a respondent_id constructor param (default "A"). It is included in response.generation payload. All existing consumers (MemoryAgent, CriticAgent, UI) that don't check respondent_id continue to work unchanged. CriticAgent and BusLogger are the only new consumers of this field.

CriticAgent — Pairwise Extension

Buffer design

_pairwise_buffer: dict[str, dict] keyed by correlation_id. Entry created when A is graded; pairwise fires when B arrives and A's entry exists. Max size 100 — oldest evicted when full. If B never arrives (B not running, or B had tool_calls), the buffer entry ages out on eviction. No hard timeout needed.

Pairwise Prometheus prompt

###Task Description:
Two responses to the same instruction are given. Determine which is better.
Output only "A" or "B" followed by a brief reason (one sentence).

###Instruction:
{query}

###Response A:
{answer_a}

###Response B:
{answer_b}

###Verdict:

State machine additions

New stateNew actionsTransitions
PAIRWISE_GRADINGSTART_PAIRWISE, PUBLISH_PAIRWISEIDLE + START_PAIRWISE → PAIRWISE_GRADING
PAIRWISE_GRADING + PUBLISH_PAIRWISE → IDLE
PAIRWISE_GRADING + FAIL → ERROR
Pairwise is non-fatal and optional
If Prometheus returns something other than "A" or "B", log a warning and skip publishing pairwise.result. Absolute scores (critic badge) already published — pairwise is additive. If RespondentB is not running, the buffer never gets a B entry and pairwise never fires; everything else works normally.
Tool-call skip applies to B too
If RespondentB's answer includes tool_calls, skip absolute grading and pairwise for B. Pairwise only fires when both A and B are gradeable (no tool_calls).

MemoryService & MemoryAgent — Pairwise Annotation

MemoryService.annotate_pairwise()

def annotate_pairwise(self, query_id_a: str, query_id_b: str, winner: str) -> None:
    for qid, is_winner in [(query_id_a, winner == "A"), (query_id_b, winner == "B")]:
        result = self._collection.get(ids=[qid])
        if not result.get("ids"):
            logger.warning("MemoryService: engram %s not found for pairwise annotation", qid)
            continue
        existing = (result.get("metadatas") or [{}])[0]
        self._collection.update(ids=[qid], metadatas=[{**existing, "pairwise_winner": is_winner}])

MemoryAgent additions

Subscription list grows to [RESPONSE_GENERATION, CRITIQUE, PAIRWISE_RESULT]. New _handle_pairwise() method. New ANNOTATING_PAIRWISE state with ANNOTATE_PAIRWISE action.

File Map

FileStatusChange
src/local/protocol/subjects.pymodAdd PAIRWISE_RESULT = "pairwise.result"
src/local/agents/generator_agent.pymodAdd respondent_id constructor param (default "A"); include in response.generation payload; generate own query_id when respondent_id="B"
src/local/agents/critic_agent.pymodPairwise buffer; route on respondent_id; pairwise prompt; publish pairwise.result
src/local/agents/critic_states.pymodAdd PAIRWISE_GRADING
src/local/agents/critic_actions.pymodAdd START_PAIRWISE, PUBLISH_PAIRWISE
src/local/agents/critic_transitions.pymodPairwise state transitions
src/local/services/memory_service.pymodAdd annotate_pairwise(query_id_a, query_id_b, winner)
src/local/agents/memory_agent.pymodSubscribe to PAIRWISE_RESULT; add _handle_pairwise()
src/local/agents/memory_agent_states.pymodAdd ANNOTATING_PAIRWISE
src/local/agents/memory_agent_actions.pymodAdd ANNOTATE_PAIRWISE
src/local/agents/memory_agent_transitions.pymodPairwise state transitions
src/local/ui/main_window.pymodBusLogger: filter respondent_id="B" from conversation log signal; still appears in bus monitor text
run_local.pymodAdd --respondent-b, --model-b, --temperature-b flags; _start_respondent_b() thread
tests/test_generator_agent.pymodrespondent_id field in payload; B generates own query_id
tests/test_critic_agent.pymodPairwise buffer, winner parsing, failure handling
tests/test_memory_service.pymodannotate_pairwise: winner/loser flags, missing engram handled
tests/test_memory_agent.pymod_handle_pairwise calls annotate_pairwise with correct IDs
tests/stories/s7_pairwise.yamlnewAcceptance story: two response.generation events, pairwise.result on bus, engrams annotated

Build Order

1

Protocol — PAIRWISE_RESULT subject + respondent_id convention

Add PAIRWISE_RESULT to subjects.py. Document respondent_id field convention in subjects.py comment. No behaviour change yet.

2

GeneratorAgent — respondent_id param + B query_id

Add respondent_id: str = "A" to constructor. Include in response.generation payload. When respondent_id == "B", generate own query_id (new UUID) and set correlation_id = original_query_id from the incoming envelope. Update unit tests.

3

run_local.py — CLI flags + RespondentB thread

Add --respondent-b (flag), --model-b (str, default = --model value), --temperature-b (float, default 0.5). Add _start_respondent_b(model, temperature). Start thread only when --respondent-b is set. Verify stack starts cleanly with and without the flag.

4

CriticAgent — pairwise buffer + state machine + publish

Add _pairwise_buffer. Route on respondent_id: A → grade absolute + buffer; B → grade absolute + look up A by correlation_id + run pairwise + publish pairwise.result. Add PAIRWISE_GRADING state, START_PAIRWISE/PUBLISH_PAIRWISE actions, transitions. Add unit tests for all paths.

5

MemoryService — annotate_pairwise()

Read-merge-update both engrams. Unit tests: winner gets pairwise_winner: true, loser gets false, existing metadata preserved, missing engram handled gracefully.

6

MemoryAgent — pairwise subscription + state machine

Subscribe to PAIRWISE_RESULT. Add _handle_pairwise(). Add ANNOTATING_PAIRWISE state and ANNOTATE_PAIRWISE action. Unit tests: calls annotate_pairwise with correct IDs from payload.

7

UI filter + Story S7

In BusLogger, skip emitting response signal when respondent_id == "B" (still passes to bus monitor text log). Write Story S7. Restart stack with --respondent-b, run story, verify pairwise.result on bus and engram annotation in ChromaDB.

Story S7 — Pairwise Grading

story_id: S7
title: "Dual respondents — pairwise result published and engrams annotated"

turns:
  - query: "What is the boiling point of water at sea level?"
    expected_content:
      - "100"

notes: >
  Run with: python run_local.py --respondent-b --temperature-b 0.7

  After running, verify manually:

  1. Bus monitor: two response.generation events — one with respondent_id "A",
     one with respondent_id "B". Only A appears in the conversation panel.

  2. Bus monitor: pairwise.result event arrives after both are graded.
     Payload must contain winner ("A" or "B"), score_a, score_b,
     query_id_a, query_id_b.

  3. ChromaDB: both engrams annotated. Check with:
       python -c "
       import sys; sys.path.insert(0,'src')
       from local.services.memory_service import MemoryService
       svc = MemoryService()
       results = svc.search_episodic('boiling point water', n=5)
       for r in results:
           m = r['metadata']
           print(m.get('respondent_id','?'), 'winner:', m.get('pairwise_winner'))
       "
  

Key Constraints

Independent conversation histories
A and B maintain separate in-memory conversation histories (keyed by agent_id). B's answers never appear in A's history. Each agent_id is distinct: generator_agent vs respondent_b.
correlation_id is the pairwise matching key
CriticAgent buffers on correlation_id (= original query_id from query.received). GeneratorAgent when respondent_id="B" must set correlation_id = envelope.correlation_id (which carries the original query_id). If correlation_id is absent, pairwise is skipped for that pair.
--respondent-b is purely additive
Without the flag, the stack behaves identically to Phase 4. All pairwise-related code paths are gated on RespondentB's presence. No regression risk to the default single-respondent mode.
respondent_id added to MemoryAgent engrams
MemoryAgent should include respondent_id from the response.generation payload in the engram metadata. This lets future retrieval filter or weight by respondent. Currently response.generation carries respondent_id; MemoryAgent writes it through.