Phase 5 introduces a second generator (RespondentB) that runs the same query with different model or sampling parameters. Over time, pairwise comparisons reveal which configuration produces better answers for which query types — building a preference signal that accumulates on engrams.
GeneratorAgent with different parameters — no new behaviour, no extra code paths. A separate class would add complexity without adding capability. The variation is in the config, not the code. --respondent-b enables it; --model-b and --temperature-b configure it. All overridable at the command line for easy experimentation.Phase 5 is complete when: (1) --respondent-b starts a second generator with configurable model and temperature, (2) both publish response.generation with their respondent_id, (3) CriticAgent grades both and publishes pairwise.result, and (4) both engrams in ChromaDB carry pairwise_winner. Story S7 passes.
respondent_id field added to response.generation payload--respondent-b, --model-b, --temperature-b CLI flagspairwise.resultPAIRWISE_GRADING critic stateMemoryService.annotate_pairwise()pairwise.result, new ANNOTATING_PAIRWISE statePAIRWISE_RESULT = "pairwise.result" added to subjects.pyrespondent_id="B" from conversation logcritic_agent_tool.py)reward.event back to respondents# RespondentA only (default, unchanged) python run_local.py # RespondentB with same model, higher temperature python run_local.py --respondent-b --temperature-b 0.7 # RespondentB with a larger model python run_local.py --respondent-b --model-b gemma4:26b # RespondentB with larger model and different temperature python run_local.py --respondent-b --model-b gemma4:26b --temperature-b 0.5
When --respondent-b is omitted, the stack runs exactly as before. --model-b defaults to the value of --model (same model, only temperature differs). --temperature-b defaults to 0.5 when not specified.
query.received with {query, query_id, session_id}response.generation with {query_id, respondent_id: "A", answer, ...}query.received → generates → publishes response.generation with {query_id: b_own_id, correlation_id: original_query_id, respondent_id: "B", answer, ...}query_id, B's engram ID = b_own_idcorrelation_id (= original query_id)pairwise.result {original_query_id, query_id_a, query_id_b, winner: "A"|"B", score_a, score_b}pairwise.result → calls annotate_pairwise(query_id_a, query_id_b, winner) → writes pairwise_winner: true/false on both engramsresponse.generation with respondent_id="B" — it reaches the bus monitor text log but not the conversation panel.RespondentB is GeneratorAgent instantiated with agent_id="respondent_b", respondent_id="B", and overridden model/temperature from CLI. No new class.
ID scheme: RespondentB generates its own query_id (new UUID) so its engram doesn't collide with A's in ChromaDB. It sets correlation_id = original_query_id from the incoming query.received envelope. CriticAgent buffers on correlation_id to match A and B from the same query. pairwise.result carries both IDs so MemoryAgent can annotate both engrams.
respondent_id constructor param (default "A"). It is included in response.generation payload. All existing consumers (MemoryAgent, CriticAgent, UI) that don't check respondent_id continue to work unchanged. CriticAgent and BusLogger are the only new consumers of this field._pairwise_buffer: dict[str, dict] keyed by correlation_id. Entry created when A is graded; pairwise fires when B arrives and A's entry exists. Max size 100 — oldest evicted when full. If B never arrives (B not running, or B had tool_calls), the buffer entry ages out on eviction. No hard timeout needed.
###Task Description:
Two responses to the same instruction are given. Determine which is better.
Output only "A" or "B" followed by a brief reason (one sentence).
###Instruction:
{query}
###Response A:
{answer_a}
###Response B:
{answer_b}
###Verdict:
| New state | New actions | Transitions |
|---|---|---|
| PAIRWISE_GRADING | START_PAIRWISE, PUBLISH_PAIRWISE | IDLE + START_PAIRWISE → PAIRWISE_GRADING PAIRWISE_GRADING + PUBLISH_PAIRWISE → IDLE PAIRWISE_GRADING + FAIL → ERROR |
pairwise.result. Absolute scores (critic badge) already published — pairwise is additive. If RespondentB is not running, the buffer never gets a B entry and pairwise never fires; everything else works normally.
def annotate_pairwise(self, query_id_a: str, query_id_b: str, winner: str) -> None:
for qid, is_winner in [(query_id_a, winner == "A"), (query_id_b, winner == "B")]:
result = self._collection.get(ids=[qid])
if not result.get("ids"):
logger.warning("MemoryService: engram %s not found for pairwise annotation", qid)
continue
existing = (result.get("metadatas") or [{}])[0]
self._collection.update(ids=[qid], metadatas=[{**existing, "pairwise_winner": is_winner}])
Subscription list grows to [RESPONSE_GENERATION, CRITIQUE, PAIRWISE_RESULT]. New _handle_pairwise() method. New ANNOTATING_PAIRWISE state with ANNOTATE_PAIRWISE action.
| File | Status | Change |
|---|---|---|
| src/local/protocol/subjects.py | mod | Add PAIRWISE_RESULT = "pairwise.result" |
| src/local/agents/generator_agent.py | mod | Add respondent_id constructor param (default "A"); include in response.generation payload; generate own query_id when respondent_id="B" |
| src/local/agents/critic_agent.py | mod | Pairwise buffer; route on respondent_id; pairwise prompt; publish pairwise.result |
| src/local/agents/critic_states.py | mod | Add PAIRWISE_GRADING |
| src/local/agents/critic_actions.py | mod | Add START_PAIRWISE, PUBLISH_PAIRWISE |
| src/local/agents/critic_transitions.py | mod | Pairwise state transitions |
| src/local/services/memory_service.py | mod | Add annotate_pairwise(query_id_a, query_id_b, winner) |
| src/local/agents/memory_agent.py | mod | Subscribe to PAIRWISE_RESULT; add _handle_pairwise() |
| src/local/agents/memory_agent_states.py | mod | Add ANNOTATING_PAIRWISE |
| src/local/agents/memory_agent_actions.py | mod | Add ANNOTATE_PAIRWISE |
| src/local/agents/memory_agent_transitions.py | mod | Pairwise state transitions |
| src/local/ui/main_window.py | mod | BusLogger: filter respondent_id="B" from conversation log signal; still appears in bus monitor text |
| run_local.py | mod | Add --respondent-b, --model-b, --temperature-b flags; _start_respondent_b() thread |
| tests/test_generator_agent.py | mod | respondent_id field in payload; B generates own query_id |
| tests/test_critic_agent.py | mod | Pairwise buffer, winner parsing, failure handling |
| tests/test_memory_service.py | mod | annotate_pairwise: winner/loser flags, missing engram handled |
| tests/test_memory_agent.py | mod | _handle_pairwise calls annotate_pairwise with correct IDs |
| tests/stories/s7_pairwise.yaml | new | Acceptance story: two response.generation events, pairwise.result on bus, engrams annotated |
Add PAIRWISE_RESULT to subjects.py. Document respondent_id field convention in subjects.py comment. No behaviour change yet.
Add respondent_id: str = "A" to constructor. Include in response.generation payload. When respondent_id == "B", generate own query_id (new UUID) and set correlation_id = original_query_id from the incoming envelope. Update unit tests.
Add --respondent-b (flag), --model-b (str, default = --model value), --temperature-b (float, default 0.5). Add _start_respondent_b(model, temperature). Start thread only when --respondent-b is set. Verify stack starts cleanly with and without the flag.
Add _pairwise_buffer. Route on respondent_id: A → grade absolute + buffer; B → grade absolute + look up A by correlation_id + run pairwise + publish pairwise.result. Add PAIRWISE_GRADING state, START_PAIRWISE/PUBLISH_PAIRWISE actions, transitions. Add unit tests for all paths.
Read-merge-update both engrams. Unit tests: winner gets pairwise_winner: true, loser gets false, existing metadata preserved, missing engram handled gracefully.
Subscribe to PAIRWISE_RESULT. Add _handle_pairwise(). Add ANNOTATING_PAIRWISE state and ANNOTATE_PAIRWISE action. Unit tests: calls annotate_pairwise with correct IDs from payload.
In BusLogger, skip emitting response signal when respondent_id == "B" (still passes to bus monitor text log). Write Story S7. Restart stack with --respondent-b, run story, verify pairwise.result on bus and engram annotation in ChromaDB.
story_id: S7
title: "Dual respondents — pairwise result published and engrams annotated"
turns:
- query: "What is the boiling point of water at sea level?"
expected_content:
- "100"
notes: >
Run with: python run_local.py --respondent-b --temperature-b 0.7
After running, verify manually:
1. Bus monitor: two response.generation events — one with respondent_id "A",
one with respondent_id "B". Only A appears in the conversation panel.
2. Bus monitor: pairwise.result event arrives after both are graded.
Payload must contain winner ("A" or "B"), score_a, score_b,
query_id_a, query_id_b.
3. ChromaDB: both engrams annotated. Check with:
python -c "
import sys; sys.path.insert(0,'src')
from local.services.memory_service import MemoryService
svc = MemoryService()
results = svc.search_episodic('boiling point water', n=5)
for r in results:
m = r['metadata']
print(m.get('respondent_id','?'), 'winner:', m.get('pairwise_winner'))
"
generator_agent vs respondent_b.correlation_id (= original query_id from query.received). GeneratorAgent when respondent_id="B" must set correlation_id = envelope.correlation_id (which carries the original query_id). If correlation_id is absent, pairwise is skipped for that pair.respondent_id from the response.generation payload in the engram metadata. This lets future retrieval filter or weight by respondent. Currently response.generation carries respondent_id; MemoryAgent writes it through.