LoCAL2

Loosely Coupled Agent Language model — Second Generation

Going-In Objectives

1 Native Conversation History
The generator receives the full conversation messages array. Gemma handles followup detection, pronoun resolution, contradiction detection, and multi-turn reasoning natively — without preprocessing or query rewriting. LoCAL1's resolve_query() suppressed these native capabilities; LoCAL2 restores them.
2 Tool-Native Architecture
Web search and memory recall are surfaced as tools Gemma calls when it decides they're needed. No explicit task decomposition, no Synthesizer aggregation, no Gateway classification. Gemma is the orchestrator; LoCAL2 provides the capabilities.
3 Externalized LLM Workings
Thinking tokens, tool calls, memory retrievals, and conversation state are first-class visible artifacts surfaced in the UI — not debug side channels. The goal is to make the model's internal reasoning process transparent and observable.

LoCAL1 vs LoCAL2

LoCAL1 — Explicit Orchestration

  • Analyst rewrites query before generator sees it
  • Analyst classifies depth, decomposes into task DAG
  • Gateway classifies realtime need, routes to web search
  • Synthesizer aggregates task results, arbitrates
  • Generator receives transformed query, not conversation
  • Thinking tokens stripped from history, debug side channel
  • 4k context default (often silently truncated)
  • Many agents, complex pub/sub surface

LoCAL2 — LLM as Orchestrator

  • Generator receives raw query + full conversation history
  • Gemma decides when to decompose, go deep, use tools
  • web_search is a tool Gemma calls when it needs it
  • No Synthesizer — single coherent generation turn
  • Generator sees the full conversation natively
  • Thinking tokens surfaced in UI as first-class output
  • num_ctx set explicitly (32k+)
  • Fewer agents, minimal bus surface

Architecture — Message Flow

UI / API
User submits query → publishes query.received
GeneratorAgent
Receives query + conversation history → calls Ollama chat endpoint with messages array + tools
↓ (Gemma emits tool call → GeneratorAgent publishes to bus → waits for tool.result → appends → resumes)
Tool: web_search
Gemma calls when it needs to find sources. GeneratorAgent publishes tool.request.web_search; WebSearchTool executes (SearXNG / Brave / Tavily), publishes tool.result.web_search with snippets + URLs. Gemma then decides whether to call web_fetch on a specific URL or synthesize from snippets directly.
Tool: web_fetch
Gemma calls when it needs full page content from a known URL (often after web_search). GeneratorAgent publishes tool.request.web_fetch; WebFetchTool retrieves and extracts the page text, publishes tool.result.web_fetch. Gemma orchestrates the search→fetch two-step natively.
Tool: recall_memory
Gemma calls when it wants prior context → backend prepends search_query: → ChromaDB similarity search → result appended to messages. Gemma passes raw text; prefix injected by orchestration layer.
Tool: save_memory
Gemma calls to explicitly save a high-signal fact → backend prepends search_document: → embed → write to ChromaDB. Selective, Gemma-driven. Complements automatic MemoryService ingestion.
GeneratorAgent
Publishes response.generation — answer text + thinking tokens + tool call log
↓ (async bus observers)
Critic
Observes response.generation → evaluates quality → publishes critique.result
MemoryService
Observes response.generation → automatically ingests every answer to episodic memory. Gemma's save_memory saves what it judges important; MemoryService saves everything.
RewardService
Observes user.feedback → routes reward.event to generator

Gemma 4 + Ollama Constraints

Thinking tokens — separated by OllamaBackend
OllamaBackend already separates thinking from the response: response.text is the clean answer, response.thinking_text is the think block. Use response.text for conversation history; surface response.thinking_text to the UI via the response.generation payload. No explicit stripping code needed. (Note: regex fallback in ConversationService deferred — only if think tags are observed leaking through in practice.)
Set num_ctx explicitly
Ollama defaults to 4k context on hardware with <24GB VRAM. Always set num_ctx: 32000 (or higher) in config options. Native Gemma 4 supports up to 128k–256k tokens depending on variant; Ollama clips unless told otherwise.
Use the chat endpoint, not generate
The Ollama /api/chat endpoint (or OpenAI-compatible /v1/chat/completions) accepts a messages array and supports tool/function calling. The /api/generate endpoint is single-shot and stateless. LoCAL2 always uses the chat endpoint.
Tool calling reliability scales with model size
gemma4:e2b (2B) may be unreliable for tool calling. The larger variants (27b, 31b) are more consistent. If tool calling is flaky, the signal is to upgrade the local model or use the Google GenAI API (which supports larger Gemma variants via client.chats.create()).

Core Components

ComponentResponsibilityType
GeneratorAgent Receives queries, maintains conversation, calls tools, publishes answers. The primary agent — everything else is enrichment. bus agent
web_search tool Executes a web search query and returns synthesized text. Called synchronously by the generator when Gemma requests it. sync tool
recall_memory tool Retrieves relevant episodic engrams from ChromaDB by embedding similarity. Called synchronously by the generator. sync tool
MemoryService Ingests answers to episodic memory. Observes response.generation. Also backs the recall_memory tool. bus observer
Critic Post-generation quality evaluation. Observes response.generation, scores the answer, publishes critique.result. bus observer
RewardService Routes user.feedback → reward.event to the generator. Carries over from LoCAL1 largely unchanged. bus observer
ConversationService Maintains per-session message history. Owned by the generator. Strips thinking tokens before archiving assistant turns. internal

Observability — Externalizing LLM Workings

A going-in objective: make the model's internal reasoning process visible, not just the final answer.

Thinking Tokens

Gemma 4's <|think|> content is displayed in the UI as a collapsible reasoning section — shown before the final answer. Stripped from history after display but visible to the user in real time.

Tool Call Log

Each tool invocation is surfaced: which tool was called, the query/parameters sent, and what was returned. Gives the user visibility into when and why Gemma decided to search or recall.

Memory Recalls

When recall_memory is called, the retrieved engrams are shown: content, similarity score, age, and reward status. User can see what prior context influenced the answer.

Conversation State

The active messages array (minus thinking tokens) is inspectable. Shows how context has accumulated across the session and what the model is working with.

What Carries Over from LoCAL1

ComponentStatusNotes
ZMQ bus / proxycarryReuse as-is. Subjects change but the infrastructure is sound.
EngramService / EngramcarryChromaDB data layer carries over directly.
EmbeddingServicecarrynomic-embed-text asymmetric RAG pattern unchanged.
PersonalMemoryServicecarryAdapt: remove bus wiring, expose as tool interface.
RewardServicecarryCarries over largely unchanged.
Critic evaluation logiccarryStrip pub/sub scaffolding, keep scoring policy.
ConversationServicecarryAdapt: max_turns bump; append_turn() receives clean text (thinking already separated upstream).
ParticipantLogWidgetcarryMain PySide6 window — sidebar + log view. Adapt: title "LoCAL2", memory panel deferred to Phase 2.
BusLogger / ThinkingLogger / EventLoggercarryAdapt BusLogger: replace memory.engram / task.available display logic with response.generation display (answer preview + thinking indicator + tool call count).
ParticipantSettingsDialogcarryCarry over verbatim. Depends on agent.get_settings_schema() / update_settings().
StarRatingWidgetcarryCarry over, not wired until Phase 4 (user.feedback signal).
LoCALSessioncarryAdapt: simplify OBSERVE list to 4 subjects; terminate on response.generation (not answer.dialog). Same publisher/subscriber pattern.
run_api.py (FastAPI gateway)carryAdapt: /query response adds thinking + tool_calls fields; /feedback stubbed to 501 until Phase 4. Moved to src/local/api/api.py.
scripts/run_generator.py patterncarryReplaced by run_local.py — unified entry point (proxy + generator + optional API + optional UI).
OllamaBackend.chat()carryAlready exists — just not used by the generator today.
Web search executiongoneGemma 4 native tool calling replaces the GatewayAgent approach entirely. WebSearchService is a fresh implementation backed by a configurable provider (SearXNG / Brave / Tavily).
GatewayAgent (as agent)goneGone entirely — no classify step, no LoCAL1 gateway code reused.
AnalystAgentgoneDecomposition and depth classification handled by Gemma natively.
SynthesizerAgentgoneNo task aggregation needed in tool-native architecture.
task.available / task.result subjectsgoneNo task pipeline — tool calls are synchronous within generation.

Implementation Phases

1a

Generator core — conversation history, bus, UI, API  → detailed plan  DONE

  • GeneratorAgent: query.receivedresponse.generation, streaming tool loop (max 5 iters)
  • ConversationService — multi-turn history, thinking-clean turns
  • Bus, proxy, PySide6 UI, FastAPI gateway
  • run_local.py unified entry point with --api / --headless
  • Thinking tokens surfaced in response payload and UI
  • Dynamic schema registration — tools announce JSON schema on tool.schema
  • Stories S1 (basic Q&A) + S2 (multi-turn followup) passing
1b

Tool bus — WebSearchTool + WebFetchTool  detailed plan →  DONE

  • WebSearchTool subscribes tool.request.web_search — SearXNG provider
  • WebFetchTool subscribes tool.request.web_fetch — BeautifulSoup extraction
  • GeneratorAgent: DISPATCHING_TOOL → WAITING_FOR_TOOL state transitions
  • Timeout / no-response handling → error string appended, generation resumes
  • Stories S3 (web_search fires) + S4 (search→fetch two-step) passing
2

Memory as tool — search_memory + MemoryAgent auto-ingest  DONE

  • search_memory tool — semantic search over ChromaDB episodic store via nomic embeddings
  • MemoryAgent — subscribes to response.generation, auto-ingests every Q&A turn with LLM classification (intent + entities)
  • Single memory tool design — Gemma calls search_memory for recall; MemoryAgent handles writes automatically. No explicit save tool.
  • Tool panel UI: Sm / Ws / Wf panels with Activity + Settings tabs
  • Story S5 (episodic recall — "how do I like my coffee?") passing
3

CriticAgent — absolute grading + memory score annotation  detailed plan →

  • CriticAgent subscribes to response.generation; calls Prometheus for absolute 1–5 grading
  • Publishes critique.result (score, feedback, query_id, session_id)
  • MemoryAgent subscribes to critique.result; annotates the engram with critic_score
  • UI shows score badge inline below each response (color-coded by score)
  • Prerequisite: ollama pull prometheus:7b (Prometheus-7B-v2.0-Q4_K_M)
  • Story S6 (critique.result published with integer score 1–5)
4

Dual respondents + Prometheus pairwise + reward signals

  • Second GeneratorAgent instance (RespondentB) with behavioural variation via config
  • Prometheus pairwise comparison via critic_agent_tool.py (Gemma-callable)
  • user.feedback (thumbs up/down) → RewardService → reward.event to producing agent — eventual direction is conversational feedback initiated by Gemma rather than a UI control
  • Score-weighted retrieval in SearchMemoryTool

Deferred to later phase: Low-score retry ("would you like me to try again?"). Correct design is Gemma initiating the retry conversation when it is aware of its own critic score — requires experimentation with how to surface critique.result back into Gemma's context.

5

Dual Respondents + Pairwise Comparison — detailed plan →

  • RespondentB = same GeneratorAgent class, different model/temperature via --respondent-b --model-b --temperature-b CLI flags. No subclass.
  • B is a shadow generator: thinking + response + dialog filtered from UI entirely; only A appears to user
  • CriticAgent pairwise buffer: grades A and B independently, runs Prometheus pairwise when both arrive, publishes pairwise.result
  • MemoryAgent annotates both engrams with pairwise_winner: true/false
  • respondent_id carried on every bus message (thinking, response, dialog, critique)
  • Story S7 (pairwise.result on bus, both engrams annotated) ✅ COMPLETE
6

Observability UI — detailed plan →

  • Critic panel (Cr): activity log of absolute grades + pairwise results; settings tab for critic.yaml
  • Memory inspector (🧠): table of episodic engrams with Age, Respondent, Score, Sentiment, Winner, Query; Refresh button; click to expand
  • Each participant has its own panel — no shared "busy" tool sidebar
  • Story S8 (Cr panel shows grade, 🧠 panel shows annotated engram)

Phase 1 — Detailed Implementation Plan

1a: Generator · Conversation History · Bus · UI · API

1b: WebSearchTool · WebFetchTool · Tool Bus

Scope & Deliverables

Phase 1 produces the minimum end-to-end system: a user can send a query on the bus, Gemma answers with full thinking surfaced, conversation history accumulates across turns, and Gemma can call web_search when it decides current data is needed.

In scope

  • Project scaffold (src layout, config, tests)
  • ZMQ bus carry-over (pub/sub, proxy)
  • Protocol layer (envelope, subjects, message types)
  • services/ollama_backend.py adapted for non-generator callers
  • services/conversation_service.py (thinking-clean turns)
  • WebSearchTool + WebFetchTool as bus participants
  • GeneratorAgent: publishes tool.request.*, waits for tool.result.*
  • generator_states.py + generator_transitions.py
  • config/generator.yaml, config/web_search.yaml
  • LoCALSession + FastAPI gateway (POST /query returns thinking + tool_calls)
  • run_local.py — unified entry point with --api and --headless flags
  • Unit tests + three acceptance stories (S1–S3) run via the API

Not in scope (Phase 2+)

  • ChromaDB / memory recall tool
  • MemoryService ingestion observer
  • Critic / Prometheus evaluation
  • Reward / feedback signal
  • Pairwise respondents
  • Observability UI
  • save_memory tool

Project Layout

Mirror LoCAL1's src/local/ tree — same PYTHONPATH=src convention. = new   = carry over   = adapt

LoCAL2/
├── config/
│   ├── generator.yaml          model, num_ctx, system_prompt, tool_definitions
│   ├── web_search.yaml         max_results, fetch_enabled, synthesis config
│   └── system.yaml             ollama_debug, bus addresses, timeouts
├── src/
│   └── local/
│       ├── __init__.py
│       ├── config_loader.py        carry over verbatim (get_config / ConfigManager)
│       ├── protocol/
│       │   ├── __init__.py
│       │   ├── envelope.py         carry over (MessageEnvelope, envelope_debug_dict)
│       │   ├── message_types.py    carry over (EVENT, RESULT, TASK_REQUEST constants)
│       │   └── subjects.py         LoCAL2 subset (6 subjects — see Bus section below)
│       ├── transport/
│       │   ├── __init__.py
│       │   ├── zmq_pubsub.py       carry over (ZmqPublisher, ZmqSubscriber)
│       │   └── bus.py              carry over (MessageBus wrapper)
│       ├── runtime/
│       │   ├── __init__.py
│       │   └── proxy.py            carry over (ZMQ XPUB/XSUB proxy)
│       ├── agents/
│       │   ├── __init__.py
│       │   ├── generator_agent.py  new: publishes tool.request.*, waits for tool.result.*, publishes response.generation
│       │   ├── generator_states.py IDLE / RECEIVING / GENERATING / DISPATCHING_TOOL / WAITING_FOR_TOOL / PUBLISHING
│       │   └── generator_transitions.py  transition table
│       ├── tools/
│       │   ├── __init__.py
│       │   ├── web_search_tool.py subscribes tool.request.web_search; configurable provider
│       │   └── web_fetch_tool.py  subscribes tool.request.web_fetch; BeautifulSoup extraction
│       ├── services/
│       │   ├── __init__.py
│       │   ├── ollama_backend.py   adapt for non-generator callers (Critic etc.)
│       │   └── conversation_service.py adapt append_turn() to accept clean text
│       ├── ui/
│       │   ├── __init__.py
│       │   ├── agent_participant_gui.py adapt: title "LoCAL2", BusLogger → response.generation, memory panel deferred
│       │   └── star_rating_widget.py   carry over; not wired until Phase 4
│       ├── session/
│       │   ├── __init__.py
│       │   └── local_session.py        adapt: OBSERVE list (4 subjects), terminate on response.generation
│       └── api/
│           ├── __init__.py
│           └── api.py              adapt from run_api.py: /query returns thinking+tool_calls; /feedback → 501 until Phase 4
├── scripts/
│   └── run_local.py             unified entry point: proxy + generator + --api + --headless flags
└── tests/
    ├── __init__.py
    ├── test_conversation_service.py
    ├── test_generator_states.py
    └── stories/
        ├── S1_basic_qa.yaml        single turn, no tools, thinking surfaced
        ├── S2_multi_turn.yaml      followup resolves without re-stating context
        ├── S3_web_search.yaml      current-data query → web_search fires, snippets in answer
        └── S4_web_fetch.yaml       query needing full page → search then fetch two-step

Bus Subjects — Phase 1

SubjectPublisherSubscriber(s)Notes
query.received UI / API / test harness GeneratorAgent Entry point. Payload: {session_id, query_id, query}
tool.request.web_search GeneratorAgent WebSearchTool Payload: {query_id, tool: "web_search", args: {query}}
tool.result.web_search WebSearchTool GeneratorAgent Payload: {query_id, result: "...", error: null}. query_id used to correlate.
tool.request.web_fetch GeneratorAgent WebFetchTool Payload: {query_id, tool: "web_fetch", args: {url}}
tool.result.web_fetch WebFetchTool GeneratorAgent Payload: {query_id, result: "...", error: null}
response.generation GeneratorAgent UI (Phase 5), Critic (Phase 3), MemoryService (Phase 2) Full payload: answer, thinking, tool_call_log, session_id
answer.dialog GeneratorAgent LoCALSession (API), history trackers Lightweight: {session_id, query, answer} — no thinking tokens

Subjects user.feedback, reward.event, critique.result defined in subjects.py but not wired until Phase 3/4. Error path: if a tool agent doesn't respond within timeout, GeneratorAgent receives no tool.result.* and appends an error string to the messages array, then resumes generation.

OllamaBackend — Adaptation

GeneratorAgent calls ollama.chat() directly in the tool call loop — it needs access to the raw response["message"] object to append to history. OllamaBackend is kept for non-generator callers (Critic in Phase 3) that just need a text response. Its chat() signature stays largely the same; the tool-call loop lives in GeneratorAgent, not here.

What stays the same

  • call() — unchanged (generate endpoint for Critic etc.)
  • chat() — returns bare text string for non-generator callers
  • _make_client() — unchanged
  • Never-raises contract — returns "" on failure

What changes

  • GeneratorAgent bypasses OllamaBackend entirely for the tool-call loop — calls ollama.chat() directly to keep response["message"] accessible
  • OllamaBackend still used by services that need a simple one-shot text response (Critic in Phase 3)
  • Remove GenerationResponse dataclass — not needed; thinking comes from getattr(response, "thinking", "") directly in GeneratorAgent

Thinking token handling — exact API

When think=True, Ollama separates thinking from the answer. Access pattern in GeneratorAgent:

response = ollama.chat(model=..., messages=..., tools=..., think=True)

answer  = (response["message"].get("content") or "").strip()
thinking = getattr(response, "thinking", "") or ""
# response["message"] is the raw message dict — append directly to history

Regex fallback: if thinking is empty but answer contains <|think|> tags, strip them and move to thinking. Add only if observed in practice.

ConversationService — Adaptation

LoCAL1's API is correct for LoCAL2. The key constraint: append_turn(session_id, user, assistant) must receive GenerationResponse.text (clean answer) as assistant — never the raw content with thinking tokens. Since OllamaBackend already separates them, no stripping logic is needed here.

What changes

  • No code changes to append_turn() — same signature
  • Caller (GeneratorAgent) passes result.text, not result.text + result.thinking_text
  • Increase _MAX_TURNS_PER_SESSION from 20 → 50 (larger context window)
  • Keep _MAX_SESSIONS = 50

What stays the same

  • OrderedDict session store with LRU eviction
  • get_history(session_id) returns list of message dicts
  • History format: [{"role":"user","content":"..."},{"role":"assistant","content":"..."}]
  • Eviction of oldest session when _MAX_SESSIONS exceeded

WebSearchTool + WebFetchTool — New bus participants

Gemma 4 natively distinguishes search (find sources) from fetch (extract page content). These are separate tools, separate bus participants. Gemma orchestrates the search→fetch two-step when it needs full page content — LoCAL2 does not apply a fetch-on-thin-snippets heuristic. No LoCAL1 gateway code reused. No synthesis LLM call — Gemma synthesizes the answer from raw results in its generation turn.

WebSearchTool

Subscribes to tool.request.web_search. Executes search via configured provider. Publishes tool.result.web_search.

Output format:
Today's date: 2026-05-30

[1] Title: ...
    Snippet: ...
    URL: https://...

[2] ...

Date prefix grounds time-sensitive answers. Up to max_results from config. URLs included so Gemma can call web_fetch on a specific one.

WebFetchTool

Subscribes to tool.request.web_fetch. Fetches the URL, extracts body text (BeautifulSoup), truncates to fetch_max_chars. Publishes tool.result.web_fetch.

Output format:
URL: https://...
Extracted text:
[first 3000 chars of body text]

On fetch failure (timeout, 4xx/5xx), publishes tool.result.web_fetch with error field set. Gemma reads the error and decides whether to try another URL.

WebSearchTool — configurable providers

ProviderConfig keyNotes
SearXNGprovider: searxngSelf-hosted, free. Default for dev.
Brave Searchprovider: braveAPI key via env BRAVE_API_KEY.
Tavilyprovider: tavilyAPI key via env TAVILY_API_KEY. Returns extracted content — Gemma may not need to call web_fetch separately.
Mockprovider: mockReturns canned results. Use in tests to avoid live HTTP.

Error and no-response handling

  • If WebSearchTool finds no results: publishes tool.result.web_search with error: "no results"
  • If WebFetchTool fails: publishes tool.result.web_fetch with error: "fetch failed: 404"
  • If neither agent responds within timeout: GeneratorAgent transitions WAITING_FOR_TOOL → TOOL_TIMEOUT, appends "[tool unavailable]" as the tool result, resumes generation
  • Gemma reads error strings and adapts — may retry with a different query or acknowledge the limitation

UI — PySide6 Carry-Over

The LoCAL1 PySide6 UI structure carries over almost verbatim. One file needs meaningful adaptation (agent_participant_gui.py); the rest are carry-overs with minor changes.

FileActionKey change
agent_participant_gui.py adapt Window title "LoCAL2". BusLogger.log(): replace memory.engram / task.available subject-specific blocks with response.generation display (answer preview, thinking indicator, tool call count). Disable episode & memory panel in Phase 1 (no ChromaDB yet) — show greyed-out "Memory (Phase 2)" tooltip.
star_rating_widget.py carry No changes. Not wired in Phase 1 — used in Phase 4 for user.feedback.
scripts/run_generator.py adapt Subscribe to query.received (not request.generation). Wire GenerationResponse.thinking_text to ThinkingLogger directly (no separate thinking_callback attribute — thinking comes back in the response object). Remove status_callback / THINKING_TRACE bus publish for Phase 1.
scripts/run_proxy.py carry Carry over LoCAL1's run_proxy.py verbatim. ZMQ XPUB/XSUB proxy is unchanged.

BusLogger adaptation — response.generation display

LoCAL1's BusLogger.log() has subject-specific blocks for memory.engram and task.available. Replace those with a block for response.generation:

elif envelope.subject == "response.generation":
    answer = (raw_payload.get("answer") or "")[:80].replace("\n", " ")
    thinking = raw_payload.get("thinking") or ""
    tool_calls = raw_payload.get("tool_calls") or []
    think_ind = "◈ thinking" if thinking else ""
    tool_ind  = f"⚙ {len(tool_calls)} tool call(s)" if tool_calls else ""
    indicators = "  ".join(x for x in [think_ind, tool_ind] if x)
    generation_line = f"\n  Answer:    {answer}{'…' if len(answer)==80 else ''}"
    if indicators:
        generation_line += f"\n  Flags:     {indicators}"

Phase 1 memory panel — disabled gracefully

In Phase 1 there is no ChromaDB, so the ◈ Memory button should be disabled with tooltip "Memory (Phase 2)". The episode panel (≡) should also be disabled — "Episodes (Phase 2)". This avoids import errors and keeps the UI runnable before memory is wired. Re-enable both in Phase 2 when EngramService is available.

GeneratorAgent — New

The core agent. Subscribes to query.received, runs the tool call loop, publishes response.generation.

Tool call loop — exact Ollama API

def _generate(self, session_id, query):
    history = self._conv.get_history(session_id) or []
    messages = [SYSTEM_MSG] + history + [{"role": "user", "content": query}]
    tool_call_log = []
    options = {"num_ctx": cfg["num_ctx"]}
    iterations = 0

    while iterations < cfg.get("max_tool_iterations", 5):
        iterations += 1
        response = ollama.chat(
            model=self._model,
            messages=messages,
            tools=TOOL_SCHEMAS,
            think=True,
            options=options,
        )
        # Always append the raw assistant message to history
        messages.append(response["message"])

        tool_calls = response["message"].get("tool_calls") or []
        if not tool_calls:
            break  # final text answer — no more tool calls

        for tc in tool_calls:
            name = tc["function"]["name"]
            args = tc["function"]["arguments"]  # already a parsed dict
            tool_result = self._execute_tool(name, args)
            tool_call_log.append({"tool": name, "args": args, "result": tool_result})
            messages.append({
                "role": "tool",
                "content": str(tool_result),
                "name": name,           # required by Ollama tool protocol
            })

    answer = (response["message"].get("content") or "").strip()
    thinking = getattr(response, "thinking", "") or ""
    self._conv.append_turn(session_id, query, answer)
    return answer, thinking, tool_call_log

response["message"] is appended directly — never construct the assistant dict manually. Tool result requires the name field. arguments is already a parsed dict; unpack with **args.

Tool definitions (in generator.yaml)

tools:
  - name: web_search
    description: >
      Search the web for current information.
      Use when the query requires up-to-date
      facts, news, prices, or recent events.
    parameters:
      type: object
      properties:
        query:
          type: string
          description: Search query string
      required: [query]

response.generation payload

{
  "session_id": "abc123",
  "query": "what's the weather?",
  "answer": "...",
  "thinking": "...",
  "tool_calls": [
    {
      "tool": "web_search",
      "args": {"query": "..."},
      "result": "..."
    }
  ],
  "generator_id": "GeneratorAgent"
}

handle_envelope() contract

  • Transition: IDLE → RECEIVING on query.received
  • Extract session_id and query from envelope.payload
  • Transition: RECEIVING → GENERATING, call _generate()
  • Each tool call: transition GENERATING → TOOL_CALLING → GENERATING
  • On final answer: transition GENERATING → PUBLISHING
  • Publish response.generation and answer.dialog
  • Transition: PUBLISHING → IDLE
  • On any exception: log error, transition back to IDLE, do not re-raise

State Machine — GeneratorAgent

States (generator_states.py)

  • IDLE — waiting for query
  • RECEIVING — envelope received, history loaded
  • GENERATING — ollama.chat() call in flight
  • DISPATCHING_TOOL — publishing tool.request.* to bus
  • WAITING_FOR_TOOL — blocking on tool.result.* envelope
  • PUBLISHING — writing response.generation to bus

Actions (generator_transitions.py)

  • RECEIVE_QUERY — query.received arrives
  • START_GENERATION — begin ollama.chat() call
  • DISPATCH_TOOL — tool_calls in response, publish request
  • TOOL_RESULT — tool.result.* received, append to messages
  • TOOL_TIMEOUT — no result within timeout, append error
  • FINISH — final text response, no tool calls
  • FAIL — exception → IDLE

Valid transitions

  • IDLE → RECEIVING (RECEIVE_QUERY)
  • RECEIVING → GENERATING (START_GENERATION)
  • GENERATING → DISPATCHING_TOOL (DISPATCH_TOOL)
  • DISPATCHING_TOOL → WAITING_FOR_TOOL (always)
  • WAITING_FOR_TOOL → GENERATING (TOOL_RESULT)
  • WAITING_FOR_TOOL → GENERATING (TOOL_TIMEOUT)
  • GENERATING → PUBLISHING (FINISH)
  • PUBLISHING → IDLE (always)
  • any → IDLE (FAIL)

While in WAITING_FOR_TOOL, incoming query.received envelopes are ignored (not queued — single active session in Phase 1). The GeneratorAgent's message loop must subscribe to both query.received and tool.result.* subjects and dispatch based on subject + current state.

Configuration Files

config/generator.yaml

agent_id: GeneratorAgent
model: gemma4:12b
num_ctx: 32000
think: true
max_tool_iterations: 5
timeout_seconds: 120
system_prompt: |
  You are a helpful assistant. ...
tools:
  - type: function
    function:
      name: web_search
      description: >
        Search the web to find sources on a topic.
        Returns snippets and URLs. Use when you
        need to find information you don't have.
      parameters:
        type: object
        properties:
          query: {type: string}
        required: [query]
  - type: function
    function:
      name: web_fetch
      description: >
        Fetch the full text content of a specific
        URL. Use after web_search when you need
        more detail than the snippet provides.
      parameters:
        type: object
        properties:
          url: {type: string}
        required: [url]

config/web_search.yaml

provider: searxng        # searxng | brave | tavily | mock
max_results: 5
fetch_enabled: true      # fetch thin snippets; skip for tavily
fetch_max_chars: 2000
timeout_seconds: 15

# Provider-specific
searxng_url: http://localhost:8888
# brave_api_key: set via env BRAVE_API_KEY
# tavily_api_key: set via env TAVILY_API_KEY

config/system.yaml
bus_pub_address: tcp://127.0.0.1:5555
bus_sub_address: tcp://127.0.0.1:5556
ollama_debug: false

# Tests: set provider: mock in web_search.yaml
# to return canned results without HTTP calls

Acceptance Stories — Phase 1

Stories are the acceptance criteria. Phase 1 is not done until all three pass against the live stack.

S1 — Basic Q&A with thinking surfaced

FieldValue
turn 1 query"What is the capital of France?"
expected subjectresponse.generation published
answer assertioncontains "Paris"
payload assertionthinking field is non-empty string
payload assertiontool_calls is empty list

S2 — Multi-turn: followup resolves without restating subject

TurnQueryAssertion
1"Tell me about the Python GIL."answer mentions Global Interpreter Lock
2"Why was it introduced?"answer explains GIL's origin without re-asking what "it" is

Validates that conversation history is correctly passed in the messages array and Gemma resolves the pronoun natively.

S3 — web_search fires on current-data query

FieldValue
turn 1 query"What major AI news broke this week?"
payload assertiontool_calls has at least one entry with tool == "web_search"
bus assertiontool.request.web_search and tool.result.web_search both appear in event log
answer assertionanswer references a specific recent event (not hallucinated)
noteModel-size dependent — gemma4:e2b may not reliably call tools. Run with 12b+ for this story.

S4 — search → fetch two-step

FieldValue
turn 1 query"Summarize the content of the top result for: LoCAL2 Gemma tool calling"
payload assertiontool_calls has entries for both web_search and web_fetch
bus assertionfour subjects appear in order: tool.request.web_search, tool.result.web_search, tool.request.web_fetch, tool.result.web_fetch
answer assertionanswer contains a summary drawn from page content, not just snippets
noteGemma must decide independently to call web_fetch — no prompt engineering to force it.

Risks & Open Questions

Tool calling reliability

gemma4:e2b (2B) is unreliable for tool calling. S3 requires at least gemma4:12b. If local VRAM is insufficient, fall back to the Google GenAI API (gemma-4-27b-it via client.chats.create()). Make the backend configurable via generator.yaml: backend: ollama | google_genai.

Infinite tool call loop

If Gemma keeps calling web_search in a loop, the agent hangs. Add a max_tool_iterations: 5 guard in the loop — break and use whatever text is available after the limit. Log a warning when the guard fires.

Think tag leakage

Some Ollama versions return thinking inline in message.content rather than in response.thinking. Add a one-pass regex strip in OllamaBackend.chat() as a fallback — only activates if tags are detected in the text field.

Session ID source

Phase 1 has no UI — the test harness publishes query.received. session_id should be a required field in the payload. If missing, generate a UUID per query (stateless single-turn mode). This is a graceful degradation, not a hard failure.

API Gateway + Unified Entry Point

run_local.py is the single command to start the whole stack. With --api it also starts a FastAPI server so queries can be submitted over HTTP — useful for testing without the UI and for automated story verification.

run_local.py — usage

# Default: proxy + generator + PySide6 UI
python scripts/run_local.py

# Add API server on port 8765 (alongside UI)
python scripts/run_local.py --api

# Headless (no UI) + API — useful for CI / story runner
python scripts/run_local.py --headless --api

# Override model
python scripts/run_local.py --model gemma4:27b --api

# Custom API port
python scripts/run_local.py --api --api-port 9000

run_local.py — startup sequence

  1. Start ZMQ proxy (XPUB/XSUB) in a daemon thread
  2. Start GeneratorAgent message loop in a daemon thread
  3. If --api: start uvicorn (FastAPI) in a daemon thread on --api-port
  4. If not --headless: create QApplication, build ParticipantLogWidget, call app.exec() (blocks main thread)
  5. If --headless: block on threading.Event().wait() until Ctrl-C

LoCALSession adaptation

OBSERVE = [
    QUERY_RECEIVED,
    RESPONSE_GENERATION,
    ANSWER_DIALOG,
    CRITIQUE,          # Phase 3 — observed but ignored until then
]

_TRAIL_SECONDS = 2.0  # reduced from 3s

# stream() terminates on RESPONSE_GENERATION
# (not ANSWER_DIALOG as in LoCAL1) then waits
# the trailing window for critique.result

The session publishes query.received and filters received events by correlation_id. Same pattern as LoCAL1 — only the subjects and terminal condition change.

API routes — Phase 1

MethodPathNotes
GET/healthCarry over verbatim
POST/queryAdapt: response adds thinking + tool_calls
POST/feedbackReturns 501 until Phase 4
# POST /query response
{
  "query_id": "...",
  "session_id": "...",
  "answer": "...",
  "thinking": "...",        # ← new
  "tool_calls": [...],      # ← new
  "events": ["query.received", "response.generation"],
  "event_log": [{"subject": "...", "t": 0.0}]
}

Story runner integration

Stories (S1–S3) run against the live stack via the API: start run_local.py --headless --api, then POST each turn to /query with the same session_id, assert on the response payload. This is simpler and more reliable than subscribing to the bus directly from the test runner.

Build Order

Each step should be runnable/testable before moving to the next. Phase 1a completes at step 5; Phase 1b completes at step 8.

— PHASE 1a —

1

Scaffold + protocol layer

Create directory tree. Carry over config_loader, envelope, zmq_pubsub, proxy, bus. Write LoCAL2 subjects.py (all Phase 1 subjects including tool.request.* / tool.result.*). Verify: python -c "from local.protocol.subjects import QUERY_RECEIVED" runs.

2

OllamaBackend + ConversationService

Carry over and adapt. Write test_conversation_service.py. Verify ollama.chat() direct call returns thinking via getattr(response, "thinking", "").

3

State machine skeleton

Write generator_states.py and generator_transitions.py — all states including DISPATCHING_TOOL and WAITING_FOR_TOOL. Write test_generator_states.py.

4

UI + LoCALSession + API + run_local.py

Carry over and adapt UI, session, api.py. Write run_local.py. Verify window launches, curl /health returns ok.

5

GeneratorAgent (no tools) → Stories S1 + S2 ✓ Phase 1a

Wire GeneratorAgent without tool schemas. POST /query returns answer + thinking. Stories S1 (basic Q&A, thinking surfaced) and S2 (multi-turn pronoun resolution) pass. Phase 1a complete.

— PHASE 1b — detailed plan →

6

WebSearchTool

Implement web_search_tool.py with configurable provider. Set provider: mock for unit test — verify tool.result.web_search published with correct payload and correlation ID.

7

WebFetchTool

Implement web_fetch_tool.py with BeautifulSoup extraction. Unit test with a local HTML fixture — verify extracted text truncated to fetch_max_chars.

8

GeneratorAgent tool loop + Stories S3 + S4 ✓ Phase 1b

Add tool schemas to GeneratorAgent, wire DISPATCHING_TOOL → WAITING_FOR_TOOL transitions, add timeout handling. Start full stack with run_local.py --headless --api. Stories S3 (web_search fires) and S4 (search→fetch two-step) pass. Phase 1b complete.

LoCAL2 Initial Architecture Plan · 2026-05-30