query.receivedtool.request.web_search; WebSearchTool executes (SearXNG / Brave / Tavily), publishes tool.result.web_search with snippets + URLs. Gemma then decides whether to call web_fetch on a specific URL or synthesize from snippets directly.tool.request.web_fetch; WebFetchTool retrieves and extracts the page text, publishes tool.result.web_fetch. Gemma orchestrates the search→fetch two-step natively.search_query: → ChromaDB similarity search → result appended to messages. Gemma passes raw text; prefix injected by orchestration layer.search_document: → embed → write to ChromaDB. Selective, Gemma-driven. Complements automatic MemoryService ingestion.response.generation — answer text + thinking tokens + tool call logresponse.generation → evaluates quality → publishes critique.resultresponse.generation → automatically ingests every answer to episodic memory. Gemma's save_memory saves what it judges important; MemoryService saves everything.user.feedback → routes reward.event to generatorresponse.text is the clean answer, response.thinking_text is the think block. Use response.text for conversation history; surface response.thinking_text to the UI via the response.generation payload. No explicit stripping code needed. (Note: regex fallback in ConversationService deferred — only if think tags are observed leaking through in practice.)num_ctx: 32000 (or higher) in config options. Native Gemma 4 supports up to 128k–256k tokens depending on variant; Ollama clips unless told otherwise./api/chat endpoint (or OpenAI-compatible /v1/chat/completions) accepts a messages array and supports tool/function calling. The /api/generate endpoint is single-shot and stateless. LoCAL2 always uses the chat endpoint.gemma4:e2b (2B) may be unreliable for tool calling. The larger variants (27b, 31b) are more consistent. If tool calling is flaky, the signal is to upgrade the local model or use the Google GenAI API (which supports larger Gemma variants via client.chats.create()).| Component | Responsibility | Type |
|---|---|---|
| GeneratorAgent | Receives queries, maintains conversation, calls tools, publishes answers. The primary agent — everything else is enrichment. | bus agent |
| web_search tool | Executes a web search query and returns synthesized text. Called synchronously by the generator when Gemma requests it. | sync tool |
| recall_memory tool | Retrieves relevant episodic engrams from ChromaDB by embedding similarity. Called synchronously by the generator. | sync tool |
| MemoryService | Ingests answers to episodic memory. Observes response.generation. Also backs the recall_memory tool. | bus observer |
| Critic | Post-generation quality evaluation. Observes response.generation, scores the answer, publishes critique.result. | bus observer |
| RewardService | Routes user.feedback → reward.event to the generator. Carries over from LoCAL1 largely unchanged. | bus observer |
| ConversationService | Maintains per-session message history. Owned by the generator. Strips thinking tokens before archiving assistant turns. | internal |
A going-in objective: make the model's internal reasoning process visible, not just the final answer.
Gemma 4's <|think|> content is displayed in the UI as a collapsible reasoning section — shown before the final answer. Stripped from history after display but visible to the user in real time.
Each tool invocation is surfaced: which tool was called, the query/parameters sent, and what was returned. Gives the user visibility into when and why Gemma decided to search or recall.
When recall_memory is called, the retrieved engrams are shown: content, similarity score, age, and reward status. User can see what prior context influenced the answer.
The active messages array (minus thinking tokens) is inspectable. Shows how context has accumulated across the session and what the model is working with.
| Component | Status | Notes |
|---|---|---|
| ZMQ bus / proxy | carry | Reuse as-is. Subjects change but the infrastructure is sound. |
| EngramService / Engram | carry | ChromaDB data layer carries over directly. |
| EmbeddingService | carry | nomic-embed-text asymmetric RAG pattern unchanged. |
| PersonalMemoryService | carry | Adapt: remove bus wiring, expose as tool interface. |
| RewardService | carry | Carries over largely unchanged. |
| Critic evaluation logic | carry | Strip pub/sub scaffolding, keep scoring policy. |
| ConversationService | carry | Adapt: max_turns bump; append_turn() receives clean text (thinking already separated upstream). |
| ParticipantLogWidget | carry | Main PySide6 window — sidebar + log view. Adapt: title "LoCAL2", memory panel deferred to Phase 2. |
| BusLogger / ThinkingLogger / EventLogger | carry | Adapt BusLogger: replace memory.engram / task.available display logic with response.generation display (answer preview + thinking indicator + tool call count). |
| ParticipantSettingsDialog | carry | Carry over verbatim. Depends on agent.get_settings_schema() / update_settings(). |
| StarRatingWidget | carry | Carry over, not wired until Phase 4 (user.feedback signal). |
| LoCALSession | carry | Adapt: simplify OBSERVE list to 4 subjects; terminate on response.generation (not answer.dialog). Same publisher/subscriber pattern. |
| run_api.py (FastAPI gateway) | carry | Adapt: /query response adds thinking + tool_calls fields; /feedback stubbed to 501 until Phase 4. Moved to src/local/api/api.py. |
| scripts/run_generator.py pattern | carry | Replaced by run_local.py — unified entry point (proxy + generator + optional API + optional UI). |
| OllamaBackend.chat() | carry | Already exists — just not used by the generator today. |
| Web search execution | gone | Gemma 4 native tool calling replaces the GatewayAgent approach entirely. WebSearchService is a fresh implementation backed by a configurable provider (SearXNG / Brave / Tavily). |
| GatewayAgent (as agent) | gone | Gone entirely — no classify step, no LoCAL1 gateway code reused. |
| AnalystAgent | gone | Decomposition and depth classification handled by Gemma natively. |
| SynthesizerAgent | gone | No task aggregation needed in tool-native architecture. |
| task.available / task.result subjects | gone | No task pipeline — tool calls are synchronous within generation. |
query.received → response.generation, streaming tool loop (max 5 iters)run_local.py unified entry point with --api / --headlesstool.schematool.request.web_search — SearXNG providertool.request.web_fetch — BeautifulSoup extractionsearch_memory tool — semantic search over ChromaDB episodic store via nomic embeddingsresponse.generation, auto-ingests every Q&A turn with LLM classification (intent + entities)search_memory for recall; MemoryAgent handles writes automatically. No explicit save tool.response.generation; calls Prometheus for absolute 1–5 gradingcritique.result (score, feedback, query_id, session_id)critique.result; annotates the engram with critic_scoreollama pull prometheus:7b (Prometheus-7B-v2.0-Q4_K_M)critic_agent_tool.py (Gemma-callable)Deferred to later phase: Low-score retry ("would you like me to try again?"). Correct design is Gemma initiating the retry conversation when it is aware of its own critic score — requires experimentation with how to surface critique.result back into Gemma's context.
--respondent-b --model-b --temperature-b CLI flags. No subclass.pairwise.resultpairwise_winner: true/falserespondent_id carried on every bus message (thinking, response, dialog, critique)Phase 1 produces the minimum end-to-end system: a user can send a query on the bus, Gemma answers with full thinking surfaced, conversation history accumulates across turns, and Gemma can call web_search when it decides current data is needed.
services/ollama_backend.py adapted for non-generator callersservices/conversation_service.py (thinking-clean turns)WebSearchTool + WebFetchTool as bus participantstool.request.*, waits for tool.result.*POST /query returns thinking + tool_calls)run_local.py — unified entry point with --api and --headless flagsMirror LoCAL1's src/local/ tree — same PYTHONPATH=src convention. ● = new ● = carry over ● = adapt
LoCAL2/ ├── config/ │ ├── generator.yaml model, num_ctx, system_prompt, tool_definitions │ ├── web_search.yaml max_results, fetch_enabled, synthesis config │ └── system.yaml ollama_debug, bus addresses, timeouts ├── src/ │ └── local/ │ ├── __init__.py │ ├── config_loader.py carry over verbatim (get_config / ConfigManager) │ ├── protocol/ │ │ ├── __init__.py │ │ ├── envelope.py carry over (MessageEnvelope, envelope_debug_dict) │ │ ├── message_types.py carry over (EVENT, RESULT, TASK_REQUEST constants) │ │ └── subjects.py LoCAL2 subset (6 subjects — see Bus section below) │ ├── transport/ │ │ ├── __init__.py │ │ ├── zmq_pubsub.py carry over (ZmqPublisher, ZmqSubscriber) │ │ └── bus.py carry over (MessageBus wrapper) │ ├── runtime/ │ │ ├── __init__.py │ │ └── proxy.py carry over (ZMQ XPUB/XSUB proxy) │ ├── agents/ │ │ ├── __init__.py │ │ ├── generator_agent.py new: publishes tool.request.*, waits for tool.result.*, publishes response.generation │ │ ├── generator_states.py IDLE / RECEIVING / GENERATING / DISPATCHING_TOOL / WAITING_FOR_TOOL / PUBLISHING │ │ └── generator_transitions.py transition table │ ├── tools/ │ │ ├── __init__.py │ │ ├── web_search_tool.py subscribes tool.request.web_search; configurable provider │ │ └── web_fetch_tool.py subscribes tool.request.web_fetch; BeautifulSoup extraction │ ├── services/ │ │ ├── __init__.py │ │ ├── ollama_backend.py adapt for non-generator callers (Critic etc.) │ │ └── conversation_service.py adapt append_turn() to accept clean text │ ├── ui/ │ │ ├── __init__.py │ │ ├── agent_participant_gui.py adapt: title "LoCAL2", BusLogger → response.generation, memory panel deferred │ │ └── star_rating_widget.py carry over; not wired until Phase 4 │ ├── session/ │ │ ├── __init__.py │ │ └── local_session.py adapt: OBSERVE list (4 subjects), terminate on response.generation │ └── api/ │ ├── __init__.py │ └── api.py adapt from run_api.py: /query returns thinking+tool_calls; /feedback → 501 until Phase 4 ├── scripts/ │ └── run_local.py unified entry point: proxy + generator + --api + --headless flags └── tests/ ├── __init__.py ├── test_conversation_service.py ├── test_generator_states.py └── stories/ ├── S1_basic_qa.yaml single turn, no tools, thinking surfaced ├── S2_multi_turn.yaml followup resolves without re-stating context ├── S3_web_search.yaml current-data query → web_search fires, snippets in answer └── S4_web_fetch.yaml query needing full page → search then fetch two-step
| Subject | Publisher | Subscriber(s) | Notes |
|---|---|---|---|
| query.received | UI / API / test harness | GeneratorAgent | Entry point. Payload: {session_id, query_id, query} |
| tool.request.web_search | GeneratorAgent | WebSearchTool | Payload: {query_id, tool: "web_search", args: {query}} |
| tool.result.web_search | WebSearchTool | GeneratorAgent | Payload: {query_id, result: "...", error: null}. query_id used to correlate. |
| tool.request.web_fetch | GeneratorAgent | WebFetchTool | Payload: {query_id, tool: "web_fetch", args: {url}} |
| tool.result.web_fetch | WebFetchTool | GeneratorAgent | Payload: {query_id, result: "...", error: null} |
| response.generation | GeneratorAgent | UI (Phase 5), Critic (Phase 3), MemoryService (Phase 2) | Full payload: answer, thinking, tool_call_log, session_id |
| answer.dialog | GeneratorAgent | LoCALSession (API), history trackers | Lightweight: {session_id, query, answer} — no thinking tokens |
Subjects user.feedback, reward.event, critique.result defined in subjects.py but not wired until Phase 3/4. Error path: if a tool agent doesn't respond within timeout, GeneratorAgent receives no tool.result.* and appends an error string to the messages array, then resumes generation.
GeneratorAgent calls ollama.chat() directly in the tool call loop — it needs access to the raw response["message"] object to append to history. OllamaBackend is kept for non-generator callers (Critic in Phase 3) that just need a text response. Its chat() signature stays largely the same; the tool-call loop lives in GeneratorAgent, not here.
call() — unchanged (generate endpoint for Critic etc.)chat() — returns bare text string for non-generator callers_make_client() — unchanged"" on failureollama.chat() directly to keep response["message"] accessibleGenerationResponse dataclass — not needed; thinking comes from getattr(response, "thinking", "") directly in GeneratorAgentWhen think=True, Ollama separates thinking from the answer. Access pattern in GeneratorAgent:
response = ollama.chat(model=..., messages=..., tools=..., think=True)
answer = (response["message"].get("content") or "").strip()
thinking = getattr(response, "thinking", "") or ""
# response["message"] is the raw message dict — append directly to history
Regex fallback: if thinking is empty but answer contains <|think|> tags, strip them and move to thinking. Add only if observed in practice.
LoCAL1's API is correct for LoCAL2. The key constraint: append_turn(session_id, user, assistant) must receive GenerationResponse.text (clean answer) as assistant — never the raw content with thinking tokens. Since OllamaBackend already separates them, no stripping logic is needed here.
append_turn() — same signatureresult.text, not result.text + result.thinking_text_MAX_TURNS_PER_SESSION from 20 → 50 (larger context window)_MAX_SESSIONS = 50get_history(session_id) returns list of message dicts[{"role":"user","content":"..."},{"role":"assistant","content":"..."}]_MAX_SESSIONS exceededGemma 4 natively distinguishes search (find sources) from fetch (extract page content). These are separate tools, separate bus participants. Gemma orchestrates the search→fetch two-step when it needs full page content — LoCAL2 does not apply a fetch-on-thin-snippets heuristic. No LoCAL1 gateway code reused. No synthesis LLM call — Gemma synthesizes the answer from raw results in its generation turn.
Subscribes to tool.request.web_search. Executes search via configured provider. Publishes tool.result.web_search.
Output format:
Today's date: 2026-05-30
[1] Title: ...
Snippet: ...
URL: https://...
[2] ...
Date prefix grounds time-sensitive answers. Up to max_results from config. URLs included so Gemma can call web_fetch on a specific one.
Subscribes to tool.request.web_fetch. Fetches the URL, extracts body text (BeautifulSoup), truncates to fetch_max_chars. Publishes tool.result.web_fetch.
Output format: URL: https://... Extracted text: [first 3000 chars of body text]
On fetch failure (timeout, 4xx/5xx), publishes tool.result.web_fetch with error field set. Gemma reads the error and decides whether to try another URL.
| Provider | Config key | Notes |
|---|---|---|
| SearXNG | provider: searxng | Self-hosted, free. Default for dev. |
| Brave Search | provider: brave | API key via env BRAVE_API_KEY. |
| Tavily | provider: tavily | API key via env TAVILY_API_KEY. Returns extracted content — Gemma may not need to call web_fetch separately. |
| Mock | provider: mock | Returns canned results. Use in tests to avoid live HTTP. |
tool.result.web_search with error: "no results"tool.result.web_fetch with error: "fetch failed: 404""[tool unavailable]" as the tool result, resumes generationThe LoCAL1 PySide6 UI structure carries over almost verbatim. One file needs meaningful adaptation (agent_participant_gui.py); the rest are carry-overs with minor changes.
| File | Action | Key change |
|---|---|---|
| agent_participant_gui.py | adapt | Window title "LoCAL2". BusLogger.log(): replace memory.engram / task.available subject-specific blocks with response.generation display (answer preview, thinking indicator, tool call count). Disable episode & memory panel in Phase 1 (no ChromaDB yet) — show greyed-out "Memory (Phase 2)" tooltip. |
| star_rating_widget.py | carry | No changes. Not wired in Phase 1 — used in Phase 4 for user.feedback. |
| scripts/run_generator.py | adapt | Subscribe to query.received (not request.generation). Wire GenerationResponse.thinking_text to ThinkingLogger directly (no separate thinking_callback attribute — thinking comes back in the response object). Remove status_callback / THINKING_TRACE bus publish for Phase 1. |
| scripts/run_proxy.py | carry | Carry over LoCAL1's run_proxy.py verbatim. ZMQ XPUB/XSUB proxy is unchanged. |
LoCAL1's BusLogger.log() has subject-specific blocks for memory.engram and task.available. Replace those with a block for response.generation:
elif envelope.subject == "response.generation":
answer = (raw_payload.get("answer") or "")[:80].replace("\n", " ")
thinking = raw_payload.get("thinking") or ""
tool_calls = raw_payload.get("tool_calls") or []
think_ind = "◈ thinking" if thinking else ""
tool_ind = f"⚙ {len(tool_calls)} tool call(s)" if tool_calls else ""
indicators = " ".join(x for x in [think_ind, tool_ind] if x)
generation_line = f"\n Answer: {answer}{'…' if len(answer)==80 else ''}"
if indicators:
generation_line += f"\n Flags: {indicators}"
In Phase 1 there is no ChromaDB, so the ◈ Memory button should be disabled with tooltip "Memory (Phase 2)". The episode panel (≡) should also be disabled — "Episodes (Phase 2)". This avoids import errors and keeps the UI runnable before memory is wired. Re-enable both in Phase 2 when EngramService is available.
The core agent. Subscribes to query.received, runs the tool call loop, publishes response.generation.
def _generate(self, session_id, query):
history = self._conv.get_history(session_id) or []
messages = [SYSTEM_MSG] + history + [{"role": "user", "content": query}]
tool_call_log = []
options = {"num_ctx": cfg["num_ctx"]}
iterations = 0
while iterations < cfg.get("max_tool_iterations", 5):
iterations += 1
response = ollama.chat(
model=self._model,
messages=messages,
tools=TOOL_SCHEMAS,
think=True,
options=options,
)
# Always append the raw assistant message to history
messages.append(response["message"])
tool_calls = response["message"].get("tool_calls") or []
if not tool_calls:
break # final text answer — no more tool calls
for tc in tool_calls:
name = tc["function"]["name"]
args = tc["function"]["arguments"] # already a parsed dict
tool_result = self._execute_tool(name, args)
tool_call_log.append({"tool": name, "args": args, "result": tool_result})
messages.append({
"role": "tool",
"content": str(tool_result),
"name": name, # required by Ollama tool protocol
})
answer = (response["message"].get("content") or "").strip()
thinking = getattr(response, "thinking", "") or ""
self._conv.append_turn(session_id, query, answer)
return answer, thinking, tool_call_log
response["message"] is appended directly — never construct the assistant dict manually. Tool result requires the name field. arguments is already a parsed dict; unpack with **args.
tools:
- name: web_search
description: >
Search the web for current information.
Use when the query requires up-to-date
facts, news, prices, or recent events.
parameters:
type: object
properties:
query:
type: string
description: Search query string
required: [query]
{
"session_id": "abc123",
"query": "what's the weather?",
"answer": "...",
"thinking": "...",
"tool_calls": [
{
"tool": "web_search",
"args": {"query": "..."},
"result": "..."
}
],
"generator_id": "GeneratorAgent"
}
query.receivedsession_id and query from envelope.payload_generate()response.generation and answer.dialogIDLE — waiting for queryRECEIVING — envelope received, history loadedGENERATING — ollama.chat() call in flightDISPATCHING_TOOL — publishing tool.request.* to busWAITING_FOR_TOOL — blocking on tool.result.* envelopePUBLISHING — writing response.generation to busRECEIVE_QUERY — query.received arrivesSTART_GENERATION — begin ollama.chat() callDISPATCH_TOOL — tool_calls in response, publish requestTOOL_RESULT — tool.result.* received, append to messagesTOOL_TIMEOUT — no result within timeout, append errorFINISH — final text response, no tool callsFAIL — exception → IDLEWhile in WAITING_FOR_TOOL, incoming query.received envelopes are ignored (not queued — single active session in Phase 1). The GeneratorAgent's message loop must subscribe to both query.received and tool.result.* subjects and dispatch based on subject + current state.
agent_id: GeneratorAgent
model: gemma4:12b
num_ctx: 32000
think: true
max_tool_iterations: 5
timeout_seconds: 120
system_prompt: |
You are a helpful assistant. ...
tools:
- type: function
function:
name: web_search
description: >
Search the web to find sources on a topic.
Returns snippets and URLs. Use when you
need to find information you don't have.
parameters:
type: object
properties:
query: {type: string}
required: [query]
- type: function
function:
name: web_fetch
description: >
Fetch the full text content of a specific
URL. Use after web_search when you need
more detail than the snippet provides.
parameters:
type: object
properties:
url: {type: string}
required: [url]
provider: searxng # searxng | brave | tavily | mock max_results: 5 fetch_enabled: true # fetch thin snippets; skip for tavily fetch_max_chars: 2000 timeout_seconds: 15 # Provider-specific searxng_url: http://localhost:8888 # brave_api_key: set via env BRAVE_API_KEY # tavily_api_key: set via env TAVILY_API_KEY config/system.yaml bus_pub_address: tcp://127.0.0.1:5555 bus_sub_address: tcp://127.0.0.1:5556 ollama_debug: false # Tests: set provider: mock in web_search.yaml # to return canned results without HTTP calls
Stories are the acceptance criteria. Phase 1 is not done until all three pass against the live stack.
| Field | Value |
|---|---|
| turn 1 query | "What is the capital of France?" |
| expected subject | response.generation published |
| answer assertion | contains "Paris" |
| payload assertion | thinking field is non-empty string |
| payload assertion | tool_calls is empty list |
| Turn | Query | Assertion |
|---|---|---|
| 1 | "Tell me about the Python GIL." | answer mentions Global Interpreter Lock |
| 2 | "Why was it introduced?" | answer explains GIL's origin without re-asking what "it" is |
Validates that conversation history is correctly passed in the messages array and Gemma resolves the pronoun natively.
| Field | Value |
|---|---|
| turn 1 query | "What major AI news broke this week?" |
| payload assertion | tool_calls has at least one entry with tool == "web_search" |
| bus assertion | tool.request.web_search and tool.result.web_search both appear in event log |
| answer assertion | answer references a specific recent event (not hallucinated) |
| note | Model-size dependent — gemma4:e2b may not reliably call tools. Run with 12b+ for this story. |
| Field | Value |
|---|---|
| turn 1 query | "Summarize the content of the top result for: LoCAL2 Gemma tool calling" |
| payload assertion | tool_calls has entries for both web_search and web_fetch |
| bus assertion | four subjects appear in order: tool.request.web_search, tool.result.web_search, tool.request.web_fetch, tool.result.web_fetch |
| answer assertion | answer contains a summary drawn from page content, not just snippets |
| note | Gemma must decide independently to call web_fetch — no prompt engineering to force it. |
gemma4:e2b (2B) is unreliable for tool calling. S3 requires at least gemma4:12b. If local VRAM is insufficient, fall back to the Google GenAI API (gemma-4-27b-it via client.chats.create()). Make the backend configurable via generator.yaml: backend: ollama | google_genai.
If Gemma keeps calling web_search in a loop, the agent hangs. Add a max_tool_iterations: 5 guard in the loop — break and use whatever text is available after the limit. Log a warning when the guard fires.
Some Ollama versions return thinking inline in message.content rather than in response.thinking. Add a one-pass regex strip in OllamaBackend.chat() as a fallback — only activates if tags are detected in the text field.
Phase 1 has no UI — the test harness publishes query.received. session_id should be a required field in the payload. If missing, generate a UUID per query (stateless single-turn mode). This is a graceful degradation, not a hard failure.
run_local.py is the single command to start the whole stack. With --api it also starts a FastAPI server so queries can be submitted over HTTP — useful for testing without the UI and for automated story verification.
# Default: proxy + generator + PySide6 UI python scripts/run_local.py # Add API server on port 8765 (alongside UI) python scripts/run_local.py --api # Headless (no UI) + API — useful for CI / story runner python scripts/run_local.py --headless --api # Override model python scripts/run_local.py --model gemma4:27b --api # Custom API port python scripts/run_local.py --api --api-port 9000
--api: start uvicorn (FastAPI) in a daemon thread on --api-port--headless: create QApplication, build ParticipantLogWidget, call app.exec() (blocks main thread)--headless: block on threading.Event().wait() until Ctrl-COBSERVE = [
QUERY_RECEIVED,
RESPONSE_GENERATION,
ANSWER_DIALOG,
CRITIQUE, # Phase 3 — observed but ignored until then
]
_TRAIL_SECONDS = 2.0 # reduced from 3s
# stream() terminates on RESPONSE_GENERATION
# (not ANSWER_DIALOG as in LoCAL1) then waits
# the trailing window for critique.result
The session publishes query.received and filters received events by correlation_id. Same pattern as LoCAL1 — only the subjects and terminal condition change.
| Method | Path | Notes |
|---|---|---|
| GET | /health | Carry over verbatim |
| POST | /query | Adapt: response adds thinking + tool_calls |
| POST | /feedback | Returns 501 until Phase 4 |
# POST /query response
{
"query_id": "...",
"session_id": "...",
"answer": "...",
"thinking": "...", # ← new
"tool_calls": [...], # ← new
"events": ["query.received", "response.generation"],
"event_log": [{"subject": "...", "t": 0.0}]
}
Stories (S1–S3) run against the live stack via the API: start run_local.py --headless --api, then POST each turn to /query with the same session_id, assert on the response payload. This is simpler and more reliable than subscribing to the bus directly from the test runner.
Each step should be runnable/testable before moving to the next. Phase 1a completes at step 5; Phase 1b completes at step 8.
— PHASE 1a —
Create directory tree. Carry over config_loader, envelope, zmq_pubsub, proxy, bus. Write LoCAL2 subjects.py (all Phase 1 subjects including tool.request.* / tool.result.*). Verify: python -c "from local.protocol.subjects import QUERY_RECEIVED" runs.
Carry over and adapt. Write test_conversation_service.py. Verify ollama.chat() direct call returns thinking via getattr(response, "thinking", "").
Write generator_states.py and generator_transitions.py — all states including DISPATCHING_TOOL and WAITING_FOR_TOOL. Write test_generator_states.py.
Carry over and adapt UI, session, api.py. Write run_local.py. Verify window launches, curl /health returns ok.
Wire GeneratorAgent without tool schemas. POST /query returns answer + thinking. Stories S1 (basic Q&A, thinking surfaced) and S2 (multi-turn pronoun resolution) pass. Phase 1a complete.
— PHASE 1b — detailed plan →
Implement web_search_tool.py with configurable provider. Set provider: mock for unit test — verify tool.result.web_search published with correct payload and correlation ID.
Implement web_fetch_tool.py with BeautifulSoup extraction. Unit test with a local HTML fixture — verify extracted text truncated to fetch_max_chars.
Add tool schemas to GeneratorAgent, wire DISPATCHING_TOOL → WAITING_FOR_TOOL transitions, add timeout handling. Start full stack with run_local.py --headless --api. Stories S3 (web_search fires) and S4 (search→fetch two-step) pass. Phase 1b complete.