Long sessions accumulate messages until the silent 20-turn truncation kicks in, losing early context. This phase adds a visible context gauge in the sidebar and a manual compaction action. Pressing the gauge asks Gemma to summarize the session so far, replaces the message history with a compact summary plus a verbatim tail, and reports tokens freed.
Ollama's streaming API delivers token counts on the final chunk of each iteration via
chunk.prompt_eval_count (tokens sent to the model) and chunk.eval_count
(tokens generated). We capture the last iteration's prompt_eval_count —
this reflects the true context size for that turn including all tool-call results.
def _generate(...) -> tuple[str, str, list[dict], int]:
# existing loop, plus:
last_chunk = None
for chunk in ollama.chat(..., stream=True):
last_chunk = chunk
# ... existing handling unchanged ...
prompt_tokens = getattr(last_chunk, "prompt_eval_count", 0) or 0
# return added as 4th element
return answer, thinking, tool_call_log, prompt_tokens
# New field in session entry (default 0 on migration / new sessions) "token_count": int # New methods: def set_token_count(self, session_id: str | None, count: int) -> None def get_token_count(self, session_id: str | None) -> int
After generation, RespondentA stores the count:
self._conv.set_token_count(session_id, prompt_tokens)
token_count is included in the RESPONSE_GENERATION payload so the UI
can update the gauge without polling.
A small arc gauge sits in the sidebar between the 💬 (conversations) and ⟳ (new session) buttons.
It is a custom QWidget subclass named ContextGauge.
class ContextGauge(QWidget):
compact_requested = Signal() # emitted on click
def __init__(self, num_ctx: int, parent=None)
def set_tokens(self, count: int) # called from MainWindow on RESPONSE_GENERATION
def paintEvent(self, event) # draws arc + center text
def mousePressEvent(self, event) # emits compact_requested
Arc drawing: QPainter.drawArc() with span = fill * 360 * 16
(Qt uses 1/16th-degree units). Arc sweeps counter-clockwise from 12 o'clock.
Center shows a compact token count (e.g. 47K).
Color thresholds:
| Fill | Color |
|---|---|
| < 60% | #7ec8a4 (green) |
| 60–85% | #c8a47e (amber) |
| > 85% | #c87e7e (red) |
Tooltip: "47,200 / 128,000 tokens — click to compact"
Compaction is routed over the bus so GeneratorAgent owns all Ollama calls — consistent with the architecture invariant.
COMPACTION_REQUEST = "compaction.request" COMPACTION_RESULT = "compaction.result"
COMPACTION_SYSTEM = (
"You are a conversation summarizer. "
"Produce a concise factual summary of the conversation below. "
"Capture: the user's goals, key facts established, decisions made, "
"open questions, and any user preferences or constraints stated. "
"Be terse. Omit pleasantries and filler. "
"Output only the summary — no preamble, no 'Here is a summary:'."
)
The call is a separate ollama.chat() with stream=False and no tools —
a one-shot summarization that does not touch the live conversation history until it succeeds.
def replace_messages(self, session_id: str | None, messages: list[dict]) -> None:
"""Atomically replace the message list for a session (used by compaction)."""
Config key: compaction_tail_turns: 4 in config/generator.yaml.
Turns kept verbatim after the summary = tail_turns * 2 messages (user + assistant pairs).
Tool call messages in the tail are included as-is.
[
{"role": "assistant", "content": "[SUMMARY] The user is building LoCAL2 ..."},
{"role": "user", "content": "...turn N-3 user..."},
{"role": "assistant", "content": "...turn N-3 assistant..."},
... (tail_turns pairs)
]
role: "assistant" because Gemma expects alternating
user/assistant turns. A system-role summary would work too, but assistant is safer across model versions.
tokens_before = last stored token_count from ConversationService.
tokens_estimated_after = character count of new message list ÷ 4 (heuristic, shown as ~).
Log marker: ── compacted: 82,400 → ~14,200 tokens (68,200 freed) ──
The compaction marker is appended to the main chat log using the same
append_log() method used for session rejoins — a centred, dimmed separator line
with the token delta.
── compacted: 82,400 → ~14,200 tokens (68,200 freed) ──
The gauge fill drops visually to the estimated post-compaction level immediately. After the next generation turn the gauge snaps to the exact Ollama-reported count.
| File | Change |
|---|---|
| src/local/protocol/subjects.py | Add COMPACTION_REQUEST, COMPACTION_RESULT |
| src/local/services/conversation_service.py | token_count field; set/get_token_count(); replace_messages() |
| src/local/agents/generator_agent.py | Capture prompt_eval_count; subscribe to COMPACTION_REQUEST; _handle_compaction(); remove remaining debug print (line 134) |
| src/local/ui/main_window.py | ContextGauge widget; sidebar wiring; _on_compaction_result() |
| config/generator.yaml | Add compaction_tail_turns: 4 |
| tests/test_conversation_service.py | Tests for set/get_token_count, replace_messages |
| tests/test_generator_agent.py | Test prompt_eval_count stored after generation |