Here's my review of the cost/token tracking changes.

---

### expert_build/llm.py:_cost_tracker + reset/get/format/record helpers
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: `get_cost_summary()` (line 39) returns a shallow copy via `dict(_cost_tracker)`. The nested `by_model` dict is shared with the internal tracker, so a caller doing `get_cost_summary()["by_model"]["claude"]["calls"] = 0` would corrupt the tracker. All current callers are read-only so this won't bite today, but the function's name implies a safe snapshot. Use `copy.deepcopy` or manually copy the nested dict. Minor issue, not a blocker.

---

### expert_build/llm.py:_parse_cli_json
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Correctly dispatches on `model.startswith("gemini")` vs Claude default. The `data.get("result") or output` fallback (line 98) treats both `None` and `""` as missing, falling back to the raw JSON string — this is intentional per commit 9ae6ace. Gemini multi-model token aggregation loop (lines 91-94) is correct. Cost is still recorded even on null-result fallback, which is the right call since tokens were consumed. The Claude branch correctly sums `input_tokens + cache_creation_input_tokens + cache_read_input_tokens` for total input.

---

### expert_build/llm.py:invoke
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: PARTIAL
INTEGRATION: WIRED
REASONING: The return value change from `stdout.decode()` to `_parse_cli_json(stdout.decode(), model)` is the key behavioral change. All downstream callers (exam.py, propose.py, coverage.py, summarize.py, pipeline.py) expect the model's text response, and `_parse_cli_json` extracts exactly that from the JSON wrapper. When the model's response is itself JSON (common in propose/coverage), it's embedded as a string in the `result` field, so extraction works correctly. The `invoke` function itself isn't directly tested with a real subprocess (parsing is tested via `_parse_cli_json`), but integration via `invoke_sync` is covered by existing tests that mock at the `invoke_sync` boundary.

---

### expert_build/llm.py:MODEL_COMMANDS
VERDICT: CONCERN
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: UNTESTED
INTEGRATION: WIRED
REASONING: Two changes here: (1) Adding `--output-format json` for Claude and `-o json` for Gemini — necessary for cost tracking, correct. (2) Adding `--skip-trust` for Gemini — this is unrelated to cost tracking and could change behavior (bypassing interactive trust prompts). It's probably needed for non-interactive pipeline usage, but it's an undocumented behavioral change bundled into a cost-tracking PR. No test verifies the command construction for either model.

---

### expert_build/cli.py:main (finally block)
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: UNTESTED
INTEGRATION: WIRED
REASONING: The `finally` block correctly runs on all exit paths: normal completion, KeyboardInterrupt (via sys.exit(130)), SystemExit (re-raised), and general exceptions. The lazy import avoids loading the module if the CLI exits early (e.g., `--help`). Output goes to stderr, which is correct — it won't pollute piped stdout. The guard `if cost:` avoids printing an empty line when no LLM calls were made (e.g., `init`, `status` commands). One minor edge: if stderr is closed/broken-pipe, the `print` could raise, but this is a standard CLI edge case.

---

### tests/test_llm.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: N/A
REASONING: Thorough test file covering: Claude JSON parsing with cache tokens, Gemini JSON parsing with multi-model aggregation, non-JSON fallback, null result fallback, non-dict JSON fallback, cost accumulation across calls, per-model tracking, reset, and format output with/without cost. The `autouse` fixture ensures test isolation. Missing: no test for Gemini `response: null` fallback (only Claude's `result: null` is tested), and no test for empty string response values. These are minor gaps.

---

### SELF_REVIEW
LIMITATIONS: Could not run the test suite to confirm all tests pass (including existing tests that might break from the `invoke()` return value change). Did not have access to the Gemini CLI docs to verify the `--skip-trust` and `-o json` flag behavior. Could not verify whether Claude's `--output-format json` flag produces exactly the JSON schema assumed by `_parse_cli_json`.

---

### FEATURE_REQUESTS
- Include the results of running the test suite (both new and existing tests) as an observation, so the reviewer can verify no regressions from the behavioral change to `invoke()`'s return value.
- When MODEL_COMMANDS or CLI invocation patterns change, include a sample of the actual CLI JSON output as an observation so the reviewer can verify the parsing logic matches reality.
