## Code Review: Cost/Token Tracking via `--output-format json`

### expert_build/llm.py — MODEL_COMMANDS change
VERDICT: CONCERN
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: UNTESTED
INTEGRATION: WIRED
REASONING: The command change from `["claude", "-p"]` to `["claude", "-p", "--output-format", "json"]` is the core behavioral change — `invoke()` now receives JSON instead of plain text. This is correctly compensated by `_parse_cli_json` extracting the text before returning. However, if the Claude or Gemini CLI ever returns non-JSON output (e.g., on warnings/deprecation notices printed to stdout), the fallback silently returns the raw JSON blob to callers who expect plain text. The Gemini command also gains `--skip-trust` which is unrelated to cost tracking — a bundled side-effect change that should be called out in the commit message. No integration test exercises `invoke()` → `_parse_cli_json` end-to-end with a subprocess mock.

---

### expert_build/llm.py:_parse_cli_json
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Two edge cases: (1) `data.get("result", output)` falls back to `output` only when the key is absent, not when it's `None`. If Claude CLI returns `{"result": null, ...}`, the function returns `None` (not a string), which would break callers expecting `str`. (2) The `model.startswith("gemini")` dispatch means any model key starting with "gemini" (e.g., a hypothetical `"gemini-pro"`) routes through the Gemini parser, while all others use the Claude parser — this is reasonable given `MODEL_COMMANDS` only has two entries, but is implicit rather than explicit. The fallback for non-JSON is good defensive design.

---

### expert_build/llm.py — cost tracker (_cost_tracker, _record_cost, get_cost_summary, format_cost_summary, reset_cost_tracker)
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Clean accumulator pattern. `get_cost_summary()` returns `dict(_cost_tracker)` which is a shallow copy — the `by_model` sub-dict is shared by reference — but current callers only read, so this is fine. Not thread-safe, but all production callers use `invoke_sync` (which calls `asyncio.run()`), so concurrent mutation isn't a concern today. `format_cost_summary` correctly omits the dollar amount when cost is zero (Gemini case). Tests cover accumulation, per-model tracking, reset, and formatting.

---

### expert_build/cli.py:main — finally block
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: UNTESTED
INTEGRATION: WIRED
REASONING: The `finally` block correctly runs after normal exit, exceptions, and `KeyboardInterrupt`. The lazy import avoids import-time side effects. Printing to stderr is the right choice — it won't pollute stdout for commands whose output is piped. The `if cost:` guard suppresses output when no LLM calls were made (e.g., `expert-build init`). No test covers this integration point, but it's simple enough that the risk is low.

---

### tests/test_llm.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Good test coverage of the new functions. The `autouse` fixture ensures clean state per test. Tests cover Claude JSON parsing (including cache tokens), Gemini single-model and multi-model parsing, non-JSON fallback, non-dict JSON fallback, cost accumulation, per-model tracking, reset, and formatting with/without cost. Missing: a test for the `invoke()` → `_parse_cli_json` integration (subprocess mock returning JSON), and a test for the `result: null` edge case mentioned above.

---

### Existing test compatibility
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: All existing tests in `test_exam.py`, `test_coverage.py`, `test_propose.py`, and `test_summarize.py` mock `invoke_sync` at the caller's module level, so they bypass the real `invoke()` entirely. The mocks return plain text strings, which is still the correct return type since `_parse_cli_json` extracts text from JSON before returning. No existing tests should break from this change.

---

## Summary

The change is well-structured and the new code is correct for the common case. Two items worth addressing:

1. **`data.get("result", output)` when result is `None`** — minor edge case, consider `data.get("result") or output` if the CLI can return null results, or add an `isinstance(text, str)` guard.
2. **No integration test for `invoke()` with JSON output** — the critical path (subprocess returns JSON → `_parse_cli_json` extracts text) is only tested in isolation. A single test with a mocked `asyncio.create_subprocess_exec` returning a JSON blob would close this gap.

Neither is a blocker — the fallback behavior is safe, and the unit tests cover the parsing logic well.

---

### SELF_REVIEW
LIMITATIONS: Could not verify the actual `gemini` CLI flags (`--skip-trust`, `-o json`) against Gemini CLI documentation. Could not run the test suite to confirm all existing tests pass with the MODEL_COMMANDS change. Did not see the full `invoke()` function body after the diff was applied — relied on the diff hunks and observation data to verify the `_parse_cli_json` call site.

---

### FEATURE_REQUESTS
- Include the full post-edit function body for modified functions (not just pre-edit from observations), so the reviewer can verify the complete logic after changes are applied
- Run the test suite as an observation step and include pass/fail results — this would catch integration issues that static review misses
