## Code Review: JSON pseudo-tool-calling for propose and coverage

### expert_build/llm.py:extract_json
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Clean lift-and-generalize of `_extract_json` from exam.py. The array support is correctly implemented — when `[` precedes `{`, it tries array parse first, then falls through to object parse on failure (non-exclusive `if`, not `elif`). Return type `dict | list | None` is accurate. All 3 callers (exam, coverage, propose) import and use it. Tests cover plain, code-fenced, embedded, invalid, truncated, array, array-in-text, and array-with-fence cases.

---

### expert_build/llm.py:RETRY_JSON
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Constant moved from exam.py to llm.py. All 3 consumer modules (exam, coverage, propose) import it from the new location. The observation confirms 8 usages across the codebase, all pointing to `expert_build.llm`. No stale references to the old location remain.

---

### expert_build/coverage.py:cmd_cert_coverage
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Correctly switched from line-by-line text parsing (`result.strip().upper() != "NONE"`) to JSON parsing with retry. The retry fires only when `extract_json` fails to return a dict with `matching_ids`. If both attempts fail, `matches` stays empty and the code falls through to keyword matching (line ~135 in the full function body), preserving the existing fallback behavior. The `str(bid).strip()` coercion on each matching ID is defensive and correct.

---

### expert_build/propose.py:cmd_propose_beliefs
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Regex-based markdown parsing replaced with `extract_json` + list iteration. The output side preserves the `### [ACCEPT/REJECT] bid` markdown format for human review, so downstream `accept-beliefs` is unaffected. The retry path mirrors the coverage pattern: try once, if not a list, retry with `RETRY_JSON` appended, if still not a list, skip the batch with a warning. The `b.get("id", "unknown")` / `b.get("claim", "")` defaults are reasonable for malformed entries.

---

### expert_build/exam.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Clean removal of `_extract_json` and `RETRY_JSON` definitions, replaced with imports from `llm`. The `import json` removal is correct — `json` was only used by the removed `_extract_json`. All 4 call sites (`extract_answer` x2, `judge_answer` x2) now use the centralized `extract_json`. The `"answer" in data` / `"verdict" in data` checks on `extract_json`'s new `dict | list | None` return type are safe: for a list, `in` checks membership (not keys), so it returns False and falls through to retry — correct behavior.

---

### expert_build/prompts.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: `PROPOSE_BELIEFS` now requests a JSON array with `id`, `claim`, `source`, `source_url` fields — matches the parsing in `cmd_propose_beliefs`. `CERT_MATCH` now requests `{"matching_ids": [...]}` with an empty-array example for no matches — matches the parsing in `cmd_cert_coverage`. The prompt structures are clear and include examples to guide the LLM.

---

### tests/test_coverage.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: New test file with 5 tests covering: valid JSON matching, empty matches, retry on non-JSON response, code-fenced responses, and invalid belief ID filtering. The mocking approach (patching `check_model_available`, `invoke_sync`, `load_beliefs`) is correct and consistent with the test patterns in `test_propose.py`. The `beliefs_db` fixture creates an empty file to pass the existence check, then `load_beliefs` is patched — correct approach.

---

### tests/test_exam.py
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Import updated from `expert_build.exam._extract_json` to `expert_build.llm.extract_json`. Three new array-specific tests added (`test_extract_json_array`, `test_extract_json_array_in_text`, `test_extract_json_array_with_code_fence`). All existing tests updated to use the new import. No behavioral changes to existing test assertions.

---

### tests/test_propose.py
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: PARTIAL
INTEGRATION: WIRED
REASONING: Tests properly updated to use JSON responses via `_json_beliefs` helper. The old `test_filter_regex_matches_accept_reject_placeholder` was replaced with `test_json_retry_on_bad_response` and `test_json_with_code_fence` — good substitution. However, `test_proposals_written_after_each_batch` has a subtle issue: `invoke_side_effect` always returns beliefs with id `"belief-from-batch-1"` (hardcoded), but `call_count` increments. After the crash on call 2, when the function is called for another batch, it returns the same `"belief-from-batch-1"` id again. The test's assertion `assert "belief-from-batch" in content` passes, but it's no longer testing that the *first* batch's output survived the crash — it could be the output from the third call. The old test used `f"belief-from-batch-{call_count}"` to distinguish batches; the new test lost that distinction. This weakens the crash-recovery test.

---

### SELF_REVIEW
LIMITATIONS: Could not see the full `exam.py` or `propose.py` files to verify whether the `re` import in exam.py is still needed by other functions, or whether `propose.py` has unused imports after the refactoring. Could not run the test suite to verify all tests pass.

---

### FEATURE_REQUESTS
- Include the full import section of modified files in observations so unused-import issues can be flagged
- Run the test suite as part of review and include pass/fail results in observations
