# Match Question Evaluator Overlay
# Covers match/classification questions using the categories + items schema.
# This overlay is appended to _base_evaluation.txt and provides only match-specific
# metric definitions. The base prompt handles: evaluation steps, output format, general rules.
# Note: integrity_check is fully handled by Step 0 in the base — not re-evaluated here.

## Match Question Evaluation (Categories + Items Schema)

This question is a match/classification question using the **categories + items schema**.
Apply the following type-specific evaluation rules in addition to the general evaluation procedure above.

**DOMAIN-GENERAL NOTE**: These rules apply to match questions across any subject — social studies, science, ELA, math, etc. Examples in this overlay are illustrative; apply rules to the actual domain.

### Schema Reference

A valid match question has this structure (2–4 categories, 4–8 items):
```json
{
  "question": "<stem instructing the student to classify/match>",
  "categories": [
    {"id": "c1", "label": "<category label>"},
    {"id": "c2", "label": "<category label>"},
    {"id": "c3", "label": "<optional third category>"}
  ],
  "items": [
    {"id": "i1", "content": "<item phrase>", "correct_category": "c1"},
    {"id": "i2", "content": "<item phrase>", "correct_category": "c2"},
    {"id": "i3", "content": "<item phrase>", "correct_category": "c3"},
    {"id": "i4", "content": "<item phrase>", "correct_category": "c1"}
  ],
  "answer_evaluation": "<explanation of why each item belongs to its category>"
}
```

---

### Step 3.5: Match Format Validation (replaces MCQ/fill-in format check)

**CRITICAL — Run this structural validation before scoring any other metric. Document all violations under `specification_compliance`.**

**Required fields check:**
1. `categories` field MUST exist and be a non-empty array (2–4 objects, each with `id` and `label`)
2. `items` field MUST exist and be a non-empty array (recommended 4–8 objects, each with `id`, `content`, and `correct_category`)
3. `answer_evaluation` (or `answer_explanation`) MUST exist and be non-empty

**Referential Integrity (hard failure — `specification_compliance = 0.0`):**
- For EVERY item, its `correct_category` value MUST exactly match one of the `id` values in `categories`
- A single mismatch (typo, wrong id, extra space) = automatic format failure
- Reasoning template: "Item `<id>` has correct_category `'<value>'` which does not match any category id in the categories list"

**Category Balance (hard failure — `specification_compliance = 0.0`):**
- EVERY category `id` MUST appear as `correct_category` in at least one item — categories with zero items are invalid
- Reasoning template: "Category `'<id>'` (`<label>`) has no items assigned to it"

**ID Uniqueness:**
- All category `id` values must be unique strings across categories
- All item `id` values must be unique strings across items
- Duplicate IDs = format failure → `specification_compliance = 0.0`

**If structural checks pass**, proceed normally. If they fail, document under `specification_compliance` but do NOT automatically fail other content-quality metrics based solely on structural errors.

---

### 2. Factual Accuracy (Binary: 0.0 or 1.0)

**Pass (1.0) if:**
- All items are factually correct
- Every item is correctly assigned to its stated category
- No internal contradictions between item content and category assignment
- `answer_evaluation` accurately explains why each item belongs to its category
- All supporting fields (category labels, explanations) are consistent with the actual items and assignments

**What does NOT count as a factual error:**
Do NOT treat subtle interpretive differences, slight paraphrasing, or slightly loose pedagogical phrasing as factual errors. If an item description is broadly accurate in the domain sense (e.g., "makes laws" is an acceptable description of the Legislative branch even if a fuller description exists), that is not a factual_accuracy failure. Reserve factual_accuracy = 0.0 for clearly wrong real-world facts, contradictions, or wrong category assignments that would misteach students.

**Fail (0.0) if:**
- Any item is factually wrong (incorrect date, attribution, geographic claim, causal assertion, scientific fact)
- An item is assigned to the wrong category
- `answer_evaluation` contradicts the actual category assignments
- `answer_evaluation` references item ids, labels, or categories that do not exist in the question
- **FIELD MISMATCH**: `answer_evaluation` or `additional_details` describes items or assignments that do not match what is actually in the question

**Special rule for factual_accuracy:**
If the only concern is a nuanced judgment about wording or emphasis, you MUST:
- Set `factual_accuracy = 1.0`, AND
- If needed, address the issue under `educational_accuracy` or only in `suggested_improvements`

Do NOT set `factual_accuracy = 0.0` unless you can point to a concrete, unambiguous error in facts or a direct contradiction between an item and its assigned category.

**MANDATORY LITMUS TEST — write this in your reasoning before assigning factual_accuracy:**
Complete one of these two statements:
- "I identified the following factual error: [quote the specific wrong fact or wrong category assignment]" → score 0.0
- "I found no factual errors. All items are correctly assigned." → score 1.0

You MUST write one of these statements. If you wrote the second statement, factual_accuracy MUST be 1.0. Outputting factual_accuracy = 0.0 without a quoted specific error is invalid.

**Image Verification — Two tiers of trust (if image is present):**

**Tier 1 — Structured image analysis** (header says "Programmatic image analysis"): GROUND TRUTH. Do NOT defer to the question's stated assignments if structured analysis contradicts them.

**Tier 2 — LLM visual interpretation only** (header says "LLM-based visual interpretation"): Best guess, NOT verified. Can be wrong about fine spatial or relational details.
→ If Tier 2 analysis contradicts an assignment BUT the assignment is internally consistent and logically sound, do NOT fail factual_accuracy. Give the content the benefit of the doubt.
→ Only fail based on Tier 2 when the error is gross and obvious.

---

### 3. Educational Accuracy (Binary: 0.0 or 1.0)

Assess whether the match question fulfills its educational intent. Educational intent may be:
- Explicit: Standards, grades, subjects mentioned in content
- Implicit: Infer from content complexity, vocabulary, question structure

---

#### Step A — Determine Question Intent (MANDATORY FIRST STEP)

Before evaluating any giveaway, apply the **INTERPRETING QUESTION INTENT** framework from the base procedure to classify this match question as one of:

**Worked example / instructional question:**
- Showing correct category assignments and the reasoning behind them is NOT a failure — that IS the instructional purpose
- Fail (0.0) only if the content is factually wrong, misleading, or clearly off-purpose
- Do NOT fail because a student could "copy" assignments from a walkthrough — that is how instruction works
- When a match question is embedded in an instructional article or lesson, default to treating it as a worked example unless there is clear evidence of independent assessment intent

**Practice problem / assessment:**
- Student is expected to classify independently before seeing answers
- Apply Steps B and C below

When uncertain, use the heuristics in INTERPRETING QUESTION INTENT. When genuinely ambiguous after those checks, default to treating the content as a worked example.

---

#### Step B — Check Reveal Cues (practice/assessment only)

Apply the **INTERPRETING UI & REVEAL CUES** framework from the base procedure. If `answer_evaluation`, correct assignments, or explanations appear behind a reveal cue (`"hidden": true`, "Click to show answer", shown after submission, or in help/feedback/hint fields) → treat as post-attempt content. Do NOT treat it as a giveaway regardless of what it contains.

`answer_evaluation` is always post-attempt by design — never treat it as student-visible pre-attempt content unless explicitly labelled otherwise.

---

#### Step C — Apply the Trivial Test (practice/assessment with no reveal gating)

An answer is a giveaway only if it is **trivially obtainable by the target audience** — the student can get correct category assignments by reading or copying without applying any grade-appropriate reasoning. This is audience-relative:
- What is trivializing for one grade may be appropriate scaffolding for another
- When pedagogical purpose is conceptual learning, visible context that still requires the student to reason is NOT a giveaway
- **When uncertain, default to NOT failing** — only fail when the answer is unambiguously given away with no reasoning required

---

**Pass (1.0) if:**
- The classification task tests conceptual understanding of the stated standard
- Categories represent meaningful conceptual distinctions
- Items require students to apply understanding, not just recall list membership
- Appropriate for the apparent grade level and subject
- Standards referenced (if any) are accurately targeted

**Fail (0.0) if (practice/assessment only, no reveal gating):**
- The question only tests rote memorization — student could answer by copying a list embedded in the student-facing content
- Category labels make the correct answer trivially obvious without any reasoning (see match-specific patterns below)
- Any item names its own category, giving away the answer
- Assesses unrelated or tangential skills

**NOTE ON STIMULI**: A question is NOT penalized under educational_accuracy simply because an included stimulus is not strictly necessary to answer. Stimulus issues are evaluated under `stimulus_quality`.

**NOTE ON HELP/FEEDBACK CONTENT**: `answer_evaluation` and content in help, feedback, or hint fields is shown AFTER the student attempts. Do NOT treat post-attempt explanations as giveaways regardless of how detailed they are.

---

#### Match-Specific Giveaway Patterns (practice/assessment — automatic fail when no reveal gating)

- The question stem or student-facing fields enumerate which items belong to which categories before asking students to classify — the classification has already been done for them
- Category labels function as definitions that make classification trivial without reasoning — e.g., a category "Organisms that perform photosynthesis" when an item reads "converts sunlight into glucose using chlorophyll", or a category "Vertebrates" when an item reads "vertebrate animals with a backbone" (the label directly answers the classification)
- `additional_details` (if student-visible with no reveal gating) provides the answer key organised by category
- An item's content repeats a substantive word from its own category label, allowing students to match by word-spotting rather than conceptual understanding — e.g., item "Occupying a position not included in the primary four-level social framework" assigned to category "Status Outside the Main Hierarchy" (the item is a near-paraphrase of the label)

---

### 4. Curriculum Alignment (Binary: 0.0 or 1.0)

**CRITICAL — Use Curriculum API Data When Provided:**
- If Curriculum API provided Standard Descriptions → verify categories and items address those standards
- If Curriculum API provided Learning Objectives → verify the conceptual distinctions being tested address those objectives
- If Curriculum API provided Assessment Boundaries → verify every item and category stays within scope
- Boundary violations MUST fail this metric (for GUARANTEED/HARD confidence)

**Pass (1.0) if:**
- Directly addresses relevant educational standards for subject/grade
- Categories and items reflect concepts and skills from curriculum standards
- Stays within appropriate assessment boundaries
- Avoids testing beyond scope of standards
- Complies with ALL Assessment Boundaries provided by Curriculum API

**Fail (0.0) if:**
- Significant misalignment with standards
- Items or categories test concepts outside scope
- Complexity inappropriate for standards
- Major deviations from curriculum objectives
- Violates any Assessment Boundary (for GUARANTEED/HARD confidence)

---

### 5. Clarity & Precision (Binary: 0.0 or 1.0)

**SCOPE: This metric evaluates SEMANTIC clarity only — whether the question wording is understandable to students at the target grade level. Format/structure requirements are evaluated in Specification Compliance, NOT here.**

**Pass (1.0) if:**
- Question stem clearly instructs the student what to do (e.g., "Classify each organism into the correct kingdom", "Sort each item as a cause or effect", "Match each term to its correct category")
- Category labels are short, parallel, and unambiguous
- Each item is a concise, substantive phrase that students can evaluate
- No item is so vague it could reasonably belong to multiple categories without clear pedagogical justification
- Grammar and structure are correct
- Vocabulary is appropriate for the target grade level (curriculum terms excepted)

**Fail (0.0) if:**
- Question stem does not make the classification task clear to the student
- Category labels are ambiguous or overlap in meaning such that a student cannot determine which category applies
- Items are vague (e.g., "economic factors", "a person") rather than specific enough to evaluate
- Items are so similar to each other they cannot be meaningfully distinguished
- Grammatical issues impede understanding
- **Merged non-word forms**: `themain`, `tothe`, `ofthe`, `inthe`, etc.
- **Grade-inappropriate vocabulary**: Uses words significantly above the target grade's reading level when a simpler alternative exists AND the word is NOT a curriculum term being taught

**What does NOT fail this metric:**
- Curriculum-specific terms that the standard explicitly teaches (e.g., "legislative", "photosynthesis", "metaphor")
- Decorative symbols or section dividers that don't create confusion
- Minor formatting artifacts that don't impede understanding

**NOTE**: Do NOT fail this metric for format violations (wrong item count, structural issues). Those belong in Specification Compliance.

---

### 6. Specification Compliance (Binary: 0.0 or 1.0)

**Evaluates whether the question follows the structural requirements of the match format AND any explicit skill specification provided.**

**Structural requirements for match (always enforced, see Step 3.5 for details):**
- 2–4 categories, each with a unique `id` and `label`
- 4–8 items (recommended), each with unique `id`, `content`, and a `correct_category` that references a valid category `id`
- Every category must have ≥1 item assigned
- `answer_evaluation` (or `answer_explanation`) must be present and non-empty

**For explicit skill specifications beyond the above:**

**If NO skill specification is provided (or spec is ambiguous/conflicting):**
- Only enforce structural requirements above; auto-pass for all other spec constraints

**If a CLEAR, EXPLICIT skill specification IS identified:**
- Evaluate compliance with word/character count, sentence structure, content constraints, stimulus requirements

**You may ONLY fail (0.0) for spec violations beyond structural requirements when ALL THREE conditions are met:**
1. You have identified a clear, explicit skill specification, AND
2. You can **quote the exact requirement text** from the spec, AND
3. You can **quote the exact content** in the question that violates that requirement

**If you cannot satisfy all three conditions, `specification_compliance` MUST be 1.0** (unless a structural failure from Step 3.5 already triggered 0.0).

---

### 7. Reveals Misconceptions (Binary: 0.0 or 1.0)

For match questions, "revealing misconceptions" means item placement is non-obvious enough to surface student errors.

**CRITICAL — Use Curriculum API Data When Provided:**
- If Curriculum API provided Common Misconceptions → verify items align with those specific error patterns
- Do NOT ignore provided misconceptions in favor of your own judgment

**Pass (1.0) if:**
- At least some items could plausibly be misclassified by a student with partial mastery
- The item placements reveal whether students understand the conceptual distinctions — not just memorization
- Distractors (ambiguous items) are plausible and relevant to known common misconceptions
- Has meaningful diagnostic value — getting it wrong indicates a specific conceptual gap

**Fail (0.0) if:**
- All items obviously belong to exactly one category — no student with any knowledge of the topic would hesitate
- No diagnostic value — correct answers don't indicate understanding, wrong answers don't indicate a specific gap
- Items are trivially obvious — any student with minimal topic exposure could assign every item correctly without analysis
- No connection to common misconceptions (especially those provided by Curriculum API)

**NOTE**: A task that feels "obvious in hindsight" to an expert is still diagnostically useful if the target student population is learning the concept. Fail only when there is genuinely no plausible confusion for the intended grade/level.

**Match-specific diagnostic signals:**

Good diagnostic value (PASS indicators):
- At least 2 items have meaningful "trap" potential — a student with partial mastery could plausibly assign them to the wrong category. Examples across domains: in science, "releases oxygen as a byproduct" could be misclassified under cellular respiration instead of photosynthesis; in social studies, an event that is both a cause and an effect in a chain requires knowing the causal sequence; in ELA, a metaphor that also functions as personification requires distinguishing the primary device
- Items that represent known common misconceptions when misclassified — the wrong-category placement is the predictable error students make, not a random one
- Items where correct classification requires distinguishing between conceptually adjacent categories (e.g., "cause" vs. "immediate effect" vs. "long-term consequence"; "producer" vs. "primary consumer" vs. "secondary consumer"; "theme" vs. "topic" vs. "main idea")

Poor diagnostic value (FAIL indicators):
- Every item contains a keyword from its correct category label (vocabulary matching, not conceptual classification)
- All items are well-known, prototypical examples that any student who attended class once would immediately recognise without any analysis
- The question is structurally a term → definition matching exercise

---

### 8. Difficulty Alignment (Binary: 0.0 or 1.0)

**CRITICAL — Use Curriculum API Difficulty Definitions when provided.** If they exist, use them. Do NOT create your own difficulty criteria when definitions exist.

**For match questions, difficulty is primarily driven by:**
- **Item ambiguity**: Items that could plausibly fit multiple categories require more careful analysis
- **Conceptual nuance**: Are the distinctions self-evident or subtle?
- **Category count**: More categories = more complex discrimination

**IMPORTANT — item count does NOT determine difficulty.** Having 4, 6, or 8 items is within the valid range for ANY difficulty level. Do NOT use item count as evidence that a question is Medium or Hard. A 6-item Easy question is perfectly valid if the items are unambiguous and the axis is simple.

**Default difficulty indicators for match (use only when Curriculum API definitions are not provided):**
- **Easy**: exactly 2 categories with clearly distinct conceptual opposition; items are unambiguous, prototypical examples; no item could plausibly belong to more than one category
- **Medium**: 2–3 categories; some items require weighing 2 attributes or considering context to classify correctly. A Medium question with only 2 categories must demonstrate meaningfully higher cognitive demand than an Easy question — if items are still unambiguous prototypical examples with no plausible alternative category, it reads as Easy regardless of the label. When both Easy and Medium versions of the same standard exist, the Medium version should use a different category axis or add a third category to increase discrimination — not just rephrase the same items with slightly harder vocabulary
- **Hard**: 3 or 4 categories with nuanced distinctions; several items could plausibly fit multiple categories and require fine-grained analysis; requires evaluating multiple attributes before classifying. A Hard question with fewer than 3 categories fails this metric — the multi-way distinction is definitional to Hard difficulty. 4 categories is valid when the concept genuinely supports a 4-way distinction; do not penalise a well-designed 4-category Hard question.

**HARD quality check for 3-category questions:**
A 3-category HARD question still fails if:
- One category is effectively a catch-all (e.g., "Other", "Neither", or a residual category that only exists to absorb leftover items)
- The three categories are not independently meaningful (e.g., splitting "Causes" into "Economic Causes" and "Political Causes" while adding "Effects" — the split of one dimension into two doesn't constitute a genuinely 3-way distinction for Hard)
- The categories are so broad that correct classification is still obvious for every item

**MEDIUM 2-CATEGORY CHECK (mandatory — run this when difficulty is MEDIUM and there are exactly 2 categories):**
A MEDIUM question with only 2 categories must have meaningfully higher cognitive demand than EASY. Apply this test:

Step 1 — Name the category axis: What single dimension separates the two categories? Write it down explicitly (e.g. "before vs. after", "cause vs. effect", "strength vs. weakness", "motivation vs. outcome").

Step 2 — Pattern check: The following axes are structurally EASY **when items are prototypical unambiguous examples**. If the axis matches any of these AND Step 3 confirms items are unambiguous, the question is EASY-level:
- Temporal binary: "before vs. after", "earlier vs. later", "origins vs. outcomes", "rise vs. fall"
- Directional binary: "cause vs. effect", "motivation vs. outcome", "intention vs. result", "foundation vs. consequence", "input vs. output", "problem vs. solution", "action vs. consequence", "stimulus vs. response"
- Evaluative binary: "strength vs. weakness", "advantage vs. disadvantage", "success vs. failure", "positive vs. negative"
- Role binary: any two-way split by actor role, level, or position (e.g. "government vs. citizen", "producer vs. consumer", "predator vs. prey", "narrator vs. character")

**Important — Step 3 overrides Step 2.** Step 2 names axes that are typically Easy in structure. But if Step 3 shows that items genuinely require substantive subject knowledge to classify — where distinctions are content-knowledge questions, not pattern-matching questions — the question can still qualify as MEDIUM. Never fail purely on axis name alone; always complete Step 3.

Step 3 — Ambiguity test: For at least 2 items, ask: "Could a student with partial mastery plausibly place this item in either category?" If ALL items unambiguously belong to exactly one category with no hesitation possible — this is EASY regardless of vocabulary difficulty or content sophistication.

**Subject-knowledge note:** Across any domain, "cause vs. effect", "before vs. after", and similar directional axes can legitimately be MEDIUM when items require genuine subject knowledge to classify — not just keyword-matching or obvious placement. For example, classifying events in a causal chain as cause or effect (where some events are genuinely both, or require knowing the sequence) passes the Step 3 ambiguity test and qualifies as MEDIUM even though the axis is "directional binary." Apply Step 3 rigorously and do NOT fail on axis name alone.

Fail (0.0) if the axis falls into any EASY pattern above **AND** items are prototypical unambiguous examples that require no genuine reasoning to classify. Hard vocabulary or long item text does not elevate an EASY structure to MEDIUM, but genuine item ambiguity does.

**HARD boundary-ambiguity test (mandatory — run this for every Hard question):**
For a Hard question to pass, items must create genuine cognitive conflict across category boundaries. Apply this two-step test:

Step 1 — Identify the category axis: What single conceptual dimension separates the three categories? (e.g., "time phase of an event", "type of biological process", "narrative role in a text", "scale of the effect"). If no single axis connects all three, the categories may be orthogonal sub-domains rather than adjacent distinctions.

Step 2 — Item boundary test: For at least 2 items, a student with partial mastery must be able to construct a plausible (even if wrong) argument for assigning that item to a different category. Ask: "What is the wrong category a partially-knowing student would pick for this item, and why would they pick it?"

**Fail (0.0) if the boundary-ambiguity test cannot be satisfied:**
- If the three categories are effectively labels for three distinct, non-overlapping sub-groups where every item self-identifies which sub-group it belongs to — then regardless of the category labels or thematic relatedness, the question is not Hard. A student who knows the sub-groups answers instantly; a student who doesn't know them cannot be helped by reasoning. Examples of this pattern across domains: three named civilizations sorted by their geography (SS); three named scientists each associated with a distinct discovery (science); three literary characters each associated with a distinct trait (ELA); three historical eras each with obviously era-specific items (SS).
- Concretely: if you cannot name at least 2 items where a partially-knowing student might plausibly choose the wrong category and explain why, score difficulty_alignment = 0.0.
- "Thematically related" categories are not sufficient — the test is whether items generate cross-boundary confusion, not whether the categories share a theme.

**MANDATORY CROSS-LEVEL CHECK** (when curriculum definitions specify concrete parameters):
1. For EACH defined difficulty level, list its parameters and check whether the content matches
2. Determine which level the content ACTUALLY fits — not which level it is labeled as
3. If the content matches a DIFFERENT level than its label → score 0.0. If it matches → score 1.0.
This prevents confirmation bias. Do NOT start from the labeled level and try to justify it.

**Pass (1.0) if:**
- Category count, item ambiguity, and conceptual nuance match the intended difficulty
- Cognitive demand (DoK 1–3) is appropriate for grade and standard

**Fail (0.0) if:**
- Labeled "Hard" but fewer than 3 categories, or only 2 obviously distinct categories with clearly separable items
- Labeled "Easy" but requires nuanced multi-attribute reasoning to classify correctly
- Complexity is significantly misaligned with the grade level

---

### 9. Passage Reference (Binary: 0.0 or 1.0)

**Pass (1.0) if:**
- When a passage/context is provided, the match question properly uses it
- When no passage is needed, the question is self-contained — items are evaluable without external context
- N/A if no passage involved (still pass)

**Fail (0.0) if:**
- Match question explicitly instructs students to use a passage/table/diagram that does not exist ("Based on the chart above, classify each...")
- A provided passage is referenced but the question can't be answered from it
- References are confusing or incorrect

---

### 10. Distractor Quality — Item Plausibility and Balance (Binary: 0.0 or 1.0)

For match questions, "distractor quality" does NOT mean wrong answer choices. It means **item quality**: are the items well-crafted, parallel, balanced, and substantive?

**MANDATORY LABEL ECHO SCAN (run before scoring this metric):**
For every item, produce this exact output in your reasoning before scoring:

```
i1: label=["word1","word2",...] content=["word1","word2",...] → overlap=[] ✓
i2: label=["word1","word2",...] content=["word1","word2",...] → overlap=["X"] ⚠ ECHO
```

Rules for the scan:
- List every word in the item's `correct_category` label EXCEPT true function words (the, a, an, of, and, in, for, to, or, is, as, by, at, on, with, that, this, their, its)
- List every word in the item's `content` EXCEPT the same function words
- Check for exact matches AND near-matches (singular/plural, verb forms: "force"/"forces"/"forced", "reproduce"/"reproduction")
- Adjectives count: "thermal", "physical", "social", "structural", "external", "internal" etc.
- Also flag: does the item paraphrase its label without exact word overlap? (e.g. "relies on water" → category "Aquatic Processes")

Scoring: 0 echoes = pass. 1 echo = borderline (note it, keep 1.0 unless it is the dominant word in the label). 2+ echoes = Fail (0.0).

**MANDATORY SYNTACTIC GIVEAWAY SCAN (run before scoring this metric):**
For every category, produce this exact output in your reasoning:

```
c1 "Label": i1=gerund, i2=past-tense, i3="The"+noun
c2 "Label": i4=past-tense, i5=past-tense, i6=past-tense
```

Then apply the threshold rule:
- Count how many items in each category share the same opening pattern (gerund / past-tense / "The"+ / "A [Name]"+ / noun phrase / etc.)
- **If 2 or more items in one category share a pattern AND 0 items in any other category share that same pattern → automatic FAIL (0.0). No exceptions.**
- Also check tonal giveaway: if items in one category consistently use negative language (failure, decline, erosion, instability) while another uses neutral or positive language → FAIL.
- Uniform structure ACROSS ALL categories is NOT a giveaway — asymmetry BY CATEGORY is the problem.

You MUST produce the per-category pattern table above before assigning a score. Claiming "no patterns found" without the table is not acceptable.

**Pass (1.0) if:**
- Items are parallel in grammatical structure within the question (all noun phrases, or all verb phrases, or all event descriptions — consistent within the set)
- Items are balanced in length — no single item is conspicuously shorter or longer than others in a way that signals its category
- Items are specific enough to require thinking, but not so specific they name their own category
- All items are plausible, substantive phrases grounded in the question's conceptual domain
- Consistent level of specificity and detail across all items

**Fail (0.0) if:**
- Structural inconsistency: some items are full sentences while others are single words or fragments
- Length imbalance reveals category placement (e.g., only one very long, detailed item that stands out)
- Items are too generic to be meaningful (e.g., "a person", "a place", "something economic")
- Items introduce concepts entirely outside the question's conceptual domain
- Some items are plausible and well-written while others are implausible or poorly written
- **Label echo**: Any item's content shares a substantive word (noun, verb, or descriptive adjective) with its own correct_category label. Examples across domains: item "thermal energy transfers through direct contact" assigned to category "Thermal Conduction" (shared: "thermal"); item "recurring verse structure signals the poem's emotional shift" assigned to category "Structural Devices" (shared: "structural" ~ "structure"); item "organisms reproduce through binary fission" assigned to category "Asexual Reproduction" (shared: "reproduce" ~ "reproduction"). One shared substantive word across all items is borderline; two or more items with shared words is a clear fail. (The mandatory scan above already covers this — this bullet is a reminder to score accordingly.)
- **Syntactic giveaway pattern**: Items in the same category share a distinctive grammatical structure that items in other categories do not — allowing a student to sort by syntax alone. For example: all person-items start with "A [name]..." while concept-items start with "The [noun]...", or all cause-items are gerund phrases while effect-items are past-tense clauses.

**NOTE**: This is the match equivalent of MCQ distractor quality checks (grammatical_parallel, plausibility, homogeneity, specificity_balance, length_check). The same principles apply — consistency, plausibility, and non-telegraphing — but evaluated for item phrasing rather than answer choice phrasing.

---

### 11. Stimulus Quality (Binary: 0.0 or 1.0)

Evaluate whether any stimulus (image, diagram, passage, etc.) included with the match question is **harmful** to the educational experience.

**CORE PRINCIPLE — HARMFUL VS. HELPFUL:**
Stimuli should only fail if they are **harmful** — wrong, misleading, distracting, or confusing. Stimuli that are helpful, neutral, or simply present should pass.

**If NO stimulus is present:**
- PASS (1.0) — most match questions are self-contained

**What counts as ACCEPTABLE (PASS):**
1. **Necessary**: Required to classify (e.g., "Classify each region shown on the map")
2. **Scaffolding**: Helps visualize the classification context (e.g., a diagram of branches of government for a Government question)
3. **Illustrative / Engaging**: Shows context without interfering with the task
4. **Neutral**: Present but not distracting

**What counts as HARMFUL (FAIL):**
1. **WRONG/INACCURATE**: The stimulus shows factually incorrect information
2. **REVEALS ASSIGNMENTS**: An image or diagram visually groups items into their correct categories, giving away the answer (e.g., a diagram that already shows which powers belong to which branch)
3. **CONTRADICTS THE QUESTION**: The stimulus conflicts with the question's items or categories
4. **ACTIVELY DISTRACTING**: So elaborate or attention-grabbing that it interferes with the classification task
5. **MISLEADING**: Could reasonably lead students toward wrong category assignments
6. **POOR QUALITY**: Blurry, illegible, or missing critical elements the question references

**Match-specific rule**: A stimulus fails stimulus_quality if it trivializes the classification task by pre-answering it — for example, a diagram that already groups items visually into their correct categories, or an image that labels each item with its category name when the student is supposed to determine those assignments.

---

### 12. Mastery Learning Alignment (Binary: 0.0 or 1.0)

Assess whether the match question requires genuine understanding rather than surface-level copying.

**Pass (1.0) if the question meets AT LEAST ONE of these criteria:**
- **Application**: Requires applying knowledge to classify — not just re-reading a list provided in the question
- **Conceptual discrimination**: Students must reason about each item's attributes to determine its category (not just recall that something was listed under a category)
- **Multi-attribute reasoning**: Items require weighing more than one property to classify correctly
- **Diagnostic utility**: Can distinguish between students who understand the conceptual distinction vs. those who don't

**Fail (0.0) if ALL of these are true:**
- The correct category assignments are trivially obtainable without any reasoning (e.g., the question stem or student-facing fields already list which items belong to which categories)
- No meaningful conceptual discrimination is required — copying suffices
- No diagnostic value — getting it right doesn't indicate understanding of the distinction

**CURRICULUM-AWARE EXCEPTION**: If the Curriculum API Difficulty Definition for the labeled difficulty explicitly describes recall or recognition as the expected cognitive level (e.g., "one-step recall", "recalling a base equivalence", "identity facts"), then `mastery_learning_alignment` MUST be 1.0. The curriculum intentionally designed this tier for recall — penalizing recall would contradict the authoritative Difficulty Definition.

**Important clarification**: A match question where items "could be recalled" by a knowledgeable student is NOT a mastery learning failure if the student must still apply the conceptual distinction to classify. Fail only when no reasoning is required at all.

---

### 13. Localization Quality (Binary: 0.0 or 1.0)

Evaluate cultural and linguistic appropriateness of the match question's items, categories, and stem.

**Pass (1.0) if:**
- Items and categories use neutral, universal educational contexts
- No inappropriate cultural specifics (festivals, landmarks, public figures) unless required by the curriculum standard
- The classification task is solvable without local cultural knowledge
- Zero sensitive content (religion, politics, dating, alcohol, gambling, adult topics) — unless required by curriculum (e.g., a history standard on religious conflicts)
- Gender-balanced or gender-neutral representation in any named examples
- No stereotyping of any groups in item content
- Inclusive and respectful of all backgrounds
- All references age-appropriate for target students

**Fail (0.0) if:**
- Items require local cultural knowledge that international students would not have
- Contains sensitive content that is not required by the curriculum standard
- Gender imbalance or stereotyping present in named examples or item content
- Disrespectful or exclusionary tone in any field

---

### 14. Cross-Difficulty Recycling Check (applies when batch contains multiple difficulties for the same substandard)

When the evaluation batch contains more than one match question for the same substandard (e.g. an Easy, a Medium, and a Hard on standard 7.8.3), compare them for item recycling.

**This check is only active when you can see multiple difficulty variants for the same substandard in the batch. Skip entirely if you only see one difficulty level for a given substandard.**

**What counts as recycling (flag under `difficulty_alignment`):**
- An item in the Medium question uses the same historical fact, event, person, or causal claim as an item in the Easy question — even if the phrasing is different (e.g. Easy: "Raising money to pay off the national debt" / Medium: "Raising money to pay back national debts" — same fact, minor verb change)
- The Medium and Easy share the same category axis (e.g. both use "Cause vs. Effect") with only vocabulary difficulty changed, not conceptual structure
- The Hard question uses the same items as the Medium but adds a third category without changing which facts are tested

**What does NOT count as recycling:**
- The same historical period or topic appears across difficulties (expected — all three target the same standard)
- Different facts about the same concept (e.g. different causes of the same event)
- Genuinely different category axes across difficulty levels (e.g. Easy tests "Before vs. After", Medium tests "Political vs. Economic vs. Social")

**Scoring impact:** Flag recycled items in the `difficulty_alignment` reasoning of the higher-difficulty question. If the majority of items in a Medium question are recycled from its Easy counterpart, set `difficulty_alignment = 0.0` for the Medium question — it is not genuinely harder, just relabelled.

---

## MANDATORY SCORE SELF-CONSISTENCY CHECK (run immediately before outputting any scores)

Before writing your final scores, check each CRITICAL metric against your reasoning:

**`factual_accuracy` self-check:**
→ If your reasoning identified a concrete factual error (wrong real-world fact, item assigned to wrong category, field mismatch), score 0.0.
→ If your reasoning found NO factual errors — specifically if you wrote "C=0" or "no critical issues" — you MUST score `factual_accuracy = 1.0`. A score of 0.0 with C=0 reasoning is a contradiction and is not allowed.
→ Litmus test: can you quote the specific wrong fact or wrong category assignment? If not, score 1.0.

**`educational_accuracy` self-check:**
→ Same rule: if you cannot quote a specific educational accuracy failure, score 1.0.

**Why this matters:** A `factual_accuracy = 0.0` or `educational_accuracy = 0.0` score is a CRITICAL metric failure that makes the overall rating INFERIOR regardless of all other metrics. Only apply these scores when you have identified a definite, specific failure — not as a precaution or from uncertainty.

---

## Additional Guidance

- **Integrity check is always first**: Step 0 runs before anything else. If a violation is found, the entire evaluation is voided.
- **Structural validation is always second**: Run Step 3.5 before scoring content-quality metrics. Structural errors land in `specification_compliance`; do NOT cascade failures into other metrics.
- **Be consistent**: Apply the same standards to all match questions. Only score 0.0 when there is a concrete, specific issue for that metric.
- **Be reproducible**: Your evaluation should produce the same result if run again on the same content. Avoid subjective, "vibes-based" judgments.
- **Be specific**: Provide actionable advice in `suggested_improvements`. Cite specific item text, category labels, or `correct_category` values — not vague impressions.
- **One issue, one primary metric**: Each issue gets scored in ONE primary metric. Mention in other reasoning if relevant, but do NOT double-penalize.
- **Infer consistently**: When standards aren't explicit, infer grade level from item vocabulary and complexity and apply that inference consistently across all metrics.
- **Domain-general application**: These rules apply to match questions in any subject. The categories+items pattern appears in science (classify organisms by kingdom), ELA (classify literary devices), math (sort expressions by property), and social studies (classify powers by branch). Evaluate against the actual domain's standards, not social studies defaults.
- **Respect UI cues**: When reveal cues are present ("Click to show answer", `"hidden": true`), assume a proper UI that hides answers until the student requests them. Do not treat post-attempt answer keys as giveaways.
