# Sequence/Ordering Question Evaluator Overlay
# Covers questions where students must arrange items in the correct order (drag-and-drop ordering).
# This overlay is appended to _base_evaluation.txt and provides only sequence-specific
# metric definitions. The base prompt handles: evaluation steps, output format, general rules.

## Sequence/Ordering Question Evaluation

This question is a sequence (drag-and-drop ordering) question. The student is given a set of
items and must arrange them in the correct order. Apply the following type-specific evaluation
rules in addition to the general evaluation procedure above.

**Content structure for sequence questions:**
- `question`: The stem naming the ordering criterion (e.g., "Place the following events in
  chronological order from earliest to latest.")
- `items`: A list of objects with `id` and `content` — the items to be ordered.
- `correct_order`: A list of item IDs representing the correct sequence.
- `answer_explanation`: Step-by-step explanation of why each item comes before the next.

---

### Metric Definitions

Evaluate the sequence question on exactly these five metrics. Each metric uses a 1–5 integer
scale. Do NOT apply closed-ended metrics (distractor_quality, etc.) to sequence questions.

---

#### 1. orderability (1–5)
Is there a clear, defensible single correct sequence? Can a student who knows the curriculum
standard confidently determine one correct ordering?

**Item classification — classify EACH item as one of these four types:**

- `specific_event` — a named action with an identifiable starting moment. You can point to the specific event, law, publication, battle, or decree that STARTED it (e.g., "Luther posts 95 Theses", "Council of Trent begins", "Yoritomo is appointed shogun").
- `bounded_era` — an intellectual tradition or era whose position is unambiguously FIRST or LAST in the sequence. Only valid when (a) the sequence spans multiple centuries across clearly distinct major periods AND (b) the era's position before or after all others is self-evident. NOT valid in sequences spanning fewer than ~150 years.
- `ongoing_state` — a condition that was true for many decades or centuries with **no identifiable single starting moment**. The test: if you cannot name the specific event/law/publication that STARTED this condition → it is an ongoing_state → orderability ≤ 3.
- `process` — a gradual development that unfolded over many decades with no single starting moment. Same test: if no specific starting event can be named → process → orderability ≤ 3.

**The single test for ongoing_state / process:**
> "Can I name the specific moment this STARTED — a named law, event, publication, battle, or decree?"
> If NO → it is an ongoing_state or process → orderability ≤ 3.

**Canonical examples:**
- "Luther posts the 95 Theses" → specific_event (1517, one moment) ✓
- "Ancient Greek philosophers develop logic" → bounded_era as item 1 of a Greece→Rome→Christianity→Renaissance sweep spanning ~2000 years ✓
- "Scholars rely on geocentric models" → ongoing_state (150 CE–1600 CE, no start moment) ✗ → orderability ≤ 3
- "The Scientific Revolution introduces the scientific method" → process (1543–1700+, 150+ years) ✗ → orderability ≤ 3
- "Humanism encourages scholars to question authority" → process (200-year movement) ✗ → orderability ≤ 3
- "Feudal system becomes the primary economic structure" → process (no single start) ✗ → orderability ≤ 3
- "Christianity introduces the concept of individual worth" → ongoing_state (2000-year theological claim) ✗ → orderability ≤ 3
- "European monarchs gain power as the Church weakens" → ongoing_state (true 1300–1600, no single start) ✗ → orderability ≤ 3
- "The Reformation emerges as a formal movement" → process (1517–1555+) ✗ → orderability ≤ 3
- "Italian city-states establish wealth through Crusade-era trade" → process (11th–13th centuries) ✗ → orderability ≤ 3
- "The merchant class emerges in medieval towns" → process ✗ → orderability ≤ 3

**CRITICAL: Position in the sequence does NOT exempt an item from this check.**
- A process item placed FIRST is still a process item → INVALID.
- A process item placed LAST (as a "culminating result" or "final consequence") is still a process item → INVALID.
- The only exemption is a valid `bounded_era` — see definition above.

**What "could_be_elsewhere=yes" means for scoring:**
- Exactly one item has placement ambiguity (could defensibly swap with an adjacent item) → orderability = 4
- Two or more items have placement ambiguity → orderability = 3
- An item is classified ongoing_state or process → orderability ≤ 3 (regardless of position)

Score:
- **5** — Every item is a specific_event or valid bounded_era. Every item has an unambiguous position. No ongoing_state or process items.
- **4** — All items are specific_events or bounded_eras; exactly one has minor positional ambiguity (e.g., two events in the same decade where curriculum does not specify sub-ordering).
- **3** — Any item is an ongoing_state or process. OR two or more items have genuine placement ambiguity. OR a spanning consequence is present.
- **2** — Multiple orderings are defensible; correct_order is one valid interpretation but not the most clearly defensible.
- **1** — No defensible single sequence exists.

**Cross-domain examples:**
- BIOLOGY (5): "Arrange the stages of mitosis: Prophase → Metaphase → Anaphase → Telophase" — each step has a unique, unambiguous position.
- HISTORY (5): "Ancient Greek philosophers develop logic" as item 1 before Rome, Christianity, Renaissance — valid bounded era, unambiguous.
- HISTORY (3): "The Scientific Revolution introduces the scientific method" — process (1543–1700+), no single start moment.
- SCIENCE (4): "Tectonic plates begin to separate" — one event with minor regional dating uncertainty → 4, not 3.
- ELA (1): "Arrange these themes from least to most important" — subjective, no correct order.

---

#### 2. sequence_accuracy (1–5)
Is the `correct_order` factually, chronologically, or logically correct according to the
curriculum standard's expected knowledge?

- **5** — Fully accurate: every item is in the correct position per established chronological,
  causal, or procedural record.
- **4** — Accurate with one item whose position is defensible but could be disputed at advanced
  scholarly levels beyond grade expectations.
- **3** — One item is in a position that is questionable but not clearly wrong at this grade level.
- **2** — One item is clearly in the wrong position per widely taught curriculum content.
- **1** — Multiple items are in the wrong positions; correct_order contradicts the established
  record taught at this grade level.

**IMPORTANT — Curriculum-Standard Ordering Acceptance:**
If the ordering is consistent with what the curriculum standard teaches at this grade level,
accept it as correct (score 4 or 5) even if scholarly debate exists at a higher level.
Do NOT penalise for nuance outside grade-level expectations.

**IMPORTANT — Chronological vs. Causal Ordering:**
When the stem says "chronological order", evaluate sequence_accuracy using strict temporal
dates, not the causal narrative a textbook might use. Do NOT score ≤ 2 because:
- The causal narrative feels confusing while the dates are correct
- A textbook presents a different causal emphasis
"Chronologically accurate but conceptually confusing" → score 3 or 4, NOT 2.

**IMPORTANT — Require date evidence before claiming a chronological inversion:**
To score sequence_accuracy ≤ 2 for a "chronological inversion", you MUST:
1. State the approximate dates of the two allegedly inverted items from your knowledge.
2. Confirm the dates form the wrong order.
If you are uncertain of the specific approximate dates for both items → default to
sequence_accuracy = 4 (uncertain, not confirmed wrong). Do NOT infer inversion from
causal logic alone — "X caused Y" does not mean X preceded Y in calendar time.
Causal relationships are NOT the same as temporal order: an effect can precede a
reinforcing cause (early adoption before a major catalyst arrives).

**IMPORTANT — When ongoing_state items are present:**
If orderability ≤ 3 because item iN is an ongoing_state/process, evaluate sequence_accuracy
only for the specific_event items. Do NOT also penalize sequence_accuracy for the same item —
that is double-counting. If the datable events are in correct order → sequence_accuracy = 4 or 5.
Only penalize sequence_accuracy if a specific_event item is itself in the wrong position.

**Cross-domain examples:**
- HISTORY (5): Portuguese exploration (1415) placed before fall of Constantinople (1453) — chronologically correct.
- ENGINEERING (5): "Define → Research → Ideate → Prototype → Test → Evaluate" — correct procedural order.
- SCIENCE (5): "Chromosomes replicate → nuclear envelope dissolves → chromosomes align → cell splits" — correct mitosis order.
- HISTORY (2): Placing the printing press invention after the Reformation started — clearly inverted.
- SCIENCE (4): Minor textbook variation on geological process ordering — accept curriculum-standard version.

---

#### 3. item_granularity (1–5)
Are the **individual items** the right scope for the stated difficulty? This metric evaluates
each item's size independently — not how similar items are to each other.

- **5** — Every item is an appropriately-sized unit for the difficulty: easy items cover whole
  phases or major milestones; hard items cover closely adjacent sub-steps.
- **4** — Most items well-scoped; one item slightly too coarse or too fine.
- **3** — Several items poorly scoped (e.g., two or more items in a "hard" question each describe
  an entire century of history, making them trivially placeable).
- **2** — Items consistently mismatched: all items are sweeping broad strokes in a "hard" question,
  or all items are micro-details in an "easy" question.
- **1** — All items completely misscoped.

**Difficulty expectations for item scope:**
- **Easy**: Each item = a major, well-known milestone. Students who studied the topic would recognize it immediately.
- **Medium**: Each item = a named sub-step or phase within a larger process.
- **Hard**: Each item = a closely adjacent step where fine-grained knowledge of timing or causation is required.

**DISTINCTION from difficulty_calibration:**
`item_granularity` = "Is each item individually the right size?"
`difficulty_calibration` = "Is the total item COUNT right, and are items close enough together to be challenging as a set?"
These are different questions. Score them on different evidence.

---

#### 4. difficulty_calibration (1–5)
Does the **number of items** and the **inter-item similarity** match the stated difficulty?
(easy: 3–4 items; medium: 4–5 items; hard: 5–6 items)

- **5** — Item count is within the valid range AND inter-item similarity matches the difficulty.
  **For HARD scoring 5:** you MUST name at least 2 specific item-pairs a student could plausibly
  confuse and explain WHY (shared vocabulary, overlapping period, similar actors, adjacent steps).
  If you cannot name any → score 4, not 5.
  **NOTE: 3 items for easy is the valid lower bound — not a deduction. Score 5 if item count
  is anywhere within the stated range and similarity matches difficulty.**
- **4** — Item count correct; inter-item similarity slightly off (e.g., 5-item hard question but
  2 pairs are clearly distinct with no trap potential). OR 1–2 items contain inline dates but
  ordering still requires content knowledge beyond date-reading.
- **3** — Item count correct but inter-item similarity noticeably off. OR multiple items contain
  explicit dates making ordering trivially solvable by reading the dates rather than applying knowledge.
- **2** — Item count AND similarity both poorly matched (e.g., hard question with 3 obvious items;
  easy question with 6 closely adjacent steps). OR most items contain explicit dates.
- **1** — Completely mismatched.

**Dates-in-items flag:** Items containing explicit years/centuries inline (e.g., "in 1492",
"c. 1440 CE", "during the 15th century") reduce the task to date-reading rather than
content knowledge. Deduct to ≤ 3 unless the specific date is itself the curriculum fact being
tested (e.g., "King John signs the Magna Carta" is fine; "King John signs the Magna Carta in
1215" is penalized if 1215 makes the ordering trivial).

**DISTINCTION from item_granularity:**
`difficulty_calibration` = "Is the COUNT right, and are items similar enough to each other to require careful reasoning?"
`item_granularity` = "Is each item individually the right size?"

---

#### 5. stem_clarity (1–5)
Does the question stem make the ordering criterion explicit? Do items avoid forward references
that reveal their relative position?

- **5** — Stem unambiguously names the ordering axis (e.g., "in chronological order", "from
  first step to last", "from cause to final effect"). Every item is self-contained.
- **4** — Stem clear but could be slightly more explicit; no forward references in items.
- **3** — Stem names an axis but vaguely (e.g., "put these in order" without specifying
  chronological/causal/procedural); OR one item contains a mild forward reference.
- **2** — Stem missing the ordering criterion entirely. OR two or more items contain forward references.
- **1** — Stem missing criterion AND multiple items leak their position through forward references.

**Forward reference examples (automatically score 1–2):**
- "Following the earlier stage, ..." — reveals position
- "After the previous step was completed, ..." — forward reference
- "As a consequence of the above, ..." — reveals relative position

**Stem-criterion match check:** Does the criterion stated in the stem match the actual
ordering logic in `correct_order`? If the stem says "chronological order" but items are
ordered causally in a way that contradicts strict calendar dates → stem_clarity = 3, note
the mismatch.

---

### Anti-Fabrication Rule — All Metrics

Before claiming ANY issue exists across ANY metric, you MUST quote or cite the specific evidence:
- Claiming "stem is missing ordering criterion" → quote the EXACT stem text showing what is absent.
- Claiming "item has a forward reference" → quote the EXACT item phrase that is a forward reference.
- Claiming "item could be placed elsewhere" → state WHAT specific knowledge would allow a student to defensibly place it differently.
- Claiming "item is too coarse or too fine" → state the specific scope mismatch with the difficulty level.
- Claiming "dates in the sequence are wrong" → state the actual dates and what the correct dates are.

**If you cannot provide the specific quote or evidence → the issue does not exist. Do NOT cite issues you cannot support with the actual text.**

---

### Required Evaluation Procedure

**Step 1 — Per-item positional audit:**
For EACH item in `correct_order`, write:
```
[iN] <era/date> | event_type=specific_event/bounded_era/ongoing_state/process | could_be_elsewhere=yes/no
```
- `era/date`: approximate period this item represents
- `event_type`: use the four-type classification from orderability above
  - If ongoing_state or process → immediately note: "orderability ≤ 3 because item iN is [type]"
- `could_be_elsewhere`: yes if a student who studied the curriculum could defensibly swap this item to a different position
  - 1 item yes → orderability = 4
  - 2+ items yes → orderability = 3

**Step 1a — Same-source check:**
Scan all item pairs: do any two items describe content FROM the same document, treaty, council session, or event that occurred at the same time? If yes → "orderability ≤ 3: items iX and iY both derive from [same source] — they are unorderable."

**Step 2 — Explicit metric scoring with evidence:**
- `orderability=N` because: [cite which items have issues; reference Step 1 findings]
- `sequence_accuracy=N` because: [state chronological/logical evidence; if ongoing-state present, evaluate only specific_event items]
- `item_granularity=N` because: [state whether each individual item is the right scope for difficulty]
- `difficulty_calibration=N` because: [state item count; for hard=5, name ≥2 confusable pairs with reason]
- `stem_clarity=N` because: [quote exact stem; quote any forward references; check criterion match]

**Step 3 — Map to binary output metrics:**

Map each sequence-specific metric to the corresponding output metric:
- `sequence_accuracy` ≤ 2 → set `factual_accuracy = 0.0`; otherwise `factual_accuracy = 1.0`
- `orderability` ≤ 2 → set `educational_accuracy = 0.0`; otherwise `educational_accuracy = 1.0`
- `item_granularity` ≤ 3 → set `difficulty_alignment = 0.0`; otherwise `difficulty_alignment = 1.0`
- `difficulty_calibration` ≤ 3 → set `curriculum_alignment = 0.0`; otherwise `curriculum_alignment = 1.0`
- `stem_clarity` ≤ 3 → set `clarity_precision = 0.0`; otherwise `clarity_precision = 1.0`

All other standard metrics not applicable to sequence questions should default to `1.0`.

**Step 4 — Overall:**
The overall score and rating are computed **automatically** by the system from your binary metric scores. Set `overall.score` to `0.5` (placeholder) and provide your reasoning summarizing the strengths and weaknesses.

**A metric scoring 4 means a real, articulable issue was found.** Do NOT score 4 "to be safe."
If you cannot name the specific problem with a metric → score 5.

**Items explicitly named in the substandard are curriculum-valid; do not flag as "beyond grade level."**

**Do NOT apply closed-ended checks to sequence questions:**
- distractor_quality (no distractors exist)
- answer_option parallelism
- option label checks
