{% extends "base.html" %} {% block title %}{{ validation_name }}{% endblock %} {% block content %}

← Back to validation

{{ validation_name }}

{% if paper_title %}

{{ paper_title }}

{% endif %}

Comparing {{ alignment.n_actual }} human comment{{ 's' if alignment.n_actual != 1 }} against {{ alignment.n_ai }} AI comment{{ 's' if alignment.n_ai != 1 }}.

Download report (.md) Download calibration delta (.json)
{% if llm_provider and llm_model %}

LLM: {{ llm_provider }} / {{ llm_model }}  ·  Base URL: {{ llm_base_url }} {% if launched_at %}  ·  Launched: {{ launched_at }} {% endif %} {% if ended_at %}  ·  Ended: {{ ended_at }} {% endif %}

{% endif %}
Intermediate artifacts — for debugging why pairs got the verdicts they did

These files are written to the run directory on every validation. They're intermediate steps — feel free to ignore unless something looks wrong.

{% if run_files %} {# One macro for the three run-files tables so column changes land in one place. #} {% macro run_files_table(files) -%} {% for f in files %} {% endfor %}
FileSizeDescription
{{ f.abs_path }} {{ f.size }} {{ f.description }}
{%- endmacro %}

Source files on disk

Everything produced by this validation lives in a single directory. Paths below are absolute so you can copy-paste straight into a shell.

Run directory: {{ run_files.run_dir }}

{% if run_files.inputs %}

Inputs

{{ run_files_table(run_files.inputs) }} {% endif %} {% if run_files.outputs %}

Outputs

{{ run_files_table(run_files.outputs) }} {% endif %} {% if run_files.internal %}
Internal artifacts ({{ run_files.internal|length }})
{{ run_files_table(run_files.internal) }}
{% endif %}
{% endif %}
{{ metrics.recall }} Recall fraction of human comments the AI caught
{{ metrics.precision }} Precision fraction of AI comments humans also raised
{{ metrics.f1 }} F1 harmonic mean of precision and recall
{{ metrics.severity_weighted_recall }} Weighted recall major misses penalized more
{{ metrics.n_hits }} Hits
{{ metrics.n_misses }} Misses
{{ metrics.n_false_alarms }} False alarms

Per-persona performance

Emitted — total AI comments this reviewer produced. Caught — distinct human comments this reviewer helped match (as primary or supporting); can exceed Emitted when one AI comment matches several human comments. False alarms — AI comments that matched no human comment (sim < 0.35). Noise ratio — False alarms ÷ Emitted.

{% for ps in calibration.persona_stats | sort(attribute='reviewer_id') %} {% endfor %}
Reviewer IDPersonaEmittedCaught False alarmsNoise ratio
{{ ps.reviewer_id or '—' }} {{ ps.persona }} {{ ps.comments_emitted }} {{ ps.actual_comments_helped_catch }} {{ ps.false_alarms }} {{ ps.noise_ratio if ps.noise_ratio is not none else '—' }}
{% if calibration.sub_rating_attributions %}

Sub-rating attributions

{% for a in calibration.sub_rating_attributions %} {% endfor %}
ReviewerSub-ratingValueExpected personaVerdict
{{ a.reviewer_label }} {{ a.sub_rating }} {{ a.value }}/{{ a.scale }} {{ a.expected_persona or '—' }} {{ a.failure_mode.replace('_', ' ') }}
{% endif %}

Hits, misses, and false alarms

Hits ({{ alignment.hits|length }}) — human comments the AI caught {# Render one AI comment (primary or supporting) inside a hit's drill-down. Kept as a local macro so primary and supporting blocks stay visually identical. #} {% macro ai_match_card(ai, sim) -%}
{{ ai.reviewer_id or '—' }} {% if ai.persona %} / {{ ai.persona }}{% endif %} {% if ai.comment_id %}  ({{ ai.comment_id }}) {% endif %} {% if ai.severity %}  {{ ai.severity|upper }} {% endif %}  sim {{ "%.2f"|format(sim) }}
{% if ai.summary %}

{{ ai.summary }}

{% endif %} {% if ai.description %}

{{ ai.description }}

{% endif %}
{%- endmacro %} {% for h in alignment.hits %}
{{ h.actual.severity|upper }} {{ h.actual.text }}
Matched by {{ h.primary_ai.reviewer_id }} / {{ h.primary_ai.persona }} (sim {{ "%.2f"|format(h.primary_sim) }}) {% if h.supporting_ai %}  ·  plus {{ h.supporting_ai|length }} more{% endif %}  — click to see AI comment text

Primary match:

{{ ai_match_card(h.primary_ai, h.primary_sim) }} {% if h.supporting_ai %}

Also matched by {{ h.supporting_ai|length }} additional AI comment{{ 's' if h.supporting_ai|length != 1 }}:

{% for sup in h.supporting_ai %} {{ ai_match_card(sup.ai, sup.sim) }} {% endfor %} {% endif %}
{% endfor %}
Misses ({{ alignment.misses|length }}) — human comments the AI failed to raise {% for m in alignment.misses %}
{{ m.severity|upper }} {{ m.text }}

Category: {{ m.category or 'uncategorized' }} · best AI similarity {{ "%.2f"|format(m.best_sim) }}

{% endfor %}
False alarms ({{ alignment.false_alarms|length }}) — AI comments no human raised {% for fa in alignment.false_alarms %}
{{ fa.reviewer_id }} / {{ fa.persona }} {{ fa.summary }}
{% endfor %}
{% if calibration.suggestions %}

Calibration suggestions

{% for s in calibration.suggestions %}
{{ s.type }} {{ s.target_persona or s.category or s.missing_personas_in_selection|join(', ') if s.missing_personas_in_selection else '' }}

{{ s.rationale }}

{% if s.prompt_patch_hint %}

Hint: {{ s.prompt_patch_hint }}

{% endif %} {% if s.fix_hint %}

Fix: {{ s.fix_hint }}

{% endif %} {% if s.example_misses %}
Example misses ({{ s.example_misses|length }})
    {% for m in s.example_misses %}
  • {{ m }}
  • {% endfor %}
{% endif %}
{% endfor %}
{% endif %} {% endblock %}