| Metric | {{ config_a }} | {{ config_b }} | % Change | |
|---|---|---|---|---|
| {{ d.name }} | {% if d.a_na %}N/A{% else %}{{ "%.4f"|format(d.value_a) }}{% endif %} | {% if d.b_na %}N/A{% else %}{{ "%.4f"|format(d.value_b) }}{% endif %} | {% if d.a_na and d.b_na %}- | {% else %}{{ "%+.1f"|format(d.pct_change) }}% | {% endif %}
Each point is a query. Points above the diagonal improved in {{ config_b }}; points below regressed.
{% if scatter_plot.legend %}Average relevance score (0โ3) at each result position. Lines should slope downward if ranking places the most relevant results first.
| Check | Failed ({{ config_a }}) | Failed ({{ config_b }}) | Delta |
|---|---|---|---|
| {{ c.display_name }} {% if check_descriptions and check_descriptions.get(c.name) %} ? {% endif %} | {{ c.failed_a }} | {{ c.failed_b }} | {{ "%+d"|format(c.delta) }} |
| Original Query | Corrected Query | Verdict | Reasoning |
|---|---|---|---|
| {{ c.original_query }} | {{ c.corrected_query }} | {{ c.verdict }} | {{ c.reasoning }} |
| Query | {{ config_a }} | {{ config_b }} | % Change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| {% if w.has_details %}{{ w.query }}{% else %}{{ w.query }}{% endif %} | {{ "%.4f"|format(w.value_a) }} | {{ "%.4f"|format(w.value_b) }} | {{ "%+.1f"|format(w.pct_change) }}% | ||||||||
|
{% for label, results in [(config_a, w.results_a), (config_b, w.results_b)] %}
{% if results %}
{{ label }}
| |||||||||||
| Query | {{ config_a }} | {{ config_b }} | % Change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| {% if l.has_details %}{{ l.query }}{% else %}{{ l.query }}{% endif %} | {{ "%.4f"|format(l.value_a) }} | {{ "%.4f"|format(l.value_b) }} | {{ "%+.1f"|format(l.pct_change) }}% | ||||||||
|
{% for label, results in [(config_a, l.results_a), (config_b, l.results_b)] %}
{% if results %}
{{ label }}
| |||||||||||
| {{ query_type_display_names.get(qt, qt) }} {% if query_type_descriptions and query_type_descriptions.get(qt) %} ? {% endif %} | |||
|---|---|---|---|
| Metric | {{ config_a }} | {{ config_b }} | % Change |
| {{ m.name }} | {% if t.value_a is not none %}{{ "%.4f"|format(t.value_a) }}{% else %}-{% endif %} | {% if t.value_b is not none %}{{ "%.4f"|format(t.value_b) }}{% else %}-{% endif %} | {% if t.pct_change is not none %}{{ "%+.1f"|format(t.pct_change) }}%{% else %}-{% endif %} |
| Query | Product | Detail |
|---|---|---|
| {{ c.query }} | {{ c.product_id }} | {{ c.detail }} |
| Metric | Value |
|---|---|
| {{ metric_display_names[m.metric_name] if metric_display_names and metric_display_names.get(m.metric_name) else m.metric_name }} {% if metric_descriptions and metric_descriptions.get(m.metric_name) %} ? {% endif %} | {% if m.query_count is not none and m.query_count == 0 %} N/A (no queries with attribute constraints) {% elif m.query_count is not none and m.total_queries is not none and m.query_count < m.total_queries %} {{ "%.4f"|format(m.value) }} (n={{ m.query_count }} of {{ m.total_queries }} queries) {% else %} {{ "%.4f"|format(m.value) }} {% endif %} |
Bars should decrease from left to right if your ranking places the most relevant results first.
Distribution of per-query NDCG@10 scores. Red bins indicate poor relevance; green bins indicate strong relevance.
| Metric | {% for qt in query_types %}{{ query_type_display_names.get(qt, qt) }} {% if query_type_descriptions and query_type_descriptions.get(qt) %} ? {% endif %} | {% endfor %}
|---|---|
| {{ metric_display_names[m.metric_name] if metric_display_names and metric_display_names.get(m.metric_name) else m.metric_name }} | {% for qt in query_types %}{% if m.by_query_type.get(qt) is not none %}{{ "%.4f"|format(m.by_query_type[qt]) }}{% else %}-{% endif %} | {% endfor %}
| Check | Passed | Failed |
|---|---|---|
{% if has_failures %}
{% set cf = check_failures[name] %}
▶ {{ counts.display_name }} {% if check_descriptions and check_descriptions.get(name) %} ? {% endif %}
{% for item in cf["entries"] %}
{{ item.query }}
{% if item.product_id %}{{ item.product_id }}{% endif %}
{{ item.detail }}
{{ item.severity }}
{% endfor %}
{% if cf["total"] > cf["entries"]|length %}
…and {{ cf["total"] - cf["entries"]|length }} more
{% endif %}
|
{{ counts.passed_display }} | {{ counts.failed }} |
| Original Query | Corrected Query | Verdict | Reasoning |
|---|---|---|---|
| {{ c.original_query }} | {{ c.corrected_query }} | {{ c.verdict }} | {{ c.reasoning }} |
| Query | Query Type | NDCG@10 | Failed Checks |
|---|---|---|---|
| {% if wq.anchor_id %}{{ wq.query }}{% else %}{{ wq.query }}{% endif %} | {{ wq.query_type }} | {{ "%.4f"|format(wq.ndcg) }} | {{ wq.failed_checks }} |
Click a query to expand individual product scores and LLM reasoning. Sorted by worst average score first.
| # | Product | Score | Attributes | Reasoning |
|---|---|---|---|---|
| {{ j.product.position + 1 }} |
{{ j.product.title }} {{ j.product.product_id }} {% if j.product.category or j.product.price > 0 or j.product.in_stock is false %}
{% if j.product.category %}{{ j.product.category }}{% endif %}
{% if j.product.price > 0 %}${{ "%.2f"|format(j.product.price) }}{% endif %}
{% if j.product.in_stock is false %}Out of stock{% endif %}
{% endif %}
|
{{ j.score }}/3 | {{ j.attribute_verdict }} |
{{ j.reasoning }}
{% if j.metadata.get('failed_checks') %}
{% for fc in j.metadata.failed_checks %}
{{ fc.check_name }}: {{ fc.detail }}{% if not loop.last %}; {% endif %}
{% endfor %}
{% endif %}
|