{% if not engines %}

No results yet

Run uv run assay benchmark --engine pdftoolbox first.

{% endif %}

Engine summary

{% for engine in engines %} {% set s = engine_summaries[engine] %} {% set r = (reports | selectattr("engine", "equalto", engine) | list)[0] %} {% endfor %}
Engine TP FP FN TN Accuracy Precision Recall F1 Runtime
{{ engine }} {{ s.tp }} {{ s.fp }} {{ s.fn }} {{ s.tn }} {{ "%.1f%%" | format(s.accuracy * 100) }} {{ "%.1f%%" | format(s.precision * 100) }} {{ "%.1f%%" | format(s.recall * 100) }} {{ "%.2f" | format(s.f1) }} {{ "%.1fs" | format(r.aggregate_runtime_ms / 1000.0) }}

Per-rule breakdown

TP / FP / FN per (rule × engine), summed across variants.

{% for engine in engines %} {% endfor %} {% for row in rule_breakdown %} {% for engine in engines %} {% set cell = row.engines.get(engine, {"tp": 0, "fp": 0, "fn": 0}) %} {% endfor %} {% endfor %}
Rule{{ engine }}
{{ row.rule_id }} {{ cell.tp }} / {{ cell.fp }} / {{ cell.fn }}

Reading this

TP (true positive)
Negative test for rule R; engine flagged R. Higher is better.
FP (false positive)
Positive baseline; engine flagged R anyway. Lower is better — false-positive fatigue.
FN (false negative)
Negative test for rule R; engine missed it. Lower is better.
TN (true negative)
Positive baseline; engine correctly silent. Higher is better.

Stub negatives (R0009-R0013, R0016-R0019, R0021-R0023, R0028-R0030, R0033, R0036, R1002) are excluded from scoring in v0.1.0 — full implementation lands in v0.1.1.