No results yet
Run uv run assay benchmark --engine pdftoolbox first.
Engine summary
| Engine | TP | FP | FN | TN | Accuracy | Precision | Recall | F1 | Runtime |
|---|---|---|---|---|---|---|---|---|---|
| {{ engine }} | {{ s.tp }} | {{ s.fp }} | {{ s.fn }} | {{ s.tn }} | {{ "%.1f%%" | format(s.accuracy * 100) }} | {{ "%.1f%%" | format(s.precision * 100) }} | {{ "%.1f%%" | format(s.recall * 100) }} | {{ "%.2f" | format(s.f1) }} | {{ "%.1fs" | format(r.aggregate_runtime_ms / 1000.0) }} |
Per-rule breakdown
TP / FP / FN per (rule × engine), summed across variants.
| Rule | {% for engine in engines %}{{ engine }} | {% endfor %}
|---|---|
| {{ row.rule_id }} | {% for engine in engines %} {% set cell = row.engines.get(engine, {"tp": 0, "fp": 0, "fn": 0}) %}{{ cell.tp }} / {{ cell.fp }} / {{ cell.fn }} | {% endfor %}
Reading this
- TP (true positive)
- Negative test for rule R; engine flagged R. Higher is better.
- FP (false positive)
- Positive baseline; engine flagged R anyway. Lower is better — false-positive fatigue.
- FN (false negative)
- Negative test for rule R; engine missed it. Lower is better.
- TN (true negative)
- Positive baseline; engine correctly silent. Higher is better.
Stub negatives (R0009-R0013, R0016-R0019, R0021-R0023, R0028-R0030, R0033, R0036, R1002) are excluded from scoring in v0.1.0 — full implementation lands in v0.1.1.