top-1 and top-3 hit rate across the golden set
Per-feature precision at top-1, sorted worst-first
Stacked by kind — ideally every feature has something across multiple depths
Corpus-wide counts