Ecommerce search evaluation powered by LLM-as-a-Judge. Domain-aware scoring, deterministic checks, and full IR metrics.
For search teams tired of eyeballing result pages.
Features
Every search result scored for relevance using structured, domain-aware rubrics tailored to your industry vertical.
Built-in domain expertise for automotive, electronics, fashion, groceries, medical, and more — with per-query overlay classification at zero extra cost.
Catch price outliers, missing titles, duplicate SKUs, and ranking mistakes with fast, rule-based checks that run before LLM scoring.
NDCG, MRR, MAP, Precision@K, and attribute-match metrics with rich HTML reports and A/B comparison views.
Sample output
| Metric | Value |
|---|---|
| NDCG@5 | 0.9829 |
| NDCG@10 | 0.9829 |
| MRR | 0.9800 |
| MAP | 0.9751 |
Bars should decrease top to bottom if ranking is correct.
| Original | Corrected | Verdict |
|---|---|---|
| intrior paint | interior paint | appropriate |
| led lighrs | red lights | inappropriate |
Veritail catches when autocorrect fixes a typo but changes the product.
| Query | Type | NDCG@10 |
|---|---|---|
| self drilling drywall anchors | long-tail | 0.7829 |
| pex crimp tool | broad | 0.8554 |
| 1/2 inch pex tubing 100ft red | long-tail | 0.9697 |
| Metric | Value |
|---|---|
| Avg Relevance | 2.94/3 |
| Avg Diversity | 2.50/3 |
| Total Flagged | 4 |
| Check | Passed | Failed |
|---|---|---|
| Near-duplicate products | 103 | 22 |
| Out-of-stock prominence | 124 | 1 |
| Price outlier | 108 | 17 |
Exact specification match. Brand, product type, and 2x10 dimensional rating all satisfied. In stock at position 1.
Query specifies 2x10 but this is rated for 2x8. A 2x8 hanger cannot safely support a 2x10 joist — dimensional mismatch makes this a structural failure risk.
Five of six suggestions target deck materials, but "decaf coffee" is completely off-domain — a grocery item with no relevance to home improvement, degrading the suggestion set.
How it works
Import your search queries. Veritail classifies each query and assigns domain-specific overlays automatically.
Your search adapter returns results. Deterministic checks flag defects, then the LLM judge scores each result on a 0–3 rubric.
Get IR metrics, per-query breakdowns, and rich HTML reports. Compare two search configurations side by side.
Quick start
$ pip install veritail # With Anthropic / Gemini support: $ pip install "veritail[cloud]"
# my_adapter.py from veritail import SearchAdapter, SearchResult class MyAdapter(SearchAdapter): def search(self, query: str, **kwargs) -> list[SearchResult]: # Call your search engine here return [ SearchResult( title="Product Name", price=29.99, url="https://example.com/product", ) ]
$ veritail --adapter my_adapter.MyAdapter \ --queries queries.csv \ --vertical electronics