{% extends "base.html" %} {% from "components/custom_dropdown.html" import render_dropdown %} {% from "components/help_macros.html" import tooltip, help_panel, help_step, help_tip %} {% set active_page = 'benchmark' %} {% block title %}Benchmark Configuration - Deep Research System{% endblock %} {% block extra_head %} {% endblock %} {% block content %}

Benchmark Configuration

Test and optimize your search configurations

View Past Results
{% call help_panel('benchmark-guidelines', 'Benchmark Guidelines', icon='info-circle', collapsed=true) %}

Purpose: Benchmarks help you evaluate if your configuration works well, not for research papers or production use.

Responsible Usage: Use reasonable example counts to avoid overwhelming search engines. Default of 75 examples is good for testing.

Requirements: Benchmarks require an evaluation model for grading. Configure in Evaluation Settings below. Default: OpenRouter with Claude 3.7 Sonnet.

Search Engine Recommendations

{{ help_tip("Shared Resources: When using SearXNG, reduce iterations and questions per iteration to minimize load.") }} {% endcall %}
Give your benchmark run a descriptive name
Dataset Selection

SimpleQA

Fact-based questions with clear answers

Recommended: 50 examples provides good balance for configuration testing

xbench-DeepSearch

100 deep research questions requiring multi-step information seeking

New: Advanced benchmark for testing deep research capabilities
100 challenging research questions available

BrowseComp

Complex browsing and comparison tasks

Poor Performance Warning: We currently achieve close to 0% accuracy on BrowseComp.
For testing only: Limited to 20 examples max to see what this benchmark is about.
Restricted to max 20 examples due to poor performance - for curiosity testing only
Current Configuration

Active Database Settings

Benchmark will use all settings from your database configuration

Provider
Loading...
Model
Loading...
Search Tool
Loading...
Iterations
Loading...
Questions/Iter
Loading...
Strategy
Loading...
To change any settings, go to Settings Dashboard
Evaluation Model Settings

Benchmark Evaluation Configuration

Configure the model used to grade benchmark results

Provider for the evaluation model
{{ render_dropdown( input_id="evaluation_model", dropdown_id="evaluation-model-dropdown", placeholder="Enter or select evaluation model", label="Evaluation Model", help_text="Model to grade benchmark results", allow_custom=true, show_refresh=true, refresh_aria_label="Refresh evaluation model list", data_initial_value=eval_settings.evaluation_model ) }}
API endpoint for evaluation model
0 recommended for consistent evaluation
Evaluation Model Selection: For accurate benchmark grading, use flagship models from major providers like Claude Sonnet series or GPT-4 class models. Local models and smaller cloud models may produce inconsistent evaluations, affecting benchmark accuracy scores. However, preliminary tests indicate that local models might be adequate for performance evaluation if highest grade standards are not required.
50
Total Examples
Estimated time: 40-60 minutes
Current Benchmark:
0%
Status:
Initializing
Current Task:
Starting benchmark...
--%
Overall Accuracy
--
--
Est. Time Left
--
0
Completed
--
Avg Time/Example
SimpleQA: --% BrowseComp: --%

Current Question

No question being processed...
-- --
Waiting for benchmark to start...

All Results

No results yet...
{% endblock %} {% block page_scripts %} {% endblock %}