| Benchmark | Tasks | Max Steps | Multi-tab | Status |
|---|---|---|---|---|
| WebArena | 812 | 30 | ✓ | Available |
| WorkArena L1-L3 | 33-341 | 30-50 | ✗ | Available |
| VisualWebArena | 910 | 30 | ✓ | Available |
| AssistantBench | 214 | 30 | ✓ | Available |
| MiniWoB | 125 | 10 | ✗ | Available |
| OSWorld | 369 | Variable | ✗ | Available |
AgentLab is a comprehensive framework for developing and evaluating agents on a variety of benchmarks supported by BrowserGym. It provides essential building blocks for creating web agents, unified LLM APIs, and extensive reproducibility features for rigorous research.
The framework supports large-scale parallel agent experiments using Ray, includes various agent architectures, and maintains a unified leaderboard across multiple benchmarks including WebArena, WorkArena, VisualWebArena, AssistantBench, and more.
A unified environment for web agent research across multiple benchmarks. Provides standardized interfaces and evaluation metrics.
Enterprise-focused web agent benchmark with realistic workplace tasks. Multiple difficulty levels (L1, L2, L3) with high seed diversity.
Advanced benchmark with 682 compositional planning and reasoning tasks for evaluating autonomous agents in enterprise workflows.
Innovative approach for context trimming in web agents, enhancing efficiency and security.
| Benchmark | Tasks | Max Steps | Multi-tab | Status |
|---|---|---|---|---|
| WebArena | 812 | 30 | ✓ | Available |
| WorkArena L1-L3 | 33-341 | 30-50 | ✗ | Available |
| VisualWebArena | 910 | 30 | ✓ | Available |
| AssistantBench | 214 | 30 | ✓ | Available |
| MiniWoB | 125 | 10 | ✗ | Available |
| OSWorld | 369 | Variable | ✗ | Available |
pip install agentlab
playwright install
from agentlab.agents.generic_agent import AGENT_4o_MINI
from agentlab.experiments.study import make_study
study = make_study(
benchmark="miniwob",
agent_args=[AGENT_4o_MINI],
comment="My first study",
)
study.run(n_jobs=5)
agentlab-assistant --start_url https://www.google.com