WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste

Abstract

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact.

To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace.

In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents.

Key Features

Compositional Tasks

682 tasks designed to evaluate compositional planning and reasoning abilities, going beyond simple navigation to complex multi-step workflows.

Multi-Modal Evaluation

Comprehensive evaluation across state-of-the-art LLMs and vision-language models, with human baseline comparisons for realistic performance assessment.

Automated Trace Generation

Built-in mechanism to generate thousands of ground-truth observation/action traces for fine-tuning and training autonomous agents.

Enterprise Workflows

Realistic workflows performed by knowledge workers, focusing on planning, reasoning, retrieval, and contextual understanding capabilities.

Evaluation Capabilities

Planning & Problem-Solving

Tasks requiring multi-step planning, decomposition of complex problems, and strategic thinking to achieve enterprise-level objectives across ServiceNow platform workflows.

  • Multi-step workflow planning
  • Problem decomposition
  • Strategic decision making
  • Resource optimization

Logical & Arithmetic Reasoning

Evaluation of mathematical reasoning, logical inference, and quantitative analysis capabilities required for data-driven enterprise decision making and process optimization.

  • Mathematical computations
  • Logical inference chains
  • Data analysis tasks
  • Quantitative reasoning

Information Retrieval & Context

Assessment of agents' ability to retrieve relevant information from large knowledge bases, understand contextual relationships, and apply retrieved knowledge to solve complex tasks.

  • Knowledge base queries
  • Contextual understanding
  • Information synthesis
  • Cross-reference validation

Benchmark Statistics

682

Total Tasks

Comprehensive task coverage across multiple enterprise scenarios

5

Core Capabilities

Planning, Reasoning, Retrieval, Context, Problem-Solving

Task Categories

  • Compositional Planning: Multi-step workflow coordination
  • Arithmetic Reasoning: Mathematical and quantitative analysis
  • Logical Reasoning: Complex inference and deduction
  • Information Retrieval: Knowledge base search and synthesis
  • Contextual Understanding: Enterprise domain comprehension

Model Evaluation

  • LLMs: State-of-the-art language models
  • VLMs: Vision-language multimodal models
  • Human Baselines: Knowledge worker performance
  • Fine-tuning Data: Thousands of action traces

BibTeX

@article{boisvert2024workarena++,
  title={WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks},
  author={Boisvert, L{\\'e}o and Thakkar, Megh and Gasse, Maxime and Caccia, Massimo and Le Sellier De Chezelles, Thibault and Cappart, Quentin and Chapados, Nicolas and Lacoste, Alexandre},
  journal={arXiv preprint arXiv:2407.05291},
  year={2024}
}