We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we proposeWorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise onWorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
WorkArena is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers. By harnessing the ubiquitous ServiceNow platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.
Goal: The agent must search for specific information in the company knowledge base.
The agent interacts with the user via BrowserGym's conversational interface.
Goal: The agent must fill a complex form with specific values for each field.
Goal: The agent must order items with specific configurations from the company's service catalog.
Goal: The agent must filter a list according to some specifications.
In this example, the agent struggles to manipulate the UI and fails to create the filter.
Goal: The agent must navigate to a specific application using the main menu.
Goal: The agent must answer a question that requires reading charts and (optionally) performing simple reasoning over them.
Agents navigating on the open web, WorkArena and WebArena, via BrowserGym.
@misc{workarena2024,
title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?},
author={Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste},
year={2024},
eprint={2403.07718},
archivePrefix={arXiv},
primaryClass={cs.LG}
}