Abstract

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

WorkArena Tasks

WorkArena is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers. By harnessing the ubiquitous ServiceNow platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.

Knowledge Bases

Goal: The agent must search for specific information in the company knowledge base.

The agent interacts with the user via BrowserGym's conversational interface.

Forms

Goal: The agent must fill a complex form with specific values for each field.

Service Catalogs

Goal: The agent must order items with specific configurations from the company's service catalog.

Lists

Goal: The agent must filter a list according to some specifications.

In this example, the agent struggles to manipulate the UI and fails to create the filter.

Menus

Goal: The agent must navigate to a specific application using the main menu.

Challenges and Solutions

Developing enterprise web-based agents is no small feat. Challenges include:
🌐 Navigating the 'World Wild Web' of Enterprise UI with huge and non-standard HTML files (100k+ tokens)
💾 Memory management (for POMPDs)
📈 Planning
📚 Domain-specific knowledge

To help agents navigate this space, we propose a new web environment with:
✅ Multimodal observations (AXtree, HTML, screenshots)
✅ High-level actions (click, fill, etc.)
✅ Chat interface for user interaction
✅ Robustness to the world wild web (e.g., iFrames, shadow DOMs)

Browser Gym

Results

WorkArena poses a real challenge! Even with GPT4 and our top-performing configs, we're seeing only a 55% success rate. A SOTA OSS code LLM hasn’t had a single successful episode yet.

✨We're calling on the open-source community to help bridge this significant gap.

Results

Browsergym

All our experiments are conducted using BrowserGym, our new conversational gym environment for web agents!

🌐 Unified testing ground for MiniWob, WebArena, WorkArena
📣 We welcome new benchmark contributions!
🚀 Try our demo on the open web

Agents navigating on the open web, WorkArena and WebArena, via BrowserGym.



BibTeX

@misc{workarena2024,
                title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?}, 
                author={Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste},
                year={2024},
                eprint={2403.07718},
                archivePrefix={arXiv},
                primaryClass={cs.LG}
          }