WebArena: A Realistic Web Environment for Building Autonomous Agents

1Carnegie Mellon University, 2Inspired Cognition
*Lead contributors. +Equal contribution.
{shuyanzh,fangzhex,gneubig}@cs.cmu.edu

WebArena is a standalone, self-hostable web environment for building autonomous agents. WebArena creates websites from four popular categories with functionality and data mimicking their real-world equivalents. To emulate human problem-solving, WebArena also embeds tools and knowledge resources as independent websites. WebArena introduces a benchmark on interpreting high-level realistic natural language command to concrete web-based interactions. We provide annotated programs designed to programmatically validate the functional correctness of each task.

WebArena Website Demos

The videos demonstrate various tasks that can be performed in WebArena.

Agent on Gitlab

"Set up a new, empty repository with the name awesome_llm_reading"

Agent on Shopping Website

"Tell me the status of my latest order and when will it arrive"

Realistic Tasks on WebArena

A high-level task that can be fully executed in WebArena. Completing such tasks requires sophisticated, long-term planning and reasoning capability. To accomplish the goal stated on the top, an agent needs to find out what art museums are located in Pittsburgh by searching Wikipedia. Next, it should identify the location of each museum on a map, optimizing the itinerary based on the information collected. Finally, the agent needs to update the README file in the appropriate repository with the planned route.

List of Tasks

Observation Space

We design the observation to be the URL and the content of a web page, with options to represent the content as a screenshot (left), HTML DOM tree (middle) and accessibility tree (right).

Evaluating Functional Correctness

We introduce two evaluation approaches. The top row measures the correctness of performing information seeking tasks. It compares the predicted answer with the annotated reference with three implementations. The bottom row programmatically checks whether the intermediate states during the executions possess the anticipated properties specified by the intent.

Related Work

The comparison between our benchmark and existing benchmarks on grounding natural language instructions to concrete executions. Our benchmark is implemented in our fully interactable highly-realistic WebArena environment. It features diverse tasks human may encounter in their daily routines. We design evaluation metrics to access the functional correctness of task executions.

BibTeX

@article{zhou2023webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
  journal={arXiv preprint arXiv:2307.13854},
  url={https://webarena.dev},
  year={2023}
}