Agent on Gitlab
"Set up a new, empty repository with the name awesome_llm_reading"
The videos demonstrate various tasks that can be performed in WebArena.
"Set up a new, empty repository with the name awesome_llm_reading"
"Tell me the status of my latest order and when will it arrive"
A high-level task that can be fully executed in WebArena. Completing such tasks requires sophisticated, long-term planning and reasoning capability. To accomplish the goal stated on the top, an agent needs to find out what art museums are located in Pittsburgh by searching Wikipedia. Next, it should identify the location of each museum on a map, optimizing the itinerary based on the information collected. Finally, the agent needs to update the README file in the appropriate repository with the planned route.
We design the observation to be the URL and the content of a web page, with options to represent the content as a screenshot (left), HTML DOM tree (middle) and accessibility tree (right).
We introduce two evaluation approaches. The top row measures the correctness of performing information seeking tasks. It compares the predicted answer with the annotated reference with three implementations. The bottom row programmatically checks whether the intermediate states during the executions possess the anticipated properties specified by the intent.
The comparison between our benchmark and existing benchmarks on grounding natural language instructions to concrete executions. Our benchmark is implemented in our fully interactable highly-realistic WebArena environment. It features diverse tasks human may encounter in their daily routines. We design evaluation metrics to access the functional correctness of task executions.
@article{zhou2023webarena,
title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
journal={arXiv preprint arXiv:2307.13854},
url={https://webarena.dev},
year={2023}
}