GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

Abstract

Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee.

Overview

We propose aligning GUI grounding models to novel environments, which includes
⓵ exploring the specific GUI environment
⓶ generating high-quality data from the exploration and
⓷ continously fine-tuning the model with the collected data.

Autonomous Exploration with GUI-Bee and Q-ICRL

·Goal:

Autonomously predict and execute minimal numbers of actions to reach a sequence of GUI screens that are as diverse as possible.

·Challenges:

1. Noisy Action Space and Action Validity: The GUI environment usually provides a set of candidate actions, but many are invalid, i.e. non-executable elements.
2. Uncertain Screen Transitions: The outcomes of GUI actions are unknown, and the screen transitions caused by actions are irreversible

·Our solution
Q-ICRL:
(Refer to our paper for psudo code and more)

We equip the GUI-Bee with a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) algorithm. It is a training-free method designed to maximize the exploration efficiency by treating the process as a Markov Decision Problem. It uses a memory-based Q-value function to quantify the outcome of actions and employs in-context learning with a Multimodal Large Language Model (MLLM) to select the most promising action. This approach balances exploration and exploitation.

·Exploration
Result:

The exploration process of GUI-Bee builds exploration graph of visited screens and actions. Left: an example of the exploration graph showing screens connected by actions. Right: A zoomed-in view of the graph with examples of two GUI screens and some explored and unexplored actions (ever/never selected during the exploration).

Illustration of the generation process of an exploration graph.

Autonomous Data Annotation with GUI-Bee

Once a new edge is added to the exploration graph during the exploration, we send the pair of screens connected by the edge to the MLLM (GPT-4o) to generate a list of action grounding queries (u^t) for the target element (e^t). The generation process uses a carefully crafted prompt, designed to ensure the queries cover both System 1 (focused on current screen content) and System 2 (anticipating interaction outcomes) grounding challenges.

NovelSreenSpot Benchmark

NovelScreenSpot is a new human-annotated benchmark for evaluating GUI action grounding models in five diverse web GUI environments and model performance improvements after they are continuously fine-tuned with new data. Unlike existing benchmarks that emphasize diversity across many environments, NovelScreenSpot provides a greater data variation within each environment and includes a large number of queries about action outcomes, requiring environment-specific knowledge.

Examples of the NovelScreenSpot benchmark. Each figure illustrates the GUI screen, the A11y string, the corresponding query, and the ground truth target element. We show one example for each environment in the benchmark.

Experiments

We employ the GUI-Bee agent to explore the five environments in NovelScreenSpot, conducting up to 400 exploration steps per environment. To ensure diverse screen data, the exploration is repeated three times per environment at varying screen resolutions. The resulting exploration statistics are summarized in Table. We leverage GPT-4o as the MLLM for GUI-Bee and the cost for the exploration is under $50 per environment.

Using the data generated from GUI-Bee exploration, we continuously fine-tune three GUI grounding models for each environments explored. Table here shows the benchmarking GUI grounding models on the NovelScreenSpot and Eventbrite environment of Multimodal-Mind2Web benchmark. This table shows the model accuracy and, in parentheses, the absolute improvement over the vanilla models after the models are continuously fine-tuned. The results demonstrate that our GUI-Bee model significantly improves the performance of GUI action grounding models in novel environments.

We propose Depth-fixed DOM Diversity Counts (D3C) to evaluate the coverage of the exploration process. This result shows the mean and standard deviation of D3C at various exploration steps across three runs in three environments. GUI-Bee agent demonstrates a wider exploration coverage.

BibTeX


      @misc{fan2025guibeealignguiaction,
        title={GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration}, 
        author={Yue Fan and Handong Zhao and Ruiyi Zhang and Yu Shen and Xin Eric Wang and Gang Wu},
        year={2025},
        eprint={2501.13896},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2501.13896}, 
  }