Powering the UI-Vision Benchmark with 10,000+ Desktop GUI Tasks

ServiceNow partnered with Turing to annotate and structure a 10,000-task dataset covering 83 real desktop applications. The dataset supports UI grounding, layout grouping, and multi-step action prediction for multimodal AI agents.

10,000+

GUI tasks delivered, including element grounding, layout grouping, and multi-step action prediction across real desktop interfaces.

70

annotators and QA specialists across India and Latin America.

83

desktop platforms covered, including VSCode, GIMP, LibreOffice, Inkscape, Firefox, VLC, and more.

MethodAgent evaluation

DomainTool use

Dataset scale10,000+ tasks

CapabilityData Packs

Powering the UI-Vision Benchmark with 10,000+ Desktop GUI Task

The Challenge

AI agents are generally trained on web or mobile UIs, but real-world enterprise users rely on complex desktop applications. The client set out to build the first large-scale benchmark for testing AI agents in desktop software environments.

The benchmark needed to support:

Element grounding: Identify the correct icon, menu item, or input field from a text query
Layout grounding: Group elements by region or function
Action prediction: Predict full trajectories of GUI interactions such as clicks, scrolls, and hotkeys

The annotation process required precise, tool-specific actions, exhaustive labeling, and frame-accurate screenshot capture, which were not achievable with generic data vendors or crowdsourcing.

The Approach

Dataset structure

Turing deployed a team of more than 70 annotators based in India and Latin America. All contributors held technical degrees, were fluent in English and technical writing, and had prior experience in UI research and data labeling.

Each task consisted of three parts:

Instruction design: Create realistic desktop-use prompts using open-source applications, such as “Open a new spreadsheet and apply conditional formatting to column B.”
Action recording: Perform the action while screen recording and capture GUI events such as CLICK, DRAG_TO, SCROLL, and TYPING
Frame annotation: Extract screenshots before each action and annotate all visible UI elements with bounding boxes, tooltips, and alt text

All actions followed a defined taxonomy of GUI events. Screenshot annotations followed strict standards for element coverage, cursor labeling, grouping logic, and text accuracy. Annotators worked in full-screen resolution and recorded actions using a proprietary annotation platform.

Task coverage

Applications: The dataset covered 83 applications across categories such as productivity (LibreOffice, Zotero), creativity (GIMP, Blender), development (VSCode, Eclipse), and communication (Signal, Mastodon)
Task types: Ranging from atomic (e.g., “Insert an image”) to complex (e.g., “Create a table, apply filters, export as PDF”)
Interactions captured: Element location, motion (e.g., drag-and-drop), layout structure, and tool-based interactions

QA and review process

To maintain high quality across all the delivered tasks, Turing implemented a dual-layer manual QA pipeline. All annotations, including bounding boxes, text labels, and action logs, were manually verified to ensure screen-event fidelity, correct screenshot timing, and annotation consistency.

Key Results

Delivered more than 10,000 end-to-end tasks covering diverse interaction types across 83 open-source applications
Annotated multiple keyframes per task, including UI components, tooltips, mouse events, and layout regions
Supported benchmark tasks such as element grounding, layout grouping, and multi-step action prediction

The Outcome

The resulting dataset was used to build UI-Vision, the first large-scale benchmark for multimodal agents in desktop environments. The dataset reveals:

Where models struggle with dense, multi-layered layouts
How they fail at multi-step planning and execution
The challenges of grounding instructions in complex desktop interfaces

Turing’s work enabled the client to:

Scale the dataset rapidly with high-quality, reproducible annotations
Maintain fidelity across complex user flows
Benchmark frontier models on real-world, GUI-intensive tasks

Can your model navigate real desktop GUIs?

Request a sample with recorded user actions, annotated UI elements, and structured metadata across open-source desktop applications.

Request Sample

What’s in the sample?

Each sample includes a complete GUI task with a screen recording, structured prompt, keyframes, and bounding box annotations.

How many platforms are covered?

83 open-source desktop apps across productivity, design, coding, communication, and multimedia.

Are tasks atomic or multi-step?

Tasks include both atomic and multi-step interactions, ranging from simple two-step navigations to complex workflows involving 10 or more actions.

Is this suitable for model fine-tuning or eval?

Yes. These tasks are fully labeled and formatted for grounding, prediction, and general agent benchmarking.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Advancing Code-Based Physics and 2D-3D Simulation Understanding with 3,800+ Tasks

Case Study

Advancing Code-Based Physics & 2D/3D Simulation Understanding with 3,800+ Tasks

Read

Want to test your AI agent on desktop interfaces?

Request a sample with screen recordings, multi-step action traces, and pixel-level annotations built to evaluate grounding, planning, and UI control.

Request Sample