ServiceNow partnered with Turing to annotate and structure a 10,000-task dataset covering 83 real desktop applications. The dataset supports UI grounding, layout grouping, and multi-step action prediction for multimodal AI agents.

AI agents are generally trained on web or mobile UIs, but real-world enterprise users rely on complex desktop applications. The client set out to build the first large-scale benchmark for testing AI agents in desktop software environments.
The benchmark needed to support:
The annotation process required precise, tool-specific actions, exhaustive labeling, and frame-accurate screenshot capture, which were not achievable with generic data vendors or crowdsourcing.
Dataset structure
Turing deployed a team of more than 70 annotators based in India and Latin America. All contributors held technical degrees, were fluent in English and technical writing, and had prior experience in UI research and data labeling.
Each task consisted of three parts:
All actions followed a defined taxonomy of GUI events. Screenshot annotations followed strict standards for element coverage, cursor labeling, grouping logic, and text accuracy. Annotators worked in full-screen resolution and recorded actions using a proprietary annotation platform.
Task coverage
QA and review process
To maintain high quality across all the delivered tasks, Turing implemented a dual-layer manual QA pipeline. All annotations, including bounding boxes, text labels, and action logs, were manually verified to ensure screen-event fidelity, correct screenshot timing, and annotation consistency.
The resulting dataset was used to build UI-Vision, the first large-scale benchmark for multimodal agents in desktop environments. The dataset reveals:
Turing’s work enabled the client to:
Request a sample with recorded user actions, annotated UI elements, and structured metadata across open-source desktop applications.
Request SampleEach sample includes a complete GUI task with a screen recording, structured prompt, keyframes, and bounding box annotations.
83 open-source desktop apps across productivity, design, coding, communication, and multimedia.
Tasks include both atomic and multi-step interactions, ranging from simple two-step navigations to complex workflows involving 10 or more actions.
Yes. These tasks are fully labeled and formatted for grounding, prediction, and general agent benchmarking.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request a sample with screen recordings, multi-step action traces, and pixel-level annotations built to evaluate grounding, planning, and UI control.