How Turing and ServiceNow Collaborated on UI-Vision: A New Benchmark for GUI-Based AI Agents

Turing Staff

27 May 2025•3 mins read

LLM training and enhancement

GenAI

Business and Research

What Are GUIs, and Why Do They Matter?

Introducing UI-Vision

Turing’s Role: Human Intelligence at Scale

Why UI-Vision Matters for AGI

Conclusion

LLM training and enhancement

GenAI

Business and Research

Graphical user interfaces (GUIs) play such an important role in how we engage with software, yet intelligent agents often overlook them. That's where UI-Vision comes in! Developed by ServiceNow researchers and collaborators, this exciting new benchmark is here to change the game. With Turing's support, this desktop-focused benchmark allows for thorough evaluation of AI agents as they navigate the diverse and dynamic world of software environments.

This teamwork showcases Turing’s commitment to enhancing AI infrastructure and paving the way for practical AGI applications. AGI is all about improving how models understand, interact with, and learn from software's visual workflows.

What Are GUIs, and Why Do They Matter?

GUIs are visual interfaces—toolbars, icons, dropdowns—that make modern software intuitive. Unlike web-based environments with structured HTML or mobile screens optimized for touch, desktop GUIs are complex, inconsistent, and more challenging for agents to interpret.

Today’s agents often struggle with visual grounding, spatial reasoning, and dynamic actions like drag-and-drop. These limitations have slowed progress in building fully autonomous systems that can navigate software like humans.

Introducing UI-Vision

UI-Vision is the first large-scale benchmark specifically designed to evaluate intelligent agents on desktop GUI tasks. Created by ServiceNow researchers in collaboration with academic and industry partners like Turing, UI-Vision covers 83 real-world applications across categories such as productivity, development, creativity, and more.
What makes UI-Vision different?
Dense, Human-Annotated Data: Every GUI interaction—including clicks, drags, and keystrokes—is paired with bounding boxes and labels.
Three Core Tasks:
Element Grounding: Locate specific UI elements from a query.
Layout Grounding: Identify functional regions (e.g., toolbar or navigation pane).
Action Prediction: Given a task, predict the agent’s next interaction.

These tasks help assess an agent’s perception, reasoning, and action capabilities, laying the foundation for more capable, adaptive AI.

Turing’s Role: Human Intelligence at Scale

Turing has been essential in crafting the high-quality training and evaluation data that powers UI-Vision. Our global annotation team, based primarily in India and Latin America, skillfully managed the most intricate aspects of developing this benchmark.

Key Contributions:

Expert Annotation: Annotators with degrees in engineering and data science received training to perform and record tasks across 83 applications.
Layered QA Process: Each annotation undergoes a rigorous, multi-stage review pipeline, including verification by subject matter experts and project leads.
Diversity by Design: The annotation team was intentionally global, allowing the dataset to reflect how users from various geographies interact with software, thus ensuring broader applicability for downstream models.

This human-in-the-loop effort reflects Turing’s belief that human intelligence is a core differentiator in building real-world AI systems.

Why UI-Vision Matters for AGI

UI-Vision enables structured, measurable evaluation of GUI agents across three levels of understanding: identifying elements, interpreting layouts, and predicting interactions.

Potential Impacts:

Better Training Signals: Developers can train agents on various real-world software beyond browser-based tasks.
Agent Benchmarking: Researchers can now compare model performance across standard tasks, highlighting areas where systems fall short (e.g., drag actions or spatial reasoning).
Enterprise Readiness: Desktop software is prevalent across various industries, and UI-Vision brings us closer to deploying agents in finance, design, logistics, and more.

Limitations:
UI-Vision is currently an offline benchmark, meaning it doesn't evaluate real-time interactions or explore multi-agent collaboration. Future extensions could address these gaps.

Conclusion

UI-Vision is crucial in assessing and enhancing intelligent agents in complex desktop environments. Turing’s role in powering the annotation process with rigorous human oversight highlights our commitment to improving AI infrastructure, enabling practical applications, and boosting real-world model performance.

This demonstrates how Turing AGI Advancement and Turing Intelligence contribute to advancing AI, linking foundational research with applied enterprise impact.