Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

Back

For clients

Back

Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

Built a dataset of GUI-based human-computer interaction tasks for pretraining and aligning a general-purpose computer-use agent. The dataset spans applications, operating systems, and task intents and combines timestamped screen recordings, step-by-step action logs, and action metadata.

10,000+

annotated GUI tasks with action logs, screenshots, and metadata.

5–100

action steps per task, with complexity governed by task design rules.

4

application intents covered: Office, Daily, Professional, and System tasks across real-world use cases.

MethodDataset creation

DomainGUI workflows

Dataset scale10,000+ tasks

CapabilityData Packs

Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

The Challenge

Training a general-purpose GUI agent requires grounded, executable demonstrations of how people use real applications, including subtle decisions, inter-application handoffs, and UI interaction reasoning. The client needed:

Thousands of tasks with consistent structure, timestamped actions, and metadata
Realistic use-case prompts with both fully specified and underspecified trajectories
Technically fluent annotators to ensure coverage of edge cases
Granular QA systems to flag irrelevant actions, inaccurate coordinates, or vague prompts

The client required a dataset that could support both pretraining and fine-tuning of agents able to complete open-ended, real-world tasks across a wide GUI surface.

The Approach

Turing deployed a global network of expert trainers across Asia and Latin America, all vetted for analytical ability and tool fluency. Tasks were designed, executed, and reviewed using Turing’s internal labeling platform with custom QA automation and structured logging.

Dataset components

Each task included:

Prompt: human-authored, practical task framed as a user request
Video recording: screen-level task execution
Action log: all key events (click, scroll, type, drag, hotkey, press) with timestamps and coordinates
Screenshots: two per action (t=0 and t=settled), capturing GUI state
Metadata: application name, operating system, prompt, category (application intent), and domain (technical/non-technical)

Tasks were distributed across:

Operating systems: macOS, Windows, Linux
Application intents: office, daily use, professional tools, and system tools
Task complexity: basic (<31 steps) and advanced (31> steps)
Application count: single-application, two-application, and multi-application workflows

QA process

Turing implemented a multi-layered QA and compliance framework:

Real-time automated checks: caught format, coordinate, or description-level issues
Manual review of all tasks: graded using a rubric spanning instruction following, action accuracy, metadata, grammar, and screenshot relevance
Spot checks and senior audits: validated consistency, complexity, and metadata coverage
Regular calibration sessions: improved reviewer alignment and minimized drift over time

Key Results

Created more than 10,000 pretraining-quality GUI tasks across diverse real-world applications
Captured fully structured action sequences with screenshots and metadata
Delivered OS-level and application-level diversity, supporting real agent generalization
Flagged and filtered low-complexity or underspecified prompts using automated QA

The Outcome

The resulting dataset powers agent training workflows and instruction-following evaluations. It is used to:

Train LLM agents on grounded UI interaction sequences
Evaluate model reasoning through action descriptions and GUI logs
Fine-tune systems on multistep planning and intent realization
Support research on underspecified prompts, trajectory generalization, and intent satisfaction

Want to train agents on real GUI workflows?

Request a sample with a realistic user prompt, action log, metadata, timestamps, screenshots, and full task metadata for operating system, application, complexity, and trajectory.

Request Sample