Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

Built a dataset of GUI-based human-computer interaction tasks for pretraining and aligning a general-purpose computer-use agent. The dataset spans applications, operating systems, and task intents and combines timestamped screen recordings, step-by-step action logs, and action metadata.

10,000+

annotated GUI tasks with action logs, screenshots, and metadata.

5–100

action steps per task, with complexity governed by task design rules.

4

application intents covered: Office, Daily, Professional, and System tasks across real-world use cases.

MethodDataset creation
DomainGUI workflows
Dataset scale10,000+ tasks
CapabilityData Packs
Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

The Challenge

Training a general-purpose GUI agent requires grounded, executable demonstrations of how people use real applications, including subtle decisions, inter-application handoffs, and UI interaction reasoning. The client needed:

  • Thousands of tasks with consistent structure, timestamped actions, and metadata
  • Realistic use-case prompts with both fully specified and underspecified trajectories
  • Technically fluent annotators to ensure coverage of edge cases
  • Granular QA systems to flag irrelevant actions, inaccurate coordinates, or vague prompts

The client required a dataset that could support both pretraining and fine-tuning of agents able to complete open-ended, real-world tasks across a wide GUI surface.

The Approach

Turing deployed a global network of expert trainers across Asia and Latin America, all vetted for analytical ability and tool fluency. Tasks were designed, executed, and reviewed using Turing’s internal labeling platform with custom QA automation and structured logging.

Dataset components

Each task included:

  • Prompt: human-authored, practical task framed as a user request
  • Video recording: screen-level task execution
  • Action log: all key events (click, scroll, type, drag, hotkey, press) with timestamps and coordinates
  • Screenshots: two per action (t=0 and t=settled), capturing GUI state
  • Metadata: application name, operating system, prompt, category (application intent), and domain (technical/non-technical)

Tasks were distributed across:

  • Operating systems: macOS, Windows, Linux
  • Application intents: office, daily use, professional tools, and system tools
  • Task complexity: basic (<31 steps) and advanced (31> steps)
  • Application count: single-application, two-application, and multi-application workflows

QA process

Turing implemented a multi-layered QA and compliance framework:

  • Real-time automated checks: caught format, coordinate, or description-level issues
  • Manual review of all tasks: graded using a rubric spanning instruction following, action accuracy, metadata, grammar, and screenshot relevance
  • Spot checks and senior audits: validated consistency, complexity, and metadata coverage
  • Regular calibration sessions: improved reviewer alignment and minimized drift over time

Key Results

  • Created more than 10,000 pretraining-quality GUI tasks across diverse real-world applications
  • Captured fully structured action sequences with screenshots and metadata 
  • Delivered OS-level and application-level diversity, supporting real agent generalization
  • Flagged and filtered low-complexity or underspecified prompts using automated QA

The Outcome

The resulting dataset powers agent training workflows and instruction-following evaluations. It is used to:

  • Train LLM agents on grounded UI interaction sequences
  • Evaluate model reasoning through action descriptions and GUI logs
  • Fine-tune systems on multistep planning and intent realization
  • Support research on underspecified prompts, trajectory generalization, and intent satisfaction

Want to train agents on real GUI workflows?

Request a sample with a realistic user prompt, action log, metadata, timestamps, screenshots, and full task metadata for operating system, application, complexity, and trajectory.

Request Sample

Share

FAQ

What kinds of applications are included?

Office tools, browsers, system settings, design tools, spreadsheets, and more across Windows, macOS, and Linux.

How are tasks structured?

Each task includes a prompt, an action log, screenshot timeline, and coordinates.

How is this data used?

The dataset supports GUI agent pretraining, simulation, reward modeling, and general task grounding.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Need real-world human-computer interaction data?

Request a sample with step-level data including screenshots and timestamps.

Request Sample