Evaluating Computer-Use Agents With 900+ Paired Tasks and Structured Failure Modes

Built a computer-use evaluation dataset consisting of 900+ structured tasks to benchmark how effectively AI systems execute long-horizon workflows and handle controlled error conditions. Each task captures full interaction telemetry, including screen recordings and event logs, and pairs correct executions with structured mistake variants.

900+

tasks delivered, with 450+ parent–child task pairs designed for controlled agent evaluation

80–200+

user actions captured per task to stress-test long-horizon agent execution

6

workflow domains covered, including education, general computing, productivity, data science, development and creativity

MethodDataset generation
DomainAgent evaluation
Dataset scale900+ tasks
CapabilityData packs

The Challenge

The client needed a reliable way to evaluate agent systems beyond single-turn outputs. Traditional evaluations measure output correctness, but real-world agent performance depends on step-by-step execution across tools, operating systems, and multi-application workflows.

In production environments, agents do not fail in binary ways. They:

  • Fail to complete the core objective
  • Introduce unintended side effects
  • Misinterpret instructions while producing plausible outputs
  • Break late in long-horizon execution chains

A simple pass/fail metric cannot distinguish between catastrophic failure and recoverable deviation. Without structured failure categorization, diagnosing performance gaps becomes ambiguous.

The client required a dataset capable of:

  • Measuring mistake handling in realistic, multi-step workflows
  • Differentiating catastrophic failure from recoverable error
  • Preserving identical task context while introducing controlled deviations
  • Maintaining domain, OS, and tool diversity without sacrificing experimental control
  • Ensuring reproducibility through execution logging and structured metadata

This required a tightly governed task-generation process that preserved real-world realism while eliminating unintended variability.

The Approach

Turing implemented a structured parent–child task generation framework grounded in mistake taxonomy, domain balancing, and reproducible execution capture.

1. Structured task pairing

Each workflow was authored as a structured pair:

  • A Parent task representing a successful execution of its prompt
  • A Child task representing a different successful execution aligned to its own prompt 
  • Shared metadata, tools, and environment configuration

Both tasks are independently correct and deterministic when evaluated in isolation.

The evaluation signal is generated when prompt and execution are intentionally swapped:

  • Parent prompt + Child execution
  • Child prompt + Parent execution

These swapped combinations produce controlled failure cases, each associated with a predefined mistake type based on the discrepancy between requested objective and executed workflow.

This design enabled systematic mistake classification without embedding errors into the source tasks themselves.

2. Structured mistake taxonomy

All mistake variants were categorized into one of three types:

  • Critical mistake: The core objective is not achieved and requires task restart
  • Bad side effect: The objective is achieved, but harmful or costly consequences are introduced
  • Misunderstanding of the instruction: The agent executes a plausible but incorrect interpretation that does not cause any harmful or costly consequences

A standardized core-objective test ensured consistent classification across annotators and reviewers. Mistake types were evenly distributed to prevent skewed evaluation signals.

3. Full trajectory capture

Each task included:

  • A structured prompt
  • Explicit subtasks derived using standardized authoring logic
  • Full-screen video recording of execution
  • Event-level logs capturing clicks, keystrokes, and scrolls
  • Timestamped screenshots prior to each action
  • Metadata including domain, OS, tool, persona, and storyline

This telemetry enabled reproducible scoring and debugging of agent behavior.

4. Controlled complexity and environment distribution

Tasks were calibrated by action count:

  • Easy: 80–125 actions
  • Medium: 125–175 actions
  • Hard: 175–225 actions

To ensure representativeness:

  • Tasks were distributed across six defined domains
  • Operating systems were evenly balanced across Windows, macOS, and Linux
  • Tool usage reflected real-world software distribution, with 40% open-source and 60% closed-source tools

5. Multi-layer quality assurance

A rubric-based QA framework validated:

  • Prompt determinism and clarity
  • Logical subtask sequencing
  • Consistency between video trajectory and event logs
  • Accurate mistake-type labeling
  • Absence of redundant or inefficient steps

Tasks failing rubric standards were reworked or re-recorded to maintain dataset integrity.

Key Results

  • Delivered 900+ deterministic computer-use tasks across 450+ structured parent–child pairs
  • Enabled production of 1800+ evaluable tasks through controlled prompt–execution swapping
  • Established a mistake-aware evaluation design enabling granular failure-mode analysis
  • Standardized long-horizon telemetry capture across operating systems
  • Created a reproducible benchmark framework for scaling computer-use evaluation
  • Produced structured metadata suitable for downstream scoring, debugging, and model iteration

The Outcome

The resulting dataset supports rigorous evaluation of computer-use agents by:

  • Comparing correct and mistake-injected trajectories under controlled conditions
  • Distinguishing between objective failure, harmful side effects, and instruction misunderstanding
  • Providing complete telemetry for evaluation, debugging, and iterative improvement
  • Stress-testing agent reliability across heterogeneous OS and software ecosystems

This framework enables measurable progress in computer-use capability rather than surface-level task completion metrics.

Need structured computer-use evaluation data with controlled failure modes?

Request a sample paired-task set including full trajectory logs, mistake classification, and QA rubric documentation.

Request Sample

Share

FAQ

What types of tasks are included?

The dataset includes multi-step computer-use workflows spanning productivity, development, data science, education, creativity, and general computing domains.

How are mistakes categorized?

Each mistake variant introduces exactly one structured error type: critical mistake, bad side effect, or misunderstanding of the instruction.

Are full execution traces included?

Yes. Each task includes synchronized video recordings, event logs, timestamps, and structured subtasks.

Is this dataset designed for training or evaluation?

It is structured primarily for evaluation and failure-mode analysis, though it can inform post-training workflows.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Evaluating agents that operate across real software environments?

Request structured computer-use task pairs designed to surface objective failures and subtle side-effect errors.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now