Evaluating Computer-Use Agents With 900+ Paired Tasks and Structured Failure Modes

Built a computer-use evaluation dataset consisting of 900+ structured tasks to benchmark how effectively AI systems execute long-horizon workflows and handle controlled error conditions. Each task captures full interaction telemetry, including screen recordings and event logs, and pairs correct executions with structured mistake variants.

900+

tasks delivered, with 450+ parent–child task pairs designed for controlled agent evaluation

80–200+

user actions captured per task to stress-test long-horizon agent execution

6

workflow domains covered, including education, general computing, productivity, data science, development and creativity

MethodDataset generation

DomainAgent evaluation

Dataset scale900+ tasks

CapabilityData packs

The Challenge

The client needed a reliable way to evaluate agent systems beyond single-turn outputs. Traditional evaluations measure output correctness, but real-world agent performance depends on step-by-step execution across tools, operating systems, and multi-application workflows.

In production environments, agents do not fail in binary ways. They:

Fail to complete the core objective
Introduce unintended side effects
Misinterpret instructions while producing plausible outputs
Break late in long-horizon execution chains

A simple pass/fail metric cannot distinguish between catastrophic failure and recoverable deviation. Without structured failure categorization, diagnosing performance gaps becomes ambiguous.

The client required a dataset capable of:

Measuring mistake handling in realistic, multi-step workflows
Differentiating catastrophic failure from recoverable error
Preserving identical task context while introducing controlled deviations
Maintaining domain, OS, and tool diversity without sacrificing experimental control
Ensuring reproducibility through execution logging and structured metadata

This required a tightly governed task-generation process that preserved real-world realism while eliminating unintended variability.

The Approach

Turing implemented a structured parent–child task generation framework grounded in mistake taxonomy, domain balancing, and reproducible execution capture.

1. Structured task pairing

Each workflow was authored as a structured pair:

A Parent task representing a successful execution of its prompt
A Child task representing a different successful execution aligned to its own prompt
Shared metadata, tools, and environment configuration

Both tasks are independently correct and deterministic when evaluated in isolation.

The evaluation signal is generated when prompt and execution are intentionally swapped:

Parent prompt + Child execution
Child prompt + Parent execution

These swapped combinations produce controlled failure cases, each associated with a predefined mistake type based on the discrepancy between requested objective and executed workflow.

This design enabled systematic mistake classification without embedding errors into the source tasks themselves.

2. Structured mistake taxonomy

All mistake variants were categorized into one of three types:

Critical mistake: The core objective is not achieved and requires task restart
Bad side effect: The objective is achieved, but harmful or costly consequences are introduced
Misunderstanding of the instruction: The agent executes a plausible but incorrect interpretation that does not cause any harmful or costly consequences

A standardized core-objective test ensured consistent classification across annotators and reviewers. Mistake types were evenly distributed to prevent skewed evaluation signals.

3. Full trajectory capture

Each task included:

A structured prompt
Explicit subtasks derived using standardized authoring logic
Full-screen video recording of execution
Event-level logs capturing clicks, keystrokes, and scrolls
Timestamped screenshots prior to each action
Metadata including domain, OS, tool, persona, and storyline

This telemetry enabled reproducible scoring and debugging of agent behavior.

4. Controlled complexity and environment distribution

Tasks were calibrated by action count:

Easy: 80–125 actions
Medium: 125–175 actions
Hard: 175–225 actions

To ensure representativeness:

Tasks were distributed across six defined domains
Operating systems were evenly balanced across Windows, macOS, and Linux
Tool usage reflected real-world software distribution, with 40% open-source and 60% closed-source tools

5. Multi-layer quality assurance

A rubric-based QA framework validated:

Prompt determinism and clarity
Logical subtask sequencing
Consistency between video trajectory and event logs
Accurate mistake-type labeling
Absence of redundant or inefficient steps

Tasks failing rubric standards were reworked or re-recorded to maintain dataset integrity.

Key Results

Delivered 900+ deterministic computer-use tasks across 450+ structured parent–child pairs
Enabled production of 1800+ evaluable tasks through controlled prompt–execution swapping
Established a mistake-aware evaluation design enabling granular failure-mode analysis
Standardized long-horizon telemetry capture across operating systems
Created a reproducible benchmark framework for scaling computer-use evaluation
Produced structured metadata suitable for downstream scoring, debugging, and model iteration

The Outcome

The resulting dataset supports rigorous evaluation of computer-use agents by:

Comparing correct and mistake-injected trajectories under controlled conditions
Distinguishing between objective failure, harmful side effects, and instruction misunderstanding
Providing complete telemetry for evaluation, debugging, and iterative improvement
Stress-testing agent reliability across heterogeneous OS and software ecosystems

This framework enables measurable progress in computer-use capability rather than surface-level task completion metrics.

Need structured computer-use evaluation data with controlled failure modes?

Request a sample paired-task set including full trajectory logs, mistake classification, and QA rubric documentation.

Request Sample

What types of tasks are included?

The dataset includes multi-step computer-use workflows spanning productivity, development, data science, education, creativity, and general computing domains.

How are mistakes categorized?

Each mistake variant introduces exactly one structured error type: critical mistake, bad side effect, or misunderstanding of the instruction.

Are full execution traces included?

Yes. Each task includes synchronized video recordings, event logs, timestamps, and structured subtasks.

Is this dataset designed for training or evaluation?

It is structured primarily for evaluation and failure-mode analysis, though it can inform post-training workflows.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Benchmarking Frontier Models With 5,000+ HLE-Grade STEM Problems

Read

Case Study

Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Evaluating agents that operate across real software environments?

Request structured computer-use task pairs designed to surface objective failures and subtle side-effect errors.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now