Powering ServiceNow’s EnterpriseOps-Gym: Benchmarking Enterprise Agents With Execution-Grounded Workflows

ServiceNow partnered with Turing to create EnterpriseOps-Gym, a benchmark designed to evaluate enterprise agents operating across realistic, multi-system workflows. The work focused on designing structured prompts, execution trajectories, and evaluation pipelines that measure how well models plan, execute, and adhere to policies in production-like environments.

1,000+

evaluated prompts spanning enterprise workflows with logged trajectories, tool interactions, and verifier scripts.

7-30

step ranges represented across prompts, enabling controlled tests of long-horizon planning difficulty.

8

enterprise domains covered, including HR, CSM, ITSM, Email, Calendar, Drive, Teams, and Hybrid workflows.

MethodDataset generation
DomainAgent evaluation
Dataset scale1,000+ tasks
CapabilityData packs
Powering ServiceNow’s EnterpriseOps-Gym Benchmarking Enterprise Agents With Execution-Grounded Workflow

The Challenge

Enterprise agents must operate across systems with stateful actions, strict policies, and error propagation across workflows. Existing benchmarks focus on short tool sequences or static tasks and fail to capture:

  • Long-horizon planning across multiple steps
  • Stateful interactions with persistent system changes
  • Cross-domain orchestration between tools and data systems
  • Policy adherence and access constraints in enterprise environments

ServiceNow aimed to build a benchmark that could evaluate whether agents can reliably complete real workflows, not just generate correct responses.

The Approach

Turing contributed to the EnterpriseOps-Gym benchmark by designing structured evaluation tasks, execution trajectories, and validation mechanisms aligned with real enterprise workflows.

1. Enterprise workflow prompt design

Turing designed 1,000+ prompts grounded in real operational scenarios, including:

  • Customer service case management
  • HR workflows such as onboarding and case handling
  • IT service operations and incident resolution
  • Productivity workflows across Email, Calendar, Drive, and Teams
  • Hybrid workflows requiring cross-domain coordination

Each prompt required agents to retrieve data, apply policies, and execute multi-step workflows.

2. Long-horizon execution modeling

Tasks were constructed with 7 to 30 step execution paths, requiring:

  • Sequential tool invocation
  • Dependency resolution across steps
  • State updates across systems
  • Correct ordering of actions

This allowed controlled evaluation of how performance changes as planning complexity increases.

3. Expert reference trajectories

Turing produced expert-grounded execution paths for each task, including:

  • Step-by-step reasoning aligned to system constraints
  • Tool calls with parameters and expected outcomes
  • Logged execution traces capturing system interactions

These trajectories served as reference standards for evaluating model behavior.

4. Verifier-based evaluation framework

Instead of evaluating outputs alone, tasks were validated using deterministic verification scripts:

  • SQL-based checks confirmed whether workflows completed correctly
  • Verification covered task completion, system state integrity, policy compliance, and side effects
  • Multiple valid execution paths were allowed as long as the final state conditions were satisfied

This ensured that evaluation measured actual execution success, not just plausible responses.

5. Multi-layer quality assurance

Turing applied a structured QA pipeline:

  • Reviewer audits for task correctness and realism
  • Feasibility checks to ensure tasks were solvable within the environment
  • Consistency validation across prompts, trajectories, and verification logic

This ensured that tasks reflected real enterprise workflows and supported reliable benchmarking.

Key Results

  • Contributed 1,000+ enterprise workflow prompts to EnterpriseOps-Gym
  • Enabled evaluation across 8 enterprise domains with realistic system interactions
  • Established execution-grounded benchmarking using verifier-based validation
  • Modeled long-horizon planning (7–30 steps) to stress-test agent reasoning
  • Supported benchmarking of stateful, multi-system workflows rather than isolated tasks

The Outcome

ServiceNow used EnterpriseOps-Gym to benchmark frontier models and uncovered clear limitations in real-world agent performance. Even the top-performing model achieved only 37.4% task completion, with performance dropping further in policy-heavy and cross-domain workflows .

The evaluation revealed that:

  • Planning, not tool use, is the primary bottleneck, with human-authored plans improving performance by 14–35 percentage points
  • Performance declines with longer workflows, highlighting challenges in maintaining multi-step reasoning
  • Policy compliance and safe refusal remain unreliable, even for top models

Need to benchmark agents in real enterprise workflows?

Request a sample of execution-grounded tasks designed for multi-step, tool-driven evaluation.

Request Sample

Share

FAQ

What is EnterpriseOps-Gym?

EnterpriseOps-Gym is a benchmark designed by ServiceNow to evaluate agentic planning and tool use in realistic enterprise environments with stateful workflows and policy constraints.

What did Turing contribute to this benchmark?

Turing designed 1,000+ enterprise prompts, execution trajectories, and validation pipelines that enabled structured evaluation of agent workflows.

How were the tasks evaluated?

Tasks were evaluated using verifier scripts that check the final system state, ensuring workflows are correctly completed rather than relying only on output quality.

What makes this benchmark different from traditional ones?

It evaluates agents on long-horizon, multi-step workflows with real system interactions, rather than short, stateless tasks.

How complex are the workflows?

Tasks range from 7 to 30 steps, requiring coordinated tool use and state management across systems.

How fast can I get a sample?

Within three business days after NDA execution.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

Evaluating long-horizon planning and tool use?

Work with Turing to design benchmarks that test real-world agent performance beyond simple tasks.

Talk to an Expert

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now