Powering ServiceNow’s EnterpriseOps-Gym: Benchmarking Enterprise Agents With Execution-Grounded Workflows

ServiceNow partnered with Turing to create EnterpriseOps-Gym, a benchmark designed to evaluate enterprise agents operating across realistic, multi-system workflows. The work focused on designing structured prompts, execution trajectories, and evaluation pipelines that measure how well models plan, execute, and adhere to policies in production-like environments.

1,000+

evaluated prompts spanning enterprise workflows with logged trajectories, tool interactions, and verifier scripts.

7-30

step ranges represented across prompts, enabling controlled tests of long-horizon planning difficulty.

8

enterprise domains covered, including HR, CSM, ITSM, Email, Calendar, Drive, Teams, and Hybrid workflows.

MethodDataset generation

DomainAgent evaluation

Dataset scale1,000+ tasks

CapabilityData packs

Powering ServiceNow’s EnterpriseOps-Gym Benchmarking Enterprise Agents With Execution-Grounded Workflow

The Challenge

Enterprise agents must operate across systems with stateful actions, strict policies, and error propagation across workflows. Existing benchmarks focus on short tool sequences or static tasks and fail to capture:

Long-horizon planning across multiple steps
Stateful interactions with persistent system changes
Cross-domain orchestration between tools and data systems
Policy adherence and access constraints in enterprise environments

ServiceNow aimed to build a benchmark that could evaluate whether agents can reliably complete real workflows, not just generate correct responses.

The Approach

Turing contributed to the EnterpriseOps-Gym benchmark by designing structured evaluation tasks, execution trajectories, and validation mechanisms aligned with real enterprise workflows.

1. Enterprise workflow prompt design

Turing designed 1,000+ prompts grounded in real operational scenarios, including:

Customer service case management
HR workflows such as onboarding and case handling
IT service operations and incident resolution
Productivity workflows across Email, Calendar, Drive, and Teams
Hybrid workflows requiring cross-domain coordination

Each prompt required agents to retrieve data, apply policies, and execute multi-step workflows.

2. Long-horizon execution modeling

Tasks were constructed with 7 to 30 step execution paths, requiring:

Sequential tool invocation
Dependency resolution across steps
State updates across systems
Correct ordering of actions

This allowed controlled evaluation of how performance changes as planning complexity increases.

3. Expert reference trajectories

Turing produced expert-grounded execution paths for each task, including:

Step-by-step reasoning aligned to system constraints
Tool calls with parameters and expected outcomes
Logged execution traces capturing system interactions

These trajectories served as reference standards for evaluating model behavior.

4. Verifier-based evaluation framework

Instead of evaluating outputs alone, tasks were validated using deterministic verification scripts:

SQL-based checks confirmed whether workflows completed correctly
Verification covered task completion, system state integrity, policy compliance, and side effects
Multiple valid execution paths were allowed as long as the final state conditions were satisfied

This ensured that evaluation measured actual execution success, not just plausible responses.

5. Multi-layer quality assurance

Turing applied a structured QA pipeline:

Reviewer audits for task correctness and realism
Feasibility checks to ensure tasks were solvable within the environment
Consistency validation across prompts, trajectories, and verification logic

This ensured that tasks reflected real enterprise workflows and supported reliable benchmarking.

Key Results

Contributed 1,000+ enterprise workflow prompts to EnterpriseOps-Gym
Enabled evaluation across 8 enterprise domains with realistic system interactions
Established execution-grounded benchmarking using verifier-based validation
Modeled long-horizon planning (7–30 steps) to stress-test agent reasoning
Supported benchmarking of stateful, multi-system workflows rather than isolated tasks

The Outcome

ServiceNow used EnterpriseOps-Gym to benchmark frontier models and uncovered clear limitations in real-world agent performance. Even the top-performing model achieved only 37.4% task completion, with performance dropping further in policy-heavy and cross-domain workflows .

The evaluation revealed that:

Planning, not tool use, is the primary bottleneck, with human-authored plans improving performance by 14–35 percentage points
Performance declines with longer workflows, highlighting challenges in maintaining multi-step reasoning
Policy compliance and safe refusal remain unreliable, even for top models

Need to benchmark agents in real enterprise workflows?

Request a sample of execution-grounded tasks designed for multi-step, tool-driven evaluation.

Request Sample

What is EnterpriseOps-Gym?

EnterpriseOps-Gym is a benchmark designed by ServiceNow to evaluate agentic planning and tool use in realistic enterprise environments with stateful workflows and policy constraints.

What did Turing contribute to this benchmark?

Turing designed 1,000+ enterprise prompts, execution trajectories, and validation pipelines that enabled structured evaluation of agent workflows.

How were the tasks evaluated?

Tasks were evaluated using verifier scripts that check the final system state, ensuring workflows are correctly completed rather than relying only on output quality.

What makes this benchmark different from traditional ones?

It evaluates agents on long-horizon, multi-step workflows with real system interactions, rather than short, stateless tasks.

How complex are the workflows?

Tasks range from 7 to 30 steps, requiring coordinated tool use and state management across systems.

How fast can I get a sample?

Within three business days after NDA execution.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

Related resources

Case Study

Benchmarking Frontier Models With 5,000+ HLE-Grade STEM Problems

Read

Case Study

Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Evaluating long-horizon planning and tool use?

Work with Turing to design benchmarks that test real-world agent performance beyond simple tasks.

Talk to an Expert

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now