Building Production-Ready RL Gyms for Commercial Agent Workflows Across 4 Platforms

Designed and delivered a sandboxed RL environment for training and evaluating AI agents on real-world commercial sales workflows. The RL Gym packaged 100+ structured workflows across four enterprise platforms, with realistic UI replicas, natural-language prompts, step-level verifiers, and Docker-containerized delivery ready for agent experimentation at scale.

100+

workflows delivered with UI replicas, natural-language prompts, step-level verifiers, and Docker packaging.

4

enterprise platforms covered, including LinkedIn Sales Navigator, HubSpot, Outreach, and Calendly.

Pass@3

complexity framework applied across all workflows to calibrate difficulty and produce reliable RL training signals.

MethodRL environments

DomainUI environments

Dataset scale100+ workflows

CapabilityRL environments

The Challenge

Training reinforcement learning agents on commercial software tasks requires more than UI screenshots and prompts. Agents need structured environments where actions have consequences, task completion is objectively verifiable, and difficulty is calibrated to produce useful learning signals.

The client needed a gym environment that could:

Replicate the behavior of real enterprise sales tools without exposing live production systems
Represent the full breadth of commercial sales workflows, from inbound lead qualification to outbound multi-channel sequencing and scheduling
Provide granular, assertion-based verification that could score partial and full task completion automatically
Scale across integration tiers, from single-platform tasks to complex four-platform workflows requiring coordinated actions across tools
Package everything in a reproducible, portable format that their training infrastructure could consume directly

The Approach

Turing designed the RL Gym as a layered system, combining sandboxed UI environments, structured workflow blueprints, a verifier API framework, and a difficulty calibration methodology grounded in execution outcomes.

1. Sandboxed UI environments per platform

For each of the four platforms, Turing built a self-contained UI replica populated with realistic seed data.

Each environment was designed to support all UI states required by the workflows assigned to that platform, including:

Edge states triggered by multi-step sequences
Conditional branches
Cross-platform handoffs

Environments were scoped to cover the platform capabilities, including

Decision-maker verification and lead prioritization in LinkedIn Sales Navigator
Lead capture, enrichment, and email personalization in HubSpot
Email sequencing and engagement tracking in Outreach
Scheduling assignment and meeting management in Calendly

2. Structured workflow blueprints

Each workflow was defined as a structured blueprint containing four components:

Goal: The business outcome the workflow is intended to achieve
Prompt: A natural-language instruction an agent receives to initiate the workflow
Steps: An ordered sequence of operations, each specifying the platform, the task objective, and the UI-level actions required
Verifiers: Granular assertions that confirm successful task completion at each step

Workflows were distributed across two categories: inbound and outbound sales motions, and covered business functions including lead qualification and scheduling, personalized outreach and sequencing, opportunity creation and pipeline progression, and dormant lead re-engagement.

Task and prompt design was grounded in real-world usage patterns shared by the client, ensuring workflows reflected authentic professional behavior rather than synthetic or hypothetical scenarios.

3. Verifier API framework

Verification was handled through a standardized API layer shared across all four platform environments. Each gym exposed consistent endpoints for:

Initializing runs
Storing and retrieving actual state
Executing assertions
Returning pass/fail results with structured metadata

The verification system supported:

JQ-based state inspection
Operator-driven assertion logic
Multi-assertion rollup into task-level pass/fail outcomes

This allowed the client's training harness to consume structured reward signals without needing to interpret raw UI state.

Cross-platform workflows used a shared run_id to coordinate state across gyms, enabling the verifier to assess actions taken across different platforms at the end of execution for a given task. Each platform environment is a standalone deployment; changes made by the agent in one gym do not propagate in real time to another. Tasks and initial data were designed so that this real-time integration is not required; correctness is evaluated holistically once the full task execution is complete.

4. Complexity calibration via Pass@3

Workflow difficulty was calibrated using a Pass@3 methodology: each workflow was executed three times, and the ratio of passes to failures determined its complexity classification.

Based on the capabilities of current computer-use agent models, all the workflows were targeted at the Hard difficulty band, tasks where models fail frequently, exposing capability boundaries and providing the model-breaking signal.

5. Docker packaging and delivery

The complete gym environment for each platform was delivered as a Docker container, including the sandboxed UI, seed data, workflow blueprints, verifier scripts, and API layer. Containers were documented with usage instructions and update procedures to support integration into the client's training infrastructure with minimal friction.

Key Results

Delivered more than 100 structured commercial workflows across four enterprise platforms
Covered all four integration tiers: single-platform, two-platform, three-platform, and four-platform workflows, with 50+ tasks requiring coordinated execution across all four tools simultaneously
Implemented a standardized verifier API framework shared across all gym environments, enabling consistent reward signal generation across platforms
Applied pass@3 difficulty calibration across all workflows to ensure the gym produces actionable RL training signal
Packaged all environments as Docker containers with full documentation, ready for direct integration into agent training pipelines

The Outcome

The client received a fully operational RL Gym for training and evaluating AI agents on commercial sales workflows. With sandboxed environments, structured prompts, assertion-based verification, and calibrated difficulty, the gym provides the infrastructure needed to move from isolated task completion to multi-platform, multi-step agent behavior.

This foundation enables the client to:

Train agents on realistic inbound and outbound sales workflows without exposing production systems
Measure task completion objectively through verifier-grounded pass/fail outcomes
Diagnose agent failure modes at the step level, not just at the workflow level
Scale experimentation across platform combinations and integration tiers
Iterate on agent capabilities with reproducible benchmarks as models improve

Want to train agents on real commercial workflows?

Request a sample workflow package including a structured prompt, step-level verifiers, and Docker environment for a single-platform or multi-platform scenario.

Request Sample

What platforms are covered in the RL Gym?

The gym covers LinkedIn Sales Navigator, HubSpot, Outreach, and Calendly, with workflows ranging from single-platform tasks to four-platform sequences requiring coordinated actions across all tools.

How is task completion verified?

Each workflow includes granular assertions evaluated through a standardized verifier API. Assertions use JQ-based state inspection and operator logic to confirm whether the agent correctly completed each step, returning structured pass/fail results.

How was workflow difficulty calibrated?

Difficulty was determined using a pass@3 framework, each workflow was executed three times, and the ratio of passes to failures classified it as Easy, Medium, or Hard.

Are the gym environments isolated from live systems?

Yes. Each environment is a sandboxed replica populated with synthetic but realistic seed data, packaged as a Docker container. No live platform credentials or production data are required.

Can the gym be extended to additional platforms or workflows?

Yes. The gym infrastructure and platform environments are non-exclusive and designed to support additional workflow development.

How fast can I get a sample?

Within three business days after NDA execution.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

Related resources

Case Study

Powering ServiceNow’s EnterpriseOps-Gym: Benchmarking Enterprise Agents With Execution-Grounded Workflows

Read

Case Study

Benchmarking Frontier Models With 5,000+ HLE-Grade STEM Problems

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Building agents that need to operate across enterprise sales tools?

Get RL environments grounded in real commercial workflows and calibrated for training signal quality.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now