Building Production-Grade Web Environments for Computer-Use Agent RL Training

Built production-grade simulated web environments across food delivery and retail platforms, paired with 500+ verifier-backed tasks designed to stress-test browser-use agents during reinforcement learning. Each environment shipped with realistic seed data, dynamic verification, and difficulty calibration tied to current SOTA agent performance.

500+

verifier-backed tasks delivered across simulated food delivery and retail environments, spanning 100+ task templates.

>50%

tasks calibrated as model-breaking against SOTA computer-use agents, with hard tasks defined as Pass@10 < 2 against SOTA computer-use agents.

End-to-end

environment delivery, including simulated UI, realistic seed data (products, menus, images, reviews, inventory), dynamic verifiers, and performance-optimized infrastructure.

MethodRL environments
DomainWeb environments
Dataset scale500+ tasks
CapabilityRL environments
Building Production-Grade Web Environments and 500+ Verifier-Backed Tasks for Computer-Use Agent RL Training

The Challenge

The client needed realistic web environments to train and evaluate browser-use agents through reinforcement learning. Existing public benchmarks for browser agents face structural limitations:

  • Static fixtures that agents can memorize rather than reason through
  • Narrow task pools that fail to capture the variety of real shopping and ordering workflows
  • Verification logic tied to fixed expected states, breaking when inventory or pricing changes
  • Insufficient task difficulty, with most public tasks already saturated by frontier computer-use agents
  • UI layers that lack production-grade behavior, such as realistic latency, paginated results, and multi-fulfillment flows

The client required complete environments that captured realistic shopping experiences end-to-end, paired with task pools calibrated to remain hard for SOTA computer-use agents. Key requirements included:

  • Production-grade UI behavior across discovery, browsing, cart, checkout, and post-purchase flows
  • Seed data that matched real-world distributions in pricing, variants,  images, users, reviews, and inventory
  • Verifier-backed tasks where success conditions were calculated dynamically against current environment state, not hardcoded against static fixtures
  • Support for both static grading and partial grading (using LLM as a judge) to handle deterministic and non-deterministic tasks

The Approach

Turing delivered simulated environments across food delivery and retail platforms using a tightly coupled UI, data, and task pipeline. Each environment was built to client specification and tuned against SOTA agent performance before delivery.

1. Environment specification and scoping

Each environment was scoped from a detailed client specification covering must-have and next-phase product features. Coverage included:

  • Food delivery: Restaurant discovery, menu browsing, cart and checkout, order history and reordering, ratings and reviews, multi-address and scheduling, and order tracking
  • Retail: Product discovery and search, variant selection, local inventory and fulfillment options (shipping, delivery, pickup), mixed-fulfillment cart and checkout, returns and customer service, and accessibility and localization

Both environments supported multiple existing personas with saved payment methods, guest checkout flows, and password reset paths, enabling tasks to assume varying user contexts.

2. Realistic seed data generation

Seed data was generated and manually quality-checked to match real-world distributions:

  • Food delivery: 600+ restaurants spanning 50+ cuisine categories across multiple metro locations, with full menus, item-level images, and hundreds of human-like reviews per popular restaurant
  • Retail: Thousands of SKUs across major departments (grocery, household, electronics, apparel, home, and more), with rich variant matrices, unit pricing, nutrition and allergen flags where applicable, store-level inventory, and review distributions calibrated to category norms

All static resources, including product and menu images, were self-hosted to ensure reproducibility. Pricing followed category-specific bands, promotions reflected realistic cadence, and review volume and distribution mirrored typical e-commerce patterns.

3. Dynamic verification framework

Tasks did not rely on hardcoded expected states. Instead, every task shipped with a unified get_verification function that included:

  • A boolean pass/fail result
  • The actual environment state at evaluation time
  • An array of assertions paired with operators
  • Explicit expected state calculation, derived dynamically from current inventory, pricing, and catalog state
  • Full state mapping from environment state to assertion-specific state

Where applicable, tasks included verifiable checkpoints. For example, after store selection, after promo application, or after delivery slot selection, enabling partial-credit signal and finer-grained failure diagnosis.

4. Task authoring across seven capability axes

Tasks were authored across 100+ templates, covering the following capability axes:

  • GUI comprehension: Parsing layout and structure of category pages, product details, cart, and checkout
  • Element identification: Locating filters, variant selectors, fulfillment toggles, store selectors, and promo inputs
  • Action execution: Add to cart, change variants, select delivery slots, apply coupons, choose pickup
  • Multi-step task planning: Navigating discovery to checkout with split fulfillment, substitutions, and returns
  • Information retrieval: Extracting unit pricing, allergen info, warranty terms, ETA windows, and tax breakdowns
  • Information integration: Comparing items across brands and choosing best value across price, delivery, and promo eligibility
  • Decision making: Selecting optimal fulfillment based on inventory and speed; choosing substitutes within preferences

5. Difficulty calibration against SOTA agents

Every task was calibrated against a SOTA computer-use agent using Pass@10 measurement, mapped to three difficulty bands:

  • Easy: Pass@10 > 7
  • Medium: Pass@10 < 7
  • Hard: Pass@10 < 2

The delivered task mix targeted 15% easy, 35% medium, and 50% hard, ensuring that more than half of all tasks remained model-breaking against SOTA computer-use agents at delivery time. Tasks that proved too easy after evaluation were revised, replaced, or escalated in difficulty.

6. Production-grade infrastructure

Because UI, data, and tasks were tightly coupled, changes at any layer required validating the full pipeline. The environments were tuned for production-realistic performance, including database query optimization to sustain target QPS thresholds during agent evaluation. Self-hosted assets, version-controlled data, and a manual QA process spanning UI, data, and prompts ensured environment stability across teams.

Key Results

  • Delivered simulated web environments across food delivery and retail platforms, with full UI, seed data, and task coverage
  • Authored more than 500 verifier-backed tasks across 100+ templates, each calibrated against SOTA computer-use agent performance
  • Achieved the target 50%+ model-breaking difficulty ratio, with hard tasks consistently producing Pass@10 < 2 against SOTA computer-use agents
  • Built dynamic verification logic that evaluated agent success against current environment state rather than static fixtures, enabling repeated training rollouts without staleness
  • Established checkpoint-based verification for multi-step tasks, providing partial-credit signal during RL training
  • Tuned environment infrastructure for production-grade QPS and stability across distributed evaluation pods

The Outcome

The client now has production-grade RL environments for training and evaluating browser-use agents on realistic shopping and ordering workflows. Because verification is dynamic and seed data reflects real-world distributions, the environments support repeated training rollouts without losing signal, and tasks remain meaningful as agents improve.

This foundation enables:

  • Reinforcement learning rollouts on tasks that genuinely stress current SOTA computer-use agents
  • Differentiated evaluation across capability axes, from element identification to multi-step decision making
  • Repeatable, dynamically verified evaluation across model versions and training checkpoints
  • Extension to additional verticals using the same environment, data, task, and verifier pattern

Need realistic web environments to train computer-use agents?

Request a sample of verifier-backed tasks across simulated retail and food delivery environments, calibrated to break SOTA browser agents.

Request Sample

Share

FAQ

What environments are included?

The delivery includes simulated web environments across food delivery and retail platforms, with restaurants, menus, ordering flows, multi-department product catalogs, and shipping, delivery, and store pickup fulfillment options.

How is task difficulty calibrated?

Difficulty is measured using Pass@10 against a SOTA computer-use agent. Easy tasks pass more than 7 of 10 attempts, medium tasks pass fewer than 7, and hard tasks pass fewer than 2.

How does verification work with dynamic environment state?

Each task ships with a get_verification function that calculates expected state dynamically from current inventory, pricing, and catalog data. This allows the environment to evolve without invalidating the task pool.

Can the environments support reinforcement learning rollouts?

Yes. The environments are designed for repeated rollouts during RL training, with dynamic verification, checkpointed task evaluation, and production-grade infrastructure tuned for sustained QPS.

Can this approach extend to other verticals?

Yes. The same UI, seed data, task, and verifier pattern can be applied to additional domains such as travel booking, banking, or enterprise SaaS workflows.

How fast can I get a sample?

Within three business days after NDA execution.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

Building or evaluating browser-use agents?

Work with Turing to design simulated web environments and verifier-backed task pools that stress-test agents on realistic, multi-step workflows.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now