Training and Evaluating Extension Tool Use with 1000+ Prompts and RLHF Preference Pairs

Created high-quality prompts and RLHF preference pairs to improve the model’s ability to use built-in extensions across core customer journeys (CUJs). The dataset included complex tool chaining, fallback reasoning, and stepwise thought, code, explain formatting to support model alignment on grounded tool use.

1000+

diverse prompts across 8 CUJs, including positive and negative examples.

Multiple

tool coverage across YouTube, Browse, Search, and Maps.

RLHF

pairs annotated with preference ratings and user-written demonstrations.

MethodRLHF data collection

DomainBrowse / YouTube / Search tool use

Dataset scale1,000+ tasks

CapabilityData Packs

Training and Evaluating Extension Tool Use with 1000+ Prompts and RLHF

The Challenge

The client needed a reliable dataset to:

Evaluate how well the model triggers the right built-in extension tool or tools for complex tasks
Train preference models that can distinguish correct tool selection, chaining, and grounding
Capture reasoning failures in under-triggered or missequenced responses
Benchmark tool-specific behavior such as fallback to search or clarification prompts

Existing SFT and evaluation data did not offer:

Complex prompt coverage across CUJs
Tool-chaining logic across real-world queries
RLHF-style comparisons for instruction tuning

The Approach

Prompt generation

Turing designed more than 1000 novel prompts spanning key CUJs, including Browse, YouTube, Search, and Maps. Each CUJ included both positive and negative examples to test the model's ability to select the correct extensions and make sequence calls correctly.

Prompts were labeled by CUJ type, category, tool type, and complexity (trivial, moderate, complex). The goal was to surface challenging edge cases such as multi-tool workflows, missing or inferred tool usage, and incorrect or redundant tool activation. Prompts were also designed to trigger fallback reasoning when URLs were broken or incomplete.

RLHF preference collection

The team collected side-by-side preference ratings for the model’s tool-based responses. Annotators compared model outputs, rated them on a rubric, and supplied user-written demonstrations when both responses were suboptimal.

Each example was annotated with structured blocks: Hidden Thought, Code, and Explain. These hidden thoughts framed tool rationale, planned multi-step chaining, and distinguish fallback intent from over-triggering. Metadata included whether the step was incomplete or marked as the final response, aligning with the client’s internal data conventions.

Edge case and failure modeling

The team explicitly designed prompts and demonstrations to surface failure cases, including:

Clarifying questions for underspecified prompts
Handling unsupported tool behaviors such as liking or saving YouTube videos
Inaccessible or suboptimal links
Tool over-triggering and fallback logic such as fallback from Browse to Search

Fallback strategies were structured to reflect realistic conversational repair behaviors, including user re-prompts and alternate tool paths.

Key Results

Built a benchmark-grade dataset of more than 1000 tool-centric prompts and RLHF examples
Covered 8 CUJs with tool-specific tags and reasoning formats
Generated hundreds of demonstrations where models failed to trigger the correct tool
Enabled training of instruction-tuned and reward models grounded in tool accuracy
Delivered structured examples with hidden thoughts, fallback logic, and chained responses

The Outcome

This dataset supports:

The client’s preference model training on real-world tool usage
Chain-of-thought training for multi-tool reasoning
Structured evaluation for CUJ coverage and fallback handling
Research on tool grounding, misuse, and recovery behaviors

Turing’s work enabled the client to accelerate toolchain maturity across Browse, YouTube, Search, and Maps with traceable, fine-grained evaluation data.

Need tool-centric preference data for real-world CUJs?

Request prompts with real-world chaining, fallback logic, and reasoning metadata.

Request Sample

What’s included in the dataset?

Each sample includes a prompt, CUJ label, tool tags, query complexity, and an RLHF pair with side-by-side ratings and user-written demonstrations.

Are demonstrations included?

Yes. Many samples include user-written thoughts, code, and explanations where model outputs failed or under-triggered.

Is this SFT or RLHF data?

Both. Prompt generation supports SFT and fine-tuning, while the preference pairs are structured for RLHF.

Can I filter by tool, failure type, or prompt complexity?

Yes. The dataset is annotated for query type, category, tool sequence, and error mode.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Benchmarking Model Fidelity with 500 Expert-Verified Software Engineering Tasks

Read

Case Study

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Read

Case Study

Improving LLM Performance with 4,000+ Apex and SOQL Notebook Tasks

Read

Want to test or train models on extension tool behavior?

Request labeled prompts and preference data across YouTube, Maps, Browse, Search, and more.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now