Training and Evaluating Extension Tool Use with 1000+ Prompts and RLHF Preference Pairs

Created high-quality prompts and RLHF preference pairs to improve the model’s ability to use built-in extensions across core customer journeys (CUJs). The dataset included complex tool chaining, fallback reasoning, and stepwise thought, code, explain formatting to support model alignment on grounded tool use.

1000+

diverse prompts across 8 CUJs, including positive and negative examples.

Multiple

tool coverage across YouTube, Browse, Search, and Maps.

RLHF

pairs annotated with preference ratings and user-written demonstrations.

MethodRLHF data collection
DomainBrowse / YouTube / Search tool use
Dataset scale1,000+ tasks
CapabilityData Packs
Training and Evaluating Extension Tool Use with 1000+ Prompts and RLHF

The Challenge

The client needed a reliable dataset to:

  • Evaluate how well the model triggers the right built-in extension tool or tools for complex tasks
  • Train preference models that can distinguish correct tool selection, chaining, and grounding
  • Capture reasoning failures in under-triggered or missequenced responses
  • Benchmark tool-specific behavior such as fallback to search or clarification prompts

Existing SFT and evaluation data did not offer:

  • Complex prompt coverage across CUJs
  • Tool-chaining logic across real-world queries
  • RLHF-style comparisons for instruction tuning

The Approach

Prompt generation

Turing designed more than 1000 novel prompts spanning key CUJs, including Browse, YouTube, Search, and Maps. Each CUJ included both positive and negative examples to test the model's ability to select the correct extensions and make sequence calls correctly.

Prompts were labeled by CUJ type, category, tool type, and complexity (trivial, moderate, complex). The goal was to surface challenging edge cases such as multi-tool workflows, missing or inferred tool usage, and incorrect or redundant tool activation. Prompts were also designed to trigger fallback reasoning when URLs were broken or incomplete.

RLHF preference collection

The team collected side-by-side preference ratings for the model’s tool-based responses. Annotators compared model outputs, rated them on a rubric, and supplied user-written demonstrations when both responses were suboptimal.

Each example was annotated with structured blocks: Hidden Thought, Code, and Explain. These hidden thoughts framed tool rationale, planned multi-step chaining, and distinguish fallback intent from over-triggering. Metadata included whether the step was incomplete or marked as the final response, aligning with the client’s internal data conventions.

Edge case and failure modeling

The team explicitly designed prompts and demonstrations to surface failure cases, including:

  • Clarifying questions for underspecified prompts
  • Handling unsupported tool behaviors such as liking or saving YouTube videos
  • Inaccessible or suboptimal links
  • Tool over-triggering and fallback logic such as fallback from Browse to Search

Fallback strategies were structured to reflect realistic conversational repair behaviors, including user re-prompts and alternate tool paths.

Key Results

  • Built a benchmark-grade dataset of more than 1000 tool-centric prompts and RLHF examples
  • Covered 8 CUJs with tool-specific tags and reasoning formats
  • Generated hundreds of demonstrations where models failed to trigger the correct tool
  • Enabled training of instruction-tuned and reward models grounded in tool accuracy
  • Delivered structured examples with hidden thoughts, fallback logic, and chained responses

The Outcome

This dataset supports:

  • The client’s preference model training on real-world tool usage
  • Chain-of-thought training for multi-tool reasoning
  • Structured evaluation for CUJ coverage and fallback handling
  • Research on tool grounding, misuse, and recovery behaviors

Turing’s work enabled the client to accelerate toolchain maturity across Browse, YouTube, Search, and Maps with traceable, fine-grained evaluation data.

Need tool-centric preference data for real-world CUJs?

Request prompts with real-world chaining, fallback logic, and reasoning metadata.

Request Sample

Share

FAQ

What’s included in the dataset?

Each sample includes a prompt, CUJ label, tool tags, query complexity, and an RLHF pair with side-by-side ratings and user-written demonstrations.

Are demonstrations included?

Yes. Many samples include user-written thoughts, code, and explanations where model outputs failed or under-triggered.

Is this SFT or RLHF data?

Both. Prompt generation supports SFT and fine-tuning, while the preference pairs are structured for RLHF.

Can I filter by tool, failure type, or prompt complexity?

Yes. The dataset is annotated for query type, category, tool sequence, and error mode.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Want to test or train models on extension tool behavior?

Request labeled prompts and preference data across YouTube, Maps, Browse, Search, and more.

Request Sample