Created high-quality prompts and RLHF preference pairs to improve the model’s ability to use built-in extensions across core customer journeys (CUJs). The dataset included complex tool chaining, fallback reasoning, and stepwise thought, code, explain formatting to support model alignment on grounded tool use.

The client needed a reliable dataset to:
Existing SFT and evaluation data did not offer:
Prompt generation
Turing designed more than 1000 novel prompts spanning key CUJs, including Browse, YouTube, Search, and Maps. Each CUJ included both positive and negative examples to test the model's ability to select the correct extensions and make sequence calls correctly.
Prompts were labeled by CUJ type, category, tool type, and complexity (trivial, moderate, complex). The goal was to surface challenging edge cases such as multi-tool workflows, missing or inferred tool usage, and incorrect or redundant tool activation. Prompts were also designed to trigger fallback reasoning when URLs were broken or incomplete.
RLHF preference collection
The team collected side-by-side preference ratings for the model’s tool-based responses. Annotators compared model outputs, rated them on a rubric, and supplied user-written demonstrations when both responses were suboptimal.
Each example was annotated with structured blocks: Hidden Thought, Code, and Explain. These hidden thoughts framed tool rationale, planned multi-step chaining, and distinguish fallback intent from over-triggering. Metadata included whether the step was incomplete or marked as the final response, aligning with the client’s internal data conventions.
Edge case and failure modeling
The team explicitly designed prompts and demonstrations to surface failure cases, including:
Fallback strategies were structured to reflect realistic conversational repair behaviors, including user re-prompts and alternate tool paths.
This dataset supports:
Turing’s work enabled the client to accelerate toolchain maturity across Browse, YouTube, Search, and Maps with traceable, fine-grained evaluation data.
Request prompts with real-world chaining, fallback logic, and reasoning metadata.
Request SampleEach sample includes a prompt, CUJ label, tool tags, query complexity, and an RLHF pair with side-by-side ratings and user-written demonstrations.
Yes. Many samples include user-written thoughts, code, and explanations where model outputs failed or under-triggered.
Both. Prompt generation supports SFT and fine-tuning, while the preference pairs are structured for RLHF.
Yes. The dataset is annotated for query type, category, tool sequence, and error mode.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request labeled prompts and preference data across YouTube, Maps, Browse, Search, and more.