Evaluating Agent Safety and Tool-Use Behavior Across 24,000+ Supervised Conversations

Created a safety supervision dataset that captures end-to-end agent behavior across multi-turn interactions, including tool calls, confirmations, refusals, and final responses. Each conversation was annotated for safety-relevant behavior with policy-compliant rewrites for identified violations.

24,000+

supervised agent conversations: Multi-turn, tool-based interactions spanning benign, dual-use, harmful, and jailbreak scenarios.

30+

safety policies covered: Including refusals, confirmation handling, tool grounding, and prompt-injection resistance.

20+

locales supported: Locale-aware supervision across regions, languages, and cultural contexts.

MethodEvaluation
DomainTrust & Safety
Dataset scale24,000+ conversations
CapabilityData Packs
Evaluating Agent Safety and Tool-Use Behavior Across 24,000+ Supervised Conversations

The Challenge

Traditional safety datasets often focus on final model outputs but overlook safety decisions made during multi-turn interactions involving tools, confirmations, and intermediate reasoning.

The client required a dataset that could:

  • Capture end-to-end agent behavior, not just final responses
  • Distinguish between safe completions, safe refusals, partial refusals, and policy violations
  • Identify when safety failures stemmed from incorrect tool usage, missing confirmations, or improper handling of harmful content
  • Provide corrected, policy-aligned ideal responses that preserve task intent and helpfulness
  • Support safety fine-tuning and calibration without exposing sensitive policy thresholds

The Approach

Turing deployed a team of trained safety annotators and reviewers to evaluate agent tool-use conversations against the client’s safety policies and the taxonomies. The workflow emphasized end-to-end supervision, explicit critique, and corrective rewrites, supported by both manual and automated quality controls.

1. Task design and coverage

Each task consisted of a realistic, multi-turn conversation in which an agent interacted with one or more tools to complete a user request. Tasks were designed to cover:

  • Benign, dual-use, harmful, and jailbreak scenarios
  • Generative and non-generative agent actions
  • Multi-step tool execution, including parallel tool calls where appropriate
  • Five levels of harmfulness across user prompts, tool outputs, and agent responses

All tasks were tagged with structured metadata, including locale, risk category, harmfulness level, task type, tool call counts, and confirmation usage.

2. Safety annotation and labeling

Annotators evaluated each conversation step by step, applying safety-relevant labels that reflect both the task type and policy expectations, including:

  • Grounded engagement, where responses preserved and relayed user- or tool-provided content without adding, suppressing, or amplifying meaning
  • Harm-free engagement for generative responses that introduced new content safely
  • Full refusal or partial refusal, applied according to policy and task context
  • Incorrect refusal or false engagement, where the agent declined or proceeded incorrectly
  • Policy violation types, such as missing confirmations, unsafe generation, or improper tool use

3. Full rewrite to policy-compliant ideal responses

When a violation was identified, annotators provided a full rewrite of the agent’s response that:

  • Aligned strictly with safety policy
  • Preserved the original task intent where possible
  • Maintained helpfulness without introducing new risk
  • Demonstrated the correct use of tools, confirmations, or refusals

Each conversation was continued through completion using the corrected responses, ensuring a fully policy-compliant trajectory.

4. Multi-pass human review and adjudication

All annotated conversations underwent multi-pass human review:

  • Reviewers independently validated safety decisions, rewrites, and metadata
  • Disagreements or errors triggered reviewer adjudication
  • Review outcomes were recorded using explicit decision signals: ACCEPT, MINOR FIX, MAJOR FIX, or BLOCK

Reviewers acted as arbiters to determine whether issues originated from annotation, policy interpretation, or task construction.

5. Automated quality checks

In parallel with human review, automated checks enforced process consistency and policy alignment, including:

  • Schema and structural validation
  • Metadata completeness and correctness
  • Tool usage requirements
  • Safety policy alignment based on the taxonomy

These checks ensured that every conversation met baseline technical and policy constraints before final acceptance.

Key Results

  • 24,000+ end-to-end agent traces delivered, each capturing full behavior across user turns, tool calls, confirmations, refusals, and final responses
  • 30+ risk categories represented, aligned with the taxonomy, ensuring broad and balanced safety coverage
  • 30+ policy dimensions evaluated per conversation, spanning generative vs non-generative actions, harmfulness levels, and tool interaction rules
  • Dual-layer QA enforced at scale, pairing expert human judgment with automated validation of structure, metadata, and safety constraints

The Outcome

The client now has a high-fidelity dataset to:

  • Fine-tune models on correct safety behavior across multi-turn, tool-based interactions
  • Evaluate agent safety decisions at each step, not just final outputs
  • Calibrate internal safety evaluation systems using richly labeled, policy-aligned data
  • Identify recurring safety failure modes related to tool use, confirmations, and refusals

This dataset provides a strong foundation for improving safety, alignment, and reliability in agentic systems operating in real-world tool environments.

Need high-fidelity supervision for agent safety evaluation?

Request a sample of supervised agent tool-use conversations with safety labels, critiques, and policy-compliant rewrites.

Request Sample

Share

FAQ

What makes this dataset different from standard safety labeling?

This dataset supervises full agent behavior, including tool calls, confirmations, refusals, and final responses, rather than labeling only single-turn outputs.

Is this data used for pre-training?

No. The dataset is designed for safety fine-tuning, calibration, and evaluation, not large-scale pre-training.

How were safety violations handled?

Violations were identified through annotation and review, then corrected with a full rewrite to a policy-compliant ideal response.

What quality controls were applied?

The process combined multi-pass human review, reviewer adjudication with explicit decision outcomes, and automated checks for schema validity, metadata completeness, tool usage, and policy alignment.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Looking to improve safety behavior in tool-using agents?

Request safety supervision datasets that capture end-to-end agent behavior across real workflows.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now