Building 2,000+ Human-Grounded Theory-of-Mind Dialogues for Persuasion Research
Delivered a large-scale, double-blind persuasion dialogue dataset annotated with structured Theory-of-Mind (ToM) reflections. Each task captured belief evolution, mid-conversation mental state inferences, and post-dialogue belief updates, providing human-grounded supervision signals to improve model performance on ToM-focused benchmarks such as PersuasiveToM, NegotiationToM, ToMATO, and more.
2,000+
double-blind persuasion dialogues delivered, each with full pre-test, dialogue, mid-check, and post-test annotations.
10
-dimension QA rubric enforced, covering dialogue coherence, persuasion strategy, ToM depth, belief evolution consistency, and language appropriateness.
100%
human-authored content: zero LLM-generated dialogue or annotations permitted, verified through QA review.

The Challenge
The client needed a large-scale dataset to improve model performance on ToM benchmarks focused on persuasive dialogue, where current models consistently underperform relative to human baselines. Existing approaches fell short in several ways:
- Standard dialogue datasets capture only surface text and persuasion outcomes, omitting the internal cognitive states that drive belief change
- Static ToM benchmarks test mental state inference in isolation, without capturing how beliefs evolve across multiple turns of real conversation
- Synthetic reasoning and chain-of-thought data lack the grounded human psychology needed to train models on genuine inference under uncertainty
- Datasets that expose internal states directly break the double-blind conditions that make persuasion realistic and psychologically valid
The client required a dataset that could capture both observable persuasion behavior and the unspoken cognitive layer beneath it, collected under conditions that preserved real-world persuasion dynamics. This meant annotating not just what participants said, but what they thought, felt, and inferred about their partner throughout the conversation.
The Approach
Turing designed and executed a structured data collection pipeline built around double-blind persuasive dialogue, mid-conversation ToM annotation, and belief evolution tracking.
1. Double-blind persuasion dialogue design
Each task paired two participants in assigned roles: a Believer, who held and defended a genuine belief, and a Challenger, who attempted to shift that belief through respectful argument. Topics spanned ethics and morality, public policy, technology, and social issues.
Key structural constraints preserved double-blind conditions throughout:
- Neither participant had access to the other's internal reflections, belief strengths, or stated reasoning
- No persuasion strategies were disclosed or prompted in advance
- Participants constructed arguments under genuine uncertainty about the other's mindset
Each dialogue consisted of a minimum of 15 turns, with individual messages capped at 200 words to maintain focus and prevent front-loading of arguments. Participants were guided to present one argument per turn, advance reasoning progressively, and avoid validation templates or generic open-ended questions.
2. Structured ToM mid-checks
At fixed intervals during each dialogue, both participants completed private mid-conversation reflections designed to capture subtext, not summaries. Each mid-check required structured responses across two perspectives:
- Self-inference: underlying beliefs driving current responses; underlying emotions influencing behavior; underlying goals, motives, or preferences shaping strategy.
- Partner-inference: beliefs inferred in the conversational partner; emotions attributed to the partner; goals or intentions the participant attributed to the other.
Participants were explicitly instructed to describe unspoken thoughts, avoid paraphrasing dialogue, and focus on internal reasoning and assumptions. These mid-checks provided human-grounded mental state annotations that are rare in existing datasets.
3. Belief evolution tracking
Each task captured belief dynamics across three stages:
- Pre-test: participants provided initial belief strength on a 0–100 scale, three supporting reasons with individual strength ratings, and a minimum 75-word belief explanation
- Mid-conversation: participants reflected on evolving beliefs, emotions, and inferences at fixed dialogue checkpoints
- Post-test: participants re-rated overall belief strength, updated individual reason strengths, and explained any changes in a minimum 75-word summary
Internal consistency rules governed the relationship between overall belief strength and individual reason ratings, ensuring that changes in one were reflected coherently in the other. This produced a longitudinal record of belief stability, partial shifts, and resistance patterns across each conversation.
4. Multi-dimensional QA rubric
All tasks were reviewed against multi-dimension QA rubric covering:
- Coherence and flow for both Believer and Challenger: whether arguments were built progressively, responded to the other party's points, and were free of grammatical issues
- Justification and persuasion strategy: whether participants used evidence, analogies, personal anecdotes, or examples to support their positions and engage with the other's core arguments
- Mid-check expression for both roles: whether reflections captured genuinely unsaid insights across beliefs, emotions, and preferences, rather than restating dialogue
- Pre- and post-test belief expression and evolution: whether explanations were coherent, consistent with stated belief strengths, and internally non-contradictory
- Tone and language appropriateness: a binary pass/fail check for civil, non-abusive language
QA reviewers were also trained to detect LLM usage through linguistic markers, treating flagged vocabulary as evidence of non-compliance.
Key Results
- More than 2,000 persuasion dialogue tasks delivered, each spanning pre-test, multi-turn dialogue, mid-conversation ToM annotations, and post-test belief updates
- 8 structured ToM dimensions captured per mid-check per participant, providing human-reported first- and second-order mental state data across the full dialogue arc
- 100% double-blind collection; no participant had access to the other's internal states, preserving the inferential uncertainty that characterizes real persuasion
- Zero synthetic content; all dialogue and annotations were human-authored, with LLM detection protocols enforced throughout QA
The Outcome
The client received a human-grounded persuasion dataset structured to provide the supervision signals required for ToM–focused model training. By capturing what participants said, what they privately believed, and what they inferred about their partner, the dataset addresses a fundamental gap in existing dialogue resources: the absence of the cognitive layer that underlies real persuasion.
Specifically, the dataset supports:
- Training models to infer beliefs, emotions, and intentions from dialogue without access to explicit state disclosures
- Improving belief tracking consistency across multi-turn conversations, a primary failure mode on benchmarks such as PersuasiveToM, NegotiationToM, ToMATO, and more
- Reducing intention misclassification by providing supervision that distinguishes surface compliance from genuine internal belief change
- Strengthening mental state representations to support downstream gains on ToM Application tasks, including persuasion strategy prediction and effectiveness judgment
Need human-grounded Theory-of-Mind data for persuasion model training?
Request a sample of double-blind persuasion dialogues with structured mid-conversation belief and mental state annotations.
Request SampleFAQ
What makes this dataset different from standard persuasion or dialogue datasets?
Most persuasion datasets capture only surface dialogue and outcome labels. This dataset adds structured mid-conversation ToM annotations, private to each participant, covering self-reported beliefs, emotions, and desires, as well as inferred partner mental states. All data is collected under double-blind conditions, preserving the inferential uncertainty of real persuasion.
How were mid-checks designed to avoid restating dialogue?
Participants were explicitly instructed to describe unspoken thoughts rather than summarize what had been said. The eight-field structure separated self-inference from partner-inference across beliefs, emotions, and preferences. QA rubric criteria specifically penalized surface restatement, with tasks rejected when mid-checks failed to provide genuinely unsaid insight across two or more dimensions.
How was LLM usage detected and prevented?
QA reviewers were trained to identify LLM-associated vocabulary patterns in both dialogue and annotations. The guidelines explicitly prohibited unnatural phrasing typically produced by language models, and flagged vocabulary was treated as evidence of non-compliance. All tasks were human-authored, with this standard enforced throughout the review process.
How does this dataset support performance on benchmarks like PersuasiveToM?
The dataset directly targets the ToM Reasoning component of such benchmarks by supplying human-grounded supervision for belief, intention, and mental state tracking across dialogue turns. Improving the quality and consistency of mental state representations supports downstream gains on ToM Application tasks, including strategy prediction and effectiveness judgment, which depend on accurate inference of evolving beliefs and intentions.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
How fast can I get a sample?
Within three business days after NDA execution.
Want to improve belief tracking and mental state reasoning in your model?
Request human-grounded persuasion datasets with structured Theory-of-Mind annotations across belief evolution, emotion inference, and multi-turn dialogue.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


