Advancing Multimodal Reasoning with 200+ PhD-Curated ImageQA Tasks
Created a human-authored spatial reasoning ImageQA dataset with over 200 expert-curated tasks for multimodal evaluation. Tasks require grounded reasoning over visual information, with images serving as a core component of question interpretation or response construction.
200+
high-difficulty ImageQA tasks across STEM, including computer science, engineering, mathematics, physics, biology, chemistry, medicine, and architecture, and more
3-phase
review cycle, including domain expert creation, peer expert validation, non-expert search trials
100%
client acceptance, with each sample passing rigorous review and benchmarking thresholds

The Challenge
The client needed a benchmark dataset to evaluate LLM performance on tasks requiring:
- Spatial and visual reasoning grounded in diagrams, graphs, and technical illustrations
- Expert-authored question design spanning multiple scientific disciplines
- Resistance to web search and AI model shortcuts
- Structured explanations and gold-standard justifications for preference and reward modeling
Existing VQA datasets are primarily focused on captioning, classification, or object recognition, rather than domain-specific reasoning grounded in images.
The Approach
Turing implemented a multi-phase creation and validation pipeline purpose-built for evaluating multimodal reasoning through images. The process ensured each ImageQA task was validated by experts and resistant to shortcuts via search or AI models.
1. Expert-level question authoring
- Domain experts, primarily PhDs, designed questions grounded in non-Googleable images: diagrams, graphs, and custom visuals
- Each task required spatial or symbolic reasoning and was formatted as a multiple-choice question with one correct answer and three distractors
- Questions were paired with explanations and source references
2. Multi-stage validation
Expert review:
Two independent domain experts attempted each question without access to the author’s answer key. A question advanced only when both experts independently selected the same answer, and that answer matched the author’s answer key. After comparison with the answer key, experts either confirmed the correct response and provided structured feedback or flagged discrepancies for revision with detailed guidance.
Revision cycle:
- The original author reviewed peer feedback and revised questions for clarity, difficulty, or objectivity
- Only questions that passed expert validation advanced to the next stage
Novice adversarial testing:
- Two independent reviewers from unrelated domains attempted to solve each question using unrestricted web search and general-purpose tools
- Tasks solvable through search engines or general-purpose LLMs were flagged for revision or removal
3. Structural compliance and final QA
- All tasks were reviewed for benchmark-aligned criteria including spatial reasoning, image dependency, non-searchability, objectivity, and answerability
- Questions were formatted for consistency and markdown compatibility, with embedded images and answer logic structured per the client’s standard
This pipeline ensured each task was human-authored, adversarially tested, image-dependent, and validated for multimodal evaluation.
Key Results
- Over 200 questions curated by PhDs across 20+ spatial-reasoning-heavy domains
- 100% client acceptance rate
- Structured explanations and reference links provided for every answer
- Used as a benchmark to evaluate frontier multimodal models
The Outcome
The resulting dataset provides a new standard for multimodal evaluation, offering:
- Image-grounded tasks for spatially grounded, symbolic, and scientific reasoning
- Signal for identifying failure modes in multimodal models
- A vetted pool of questions used to evaluate frontier multimodal systems across disciplines
Every task met defined criteria for image dependency, reasoning depth, domain correctness, and adversarial robustness.
Training reward models for grounded, visual reasoning?
Request ImageQA-style tasks designed for scientific, spatial, and symbolic reasoning, including distractors, image anchors, and peer-reviewed explanations.
Request SampleFAQ
What types of images were used?
Charts, diagrams, structural schematics, graphs, biomedical imagery, and custom visuals, many generated by hand or code.
Were the questions validated against AI models?
Yes. Every task was tested against general-purpose LLMs. If solvable, it was revised or discarded.
What domains are covered?
The domains included computer science, engineering, math, physics, biology, chemistry, medicine, architecture, and more.
Can this data be reused for RLHF or reward modeling?
Yes. Each task includes structured rationale and distractor logic suitable for agent training and preference modeling.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
How fast can I get a sample?
Within three business days after NDA execution.
Want image-grounded evaluation data designed to test frontier models?
Request PhD-authored ImageQA samples built to resist retrieval shortcuts and evaluate spatial reasoning.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


