Built a 20K-sample dataset to surface model failures in scientific chart reasoning, enabling more accurate eval, reward shaping, and subfigure calibration.
A leading LLM research team partnered with Turing to explore reasoning failures in scientific chart understanding. The goal was to evaluate and improve how multimodal models handle subplots, trends, and layout logic across complex, domain-specific figures.
Multimodal models fail to reliably reason over subplots, interpret trend overlaps amongst subplots, provide accurate data, or align with legends. The lab needed high-quality prompts that would:
Dataset
Evaluation
With this annotated CoT data, the lab is now able to:
Request a curated sample with chart image, QA pair, long-form reasoning, and step-level annotations. Evaluate how your model handles trend comparison, legend cues, and subfigure alignment.
Scope a Pilot with TuringEach CoT includes 15–30+ reasoning steps, grounded in real visual elements.
Real-world scientific charts with ≥3 subplots: line, scatter, histogram, bar, and hybrid layouts.
All outputs undergo multi step human expert + automated review with adjudication.
A standard mutual NDA; Turing returns countersignature within one business day.
Samples are delivered within 3 business days of NDA execution.
Request a sample from the Scientific Chart CoT Dataset, including chart image, detailed QA, reasoning trace, and common failure annotations.