Case Study: Scientific Chart Reasoning With 20K+ Expert CoTs

Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

Built a 20K-sample dataset to surface model failures in scientific chart reasoning, enabling more accurate eval, reward shaping, and subfigure calibration.

20,000+

expert-written tasks and rewrites: Built to evaluate figure reasoning across 6+ domains.

98%+

accuracy on descriptive QA: Backed by expert annotation and multi-layered review.

7–8 pt

estimated accuracy lift: After fine-tuning on trend comparison and visual reasoning tasks.

IndustryAI Research

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

A leading LLM research team partnered with Turing to explore reasoning failures in scientific chart understanding. The goal was to evaluate and improve how multimodal models handle subplots, trends, and layout logic across complex, domain-specific figures.

The Challenge

Multimodal models fail to reliably reason over subplots, interpret trend overlaps amongst subplots, provide accurate data, or align with legends. The lab needed high-quality prompts that would:

Surface reasoning breakdowns in charts with 3–6 subplots, overlapping trends, ambiguous axes, or indirect labels
Provide long-form, 15–30+ step chain-of-thought (CoT) explanations, fully grounded in the chart image, not external context
Simulate real-world evaluation use cases across research-grade plots and performance dashboards

The Approach

Dataset

10,000+ SOTA model-breaking Question and Answer tasks with long-form CoTs (15–30+ steps each)
10,000+ CoT rewrites to support model fine-tuning and reward model design
Domains: Computer Science, Economics, Mathematics, Physics, Quant. Biology, Quant. Finance, and Statistics
Multi-hop questions requiring trend comparison, subplot differentiation, and legend decoding
Prompt construction:
a. Reasoning questions needing multiple steps of deductive reasoning
b. Visual-only grounding on chart without any outside context from the image description, research paper or otherwise
Answer format:
a. Step-by-step CoT using LaTeX and precise numerics
b. Concise answer summary for eval alignment

Evaluation

PhD-level expert analysts performed signal-detection tests on 100+ sample charts ensuring nuanced understanding of axes, legends, and encoding.
Every sample reviewed for factuality, flow, and grounding at the step level
Automated checks enforced structure, notation, and reasoning clarity
Achieved:
a. ≥ 98% accuracy on descriptive CoTs
b. ≥ 95% on full reasoning sequences

Key Metrics

Revealed failure patterns in axis interpretation, subplot confusion, and overconfident trend mislabeling
Estimated to improve model accuracy by 7–8 percentage points on trend comparison after SFT
Internal tests showed measurable gains in trend-following and figure reasoning

The Outcome

With this annotated CoT data, the lab is now able to:

Train models on open-ended chart reasoning and subfigure calibration
Evaluate grounding precision across color, shape, and axis cues
Penalize visual hallucinations via reward model shaping
Extend evaluation to business dashboards and KPI charts

Stress-test your multimodal model on scientific figures

Request a curated sample with chart image, QA pair, long-form reasoning, and step-level annotations. Evaluate how your model handles trend comparison, legend cues, and subfigure alignment.

Request Sample

How long are the CoTs?

Each CoT includes 15–30+ reasoning steps, grounded in real visual elements.

What types of figures are included?

Real-world scientific charts with ≥3 subplots: line, scatter, histogram, bar, and hybrid layouts.

What’s the QA process?

All outputs undergo multi step human expert + automated review with adjudication.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How soon can I test it?

Samples are delivered within 3 business days of NDA execution.

Related resources

Multilingual TTS at Enterprise Scale-From Infrastructure to AGI

Article

Advancing Multilingual TTS at Enterprise Scale: Infrastructure to AGI

Read

Advanced, high-quality data for multimodal LLMs.png

Article

Advanced Strategies for High-Quality, Scalable, and Diverse Data for Multimodal LLMs

Read

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Read

Want to see where your model breaks?

Request a sample from the Scientific Chart CoT Dataset, including chart image, detailed QA, reasoning trace, and common failure annotations.

Request Sample

Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

20,000+

expert-written tasks and rewrites: Built to evaluate figure reasoning across 6+ domains.

98%+

accuracy on descriptive QA: Backed by expert annotation and multi-layered review.

7–8 pt

estimated accuracy lift: After fine-tuning on trend comparison and visual reasoning tasks.

The Challenge

The Approach

Key Metrics

The Outcome

Stress-test your multimodal model on scientific figures

Share

FAQ

How long are the CoTs?

What types of figures are included?

What’s the QA process?

What’s the NDA process?

How soon can I test it?

Related resources

Article

Advancing Multilingual TTS at Enterprise Scale: Infrastructure to AGI

Article

Advanced Strategies for High-Quality, Scalable, and Diverse Data for Multimodal LLMs

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Want to see where your model breaks?