Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

Built a 20K-sample dataset to surface model failures in scientific chart reasoning, enabling more accurate eval, reward shaping, and subfigure calibration.

20,000+

Expert-written tasks and rewrites: Built to evaluate figure reasoning across 6+ domains.

98%+

Accuracy on descriptive QA: Backed by expert annotation and multi-layered review.

7–8 pt

Estimated accuracy lift: After fine-tuning on trend comparison and visual reasoning tasks.

IndustryAI Research
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement
Revealing Systemic Chart Reasoning Gaps with 20K+ Expert CoTs

A leading LLM research team partnered with Turing to explore reasoning failures in scientific chart understanding. The goal was to evaluate and improve how multimodal models handle subplots, trends, and layout logic across complex, domain-specific figures.

The Challenge

Multimodal models fail to reliably reason over subplots, interpret trend overlaps amongst subplots, provide accurate data, or align with legends. The lab needed high-quality prompts that would:

  • Surface reasoning breakdowns in charts with 3–6 subplots, overlapping trends, ambiguous axes, or indirect labels
  • Provide long-form, 15–30+ step chain-of-thought (CoT) explanations, fully grounded in the chart image, not external context
  • Simulate real-world evaluation use cases across research-grade plots and performance dashboards

The Approach

Dataset

  • 10,000+ SOTA model-breaking Question and Answer tasks with long-form CoTs (15–30+ steps each)
  • 10,000+ CoT rewrites to support model fine-tuning and reward model design
  • Domains: Computer Science, Economics, Mathematics, Physics, Quant. Biology, Quant. Finance, and Statistics
  • Multi-hop questions requiring trend comparison, subplot differentiation, and legend decoding
  • Prompt construction:
    a. Reasoning questions needing multiple steps of deductive reasoning 
    b. Visual-only grounding on chart without any outside context from the image description, research paper or otherwise
  • Answer format:
    a. Step-by-step CoT using LaTeX and precise numerics
    b. Concise answer summary for eval alignment

Evaluation

  • PhD-level expert analysts performed signal-detection tests on 100+ sample charts ensuring nuanced understanding of axes, legends, and encoding.
  • Every sample reviewed for factuality, flow, and grounding at the step level
  • Automated checks enforced structure, notation, and reasoning clarity
  • Achieved:
    a. ≥ 98% accuracy on descriptive CoTs
    b. ≥ 95% on full reasoning sequences

Key Metrics

  • Revealed failure patterns in axis interpretation, subplot confusion, and overconfident trend mislabeling
  • Estimated to improve model accuracy by 7–8 percentage points on trend comparison after SFT
  • Internal tests showed measurable gains in trend-following and figure reasoning

The Outcome

With this annotated CoT data, the lab is now able to:

  • Train models on open-ended chart reasoning and subfigure calibration
  • Evaluate grounding precision across color, shape, and axis cues
  • Penalize visual hallucinations via reward model shaping
  • Extend evaluation to business dashboards and KPI charts

Stress-test your multimodal model on scientific figures

Request a curated sample with chart image, QA pair, long-form reasoning, and step-level annotations. Evaluate how your model handles trend comparison, legend cues, and subfigure alignment.

Scope a Pilot with Turing

Share

FAQ

How long are the CoTs?

Each CoT includes 15–30+ reasoning steps, grounded in real visual elements.

What types of figures are included?

Real-world scientific charts with ≥3 subplots: line, scatter, histogram, bar, and hybrid layouts.

What’s the QA process?

All outputs undergo multi step human expert + automated review with adjudication.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How soon can I test it?

Samples are delivered within 3 business days of NDA execution.

Want to see where your model breaks?

Request a sample from the Scientific Chart CoT Dataset, including chart image, detailed QA, reasoning trace, and common failure annotations.

Scope a Pilot with Turing