STEM datasets for post-training evaluation and reasoning

Human-authored datasets, benchmarks, and tools for evaluating and improving scientific and mathematical reasoning in LLMs.

STEM datasets

Human-authored datasets across STEM domains to support scientific accuracy, alignment training, and symbolic rigor at scale.

Math, Physics, Chemistry, and Biology Datasets

Datasets curated to test logical structure, problem-solving accuracy, and formal rigor, grounded in real-world scientific domains.
Request STEM Data Packs

Chain-of-Thought + Stepwise Reasoning Packs

Trace-based reasoning examples scored for fidelity, designed for training and reward shaping.
Request CoT Datasets

High-Throughput Training Data

Fast, structured datasets optimized for SFT, RLHF, and symbolic alignment workflows.
Request Sample Datasets
Domain-specific dataset development

Lean-Based Proof QA Datasets

Iterative proof generation in Lean 4 paired with informal math questions, supporting symbolic reasoning and fine-tuned verification.
Request Symbolic Reasoning Datasets

Benchmarks and evaluation

Rubric-aligned benchmarks and structured diagnostics that surface STEM-specific model weaknesses and reasoning gaps.

LLM Icon

VLM-Bench

Benchmark model reasoning on over 700 vision–language tasks grounded in STEM, logic, and world knowledge.
Download Report

GPQA, AIME, and MMLU-Pro Comparisons

See how your models stack up on known benchmarks or define your own test sets with domain-specific metrics.
Run a Diagnostic
STEM Icons__Search-resistant problem formulation

High-Difficulty STEM Benchmarks

Evaluate your model’s capability on problems unsolvable by SOTA models, paired with rubric-based grading and expert-written answers.
Run a Diagnostic

RL environments for STEM workflows

Evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.

UI-Based RL Environments for Interface Agents

Evaluate scientific reasoning agents within virtual lab environments that simulate physics, chemistry, or biological systems.
Request UI Agent Environments

MCP Environments for Function-Calling Agents

Train agents on function calling and tool execution inside sandboxed server environments. Includes tool schemas, reward verifiers, and seed databases.
Request Function-Calling Environments

End-to-End Evaluation and Training Loops

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.
Request RL Environments

Research and case studies

Scale STEM reasoning with expert-built datasets

Train, fine-tune, or evaluate models on structured STEM tasks, backed by domain-reviewed data and traceable QA.

Talk to a Researcher