Human-authored datasets, benchmarks, and tools for evaluating and improving scientific and mathematical reasoning in LLMs.






Human-authored datasets across STEM domains to support scientific accuracy, alignment training, and symbolic rigor at scale.
Rubric-aligned benchmarks and structured diagnostics that surface STEM-specific model weaknesses and reasoning gaps.
Evaluate agents on real-world STEM tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.
Train, fine-tune, or evaluate models on structured STEM tasks, backed by domain-reviewed data and traceable QA.