Research-grade datasets and evaluation resources across finance, legal, medical, and economics domains.






Curated QA and reasoning tasks across specialized fields, built for depth, accuracy, and domain fidelity.
Research-grade benchmarks and diagnostics built to surface failure modes and measure verified performance in domain-specific systems.
Evaluate reasoning agents on real-world finance, economics, legal, and medical tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.
From tax code to triage, our data helps you train and evaluate models with high-stakes reasoning in mind.