Get Curated Research Datasets

Access benchmark-quality RL, multimodal, vision, and STEM datasets to accelerate your post-training research. Choose from pre-defined packs or create custom datasets tailored to your experiments.

Request Data Packs

Dataset Catalog

Choose from our curated data collections—each optimized for post-training research and ready to request:

Audio Datasets

Train models on diverse audio inputs: speech, noisy recordings, and advanced ASR scenarios.
Request Sample

Vision Datasets

Enhance detection, segmentation, and multimodal reasoning with curated image and video corpora.
Request Sample

STEM Reasoning Datasets

Challenge LLMs with structured math, physics, and chemistry problem sets and step-by-step solutions.
Request Sample

Coding Datasets

Benchmark code generation and reasoning with real-world coding challenges, CoT examples, and function-calling data.
Request Sample

Gaming & Simulation Datasets

Evaluate reinforcement-learning and agentic performance using industry-standard gaming environments and scenarios.
Request Sample

Medical & Document Datasets

Validate model performance on clinical records, medical imaging, and document Q&A tasks with domain-specific corpora.
Request Sample

Custom Data Packs

Work with our team to assemble bespoke datasets—across modalities, languages, and domains—to suit your unique research needs.
Request Sample

Why These Datasets

Proven Benchmark Relevance

Each collection aligns to real-world tasks—from Chatbot Arena benchmarks to STEM exam challenges—ensuring your models are tested against the standards that matter most.

Expert Curation & Quality

All datasets are reviewed and cleaned by PhD-level researchers, delivering reliable, high-fidelity samples ready for immediate integration.

Transparent Specifications

Detailed metadata on volume, modalities, languages, and licensing accompanies every sample—giving you the context you need for reproducible research.

Frequently Asked Questions

How long does it take to receive sample data?

Samples are delivered via email and typically within 48 hours of your request, so you can begin integration and evaluation without delay.

Can I request multiple datasets at once?

Yes, you can select any combination of pre-defined packs or custom datasets in a single request form, and we’ll bundle them in one delivery.

What formats and modalities are supported?

We provide samples in machine-learning–ready formats (e.g., image folders, CSV/JSON for tabular and text, WAV for audio). All modalities listed in the catalog—RL, vision, audio, STEM, coding, gaming, and more—are available.

How do you license sample data?

Sample datasets are provided under a research-only license. For full-pack access or commercial use, we’ll follow up to discuss terms and pricing.

Can I get a custom data pack if I don’t see what I need?

Absolutely—use the “Custom Data Packs” option in the catalog to describe your requirements, and our team will work with you to assemble the right dataset.

What happens after I receive samples?

You’ll receive curated sample files and metadata, followed by outreach from our team to discuss full-pack access, volume, pricing, and any custom adjustments.

Ready for Frontier Model Data?

Request your data packs today and accelerate your research.

Request Data Packs