Multimodal datasets for post-training evaluation and agent reasoning

Get curated datasets and structured RL environments to stress-test models on the hardest multimodal tasks, including noisy speech, vision-language QA, and interface reasoning.

Multimodal datasets

Curated datasets across audio, vision, and interface-agent tasks designed to test reasoning and generalization across modalities.

Audio Datasets

Train models on nuanced, multilingual audio data across 60+ locales, covering disfluency, emotion, prosody, overlapping speech, and diarized multi-speaker tasks.
Request Audio Datasets

Vision Datasets

Evaluate visual perception and reasoning through paired image-text tasks and STEM-based multimodal QA.
Request Vision Datasets

Interface & Agent Datasets

Train and test interface-aware agents using GUI supervision, process traces, and interaction datasets grounded in real-world workflows.
Request Interface Agent Datasets

Benchmarks and evaluation

Research-grade benchmarks and diagnostics built to surface failure modes and measure verified performance in multimodal systems.

LLM Icon

VLM-Bench

Benchmark model reasoning on over 700 vision–language tasks grounded in STEM, logic, and world knowledge.
Download Report

Evaluation Diagnostics

Run structured evaluations to identify model weaknesses and failure modes across multimodal domains.
Run a Diagnostic
STEM Icons__Search-resistant problem formulation

STEM VQA Sets

Stress-test models on graduate-level visual reasoning tasks in STEM domains.
Run a diagnostic

RL environments for multimodal agents

Evaluate agents on real tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.

UI-Based RL Environments for Interface Agents

Evaluate computer-use agents inside interactive UI clones of apps like Jira, Salesforce, and Zendesk. These environments simulate real human interaction via mouse/keyboard input and event tracking.
Request UI Agent Environments

MCP Environments for Function-Calling Agents

Train agents on function calling and tool execution inside sandboxed server environments. Includes tool schemas, reward verifiers, and seed databases.
Request Function-Calling Environments

End-to-End Evaluation and Training Loops

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.
Request RL Environments

Ready to stress-test your multimodal model?

Work with Turing to generate, evaluate, or scale multimodal data tailored to your use case.

Talk to a Researcher