Coding datasets for post-training evaluation and agent reasoning

Reasoning-first datasets and benchmarks for function-calling, secure coding, and real-world software development.

Coding datasets

Structured prompts and real-world tasks to evaluate and improve model reasoning across software engineering workflows.

Structured Reasoning Datasets

Competitive programming tasks and rubric-aligned prompts that evaluate logic depth, planning, and correctness in code.
Request Reasoning Datasets

Chain-of-Thought Coding Traces

Stepwise code generation prompts with human-verified CoT traces, useful for reward modeling and SFT.
Request CoT Coding Traces

Multimodal Code Tasks

Applied coding problems with multimodal inputs and real-world constraints, useful for agent-based coding tasks.
Request Industry Datasets

Benchmarks and evaluation

Containerized benchmarks and scoring systems that test model performance in realistic development environments.

LLM Icon

SWE-bench++

Evaluate coding agents on real GitHub tasks using containerized environments and verified trajectories.
Explore Benchmark

VLM-bench

Benchmark model reasoning on over 700 vision–language tasks grounded in STEM, logic, and world knowledge.
Download Report
STEM Icons__Search-resistant problem formulation

CodeBench

900+ multilingual coding tasks with deterministic pass/fail scoring. Built for Aider compatibility, regression testing, and QA.
Request Sample Data

RL environments for coding workflows

Evaluate coding agents on real-world programming tasks, generate fine-tuning trajectories, and train reward models in reproducible, high-fidelity environments.

UI-Based RL Environments for Code Agents

Evaluate code-generation and debugging agents inside interactive IDE replicas that simulate real developer environments. These environments track edits, compile results, and run tests to measure functional accuracy.
Request UI Agent Environments

MCP Environments for Function-Calling Agents

Train agents to call APIs, manage toolchains, and execute scripts within sandboxed development environments. Includes tool schemas, reward verifiers, and seed databases.
Request Function-Calling Environments

End-to-End Evaluation and Training Loops

Each RL environment includes prompts, verifiers, analytics harnesses, and trajectory outputs, enabling evaluation diagnostics, reward shaping, and supervised fine-tuning at scale.
Request RL Environments

Research and case studies

Ready to benchmark or debug your coding model?

Request sample data, access trajectory logs, or run a scoped SWE-Bench++ evaluation.

Talk to a Researcher