Created more than 3,800 simulation tasks featuring expert-authored prompts, rewrites, and structured error labels designed to surface execution, logic, and visual flaws in AI-generated physics simulations.

Models generating physics simulation code often produce outputs that execute but are logically incorrect, visually inconsistent, or unresponsive. The client required:
The solution needed to span Python and JavaScript, encompass both 2D and 3D simulations, and evolve with feedback.
Dataset
Turing established a structured, repeatable process to generate simulation QA tasks at scale. Each task included:
Coverage areas
All outputs adhered to strict internal QA standards and annotation guidelines. Each rewrite underwent independent review for prompt completeness, visual fidelity, and execution performance.
The final dataset enabled the client to:
When models learn to simulate physical behavior, not just generate code or images, they begin to reason about the world in ways that resemble human intuition and causality. The implications extend far beyond QA benchmarks:
By enabling models to learn through simulation, we unlock foundational capabilities in planning, design, prediction, and embodied cognition, required for the next generation of general-purpose AI.
Request a labeled task with prompt, rewrite, and detailed failure modes, or access off-the-shelf 2D/3D simulation datasets and agent-ready evaluations for robotics, frontend, and world modeling tasks.
Request SampleEach sample includes a full prompt, model critique, rewrite, and detailed issue labels.
Yes. All data is authored by Turing with no external or reused content included.
Python (PyGame, Matplotlib) and JavaScript (P5.js, Matter.js, Three.js, Cannon.js), delivered in HTML and Python script formats (module-based).
The dataset labels errors across four main categories: runtime, performance, visual, and simulation logic.
A standard mutual NDA. Turing returns a countersigned agreement within one business day.
Within three business days after NDA execution.
Request a labeled sample featuring prompt adherence checks, fidelity rewrites, and execution-grade visual QA.