Created a Python dataset following RLEF-style execution feedback, with independently authored prompts, solutions, and tests to support evaluation and training using execution feedback.

The client wanted to improve their model’s performance on an Aiders-benchmark-like dataset. They aimed to achieve this through a Reinforcement Learning with Execution Feedback (RLEF)-inspired workflow, where the dataset is created along with ideal code solutions and unbiased unit tests to verify correctness.
Turing deployed a team of experienced Python engineers and evaluators to design, validate, and review execution-grounded programming tasks. The workflow emphasized strict separation of responsibilities, reproducibility through execution, and reviewer arbitration to ensure that every task was internally consistent and evaluation-ready.
1. Dataset structure and metadata
Each task followed a consistent structure and included rich metadata defining scope and intent. Metadata captured the task’s domain, taxonomy classification, target use case, and model context. The taxonomy was derived from the Aider Polyglot benchmark structure and extended to support broader Python programming patterns beyond benchmark-specific formats.
2. Prompt and requirements authoring
For each task, a contributor authored a detailed prompt and a dedicated “Requirements” section. Together, these defined the problem to be solved, input and output formats, function signatures, constraints, and edge cases.
The prompt and requirements were written to be complete and unambiguous, with enough detail that another contributor could write correct unit tests without referencing the solution code.
3. Reference solution development
The reference Python solution was authored by the same contributor who wrote the prompt and requirements. The solution adhered strictly to the documented expectations and did not introduce behavior not specified in the prompts and requirements. This ensured that independently written tests would pass based solely on alignment with the prompt and requirements.
4. Independent unit test authoring
Unit tests were written by a second contributor who had access only to the prompt and requirements, not the reference solution. This ensured that tests validated the stated problem rather than a specific implementation and that no implicit assumptions were encoded in the test logic.
The tests were designed to be executable against the model-generated Python solution.
5. Execution-based validation
All reference solutions and unit tests were executed in sandboxed environments using containerized runtimes. Execution failures were investigated and resolved before tasks were accepted, ensuring consistent behavior across environments and eliminating ambiguity around correctness.
6. Model stress testing and failure annotation
To confirm that tasks meaningfully exercised reasoning and instruction-following constraints, evaluators ran them against state-of-the-art language models using internal tooling. Generated outputs were reviewed, and observed failure patterns were tagged using structured issue categories. This step validated that tasks were non-trivial and grounded in realistic coding challenges.
7. Quality assurance and arbitration
Reviewers conducted structured QA passes across tasks. When discrepancies arose, reviewers acted as arbiters to determine whether issues originated from the prompt and requirements, the reference solution, or the unit tests. This arbitration process ensured consistency and prevented silent drift across the dataset.
The project delivered a clean, execution-grounded Python dataset suitable for both evaluation and training workflows. With strict separation between prompts, solutions, and tests, the dataset provides reliable signals for identifying reasoning gaps, instruction-following failures, and correctness issues.
The resulting tasks can be used to evaluate code generation systems under realistic constraints, support execution-based training methods, and stress-test models beyond surface-level pass rates. This foundation enables teams to improve reliability in LLM-based software development agents using concrete, reproducible feedback.
Request a curated sample including prompts, requirements, independently authored tests, code solution, and execution feedback.
Request SampleThis dataset enforces strict separation between prompts, reference solutions, and unit tests, with independent test authorship and execution-based validation. Tests are written from the prompt and requirements rather than the solution, then executed in sandboxed environments.
Yes. All tasks were validated in sandboxed, containerized environments to ensure reproducibility. Reference solutions were required to pass all associated test cases before inclusion, and any failures were reviewed and resolved through arbitration. The dataset is packaged with a README and supporting files to enable smooth execution and integration into training or evaluation workflows.
Yes. The execution-grounded structure supports both evaluation and training workflows that rely on concrete execution feedback.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request sample tasks designed to expose reasoning, constraint handling, and instruction-following gaps through execution-based validation.
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.