Creating 10,000+ RLEF-Ready Python Tasks for Model Evaluation and Training

Created a Python dataset following RLEF-style execution feedback, with independently authored prompts, solutions, and tests to support evaluation and training using execution feedback.

10,000+

Python programming tasks created with execution-grounded validation

Strict

acceptance criteria required reference solutions to pass all tests before inclusion

Sandboxed

execution environments enforced consistent, reproducible behavior

MethodDataset Generation

DomainCoding

Dataset scale10,000+ tasks

CapabilityData Packs

Creating 10,000+ RLEF-Ready Python Tasks for Model Evaluation and Training

The Challenge

The client wanted to improve their model’s performance on an Aiders-benchmark-like dataset. They aimed to achieve this through a Reinforcement Learning with Execution Feedback (RLEF)-inspired workflow, where the dataset is created along with ideal code solutions and unbiased unit tests to verify correctness.

The Approach

Turing deployed a team of experienced Python engineers and evaluators to design, validate, and review execution-grounded programming tasks. The workflow emphasized strict separation of responsibilities, reproducibility through execution, and reviewer arbitration to ensure that every task was internally consistent and evaluation-ready.

1. Dataset structure and metadata

Each task followed a consistent structure and included rich metadata defining scope and intent. Metadata captured the task’s domain, taxonomy classification, target use case, and model context. The taxonomy was derived from the Aider Polyglot benchmark structure and extended to support broader Python programming patterns beyond benchmark-specific formats.

2. Prompt and requirements authoring

For each task, a contributor authored a detailed prompt and a dedicated “Requirements” section. Together, these defined the problem to be solved, input and output formats, function signatures, constraints, and edge cases.

The prompt and requirements were written to be complete and unambiguous, with enough detail that another contributor could write correct unit tests without referencing the solution code.

3. Reference solution development

The reference Python solution was authored by the same contributor who wrote the prompt and requirements. The solution adhered strictly to the documented expectations and did not introduce behavior not specified in the prompts and requirements. This ensured that independently written tests would pass based solely on alignment with the prompt and requirements.

4. Independent unit test authoring

Unit tests were written by a second contributor who had access only to the prompt and requirements, not the reference solution. This ensured that tests validated the stated problem rather than a specific implementation and that no implicit assumptions were encoded in the test logic.

The tests were designed to be executable against the model-generated Python solution.

5. Execution-based validation

All reference solutions and unit tests were executed in sandboxed environments using containerized runtimes. Execution failures were investigated and resolved before tasks were accepted, ensuring consistent behavior across environments and eliminating ambiguity around correctness.

6. Model stress testing and failure annotation

To confirm that tasks meaningfully exercised reasoning and instruction-following constraints, evaluators ran them against state-of-the-art language models using internal tooling. Generated outputs were reviewed, and observed failure patterns were tagged using structured issue categories. This step validated that tasks were non-trivial and grounded in realistic coding challenges.

7. Quality assurance and arbitration

Reviewers conducted structured QA passes across tasks. When discrepancies arose, reviewers acted as arbiters to determine whether issues originated from the prompt and requirements, the reference solution, or the unit tests. This arbitration process ensured consistency and prevented silent drift across the dataset.

Key Results

Created a dataset of more than 10,000 Python tasks for RLEF-style fine-tuning or evaluation, with strict prompt, solution, and test separation
Ensured every task produced a deterministic execution signal
Eliminated test leakage by enforcing independent test authorship
Produced structured metadata to support analysis and targeted use
Established a repeatable workflow for building execution-grounded code tasks

The Outcome

The project delivered a clean, execution-grounded Python dataset suitable for both evaluation and training workflows. With strict separation between prompts, solutions, and tests, the dataset provides reliable signals for identifying reasoning gaps, instruction-following failures, and correctness issues.

The resulting tasks can be used to evaluate code generation systems under realistic constraints, support execution-based training methods, and stress-test models beyond surface-level pass rates. This foundation enables teams to improve reliability in LLM-based software development agents using concrete, reproducible feedback.

Need RLEF-ready tasks for execution-guided training?

Request a curated sample including prompts, requirements, independently authored tests, code solution, and execution feedback.

Request Sample

How is this different from SWE-bench or HumanEval?

This dataset enforces strict separation between prompts, reference solutions, and unit tests, with independent test authorship and execution-based validation. Tests are written from the prompt and requirements rather than the solution, then executed in sandboxed environments.

Is the dataset fully executable?

Yes. All tasks were validated in sandboxed, containerized environments to ensure reproducibility. Reference solutions were required to pass all associated test cases before inclusion, and any failures were reviewed and resolved through arbitration. The dataset is packaged with a README and supporting files to enable smooth execution and integration into training or evaluation workflows.

Can this dataset be used for training as well as evaluation?

Yes. The execution-grounded structure supports both evaluation and training workflows that rely on concrete execution feedback.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

Read

Case Study

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Want to surface subtle code-generation failures?

Request sample tasks designed to expose reasoning, constraint handling, and instruction-following gaps through execution-based validation.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now