Delivering 12,000+ SWE-Bench-Style Tasks With Execution-Grounded Validation

Built an end-to-end SWE-bench-style data generation pipeline that sources, validates, annotates, and delivers execution-grounded software engineering tasks across 10+ programming languages. Each task pairs a real GitHub issue with a PR, a validated patch and dockerized test harness, supporting both fail-to-pass and pass-to-pass evaluation signals for model training and benchmarking.

12,000+

SWE-bench-style tasks delivered across 10+ programming languages, including Python, JavaScript/TypeScript, Java, Go, Rust, and more.

6-stage

production pipeline spanning public PR sourcing, hybrid quality annotations, trajectory generation, and 3x validation for reproducibility.

15-criteria

early rejection framework combined with dual-axis scoring and golden patch validation.

MethodDataset generation
DomainCoding
Dataset scale12,000+ tasks
CapabilityData packs
Delivering 12,000+ SWE-Bench-Style Task With Execution-Grounded Validation

The Challenge

The client needed SWE-bench-style data that could move beyond the original benchmark's Python-only, single-source design. Existing public SWE-bench datasets posed several limitations:

  • Limited to Python repositories, leaving major language ecosystems uncovered
  • Dependent on a finite pool of mineable open-source PRs, with diminishing returns as repositories were exhausted
  • Vulnerable to ambiguous issue descriptions, brittle tests, and patches that "cheat" by relying on accessor methods not implied by the issue
  • Lacking systematic mechanisms to confirm that a generated patch truly resolves an issue, rather than passing a narrow set of tests

The client required a scalable production pipeline capable of:

  • Generating SWE-bench-style tasks across multiple programming languages with consistent quality standards
  • Validating each task through dockerized execution rather than static review
  • Detecting and correcting issue ambiguity, test misalignment, and missing accessor hints before delivery

The Approach

Turing built a production pipeline with clearly separated ownership at each step, combining engineering throughput with multi-layer human review and automated execution validation.

1. Public repository sourcing

Our sourcing pipeline identified candidate repositories and PRs that satisfied strict eligibility criteria, including language composition, build system compatibility, test count thresholds, and PR recency. Each candidate underwent automated JSON spec generation and was evaluated against a reference agent. 

2. Hybrid annotation with multi-layer review

The annotation stage combined synthetic pre-fill with multi-layer human review:

  • E1 (Expert 1) and E2 (Expert 2): Synthetic responses generated programmatically across all rating dimensions, including early rejection checks, issue clarity, test-to-issue alignment, and hint requirements
  • Evaluator 1 (5+ YOE expert): Reviewed both synthetic responses, applied independent judgment, and consolidated the final verdict
  • Evaluator 2 (8+ YOE expert): Performed calibration review and stamped delivery quality

Each task was evaluated against a 15-criterion early rejection framework covering test name stability, duplication, base/before/after consistency, fail-to-pass and pass-to-pass alignment, log harness integrity, and PR size limits. Surviving tasks were scored on a 0–3 scale across two axes:

  • Issue clarity: Is the problem statement specific enough that a model can produce a correct solution without guessing?
  • Test-to-issue alignment: Do the tests cover the issue without being too strict (false negatives) or too lenient (false positives)?

When a test required an accessor method not implied by the issue, annotators authored a structured hint specifying the exact file path and function signature, preventing tests from "cheating" while preserving implementation freedom.

3. Pool insertion and golden PR validation

Validated instances were converted to JSON format, agent runs were triggered for specified models and configurations, and completed tasks were promoted to "golden PR" status. This stage produced the canonical task pool from which trajectories and deliveries were drawn.

4. Trajectory generation

Depending on the client's evaluation needs, tasks were drawn from the pool to generate full execution trajectories, including agent reasoning, tool calls, patch attempts, and test outcomes. Trajectories were structured to support both fine-tuning and evaluation workflows.

5. 3x validation

Every task underwent a final, repeated validation pass before delivery:

  • Repository remained publicly accessible and had not been made private
  • Source PR remained merged and was not reopened
  • Pass-to-pass and fail-to-pass test sets remained correct and consistent across repeated runs
  • No required files were empty or corrupted

Key Results

  • Delivered more than 12,000 SWE-bench tasks spanning 10+ programming languages, expanding meaningful coverage well beyond Python-only benchmarks
  • Established a hybrid synthetic-plus-human annotation framework that combined the throughput of automated pre-fill with the rigor of multi-pass review
  • Enforced 15 early rejection criteria and dual-axis scoring across every task, producing a verified subset rather than raw mined data

The Outcome

The client now has a production-grade SWE-bench-style data pipeline that produces execution-grounded, multi-language coding tasks at scale. By combining public PR sourcing, hybrid annotation, and three independent rounds of golden validation, the pipeline produces tasks that are reproducible, unambiguous, and resistant to evaluation shortcuts. 

This foundation supports:

  • Multi-language model evaluation across realistic software engineering scenarios, including bug fixes, feature implementation, performance improvements, and refactors
  • Fine-tuning workflows that rely on stable fail-to-pass execution signals
  • Benchmarking frontier models on languages outside the saturated Python set
  • Continuous expansion of the task pool as new languages, repositories, and PR patterns become available

Need multi-language SWE-bench-style tasks with verified fail-to-pass signals?

Request a curated sample of execution-grounded coding tasks across Python, JavaScript, Java, Rust, C#, etc., each validated through dockerized golden patch testing.

Request Sample

Share

FAQ

Which programming languages are covered?

The pipeline produces SWE-bench tasks across 10+ languages like Python, JavaScript/TypeScript, Java, Go, Rust, Ruby, and C#.

How is task quality ensured?

Every task passes through a 15-criterion early rejection framework, dual-axis scoring on issue clarity and test-to-issue alignment, multi-layer review, and three independent golden validation runs before delivery.

What is the difference between fail-to-pass and pass-to-pass tests?

Fail-to-pass tests must fail before the patch is applied and pass after, confirming the patch resolves the issue. Pass-to-pass tests must pass both before and after, confirming the patch does not introduce regressions.

Can this dataset be used for both training and evaluation?

Yes. The execution-grounded structure, with stable fail-to-pass and pass-to-pass signals, supports fine-tuning workflows, RLEF-style training, and benchmarking across model versions.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Building or benchmarking coding agents beyond Python?

Request SWE-bench-style samples designed to evaluate real-world software engineering performance across diverse software ecosystems, including bug fixes, feature requests, and performance improvements.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now