Curating 200+ SWE-Bench Java Tasks to Evaluate Model Breakpoints and Patch Generalization

Delivered 200+ SWE-bench Java tasks from 20+ GitHub repositories, each including a trainer-authored issue prompt and a validated patch. The dataset combined model-solvable and model-breaking examples, enabling evaluation of model performance across varied difficulty levels while ensuring prompts remained neutral, testable, and solution-agnostic.

200+

SWE-bench Java tasks, curated from open-source pull requests

30%

model-solvable tasks, intentionally sampled to balance difficulty

70%

model-breaking tasks to expose model failure patterns across repo contexts

MethodDataset Generation
DomainCoding
Dataset scale200+ tasks
CapabilityData Packs
Curating 200+ SWE-Bench Java Tasks to Evaluate Model Breakpoints and Patch Generalization

The Challenge

Existing code benchmarks rarely test model solvability across realistic patch distributions. The client sought to stress-test their model's ability to resolve real-world bugs, rather than hand-picked or synthetic examples.

Key challenges included:

  • Finding testable, standalone issues in mature Java repositories
  • Ensuring issue descriptions were clear but not solution-revealing
  • Balancing solvable and hard samples while staying within a real PR distribution
  • Controlling for data leakage and trivial patches

The Approach

Repository and task identification

Turing identified 20+ open-source GitHub repositories that met strict criteria:

  • ≥95% code written in Java
  • Maven-based builds with standardized test output paths
  • ≥50 PRs modifying both test and source code
  • Pass/fail test behavior aligned with patch application

Each PR met the following validation criteria:

  • Include meaningful logic changes (excluding trivial renames or version bumps)
  • Contain reproducible test failures that passed after patch application
  • Avoid changes to high-level interfaces and class structures in ways that risk compilation errors
  • Provide a self-contained issue or derivable problem from the PR content
  • Originate from pull requests created after January 2024

Dockerized execution and test validation

For each selected repository, the team created a dedicated Docker image capable of running the full test suite in a clean, isolated environment. This ensured consistent behavior across machines and eliminated environment-specific variance.

Within this environment, the team:

  • Executed tests against the original commit to confirm reproducible failures
  • Executed tests against the patched commit to confirm resolution
  • Recorded both fail-to-pass and pass-to-pass outcomes
  • Verified that compilation succeeded before and after patch application

Only pull requests exhibiting stable, reproducible test behavior were retained.

Issue description curation

When no original issue was linked to the PR, Turing’s trainers:

  • Authored new issue descriptions based on the PR diff and intent
  • Wrote problem-focused, not fix-guiding, descriptions
  • Added context only when necessary for reproducibility
  • Aligned descriptions with the test diff to preserve evaluability without revealing the solution

This format intentionally diverged from prior benchmarks (like SWE-bench), where issue hints can bias solutions. Prompts were designed to state the problem clearly without implying the fix.

Task balance

To maintain evaluation granularity:

  • ≈30% of tasks were designated as model-solvable
  • ≈70% exposed failure points, including complex logic changes, ambiguous context, or multi-file patches

Key Results

  • Delivered a 200-task Java dataset containing real bugs balanced for solvability and realism
  • Authored neutral, reproducible prompts across PRs with and without original issues
  • Used automated patch and test validation pipelines, ensuring fail-to-pass and pass-to-pass logic
  • Enabled controlled evaluation of model successes and failure modes
  • Delivered Dockerized, test-ready samples integrated with the client’s patch evaluation framework

The Outcome

This dataset enables model builders to evaluate performance across real-world Java software tasks:

  • Tests model patching ability on authentic bugs
  • Distinguishes between solvable tasks and hard breakpoints
  • Reveals failure patterns in LLM-generated code under controlled conditions
  • Supports deeper analysis of generalization, instruction following, and patch realism

Trainer-authored prompts, paired with reproducible test cases, make the dataset a valuable resource for fine-tuning, evaluation, and failure-mode diagnosis.

Want to evaluate your model on patch-level bug fixing?

Request a sample task featuring a curated issue prompt, validated patch, pass/fail test states, and metadata on difficulty, solvability, and repository source.

Request Sample

Share

FAQ

What’s in the sample?

Each sample includes a SWE bench-style Java task with issue prompt, validated patch, and complete test context.

Are tasks balanced by difficulty?

Yes. Approximately 30% are model-solvable and 70% are model-breaking examples to ensure balanced evaluation across difficulty levels.

What’s special about the prompts?

They are trainer-written to describe the problem clearly without guiding the fix, while remaining grounded, reproducible, and testable.

How is correctness validated?

Each task was verified through fail-to-pass test harnesses confirming correct patch behavior.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after the NDA is executed.

Is your model ready for unstructured bug reports and ambiguous diffs?

Request a sample containing hand-curated coding tasks, including model-solvable and failure-triggering, to diagnose model strengths and blind spots.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now