Curating 200+ SWE-Bench Java Tasks to Evaluate Model Breakpoints and Patch Generalization

Delivered 200+ SWE-bench Java tasks from 20+ GitHub repositories, each including a trainer-authored issue prompt and a validated patch. The dataset combined model-solvable and model-breaking examples, enabling evaluation of model performance across varied difficulty levels while ensuring prompts remained neutral, testable, and solution-agnostic.

200+

SWE-bench Java tasks, curated from open-source pull requests

30%

model-solvable tasks, intentionally sampled to balance difficulty

70%

model-breaking tasks to expose model failure patterns across repo contexts

MethodDataset Generation

DomainCoding

Dataset scale200+ tasks

CapabilityData Packs

Curating 200+ SWE-Bench Java Tasks to Evaluate Model Breakpoints and Patch Generalization

The Challenge

Existing code benchmarks rarely test model solvability across realistic patch distributions. The client sought to stress-test their model's ability to resolve real-world bugs, rather than hand-picked or synthetic examples.

Key challenges included:

Finding testable, standalone issues in mature Java repositories
Ensuring issue descriptions were clear but not solution-revealing
Balancing solvable and hard samples while staying within a real PR distribution
Controlling for data leakage and trivial patches

The Approach

Repository and task identification

Turing identified 20+ open-source GitHub repositories that met strict criteria:

≥95% code written in Java
Maven-based builds with standardized test output paths
≥50 PRs modifying both test and source code
Pass/fail test behavior aligned with patch application

Each PR met the following validation criteria:

Include meaningful logic changes (excluding trivial renames or version bumps)
Contain reproducible test failures that passed after patch application
Avoid changes to high-level interfaces and class structures in ways that risk compilation errors
Provide a self-contained issue or derivable problem from the PR content
Originate from pull requests created after January 2024

Dockerized execution and test validation

For each selected repository, the team created a dedicated Docker image capable of running the full test suite in a clean, isolated environment. This ensured consistent behavior across machines and eliminated environment-specific variance.

Within this environment, the team:

Executed tests against the original commit to confirm reproducible failures
Executed tests against the patched commit to confirm resolution
Recorded both fail-to-pass and pass-to-pass outcomes
Verified that compilation succeeded before and after patch application

Only pull requests exhibiting stable, reproducible test behavior were retained.

Issue description curation

When no original issue was linked to the PR, Turing’s trainers:

Authored new issue descriptions based on the PR diff and intent
Wrote problem-focused, not fix-guiding, descriptions
Added context only when necessary for reproducibility
Aligned descriptions with the test diff to preserve evaluability without revealing the solution

This format intentionally diverged from prior benchmarks (like SWE-bench), where issue hints can bias solutions. Prompts were designed to state the problem clearly without implying the fix.

Task balance

To maintain evaluation granularity:

≈30% of tasks were designated as model-solvable
≈70% exposed failure points, including complex logic changes, ambiguous context, or multi-file patches

Key Results

Delivered a 200-task Java dataset containing real bugs balanced for solvability and realism
Authored neutral, reproducible prompts across PRs with and without original issues
Used automated patch and test validation pipelines, ensuring fail-to-pass and pass-to-pass logic
Enabled controlled evaluation of model successes and failure modes
Delivered Dockerized, test-ready samples integrated with the client’s patch evaluation framework

The Outcome

This dataset enables model builders to evaluate performance across real-world Java software tasks:

Tests model patching ability on authentic bugs
Distinguishes between solvable tasks and hard breakpoints
Reveals failure patterns in LLM-generated code under controlled conditions
Supports deeper analysis of generalization, instruction following, and patch realism

Trainer-authored prompts, paired with reproducible test cases, make the dataset a valuable resource for fine-tuning, evaluation, and failure-mode diagnosis.

Want to evaluate your model on patch-level bug fixing?

Request a sample task featuring a curated issue prompt, validated patch, pass/fail test states, and metadata on difficulty, solvability, and repository source.

Request Sample

What’s in the sample?

Each sample includes a SWE bench-style Java task with issue prompt, validated patch, and complete test context.

Are tasks balanced by difficulty?

Yes. Approximately 30% are model-solvable and 70% are model-breaking examples to ensure balanced evaluation across difficulty levels.

What’s special about the prompts?

They are trainer-written to describe the problem clearly without guiding the fix, while remaining grounded, reproducible, and testable.

How is correctness validated?

Each task was verified through fail-to-pass test harnesses confirming correct patch behavior.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after the NDA is executed.

Related resources

Case Study

Creating a 1,500-Task Real-World Software Engineering Benchmark with E2E UI Test Oracles

Read

Advancing Code-Based Physics and 2D-3D Simulation Understanding with 3,800+ Tasks

Case Study

Advancing Code-Based Physics & 2D/3D Simulation Understanding with 3,800+ Tasks

Read

Case Study

Building 7K+ High-Complexity SlideVQA Tasks Across 20+ Knowledge Domains

Read

Is your model ready for unstructured bug reports and ambiguous diffs?

Request a sample containing hand-curated coding tasks, including model-solvable and failure-triggering, to diagnose model strengths and blind spots.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now