Creating a 1,500-Task Real-World Software Engineering Benchmark with E2E UI Test Oracles

Designed a large-scale, software-engineering benchmark composed of high-quality tasks drawn from a complex, open-source codebase. Each task includes a self-contained prompt from a real issue report and a solution-agnostic grader that accepts any valid solution and rejects invalid ones.

2,000+

resolved issues reviewed: Each issue was evaluated for clarity, reproducibility, and testability.

1,500+

benchmark tasks retained: Only issues with a clear description and an objective method for verifying correctness were included.

100%

E2E UI test graders: Experts authored UI tests that validate behavior from an end-user perspective, ensuring any correct fix passes even if its implementation differs, while invalid fixes fail.

IndustryAI Research

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

Creating a 1,500-Task Real-World Software Engineering Benchmark with E2E UI Test Oracles

The Challenge

Modern evaluation assess reasoning, math, competitive programming, and code generation, but rarely reflect the messy reality of shipping fixes in live product: respecting codebase conventions and avoiding regressions. Most SWE benchmarks grade solutions using unit tests from original PRs. This approach is efficient but often too narrow, susceptible to grader gaming, and biased toward code that already had unit tests.

The client required a benchmark closer to real software delivery: E2E validation that exercises full end-user flows, captures cross-component interactions, and resists gaming more effectively than unit-test-only grading.

The Approach

Dataset

Source pool

2,000+ resolved issues from the target repository
Each issue linked to a PR that purported to resolve it

Task design targets

Each task evaluates whether a model can:

Understand and act on a realistic engineering prompt in a complex codebase
Produce a patch that resolves the issue as verified by the E2E UI test
Avoid regressions guarded by both the project’s existing tests and the newly authored E2E test

Discard criteria

We excluded candidates if:

The issue was vague, dependent on inaccessible context, or not self-contained
The issue was non-reproducible or untestable. Reproducibility was essential for designing an E2E test that fails when the issue exists and passes once resolved
The original PR did not clearly fix the issue. The original PR needed to contain a valid fix since it served as the “fixed” baseline for grader validation. This guarantee, combined with reproducibility, ensures that the E2E test fails in the broken state and passes in the fixed state

Why E2E UI tests?
Compared to unit-test-only grading, E2E tests:

Reflect complete user workflows across layers, revealing integration issues unit tests miss
Are solution-agnostic by design. Experts ensured that each test fairly evaluates any valid fix, not just the specific patch from the original PR, and rejects any invalid fix that fails to address the issue
Are harder to game, reducing the risk of narrow patches that satisfy a single assertion without fixing real behavior

Evaluation

To ensure benchmark integrity, we implemented three quality controls:

Issue clarity. Our experts scored issues for clarity and completeness (1–5 scale). Only clear and well-specified issues were included.
Grader validity. Each E2E test was constructed to pass for any reasonable fix and fail for invalid patches.
Difficulty tagging. Each task was assigned a difficulty rating based on how challenging it would be for a professional software engineer to resolve. To ensure fair and consistent assessment, our expert engineers first familiarized themselves with the target repository, its architecture, and development conventions, allowing them to accurately judge the relative complexity of each task.

Candidates failing clarity, reproducibility, or testability requirements were removed. Only tasks meeting the defined quality threshold were included in the final benchmark.

Key Metrics

Screened 2,000+ resolved issues spanning varied difficulty levels and task types (from bug fixes to feature requests)
Delivered 1,500+ tasks, each including a clear issue report, the original PR, a solution-agnostic E2E UI test, and reviewer metadata (e.g., task difficulty)
Excluded ~500 candidates due to non-reproducible or non-testable issues

The Outcome

A trustworthy, end-to-end software-engineering benchmark that:

Fairly evaluates models on real issues. Passing a sample means the model produced a patch that resolved the reported issue without introducing regressions or breaking existing functionality. Failure indicates that the fix either did not address the issue or created unintended side effects.
Reveals the model’s true capabilities and weaknesses. All issue descriptions are clear, and all tests are designed to be fair and solution-agnostic, ensuring each model’s result accurately reflects its underlying strengths and limitations. A model’s failure therefore signals where it truly struggles.
Highlights meaningful failure modes. The benchmark identifies where models struggle most when handling complex, end-to-end software engineering tasks. Each clearly defined issue and fair, solution-agnostic grader provides actionable insight into how and why a model failed, enabling targeted improvement strategies.
Raises the bar beyond unit tests. By validating complete user flows, the benchmark better represents real-world engineering work and reduces susceptibility to narrow test hacks.

Want to evaluate where your model breaks in complex engineering tasks?

Request a sample that reveals real model failure points using high-fidelity E2E tests across bug fixes, features, and regressions.

Request Sample

What does each benchmark task include?

Each task includes an issue report, a link to the PR and designed UI test, and expert-labeled metadata for issue clarity, PR resolution, and task difficulty.

How were tasks verified?

Each issue description was verified for clarity and completeness . Once confirmed, the PR was validated to ensure it properly fixed the issue. If both criteria were met, an E2E test was designed to fairly verify the fix.

What kinds of issues are included?

The benchmark includes diverse tasks such as bug fixes, performance improvements, and feature requests.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

When will I receive the sample?

Within three business days after NDA execution.

Related resources

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Case Study

Delivering HLE-Grade Math Prompts to Benchmark SOTA Models

Read

Blog

Real-World Ready: Why Private Benchmarks are Essential for Trustworthy AI Code Generation

Read

How does your model handle messy, real-world software bugs?

Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now