Designed a large-scale, software-engineering benchmark composed of high-quality tasks drawn from a complex, open-source codebase. Each task includes a self-contained prompt from a real issue report and a solution-agnostic grader that accepts any valid solution and rejects invalid ones.

Modern evaluation assess reasoning, math, competitive programming, and code generation, but rarely reflect the messy reality of shipping fixes in live product: respecting codebase conventions and avoiding regressions. Most SWE benchmarks grade solutions using unit tests from original PRs. This approach is efficient but often too narrow, susceptible to grader gaming, and biased toward code that already had unit tests.
The client required a benchmark closer to real software delivery: E2E validation that exercises full end-user flows, captures cross-component interactions, and resists gaming more effectively than unit-test-only grading.
Dataset
Source pool
Task design targets
Each task evaluates whether a model can:
Discard criteria
We excluded candidates if:
Why E2E UI tests?
Compared to unit-test-only grading, E2E tests:
Evaluation
To ensure benchmark integrity, we implemented three quality controls:
Candidates failing clarity, reproducibility, or testability requirements were removed. Only tasks meeting the defined quality threshold were included in the final benchmark.
A trustworthy, end-to-end software-engineering benchmark that:
Request a sample that reveals real model failure points using high-fidelity E2E tests across bug fixes, features, and regressions.
Request SampleEach task includes an issue report, a link to the PR and designed UI test, and expert-labeled metadata for issue clarity, PR resolution, and task difficulty.
Each issue description was verified for clarity and completeness . Once confirmed, the PR was validated to ensure it properly fixed the issue. If both criteria were met, an E2E test was designed to fairly verify the fix.
The benchmark includes diverse tasks such as bug fixes, performance improvements, and feature requests.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.