Curated 500 expert-verified tasks from real GitHub issues and pull requests (PRs). Each task includes a clear, self-contained issue report and a test designed to accept any correct solution, not only the original fix.

Modern benchmarks demonstrate model strengths in reasoning, math, competitive programming, and code generation. Yet they rarely capture the real-world complexity of software engineering, where fixes must land in live repositories, respect project conventions, and prevent regressions.
The client required a benchmark grounded in real software engineering workflows, incorporating bug reports and feature requests. These issue reports typically describe:
Success in this setting means producing a patch that resolves the issue without introducing regressions elsewhere in the codebase.
While a benchmark can be assembled automatically from open-source GitHub repositories by collecting issues, pull requests, and their corresponding tests, such automation often introduces quality risks.. For example:
To build a valid benchmark, manual expert verification was essential to ensure clarity, test fairness, and real-world fidelity.
Turing curated a benchmark of 500 expert-verified software engineering tasks, each grounded in a real-world GitHub issue. Every task included:
Source pool:
Each task was designed to evaluate whether a model could:
Tasks were discarded if:
Evaluation
To ensure benchmark integrity, Turing implemented a multi-layer human review protocol:
Only tasks that passed all review thresholds were included in the final benchmark.
A trustworthy software engineering benchmark that:
Request a benchmark with a fair grader that evaluates models on real-world software engineering tasks.
Request SampleEach task includes an issue report, a link to the PR and corresponding test, and expert-labeled metadata for clarity, fairness, and difficulty.
Each candidate task received seven expert reviews to ensure issue clarity and test fairness.
The dataset includes diverse issue types such as bug fixes, performance enhancements, code refactoring, and feature requests. It maintains strong repository diversity, ensuring balanced representation across project types, domains, and coding practices.
Only tests that accept any correct fix and reject invalid ones are included, rather than tests limited to the original patch.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.