Benchmarking Model Fidelity with 500 Expert-Verified Software Engineering Tasks
Curated 500 expert-verified tasks from real GitHub issues and pull requests (PRs). Each task includes a clear, self-contained issue report and a test designed to accept any correct solution, not only the original fix.
2,000+
resolved GitHub issues reviewed: Each linked to a PR with associated tests before expert curation.
500
benchmark tasks retained: Only issues with clear issue reports and solution-agnostic tests were accepted.
7x
independent reviews per task, ensuring test validity, issue clarity, and difficulty calibration.

The Challenge
Modern benchmarks demonstrate model strengths in reasoning, math, competitive programming, and code generation. Yet they rarely capture the real-world complexity of software engineering, where fixes must land in live repositories, respect project conventions, and prevent regressions.
The client required a benchmark grounded in real software engineering workflows, incorporating bug reports and feature requests. These issue reports typically describe:
- Observed versus expected behavior
- Relevant logs, environments, or reproduction steps
- The scope and impact of the change
Success in this setting means producing a patch that resolves the issue without introducing regressions elsewhere in the codebase.
While a benchmark can be assembled automatically from open-source GitHub repositories by collecting issues, pull requests, and their corresponding tests, such automation often introduces quality risks.. For example:
- Vague or underspecified issue descriptions may unfairly penalize models
- Tests tied too closely to the original PR may reject other valid solutions
- Brittle or poorly designed tests may accept invalid fixes, inflating perceived model capability
To build a valid benchmark, manual expert verification was essential to ensure clarity, test fairness, and real-world fidelity.
The Approach
Turing curated a benchmark of 500 expert-verified software engineering tasks, each grounded in a real-world GitHub issue. Every task included:
- A clear, actionable issue report describing a bug or a feature request
- The associated PR that resolved the issue
- The PR’s corresponding test, used to verify resolution
Source pool:
- 2,000+ resolved issues from diverse open-source repositories
- Each issue was paired with a PR that fixed it and a test verifying resolution
Each task was designed to evaluate whether a model could:
- Understand and act on a realistic software bug report
- Write a patch that resolves the issue which can be verified by a fair test (i.e., accepts any valid fix)
- Avoid regressions or codebase violations
Tasks were discarded if:
- The issue was vague, dependent on inaccessible context, or not self-contained
- The test was overly specific to the original solution, rejecting other valid fixes, or was poorly written, allowing invalid ones
Evaluation
To ensure benchmark integrity, Turing implemented a multi-layer human review protocol:
- Expert review panel: Each candidate task was evaluated by seven independent software engineers using GitHub links to the issue, PR, and test
- Scoring rubric: Reviewers scored each task on:
- Issue clarity (1–5): Can the issue be resolved without additional assumptions?
- Test validity (1–5): Does the test fairly validate all correct fixes and reject invalid ones?
- Task difficulty (1–5): How challenging is the fix, assuming full understanding? - Flagging system: Reviewers flagged concerns such as:
- Overfitted or flaky tests
- Underspecified or misleading prompts
- Non-reproducible bugs or non-deterministic behavior
Only tasks that passed all review thresholds were included in the final benchmark.
Key Metrics
- Screened 2,000+ resolved GitHub issues across diverse open-source repositories
- Curated 500 high-confidence tasks, each with a clear issue report and fair test, along with review metadata such as task difficulty
- Applied 7x blind expert reviews per sample to score clarity, fairness, and difficulty
- Rejected ~75% of candidates, due to vague issue reports or tests that unfairly rejected valid fixes
The Outcome
A trustworthy software engineering benchmark that:
- Fairly evaluates models on real issues. Passing a sample means the model created a patch that resolved the issue without breaking anything in the repository. Failing indicates that the patch either did not fix the issue or introduced regressions
- Reveals true capability by removing impossible or misleading items (e.g., vague issues or over-specific tests), ensuring that failures reflect actual model weaknesses
- Highlights model failure modes, identifying weaknesses in handling software engineering tasks. Each failure provides insight into how and why the model struggled, supporting targeted improvement strategies for future performance
Want to evaluate your model on software engineering tasks?
Request a benchmark with a fair grader that evaluates models on real-world software engineering tasks.
Request SampleFAQ
What does each benchmark task include?
Each task includes an issue report, a link to the PR and corresponding test, and expert-labeled metadata for clarity, fairness, and difficulty.
How were tasks verified?
Each candidate task received seven expert reviews to ensure issue clarity and test fairness.
What kinds of issues are included?
The dataset includes diverse issue types such as bug fixes, performance enhancements, code refactoring, and feature requests. It maintains strong repository diversity, ensuring balanced representation across project types, domains, and coding practices.
How are tests validated?
Only tests that accept any correct fix and reject invalid ones are included, rather than tests limited to the original patch.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
When will I receive the sample?
Within three business days after NDA execution.
How does your model handle messy, real-world software bugs?
Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


