Benchmarking Model Fidelity with 500 Expert-Verified Software Engineering Tasks

Curated 500 expert-verified tasks from real GitHub issues and pull requests (PRs). Each task includes a clear, self-contained issue report and a test designed to accept any correct solution, not only the original fix. 

2,000+

Resolved GitHub issues reviewed: Each linked to a PR with associated tests before expert curation.

500

Benchmark tasks retained: Only issues with clear issue reports and solution-agnostic tests were accepted.

7x

Independent reviews per task, ensuring test validity, issue clarity, and difficulty calibration.

IndustryAI Research
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement

The Challenge

Modern benchmarks demonstrate model strengths in reasoning, math, competitive programming, and code generation. Yet they rarely capture the real-world complexity of software engineering, where fixes must land in live repositories, respect project conventions, and prevent regressions.

The client required a benchmark grounded in real software engineering workflows, incorporating bug reports and feature requests. These issue reports typically describe:

  • Observed versus expected behavior
  • Relevant logs, environments, or reproduction steps
  • The scope and impact of the change

Success in this setting means producing a patch that resolves the issue without introducing regressions elsewhere in the codebase.

While a benchmark can be assembled automatically from open-source GitHub repositories by collecting issues, pull requests, and their corresponding tests, such automation often introduces quality risks.. For example:

  • Vague or underspecified issue descriptions may unfairly penalize models
  • Tests tied too closely to the original PR may reject other valid solutions
  • Brittle or poorly designed tests may accept invalid fixes, inflating perceived model capability

To build a valid benchmark, manual expert verification was essential to ensure clarity, test fairness, and real-world fidelity.

The Approach

Turing curated a benchmark of 500 expert-verified software engineering tasks, each grounded in a real-world GitHub issue. Every task included:

  • A clear, actionable issue report describing a bug or a feature request
  • The associated PR that resolved the issue
  • The PR’s corresponding test, used to verify resolution

Source pool:

  • 2,000+ resolved issues from diverse open-source repositories
  • Each issue was paired with a PR that fixed it and a test verifying resolution

Each task was designed to evaluate whether a model could:

  • Understand and act on a realistic software bug report
  • Write a patch that resolves the issue which can be verified by a fair test (i.e., accepts any valid fix)
  • Avoid regressions or codebase violations

Tasks were discarded if:

  • The issue was vague, dependent on inaccessible context, or not self-contained
  • The test was overly specific to the original solution, rejecting other valid fixes, or was poorly written, allowing invalid ones

Evaluation

To ensure benchmark integrity, Turing implemented a multi-layer human review protocol:

  • Expert review panel: Each candidate task was evaluated by seven independent software engineers using GitHub links to the issue, PR, and test
  • Scoring rubric: Reviewers scored each task on:
    - Issue clarity (1–5): Can the issue be resolved without additional assumptions?
    - Test validity (1–5): Does the test fairly validate all correct fixes and reject invalid ones?
    - Task difficulty (1–5): How challenging is the fix, assuming full understanding?
  • Flagging system: Reviewers flagged concerns such as:
    - Overfitted or flaky tests
    - Underspecified or misleading prompts
    - Non-reproducible bugs or non-deterministic behavior

Only tasks that passed all review thresholds were included in the final benchmark.

Key Metrics

  • Screened 2,000+ resolved GitHub issues across diverse open-source repositories
  • Curated 500 high-confidence tasks, each with a clear issue report and fair test, along with review metadata such as task difficulty
  • Applied 7x blind expert reviews per sample to score clarity, fairness, and difficulty
  • Rejected ~75% of candidates, due to vague issue reports or tests that unfairly rejected valid fixes

The Outcome

A trustworthy software engineering benchmark that:

  • Fairly evaluates models on real issues. Passing a sample means the model created a patch that resolved the issue without breaking anything in the repository. Failing indicates that the patch either did not fix the issue or introduced regressions
  • Reveals true capability by removing impossible or misleading items (e.g., vague issues or over-specific tests), ensuring that failures reflect actual model weaknesses
  • Highlights model failure modes, identifying weaknesses in handling software engineering tasks. Each failure provides insight into how and why the model struggled, supporting targeted improvement strategies for future performance

Want to evaluate your model on software engineering tasks?

Request a benchmark with a fair grader that evaluates models on real-world software engineering tasks.

Request Sample

Share

FAQ

What does each benchmark task include?

Each task includes an issue report, a link to the PR and corresponding test, and expert-labeled metadata for clarity, fairness, and difficulty.

How were tasks verified?

Each candidate task received seven expert reviews to ensure issue clarity and test fairness.

What kinds of issues are included?

The dataset includes diverse issue types such as bug fixes, performance enhancements, code refactoring, and feature requests. It maintains strong repository diversity, ensuring balanced representation across project types, domains, and coding practices.

How are tests validated?

Only tests that accept any correct fix and reject invalid ones are included, rather than tests limited to the original patch.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

When will I receive the sample?

Within three business days after NDA execution.

How does your model handle messy, real-world software bugs?

Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.

Request Sample