Benchmarking Model Fidelity with 500 Expert-Verified Software Engineering Tasks

Curated 500 expert-verified tasks from real GitHub issues and pull requests (PRs). Each task includes a clear, self-contained issue report and a test designed to accept any correct solution, not only the original fix.

2,000+

resolved GitHub issues reviewed: Each linked to a PR with associated tests before expert curation.

500

benchmark tasks retained: Only issues with clear issue reports and solution-agnostic tests were accepted.

7x

independent reviews per task, ensuring test validity, issue clarity, and difficulty calibration.

IndustryAI Research

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

The Challenge

Modern benchmarks demonstrate model strengths in reasoning, math, competitive programming, and code generation. Yet they rarely capture the real-world complexity of software engineering, where fixes must land in live repositories, respect project conventions, and prevent regressions.

The client required a benchmark grounded in real software engineering workflows, incorporating bug reports and feature requests. These issue reports typically describe:

Observed versus expected behavior
Relevant logs, environments, or reproduction steps
The scope and impact of the change

Success in this setting means producing a patch that resolves the issue without introducing regressions elsewhere in the codebase.

While a benchmark can be assembled automatically from open-source GitHub repositories by collecting issues, pull requests, and their corresponding tests, such automation often introduces quality risks.. For example:

Vague or underspecified issue descriptions may unfairly penalize models
Tests tied too closely to the original PR may reject other valid solutions
Brittle or poorly designed tests may accept invalid fixes, inflating perceived model capability

To build a valid benchmark, manual expert verification was essential to ensure clarity, test fairness, and real-world fidelity.

The Approach

Turing curated a benchmark of 500 expert-verified software engineering tasks, each grounded in a real-world GitHub issue. Every task included:

A clear, actionable issue report describing a bug or a feature request
The associated PR that resolved the issue
The PR’s corresponding test, used to verify resolution

Source pool:

2,000+ resolved issues from diverse open-source repositories
Each issue was paired with a PR that fixed it and a test verifying resolution

Each task was designed to evaluate whether a model could:

Understand and act on a realistic software bug report
Write a patch that resolves the issue which can be verified by a fair test (i.e., accepts any valid fix)
Avoid regressions or codebase violations

Tasks were discarded if:

The issue was vague, dependent on inaccessible context, or not self-contained
The test was overly specific to the original solution, rejecting other valid fixes, or was poorly written, allowing invalid ones

Evaluation

To ensure benchmark integrity, Turing implemented a multi-layer human review protocol:

Expert review panel: Each candidate task was evaluated by seven independent software engineers using GitHub links to the issue, PR, and test
Scoring rubric: Reviewers scored each task on:
- Issue clarity (1–5): Can the issue be resolved without additional assumptions?
- Test validity (1–5): Does the test fairly validate all correct fixes and reject invalid ones?
- Task difficulty (1–5): How challenging is the fix, assuming full understanding?
Flagging system: Reviewers flagged concerns such as:
- Overfitted or flaky tests
- Underspecified or misleading prompts
- Non-reproducible bugs or non-deterministic behavior

Only tasks that passed all review thresholds were included in the final benchmark.

Key Metrics

Screened 2,000+ resolved GitHub issues across diverse open-source repositories
Curated 500 high-confidence tasks, each with a clear issue report and fair test, along with review metadata such as task difficulty
Applied 7x blind expert reviews per sample to score clarity, fairness, and difficulty
Rejected ~75% of candidates, due to vague issue reports or tests that unfairly rejected valid fixes

The Outcome

A trustworthy software engineering benchmark that:

Fairly evaluates models on real issues. Passing a sample means the model created a patch that resolved the issue without breaking anything in the repository. Failing indicates that the patch either did not fix the issue or introduced regressions
Reveals true capability by removing impossible or misleading items (e.g., vague issues or over-specific tests), ensuring that failures reflect actual model weaknesses
Highlights model failure modes, identifying weaknesses in handling software engineering tasks. Each failure provides insight into how and why the model struggled, supporting targeted improvement strategies for future performance

Want to evaluate your model on software engineering tasks?

Request a benchmark with a fair grader that evaluates models on real-world software engineering tasks.

Request Sample

What does each benchmark task include?

Each task includes an issue report, a link to the PR and corresponding test, and expert-labeled metadata for clarity, fairness, and difficulty.

How were tasks verified?

Each candidate task received seven expert reviews to ensure issue clarity and test fairness.

What kinds of issues are included?

The dataset includes diverse issue types such as bug fixes, performance enhancements, code refactoring, and feature requests. It maintains strong repository diversity, ensuring balanced representation across project types, domains, and coding practices.

How are tests validated?

Only tests that accept any correct fix and reject invalid ones are included, rather than tests limited to the original patch.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

When will I receive the sample?

Within three business days after NDA execution.

Related resources

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Case Study

Delivering HLE-Grade Math Prompts to Benchmark SOTA Models

Read

Blog

Real-World Ready: Why Private Benchmarks are Essential for Trustworthy AI Code Generation

Read

How does your model handle messy, real-world software bugs?

Request human-curated engineering tasks with vetted issue reports, blind-evaluated tests, and difficulty labels.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now