Evaluate reasoning where it matters most. Real code, real issues, real reproducibility.

SWE-Bench++ measures how coding agents reason through authentic GitHub problems, pairing every task with a containerized environment and verified trajectory. It extends the original SWE-Bench with multilingual repositories, reproducible pull requests, and fairness-driven validation, so labs can trust what their reasoning scores mean, not just how high they are.

Advancing code reasoning with reproducible benchmarks

Public SWE datasets provide familiarity but not fidelity: they age fast, repeat test logic, and reward memorization over reasoning. SWE-Bench++ closes those gaps with multilingual coverage, transparent versioning, and validated human-AI QA. Each evaluation begins with scoped, real-world tasks designed to test reasoning depth before scaling to full repositories, creating a continuous, auditable signal for model improvement and post-training research.

Core SWE-Bench++ capabilities

Each evaluation module begins with a scoped set of tasks designed to verify reasoning before expanding to larger or more complex repos.

Scalable sourcing and filtering

11 languages, 6 engineering domains, and 7 issue types, with github repos exceeding 500 stars and 50K lines of code to mirror real-world complexity.

Intelligent data curation for high-quality PRs

All tickets dated ≥ Nov 2024 and 80% medium–hard, multi-file patches and verified to defeat SOTA models like GPT-5 and Claude Opus 4.

Agentic dockerization

Template-based scaffolding combined with LLM automation creates safe, high-quality Dockerfiles for fast, reproducible environments.

LLM-powered quality control

An LLM serves as the final QA step, automatically assessing issue clarity and test-to-issue alignment.

Diagnostic feedback

A hybrid log-parser converts unstructured test outputs into structured, actionable data for granular failure analysis and automated debugging.

Trajectory-powered fine-tuning

ReACT traces bundled with each task accelerate hill-climb SFT and shorten experimentation cycles.
Inside the Turing Applied AGI Benchmark for VLM 1.0

See benchmark results

View live data, performance charts, and analysis from current SWE-Bench++ runs.
Compare models and agents across languages, issue types, and reasoning depths to track progress on verified GitHub tasks.

What powers SWE-Bench++

Strengthen models and agents with SWE-Bench++

Multilingual repository base

Languages spanning Python, Java, C++, Go, JavaScript, and Ruby for broad code reasoning coverage.

Diverse engineering domains

Web, mobile, data science, infrastructure, IoT, and security repositories drawn from active, high-quality open-source projects.

Real-world task profiles

Bug fixes, refactors, performance, and dependency updates sourced from repositories exceeding 500 stars and 50 K lines of code.

Validated and versioned trajectories

Each issue–fix pair includes reasoning traces, execution logs, and structured diagnostics, all version-controlled for transparent comparison and reproducible fine-tuning.

Reproducible infrastructure

Containerized environments and standardized Docker templates guarantee consistent runtime conditions and environment parity across model families.

Strengthen models and agents with SWE-Bench++

Generate verified issues and reasoning trajectories that improve code reasoning on SWE-Bench++ and other coding benchmarks.

Start Hillclimb