SWE-Bench++ measures how coding agents reason through authentic GitHub problems, pairing every task with a containerized environment and verified trajectory. It extends the original SWE-Bench with multilingual repositories, reproducible pull requests, and fairness-driven validation, so labs can trust what their reasoning scores mean, not just how high they are.







Public SWE datasets provide familiarity but not fidelity: they age fast, repeat test logic, and reward memorization over reasoning. SWE-Bench++ closes those gaps with multilingual coverage, transparent versioning, and validated human-AI QA. Each evaluation begins with scoped, real-world tasks designed to test reasoning depth before scaling to full repositories, creating a continuous, auditable signal for model improvement and post-training research.
Each evaluation module begins with a scoped set of tasks designed to verify reasoning before expanding to larger or more complex repos.

View live data, performance charts, and analysis from current SWE-Bench++ runs.
Compare models and agents across languages, issue types, and reasoning depths to track progress on verified GitHub tasks.
Languages spanning Python, Java, C++, Go, JavaScript, and Ruby for broad code reasoning coverage.
Web, mobile, data science, infrastructure, IoT, and security repositories drawn from active, high-quality open-source projects.
Bug fixes, refactors, performance, and dependency updates sourced from repositories exceeding 500 stars and 50 K lines of code.
Each issue–fix pair includes reasoning traces, execution logs, and structured diagnostics, all version-controlled for transparent comparison and reproducible fine-tuning.
Containerized environments and standardized Docker templates guarantee consistent runtime conditions and environment parity across model families.
Generate verified issues and reasoning trajectories that improve code reasoning on SWE-Bench++ and other coding benchmarks.