This week’s edition introduces SWE-bench++, a major upgrade to how we evaluate autonomous coding agents. Built over 7,000 real-world tasks spanning 9 languages and 11 repo types, the benchmark includes reproducible dockerized environments, structured trajectory capture for fine-tuning, and multilingual leaderboards. We also dive into a new study questioning whether reinforcement learning actually expands model reasoning, a deep benchmark suite for spatiotemporal vision-language understanding, and a bold proposal to co-evolve human and AI cognition, not just models.
This week, we’re introducing SWE-bench++, a next-gen benchmark and training suite for evaluating software engineering agents at scale. Built over 7,000 real-world tasks across 9 programming languages and 11 repository types, SWE-bench++ pushes beyond prior benchmarks with reproducible environments, trajectory logging, and multilingual support.
Here’s what we’re seeing:
💡 Why it matters: In the era of autonomous coding agents, evaluation must move from synthetic bugs to complex, real-world repos. SWE-bench++ sets a new bar for what it means to “solve software engineering.”
🎉 Fast Company named Turing Co-Founder & CEO Jonathan Siddharth one of the 20 Innovators Shaping the Future of AI in 2025.
Jonathan has long believed that advancing AI requires pairing powerful models with the right data, systems & talent. His work building Turing into the world’s leading research accelerator reflects that vision in action, helping frontier labs and enterprises move from general intelligence to real, measurable outcomes.
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.