AGI Advance: Weekly AI & AGI Insights (Dec 9, 2025)

Turing Staff
10 Dec 20253 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week’s edition introduces SWE-bench++, a major upgrade to how we evaluate autonomous coding agents. Built over 7,000 real-world tasks spanning 9 languages and 11 repo types, the benchmark includes reproducible dockerized environments, structured trajectory capture for fine-tuning, and multilingual leaderboards. We also dive into a new study questioning whether reinforcement learning actually expands model reasoning, a deep benchmark suite for spatiotemporal vision-language understanding, and a bold proposal to co-evolve human and AI cognition, not just models.

What we're doing

This week, we’re introducing SWE-bench++, a next-gen benchmark and training suite for evaluating software engineering agents at scale. Built over 7,000 real-world tasks across 9 programming languages and 11 repository types, SWE-bench++ pushes beyond prior benchmarks with reproducible environments, trajectory logging, and multilingual support.

Here’s what we’re seeing:

  • Dockerized, reproducible environments: Template-guided scaffolding ensures fair, scalable evaluation across codebases.
  • Agentic trajectory capture: Every successful task logs structured reasoning steps, enabling SFT and DPO fine-tuning.
  • Pass@1 leaderboards: Claude Sonnet 4.5 and GPT-5 (Aug 2025) top the charts with ~20% resolution rates.

💡 Why it matters: In the era of autonomous coding agents, evaluation must move from synthetic bugs to complex, real-world repos. SWE-bench++ sets a new bar for what it means to “solve software engineering.”

Explore SWE-bench++

What we're celebrating

🎉 Fast Company named Turing Co-Founder & CEO Jonathan Siddharth one of the 20 Innovators Shaping the Future of AI in 2025.

Jonathan has long believed that advancing AI requires pairing powerful models with the right data, systems & talent. His work building Turing into the world’s leading research accelerator reflects that vision in action, helping frontier labs and enterprises move from general intelligence to real, measurable outcomes.

See the full Fast Company list

What we're reading

  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
    This study challenges the common assumption that reinforcement learning with verifiable rewards (RLVR) enhances an LLM’s reasoning capacity. The authors show that while RL improves pass@1 rates by refining generation efficiency, it doesn't expand the model’s reasoning boundary; base models already contain the successful strategies RLVR exploits. In fact, RL training often narrows the model's exploration space. In contrast, distillation from stronger models (like DeepSeek-R1) does introduce new reasoning paths. The paper highlights the need for more agentic, multi-turn interaction strategies to truly push model reasoning forward.
  • PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
    This paper introduces PerceptionLM (PLM), a fully open, distillation‑free vision‑language framework designed to advance detailed image and video understanding. PLM is trained entirely on open data and expands capabilities through two major contributions: PLM–FGQA, a 2.4M-sample fine-grained video question–answer dataset capturing subtle aspects of human actions, and PLM–STC, a 476K-sample spatio‑temporal captioning dataset with precise region tracking and timestamped descriptions. To evaluate these skills, the authors introduce PLM–VideoBench, a benchmark that tests “what,” “where,” “when,” and “how” reasoning in videos. Trained with a mix of synthetic and new human-curated datasets, PLM achieves competitive or superior performance to open‑source models and even proprietary systems like GPT‑4o across 40+ benchmarks.
  • AI & Human Co-Improvement for Safer Co-Superintelligence
    This paper introduces Co-Improvement, a framework that reimagines AI progress not as autonomous self-improvement, but as collaborative advancement between humans and AI. Rather than building AI systems that independently conduct research and optimize themselves, the authors advocate for systems that are designed to co-research with humans, enhancing both human cognition and AI capabilities. The goal is co-superintelligence: a safe, bidirectional augmentation of intelligence through joint problem-solving, method design, experimentation, and evaluation. The paper argues that this approach accelerates paradigm shifts, mitigates alignment risks, and keeps human values central in the development of future AI systems.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously