AGI Advance: Weekly AI & AGI Insights (Nov 18, 2025)

Turing Staff
18 Nov 20254 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week’s edition highlights how clean, synthetic trajectory data can power meaningful gains in model performance, without needing bigger weights. We explore how Turing’s multilingual coding traces helped Qwen 2.5 jump from zero to 25 solved tasks on SWE-Bench, offering a clear case for fine-tuning with process-level supervision. Also in this issue: Anthropic uncovers the first autonomous AI-powered cyber attack, Google Research introduces a new learning paradigm to battle catastrophic forgetting, and Cognizant sets a new milestone for long-horizon LLM task execution.

What we're thinking

This week, we explored how Turing's multilingual trajectory data measurably improves performance on software engineering benchmarks. In a series of fine-tuning experiments, our team validated that synthetic, pass-only agent traces generated from Turing's internal dataset, help open-weight models climb the SWE-Bench leaderboard.


Here’s what we’re seeing:

  • SFT-ready traces generated on top of Turing’s internal dataset: We used Claude Sonnet 4 with a custom wrapper to generate agentic trajectories on challenging multilingual GitHub issues. We used a novel generation assistance process that allowed for generating passing trajectories for even "model breaking" trajectories.
  • Consistent lift across scales: The Qwen 2.5 Coder 7B model went from 0/300 to 20/300 task completions with only 800 Turing samples while the 32B variant reached 25/300 from 4/300. We also observed a consistent linear increase as we keep doubling Turing data in the training mix.
  • Synthetic doesn’t mean sloppy: Even with 40% of training data unreviewed and purely synthetic, models fine-tuned with our trajectories still outperformed larger baselines like the Qwen 72B Instruct.

Training better coding agents isn’t just about bigger models, it’s about cleaner, more human-like paths to solutions. With the right data mix, and balanced scaffolding, even synthetic traces can push models toward SOTA performance.

What we're celebrating

🎉 ServiceNow AI Research × Turing × Mila: GroundCUA and GroundNext

This week, ServiceNow AI Research released GroundCUA, a high-quality dataset of over 3 million annotations spanning 80+ real-world desktop applications, designed to teach agents how humans actually use software like IDEs, spreadsheets, design tools, and browsers. Turing contributed frontier-grade data and feedback to help scale human-grounded demonstrations.

Read the paper

What we're reading

  • Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign
    Anthropic’s Threat Intelligence team uncovered and disrupted the first known large-scale cyber espionage campaign largely executed by an autonomous AI system. A Chinese state-sponsored group, GTG-1002, used Claude Code to independently perform 80–90% of attack operations, spanning reconnaissance, vulnerability discovery, credential harvesting, and data exfiltration, across 30+ targets, including tech firms and government agencies. The actor manipulated the AI into performing each attack phase under the guise of penetration testing. The operation revealed both the power and danger of autonomous AI in offensive security, underscoring the urgent need for robust safeguards and AI-assisted defense systems.
  • Introducing Nested Learning: A New ML Paradigm for Continual Learning
    Google Research introduces Nested Learning, a novel machine learning approach that reimagines models as systems of interwoven optimization problems, each with distinct update frequencies. This unified view of architecture and training helps mitigate catastrophic forgetting, a core challenge in continual learning. Their prototype model, Hope, shows strong gains in language modeling, reasoning, and long-context tasks, outperforming Titans and Transformers. Key innovations include continuum memory systems, deep optimizers, and self-modifying architectures, bringing ML design closer to how the human brain processes and retains knowledge.
  • Solving a Million-Step LLM Task With Zero Errors
    Cognizant AI Lab introduces MAKER, a system that successfully executes over one million LLM steps without error, a milestone in long-horizon task reliability. Instead of building bigger models, MAKER uses massively decomposed agentic processes (MDAPs): it breaks down tasks into minimal subtasks, uses voting-based error correction, and discards uncertain outputs via “red flagging”. Tested on the Towers of Hanoi benchmark, the approach showed logarithmic cost scaling, even for error-sensitive tasks. It also hints at a new direction in AI scaling, less about raw model size, more about process architecture and modularity.

Where we’ll be

Turing will be at this major AI conference in the coming month—join us to discuss the future of AGI:

  • NeurIPS 2025
    [Mexico City | Nov 30 – Dec 5]
    [San Diego Convention Center | Dec 2 – 7]

    The Neural Information Processing Systems Foundation is a non-profit that promotes research in AI and ML by organizing a leading annual conference focused on ethical, diverse, and interdisciplinary collaboration.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously