AGI Advance: Weekly AI & AGI Insights (Dec 30, 2025)

Turing Staff
30 Dec 20253 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week’s edition spotlights the next benchmark frontier for coding agents. We introduce Code Review Bench, a 6,296-task dataset built from real GitHub PRs to evaluate how well LLMs judge correctness, catch bugs, and critique tradeoffs. We also look at the emergence of 1,000-layer RL networks with qualitatively new behaviors, a new gating mechanism that tames attention drift in LLMs, and a surprising reason diffusion models generalize so well: they implicitly forget what they memorize.

What we're doing

This week, we’re highlighting Code Review Bench, a 6,296-task benchmark purpose-built to evaluate LLMs on code review, not just code generation. While today’s agents excel at unit-test-verified fixes, that’s not how real engineering works. Code review captures deeper signals: bug severity, design critique, contextual judgment, and productivity tradeoffs.

Here’s what we’re doing:

  • Structured on real-world PRs: Each task is drawn from actual GitHub workflows and labeled as either APPROVE or REQUEST_CHANGES, with added hints to disambiguate reviewer intent.
  • Public + commercial split: We’ve open-sourced a 1,200-task subset on Hugging Face, with the full 6,296-task set available for licensing.
  • Frontier model evals: Claude Sonnet 4.5 tops the overall success rate (50.8%), while GPT-5 Codex leads in bug catching (89.15%), revealing distinct strengths across agent classes.

💡 Code Review Bench helps evaluate how well models reason through ambiguity, critique tradeoffs, and elevate quality.

Explore the benchmark

What we're saying

🗣️ Jonathan Siddharth, Founder & CEO:

In a conversation with Harry Stebbings, Jonathan explains why the next wave of AI progress depends on tasks that demand real expertise, real reasoning, and real-world judgment. These signals do not exist on the public internet. They cannot be scraped. They must be created by people who understand the work at a deep level.

Watch the full episode

What we're reading

  • 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
    This paper tackles a long-standing bottleneck in reinforcement learning: unlike vision and language, RL has failed to see large gains from scaling model size, largely relying on shallow 2–5 layer networks. The authors show that in self-supervised, goal-conditioned RL, scaling depth up to 1024 layers dramatically changes this picture, using a residual, layer-normalized architecture with contrastive RL. Across locomotion, navigation, and manipulation tasks, deeper networks deliver 2×–50× performance gains, with some humanoid maze tasks improving by over 50×, and exhibit sharp “phase transitions” where qualitatively new behaviors emerge. Notably, agents evolve from collapsing or brute-force motion to upright walking and even acrobatic wall-vaulting as depth increases.
  • Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
    This paper studies why gating helps attention, showing that a simple sigmoid gate applied after Scaled Dot-Product Attention (SDPA) consistently improves LLM performance, stability, and scaling. Across 1.7B dense models and 15B MoE models trained on up to 3.5T tokens, SDPA output gating reduces perplexity by ~0.2 and improves benchmarks like MMLU by up to 2 points, outperforming parameter-matched baselines. The authors trace these gains to two mechanisms: added non-linearity that increases the expressiveness of attention’s low-rank projections, and query-dependent sparsity that suppresses irrelevant context. Crucially, this sparse gating eliminates attention sinks, reducing first-token attention from ~47% to under 5%, stabilizes training at higher learning rates, and enables strong long-context extrapolation, delivering over 10-point gains on RULER at 64k–128k context lengths.
  • Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
    This paper explains why diffusion models often generalize instead of memorizing, even when overparameterized, by identifying an implicit dynamical regularization in training. The authors show two distinct timescales: a fast generalization time (τgen) when sample quality becomes high, and a much later memorization time (τmem) that grows linearly with dataset size. This creates a widening training window where models learn the population score but not dataset-specific details, so early stopping yields high-quality, non-memorized samples. Experiments with U-Nets and a theoretical random-features analysis both confirm this separation of timescales as the key mechanism preventing memorization.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously