This week’s edition focuses on stress-testing symbolic reasoning at the highest level. Turing delivered 1,000+ HLE-grade math prompts to break top-performing LLMs like Nova, R1, and Qwen across domains like Algebra, Topology, and Discrete Math. We also celebrate Taylor Bradley’s recognition in The Modern People Leader podcast, and dive into new research explaining why scaling works, how unlabeled data unlocks learning, and why “Aha!” moments in reasoning models may be mostly an illusion.
What we're doing
This week, we’re highlighting how Turing delivered 1,000+ HLE-grade math prompts designed to break state-of-the-art LLMs on symbolic reasoning, multi-step logic, and graduate-level problem formulation. Aligned with the rigor of the original Humanity’s Last Exam (HLE) dataset, each task was validated for correctness, clarity, and model-breakage impact.
Here’s what we delivered:
- 1,000+ expert-authored math prompts: Spanning 10+ subdomains, including Algebra, Discrete Math, Topology, and Analysis.
- 100% dual-layer QA: Every task reviewed for novelty, answer validity, LaTeX formatting, and subdomain coverage.
- Breakage-verified tasks: All prompts broke two internal models; over 50% also broke an external SOTA model (Nova, R1, Sonnet, or Qwen).
This dataset provides a reusable testbed for diagnosing symbolic reasoning gaps, verifying evaluator quality, and building reward models that understand real math.
What we're celebrating
🎉 Taylor Bradley ranks #2 on The Modern People Leader’s Top 10 episodes of 2025
In a year full of noise about AI in HR, Taylor cut straight to execution, breaking down how his team onboarded 800 people in five days using prompt libraries, systemized templates, and structured workflows.
What we're reading
- Optimal Mistake Bounds for Transductive Online Learning
This paper resolves a 30-year-old open problem by precisely quantifying how much unlabeled data helps in online learning. While standard online learning requires d mistakes, where d is the Littlestone dimension, the authors prove that transductive online learning needs only Θ(√d) mistakes, establishing a quadratic separation between the two settings. They show this bound is tight, with a matching lower bound for all hypothesis classes and a constructive upper bound via carefully designed classes and algorithms. The result demonstrates that advance access to the unlabeled instance sequence can exponentially reduce mistakes in online learning, in sharp contrast to the PAC setting where unlabeled data provides no such benefit. - Superposition Yields Robust Neural Scaling
This paper explains why loss in large language models follows a robust power law with model size by identifying representation superposition as the key mechanism. Using a controlled toy model, the authors show that when models operate in strong superposition – representing far more features than their hidden dimension via overlapping vectors – loss generically scales as 1 / model width, independent of the underlying feature-frequency distribution. In contrast, under weak superposition, power-law loss scaling only emerges if feature frequencies themselves follow a power law, with exponents tied to the data distribution. Empirically, the authors demonstrate that real LLMs lie firmly in the strong superposition regime, with token representations exhibiting 1/m overlap geometry and observed loss exponents ≈ 1, consistent with Chinchilla-style scaling. - The Illusion of Insight in Reasoning Models
This paper investigates whether so-called “Aha!” moments in reasoning models reflect genuine intrinsic self-correction or are merely artifacts of unstable inference. Analyzing 1M+ reasoning traces across training checkpoints, domains (math, cryptic crosswords, spatial puzzles), temperatures, and model families, the authors find that mid-trace reasoning shifts are rare (≈6%) and usually reduce accuracy rather than improve it. Formal “Aha!” events that both introduce a reasoning pivot and measurably increase correctness are vanishingly uncommon, even under lenient definitions. Crucially, while spontaneous shifts do not help even under high uncertainty, externally triggered reconsideration conditioned on high entropy reliably boosts performance, delivering up to +8.4 percentage points on MATH-500.
Stay ahead with AGI Advance
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Ready to Optimize Your Model for Real-World Needs?
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.


