AGI Advance: Weekly AI & AGI Insights (June 23, 2026)

Turing Staff
25 Jun 20264 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week, we highlight how Turing built a large-scale artifact generation benchmark spanning 1,500+ validated outputs across PPTX, DOCX, PDF, HTML, and infographic formats, generated by leading AI providers under realistic enterprise conditions. We also introduce the Advanced PhD Reasoning Rubrics Data Pack and cover new research on long-horizon agent benchmarks, highly capable small reasoning models, and efficient test-time computation scaling.

What we're doing

This week, we're highlighting how Turing built a large-scale artifact generation benchmark to evaluate whether leading AI models can reliably produce enterprise document formats from realistic prompts, across providers, formats, and complexity levels. 

Here's what we delivered:

  • 1,500+ validated artifacts across different formats, including PPTX, HTML, DOCX, PDF, and infographics, generated across AI providers including Claude, Gemini, OpenAI, Perplexity, Manus, and NanoBanana, with one controlled session per query-provider pair and every run logged with provider, model version, format type, complexity level, and QA status
  • Four complexity levels covering the full enterprise instruction-following range, from basic topic-based generation to high-specificity prompts combining detailed content requirements, multiple formatting constraints, and citation obligations
  • 99.9% artifact acceptance rate, enforced through format-specific QA criteria for every artifact type and a ten-category failure taxonomy that enabled systematic analysis of failure patterns across providers and formats

💡 By combining format-specific QA with a structured failure taxonomy, this benchmark reveals where models succeed, where they fail, and why aggregate pass rates often hide the patterns that matter for deployment.

Read the full case study

What we're celebrating

🎉Introducing the Advanced PhD Reasoning Rubrics Data Pack

Turing released the Advanced PhD Reasoning Rubrics Data Pack: 1,106 expert-authored PhD-level tasks across Computer Science, Data Science, and Chemistry, each paired with weighted atomic rubrics that evaluate the reasoning process, not just final answers.

Calibrated across 16 evaluation rounds with pass rates from 0% to 50% on SOTA models, it's built for RL, reward modeling, post-training, and reasoning failure analysis.

Explore the dataset

What we're reading

  • Agents’ Last Exam
    Researchers from UC Berkeley and collaborators introduced Agents’ Last Exam (ALE), a benchmark designed to measure whether AI agents can complete economically valuable, long-horizon professional work, rather than just answer questions or solve isolated coding tasks. The benchmark spans 1,490 task instances across 55 subdomains and 13 industry clusters, covering fields from engineering, manufacturing, life sciences, finance, healthcare, legal work, and media production.

    Current frontier agents perform far below saturation. The strongest configuration, Codex with GPT-5.5, achieves an overall pass rate of 24.0%, while the hardest Last-Exam tier remains effectively unsolved, with most leading agents scoring 0–2.6% pass rates and average full-pass performance below 1% across mainstream systems.
  • VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
    Researchers at Weibo AI introduce VibeThinker-3B, a compact reasoning model that explores how far verifiable reasoning can be pushed within a strict 3B-parameter budget. Using a training pipeline that combines curriculum-based SFT, multi-domain RL, offline self-distillation, and instruction alignment, the model achieves performance typically associated with much larger frontier systems.

    VibeThinker-3B scores 94.3 on AIME 2026, 80.2 Pass@1 on LiveCodeBench v6, and 76.4 on IMO-AnswerBench, while a claim-level test-time scaling method (CLR) boosts AIME 2026 to 97.1 and IMO-AnswerBench to 80.6. It also achieves a 96.1% acceptance rate on recent unseen LeetCode contests, matching or exceeding several flagship models hundreds of times larger.
  • LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
    Researchers introduce LoopCoder-v2, a family of 7B Parallel Loop Transformer (PLT) coding models that investigate how much latent computation can be added through recurrent loops before performance saturates. Unlike standard Transformers, PLTs reuse the same block multiple times, enabling deeper reasoning without increasing parameter count.

    Training variants with 1–4 loops on 18T tokens, the authors find a strongly non-monotonic scaling effect: the 2-loop model delivers the best results, improving SWE-bench Verified from 43.0% to 64.4% and Multi-SWE from 14.0% to 31.0%, while 3- and 4-loop variants regress significantly.

Where we’ll be

🔹 ICML 2026 — International Conference on Machine Learning
📍 Seoul, South Korea | 🗓️ July 6-11

ICML is one of the world’s leading machine learning conferences, highlighting frontier research across AI, data science, and applied domains from vision to robotics.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously