AGI Advance: Weekly AI & AGI Insights (May 26, 2026)

Turing Staff
27 May 20264 mins read
LLM training and enhancement
AGI_Advance_Newsletter

Most training datasets capture what people say, not what they believe, infer, or feel beneath the surface. This week, we highlight how Turing delivered 2,000+ double-blind persuasion dialogues annotated with structured first- and second-order mental-state reflections, providing rare supervision for belief modeling and cognitive reasoning. We share why Turing is acquiring private codebases and paying up to $100K+ for the operational context frontier models need. We also cover new research on agentic execution in Gemini 3.5, forecasting scientific breakthroughs under temporal constraints, and source-level self-evolving agent systems.

What we're doing

This week, we're highlighting how Turing delivered a human-grounded persuasion dialogue dataset to improve model performance on Theory-of-Mind benchmarks. Unlike standard dialogue datasets that capture only surface text and persuasion outcomes, this dataset annotates the cognitive layer beneath conversation, covering what participants believed, felt, and inferred about their partner across every turn.

Here's what we delivered:

  • 2,000+ double-blind persuasion dialogues across ethics, public policy, technology, and social issues, each spanning pre-test belief capture, multi-turn dialogue, structured mid-conversation ToM reflections, and post-test belief updates
  • Structured ToM dimensions per mid-check per participant, capturing self-reported beliefs, emotions, and goals alongside inferred partner beliefs, emotions, and intentions, providing first- and second-order mental state annotations rare in existing dialogue datasets
  • 100% human-authored content, with zero LLM-generated dialogue or annotations permitted and active LLM detection protocols enforced throughout QA review

💡 Models underperform on ToM benchmarks because training data captures what people say, not what they think. By annotating the unspoken cognitive layer under real double-blind persuasion, this dataset gives models the supervision signal they need to close the gap.

Read the full case study

What we're saying

🗣️ Your Codebase Is Worth More Than You Think

Private codebases aren't just engineering assets, they're training assets. The decisions, tradeoffs, and systems thinking captured in your repos, Jira tickets, PRDs, architecture docs, and support threads are exactly the kind of operational context frontier AI models need to reason about real software workflows.

Turing is acquiring high-quality codebases along with the context around them, and paying up to $100K+ depending on complexity. If you've built something real, this is your chance to turn it into impact and income.

Learn more

What we're reading

  • Gemini 3.5: Frontier Intelligence with Action
    Google introduced Gemini 3.5, a new model family focused on agentic execution, coding, and long-horizon workflows, starting with the release of 3.5 Flash. The model delivers frontier-level performance at high speed, outperforming Gemini 3.1 Pro on benchmarks like Terminal-Bench 2.1 (76.2%), MCP Atlas (83.6%), and GDPval-AA (1656 Elo), while generating tokens 4× faster than other frontier models.

    Designed for real-world agents, 3.5 Flash powers multi-step workflows, collaborative subagents, codebase maintenance, UI generation, and enterprise automation through Google Antigravity and Gemini Enterprise. Google also announced Gemini Spark, a persistent personal AI agent built on 3.5 Flash that can operate continuously on behalf of users.
  • Forecasting Scientific Progress with Artificial Intelligence
    This paper introduces CUSP (Cutoff-conditioned Unseen Scientific Progress), a benchmark for testing whether AI systems can forecast future scientific breakthroughs under strict temporal knowledge constraints. Built from 4,760 scientific milestones and 17,429 forecasting tasks across biology, AI, chemistry, medicine, and physics, CUSP evaluates feasibility prediction, mechanistic reasoning, solution generation, and date forecasting.

    Results show a consistent gap between scientific reasoning and scientific forecasting. Frontier models like GPT-5.4 perform well on identifying plausible approaches (81.9% MCQ accuracy) but remain near chance on feasibility prediction (~50%) and systematically mispredict when discoveries will occur. Models also exhibit strong overconfidence and response bias, with performance remaining surprisingly similar on both pre- and post-cutoff discoveries.
  • MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
    This paper introduces MOSS, a framework that enables autonomous agents to rewrite their own source code, extending self-evolution beyond prompts, skills, and memory into the agent harness itself. Unlike prior systems limited to text-mutable artifacts, MOSS can modify routing, hooks, state management, and dispatch logic directly at the code level.

    MOSS uses a deterministic multi-stage pipeline: it collects real production failures, localizes root causes, generates and reviews code patches through external coding agents, validates candidates in ephemeral trial workers, and deploys successful versions through rollback-safe container swaps.

    On OpenClaw, a single autonomous evolution cycle improved the average score across four agent benchmark tasks from 0.25 to 0.61, demonstrating that source-level self-rewriting can reliably fix structural failures unreachable through prompt or skill updates alone.

Where we’ll be

🔹 LLM Researchers Happy Hour
📍 Mountain View, California | 🗓️ May 27

Connect with AI researchers advancing today’s SOTA foundation models and enterprise AI leaders driving real-world innovation.

🔹 CVPR 2026 — IEEE/CVF Conference on Computer Vision and Pattern RecognitionCVPR 2026 — IEEE/CVF Conference on Computer Vision and Pattern Recognition
📍 Denver, Colorado | 🗓️ June 3-7

CVPR is the world's premier conference that brings together researchers and practitioners to share significant advancements in computer vision, pattern recognition, and AI.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously