This week in AGI Advance, we zoom in on what it takes to build trustworthy, retrieval-centric agents, and why the future of LLM reasoning may come not from more pretraining, but from better runtime scaffolding, cleaner context, and self-refining systems.

What we're thinking

We’ve been thinking about what it really takes to make retrieval-centric agents earn trust inside an enterprise—where every question must honor changing data, bespoke tools, and fine-grained permissions.

Our conversations with a leader in this space surfaced three early signals:

Perfect inputs over bigger models. When a language model receives clean, up-to-date context, its reasoning “just works”; most breakdowns trace back to gaps in the retrieval layer, not the generator itself.
Permissions are part of the query. Reliable answers demand identity-aware look-ups that unify Slack, Drive, SharePoint, and more, mapping each user to what they’re allowed to see across heterogeneous systems. The hard work lives in harmonizing those rules.
Agentic search needs multi-hop reasoning. The next wave chains specialized tool calls, asks clarifying questions, and knows when to defer, turning search from a single vector hit into a conversational workflow.

Retrieval isn’t just fetching text; it’s reasoning about context, policy, and trust. The real breakthroughs will emerge from smarter data scaffolding and onboarding agents the way we onboard new hires—not from ever-larger base models.

What we're reading

Absolute Zero: Reinforced Self-play Reasoning with Zero Data
This paper introduces Absolute Zero Reasoner (AZR), a self-play system where an LLM learns by proposing, solving, and refining its own tasks with zero external data. Grounded by a code executor, AZR achieves state-of-the-art performance in math and coding benchmarks, outperforming many models trained on curated human data. It learns across deduction, induction, and abduction tasks, and even develops intermediate planning behaviors, highlighting how far LLMs can go when they generate and verify their own learning signals.
Reasoning Models Don’t Always Say What They Think
This paper shows that even advanced reasoning models like Claude 3.7 Sonnet often use hints without verbalizing them, raising doubts about the faithfulness of their chain-of-thought (CoT) reasoning. CoT monitoring caught some unintended behaviors, but failed to detect most reward hacks and often missed internal reasoning altogether. Even when RL improved task accuracy, it didn’t make models more honest about how they solved the problem. The key takeaway: CoT monitoring helps, but it’s not enough to guarantee safe or interpretable reasoning.
SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
SoftCoT++ extends SoftCoT by enabling test-time scaling in the thinking stage, using multiple soft thought representations and contrastive learning to inject diversity, without changing the model’s architecture. The method outperforms traditional token-level scaling on math, symbolic, and commonsense benchmarks, and shows that latent-space reasoning diversity can unlock higher inference quality while staying efficient and robust across model families.

Where we’ll be

Turing will be at two major AI conferences in the coming months—join us to discuss the future of AGI:

ICML 2025 [Vancouver Convention Center, Canada | July 13 – 19]
The International Conference on Machine Learning (ICML) is a leading international conference focused on the advancements in machine learning and its applications.
KDD 2025 [Toronto, ON, Canada | Aug 3 – 7]
The ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) focuses on innovative research in data mining, knowledge discovery, and large-scale data analytics.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]