AGI Advance: Weekly AI & AGI Insights (Jan 27, 2026)

Turing Staff
28 Jan 20264 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week’s edition focuses on agent safety in the real world. Turing built a dataset of 24,000+ multi-turn conversations, capturing how AI agents make decisions across tool use, refusals, and final response, annotated step by step across 30+ safety dimensions. Additionally, Jonathan Siddharth speaks at Axios House in Davos about why enterprise is the proving ground for superintelligence, and we dig into new research on dynamic context discovery, tool orchestration, and execution-grounded data generation.

What we're doing

This week, we’re spotlighting how Turing helped a client build a dataset of 24,000+ multi-turn conversations to evaluate and improve agent safety in tool-rich environments. Unlike traditional datasets focused only on final responses, this effort supervised every step, from tool calls to confirmations, refusals, and rewrites across 30+ safety dimensions.

Here’s what we delivered:

  • 24,000+ supervised traces across benign, dual-use, harmful, and jailbreak scenarios
  • Step-level safety labels and policy-aligned rewrites, covering tool misuse, confirmation handling, and refusal breakdowns
  • Multi-pass QA and automation to enforce structural, behavioral, and policy compliance at scale

💡 Real safety failures don’t just happen at the final output; they unfold across decisions, tools, and turns. This dataset captures them all.

Read the full case study

What we're saying

🗣️ Jonathan Siddharth at Axios House, Davos

In a conversation with Axios Publisher Nicholas Johnston, Turing CEO Jonathan Siddharth made the case for why enterprise is the real proving ground for superintelligence.

“The models are capable of X, but we’re only extracting X minus delta of value. Closing that gap is where the next breakthroughs will come from.”

Jonathan shared why real-world deployment across banks, life sciences, and government is the key to uncovering model failure modes, surfacing missing enterprise knowledge, and building the systems where intelligence becomes infrastructure.

Watch the episode here

What we're reading

  • Dynamic Context Discovery
    This post introduces dynamic context discovery, a context-engineering pattern where agents pull in information only when needed instead of relying on large, static prompts. Cursor applies this by treating tool outputs, chat history, MCP tool metadata, skills, and terminal sessions as files, allowing agents to search, read, and load context incrementally. This approach reduces token usage, avoids context pollution, and improves recovery after summarization by letting agents re-query prior details rather than relying on lossy summaries. A key result is a 46.9% reduction in total agent tokens in MCP-heavy runs by dynamically loading tool descriptions instead of injecting them upfront. Overall, the design reframes context as a queryable workspace, enabling longer, more reliable agent trajectories with better task focus.
  • AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning
    This paper addresses a limitation of multimodal LLMs: rigid and brittle tool usage, where models struggle to decide which tools to use, when to use them, and how to compose them over long reasoning horizons, especially with unseen tools. The authors introduce AdaReasoner, a family of MLLMs that learn tool use as a general reasoning skill via three components: a scalable multi-turn trajectory curation pipeline, Tool-GRPO (a reinforcement learning method that optimizes tool selection and sequencing based on end-task success), and an adaptive learning mechanism that decouples tool logic from specific tool names and tasks. As a result, AdaReasoner learns to adopt useful tools, discard irrelevant ones, and modulate tool-use frequency based on task feedback, even generalizing to new tools introduced only at inference time. Empirically, the 7B model improves by +24.9% on average across benchmarks like VSP, Jigsaw, and GUIQA, and outperforms strong proprietary systems including GPT-5 on multiple visual reasoning tasks, showing that effective tool orchestration can outweigh raw model scale.
  • SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback
    This paper tackles the scarcity and cost of high-quality training data for deep search agents, where questions require long, multi-step retrieval and reasoning chains. The authors introduce SAGE, a dual-agent pipeline in which a data generator creates question–answer pairs with a target difficulty (measured by search steps), and a search agent attempts to solve them, providing execution feedback to iteratively refine correctness and difficulty. Compared to resampling-only baselines, execution feedback substantially increases the proportion of questions that are both correct and genuinely hard, producing data that requires more diverse reasoning strategies. Training search agents on SAGE-generated data yields up to 27% relative gains in-domain and up to 23% out-of-domain over popular datasets like HotpotQA and Musique. Notably, agents trained on SAGE data also transfer from Wikipedia-based retrieval to Google Search at inference time without additional training, showing strong generalization.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously