AGI Advance: Weekly AI & AGI Insights (May 5, 2026)

Turing Staff
06 May 20265 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This edition highlights how Turing delivered an execution-grounded evaluation benchmark for agentic HR workflows; one that goes beyond output scoring to validate tool sequencing, parameter usage, and end-to-end task completion through verifier-based checks. The reading list covers expert upcycling for MoE models, Claude's performance on human-unsolvable bioinformatics problems, and a new agent architecture that argues context quality matters more than context length. We also discuss how the next decade of AI will be won by closing the data-deployment loop fastest.

What we're doing

This week, we're highlighting how Turing delivered an execution-grounded evaluation benchmark for measuring how agentic models perform across real-world HR workflows. Unlike benchmarks that score final responses, this framework validates whether agents correctly complete multi-step processes, including tool selection, parameter usage, and workflow sequencing, through verifier-based checks against actual system state.

Here's what we delivered:

  • 100+ execution-grounded workflow tasks across core HR operations, including onboarding, background checks, case management, and immigration processing
  • 1,000+ runs per model with pass@10 scoring, enabling stable cross-model comparison that separates signal from noise, with the top model at ~70% pass rate vs. ~50% and ~5% for others
  • A structured failure taxonomy classifying every failed run by root cause: tool selection, parameter hallucination, sequencing errors, and recovery failures, revealing that tool sequencing alone accounts for over 80% of failures in top models

💡 Agents that produce plausible responses can still fail silently on real workflows. By grounding evaluation in verifier-validated execution rather than output inspection, this benchmark surfaces the failure modes that matter most: wrong tools, wrong order, and wrong parameters, and gives training teams the signal to fix them.

Read the full case study

What we're saying

🗣️ Enterprise Superintelligence Will Be Won on the Data and Deployment Loop

In a new post, Turing CEO Jonathan Siddharth highlights that the next axis of AI advantage isn't algorithms or compute alone; it's the speed of the loop between data and deployment. 

Data companies don't see deployment. Deployment companies don't see data.

Turing operates on both sides: generating training data, evals, and RL environments for frontier labs while deploying agentic solutions into Fortune 500 enterprises across financial services, life sciences, healthcare, and more. Every deployment surfaces capability gaps that inform the next round of data and model improvement. The next decade, Jonathan mentions, will be won by the systems and partnerships that close this loop fastest. 

Read the full post

What we're reading

  • Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
    Amazon introduces expert upcycling, a method for progressively expanding MoE capacity mid-training by duplicating existing experts and extending the router, while keeping top-K routing fixed to preserve per-token inference cost. Rather than training a large MoE from scratch, the approach starts with a smaller E-expert model, then expands to mE experts at a chosen transition point, inheriting learned representations as a warm initialization. In 7B→13B total parameter experiments, the upcycled model matches the fixed-size baseline across 11 downstream benchmarks while saving ~32% of GPU hours, and ~67% when an existing checkpoint is reused. A key contribution is utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when the continued pre-training budget is limited. Ablations show that gradient norm is the recommended default selection criterion, that at least 50% continued pre-training is needed for strong gap closure, and that expert upcycling consistently outperforms sparse upcycling (dense→MoE) across all tested activation ratios, with the advantage widening as activation ratios decrease.
  • Evaluating Claude’s Bioinformatics Research Capabilities with BioMysteryBench
    Anthropic introduces BioMysteryBench, a 99-question bioinformatics benchmark built from real-world datasets where answers are derived from controllable, objectively verifiable properties of data rather than subjective scientific conclusions. Unlike prior benchmarks, it is method-agnostic, allows models unrestricted access to tools and databases, and crucially does not require questions to be human-solvable, enabling evaluation on problems that panels of domain experts could not crack. On 76 human-solvable tasks, the latest Claude generations perform on par with or ahead of human experts, sometimes using entirely different analytical strategies. On 23 human-difficult tasks, Claude Mythos Preview achieves a 30% solve rate, outperforming panels of five domain experts. A deeper reliability analysis reveals a striking pattern: on human-solvable problems, models are strongly bimodal; they either solve a problem consistently or not at all, while on human-difficult problems, nearly half of correct answers come from reasoning paths the model cannot reliably reproduce. The results are reinforced by Genentech and Roche's independently developed CompBioBench, where Claude Opus 4.6 reaches 81% overall and 69% on the hardest questions.
  • GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
    This paper introduces GenericAgent (GA), a general-purpose agent system built to maximize context information density. The core argument is that long-horizon agent performance is determined not by context length but by how much decision-relevant information is maintained within a finite context budget. GA realizes this through four mechanisms: a minimal set of nine atomic tools that reduce prompt overhead and action-space ambiguity; a hierarchical on-demand memory that keeps only a lightweight orientation layer visible by default while retrieving deeper knowledge only when needed; a self-evolution pipeline that converts verified trajectories into reusable SOPs and executable code; and a layered context truncation and compression system that prevents unbounded context growth during long executions. Across benchmarks, GA matches or outperforms Claude Code and OpenClaw on task completion while using substantially fewer tokens, on Lifelong AgentBench, GA achieves 100% accuracy at 15.5% of OpenClaw's token cost. Its self-evolution mechanism reduces token consumption by up to 89.6% across repeated runs of the same task, with the gains compounding most on complex multi-step workflows.

Where we’ll be

🔹 CVPR 2026 — IEEE/CVF Conference on Computer Vision and Pattern Recognition
📍 Denver, Colorado | 🗓️ June 3-7

CVPR is the world's premier conference that brings together researchers and practitioners to share significant advancements in computer vision, pattern recognition, and AI.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously