This week’s edition dives into the limits of frontier models and how to surface them with precision. We spotlight Turing’s delivery of 5,000+ HLE-grade STEM problems designed to benchmark deep scientific reasoning, along with our collaboration with Meta and Hugging Face on OpenEnv, a new gold standard for agent evaluation. Also in this issue: fluid intelligence in generative models, a contamination-free medical benchmark, and an autonomous math research agent tackling Erdős problems and beyond.
What we're thinking
This week, we’re highlighting how Turing partnered with a frontier AI lab to design and deliver 5,000+ HLE-grade STEM problems purpose-built for benchmarking next-generation language models. Unlike saturated academic datasets, this corpus was engineered to stress deep scientific and mathematical reasoning under strict structural and evaluation constraints.
Here’s what we delivered:
- 5,000+ graduate- to PhD-level problems designed for high-sensitivity frontier model benchmarking
- 100% client acceptance rate, with every problem meeting correctness, precision, and SOTA model-breaking standards
- 40+ STEM subdomains across physics, chemistry, biology, and mathematics, grounded in leading academic taxonomies and research frameworks
💡 Hard problems aren’t enough. Frontier benchmarking requires an evaluation-safe structure, answer uniqueness, calibrated difficulty, and domain depth.
What we're celebrating
🎉 Turing × AI at Meta × Hugging Face
Frontier labs are moving beyond synthetic benchmarks, and we’re proud to partner with AI at Meta and Hugging Face on OpenEnv, a rigorous evaluation framework for testing agents in real, reproducible RL environments.
Here’s what we’ve learned:
- Multi-step reasoning is still the dominant failure point
- Ambiguity degrades reliability across tool-use paths
- Tool selection isn’t enough; argument validation and execution order are key
- Feedback loops improve recovery and agent stability
What we're reading
- GENIUS: Generative Fluid Intelligence Evaluation Suite
This paper argues that current unified multimodal models (UMMs) excel at Crystallized Intelligence, recalling learned visual knowledge, but lack Generative Fluid Intelligence (GFI), the ability to reason, induce patterns, and adapt under novel, ad-hoc constraints. To measure this gap, the authors introduce GENIUS, a 510-sample benchmark structured around three primitives: Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation, evaluated via hybrid metrics for rule compliance, visual consistency, and aesthetic quality. Testing 12 leading open-source and proprietary models reveals striking deficits: even the strongest model scores only 57.19 overall, with failures concentrated in contextual adaptation where models default to pre-trained priors instead of novel rules. Diagnostic experiments show that models often understand constraints in VQA form but fail to execute them visually, exposing an “execution gap.” The authors further propose a training-free attention adjustment mechanism that rebalances token-level gradients during inference, yielding consistent gains and offering a concrete pathway toward activating latent fluid reasoning in generative models. - LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
This paper introduces LiveMedBench, a medical benchmark designed to address two core weaknesses of existing evaluations: data contamination and temporal misalignment with evolving medical knowledge. The dataset currently includes 2,756 real-world clinical cases across 38 specialties and multiple languages, paired with 16,702 case-specific rubric criteria, constructed via a Multi-Agent Clinical Curation Framework that verifies clinical plausibility against authoritative guidelines. Instead of lexical overlap or holistic LLM-as-a-Judge scoring, the authors propose an Automated Rubric-based Evaluation Framework, which decomposes physician responses into granular, weighted criteria and shows significantly stronger alignment with human experts (Pearson ρ = 0.54 vs. 0.26 for LLM-as-a-Judge). Evaluating 38 LLMs, the best model achieves only 39.2%, and 84% of models degrade on post-cutoff cases, confirming contamination and knowledge obsolescence effects. Error analysis reveals that the dominant bottleneck is not factual recall but contextual application, with 35–48% of failures arising from inability to tailor medical knowledge to patient-specific constraints. - Towards Autonomous Mathematics Research
This paper introduces Aletheia, a math research agent built on Gemini Deep Think that iteratively generates, verifies, and revises long-form proofs in natural language, aiming to move from Olympiad-level problem solving to research mathematics. Aletheia combines advanced inference-time scaling, tool use (search and browsing), and a generator–verifier–reviser loop, achieving 95.1% accuracy on IMO-Proof Bench (advanced) and strong results on PhD-level exercises. Beyond benchmarks, it contributed to several research outputs, including a fully AI-generated paper on eigenweights in arithmetic geometry, human–AI collaborative papers, and semi-autonomous evaluation of 700 Erdős problems, resolving four open questions. However, large-scale auditing shows most autonomous attempts remain flawed, with only 6.5% meaningfully correct among 200 vetted Erdős candidates, highlighting reliability limits. The authors propose a taxonomy of Autonomous Mathematics Research Levels to standardize transparency around AI contribution and mathematical significance.
Where we’ll be
🔹 LLM Researchers Happy Hour
📍 Mountain View, California | 🗓️ March 5
Join Turing co-founders Jonathan Siddharth and Vijay Krishnan, along with Foundation Capital’s Ashu Garg, for an evening of discussion on the future of LLMs and AI, alongside researchers and leaders from OpenAI, Anthropic, Meta, Google, Microsoft, and more.
Stay ahead with AGI Advance
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Ready to Optimize Your Model for Real-World Needs?
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.


