This week’s edition focuses on the next frontier of multimodal evaluation. Turing built a 200+ task ImageQA dataset curated by PhDs across 20+ STEM disciplines, designed to stress-test models on spatial, symbolic, and scientific reasoning. We also celebrate our partnership with Anthropic to bring Claude Enterprise into production systems, and dive into major research updates from Google’s Gemini Deep Think, diffusion language modeling frameworks, and deterministic decoding strategies that eliminate hallucinated packages.
What we're doing
This week, we’re highlighting how Turing built a high-difficulty ImageQA dataset designed to stress-test multimodal models on spatial, symbolic, and scientific reasoning. Unlike standard VQA datasets focused on captioning or object recognition, every task requires deep, image-dependent reasoning across STEM domains.
Here’s what we delivered:
- 200+ PhD-authored ImageQA tasks spanning 20+ STEM-heavy disciplines
- A 3-phase validation pipeline: expert creation, peer expert verification, and adversarial search testing
- 100% client acceptance, with every question meeting strict image-dependency, reasoning depth, and non-searchability criteria
💡 Multimodal intelligence isn’t about describing images. It’s about reasoning through them.
What we're celebrating
🎉 Turing × Anthropic
We’re excited to announce that Turing is joining Anthropic as a launch partner to customize Claude Enterprise for complex, real-world enterprise agents.
Together with Anthropic, we’re embedding Claude directly into enterprise systems, designing human-in-the-loop workflows, and deploying agentic AI with the governance, oversight, and operational rigor required to scale.
What we're reading
- Gemini 3 Deep Think: Advancing Science, Research and Engineering
Google has released a major upgrade to Gemini 3 Deep Think, its specialized reasoning mode built for complex scientific and engineering challenges. Designed in collaboration with researchers, the system targets open-ended problems with incomplete data and unclear solution paths, combining deep theoretical reasoning with practical application.
On rigorous benchmarks, Deep Think achieves 48.4% on Humanity’s Last Exam (no tools), 84.6% on ARC-AGI-2, an Elo of 3455 on Codeforces, and gold-medal-level performance on the 2025 International Math, Physics, and Chemistry Olympiads. It also demonstrates strength in advanced theoretical physics with 50.5% on CMT-Benchmark - dLLM: Simple Diffusion Language Modeling
This paper introduces dLLM, an open-source framework that standardizes the full diffusion language model (DLM) pipeline—training, inference, and evaluation—into a modular and extensible system. While recent DLMs share common components, implementations are fragmented and difficult to reproduce; dLLM unifies Masked Diffusion (MDLM), Block Diffusion (BD3LM), and related variants under a consistent HuggingFace-based interface.
Beyond reproducing and finetuning open-weight models like LLaDA and Dream, dLLM provides minimal recipes for building small DLMs from scratch. Notably, it demonstrates that both BERT-style encoders and autoregressive LMs can be converted into functional DLMs using only lightweight SFT, without architectural changes or large-scale retraining. The framework also integrates efficient inference methods (e.g., Fast-dLLM) and a unified evaluation pipeline that reproduces official benchmark results. - PackMonitor: Enabling Zero Package Hallucinations Through Decoding-Time Monitoring
This paper presents PackMonitor, a plug-and-play framework that eliminates package hallucinations in LLM-generated installation commands by enforcing validity at decoding time. Instead of reducing hallucinations probabilistically, PackMonitor constrains generation using a DFA built from authoritative package lists (e.g., PyPI), masking any token path that would produce a non-existent package.
Across five LLMs and two benchmarks, PackMonitor reduces hallucination rates from up to 16–26% to 0%, adds only ~0.05–0.3s latency, and preserves general coding performance (no drop on HumanEval). The results show that when validity is decidable, hallucinations can be deterministically prevented rather than merely mitigated.
Where we’ll be
🔹 LLM Researchers Happy Hour
📍 Mountain View, California | 🗓️ March 5
Join Turing co-founders Jonathan Siddharth and Vijay Krishnan, along with Foundation Capital’s Ashu Garg, for an evening of discussion on the future of LLMs and AI, alongside researchers and leaders from OpenAI, Anthropic, Meta, Google, Microsoft, and more.
Stay ahead with AGI Advance
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Ready to Optimize Your Model for Real-World Needs?
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.


