When frontier models converge on the same benchmark scores, differentiation disappears, even if real capability hasn’t plateaued. This week, we introduce HLE++, a calibrated STEM dataset engineered to restore measurable pass@k separation across graduate-to-PhD difficulty bands. We also celebrate Open RL hitting #1 on Hugging Face, and unpack GPT-5.4’s release alongside new research on autonomous agents, reasoning efficiency, and real-world governance failures.
What we're doing
As Humanity’s Last Exam (HLE) benchmarks saturate, progress becomes invisible. At Turing, we designed HLE++ to restore headroom, with calibrated, graduate-to-PhD-level STEM problems designed to preserve measurable pass@k separation on frontier systems.
Here’s what we're offering:
- Packs of 5,000+ graduate-to-PhD-level problems, validated on Opus 4.5 Extended and GPT-5.2 Thinking
- Deterministic, single-answer formats, 100% original and search-resistant
- Calibrated difficulty bands, including pass@8 ≈ 0 headroom sets for SFT and low-positive bands for RL
💡 When benchmarks saturate, progress plateaus. HLE++ restores measurable signals with evaluation-safe, frontier-calibrated data built for real separation.
What we're celebrating
🎉 Open RL is live and trending #1 on Hugging Face
We released Open RL, a collection of HLE-grade, objectively verifiable STEM reasoning tasks across physics, math, biology, and chemistry, and it’s already #1 trending on Hugging Face.
If we want models that truly reason, our benchmarks must be rigorous, reproducible, and objectively checkable. Open RL is built to raise that standard.
What we're reading
- Introducing GPT‑5.4
OpenAI introduces GPT-5.4 (and GPT-5.4 Pro) as a unified frontier model optimized for professional knowledge work, coding, and agentic tool use across ChatGPT, the API, and Codex. A core contribution is native computer-use capability for general-purpose automation across software and web workflows, alongside up to 1M tokens of context in Codex/API for long-horizon planning, execution, and verification. GPT-5.4 sets new or improved results on agentic and tool benchmarks, including GDPval 83.0% (wins or ties), OSWorld-Verified 75.0%, Toolathlon 54.6%, and BrowseComp 82.7%, while matching or slightly improving coding performance on SWE-Bench Pro (57.7%) relative to GPT-5.3-Codex. The release also adds tool search, reducing token overhead in tool-heavy settings (reported 47% token reduction on MCP-Atlas tasks at equal accuracy), and improves factuality (reported 33% fewer false claims vs GPT-5.2 on flagged prompts). Deployment includes expanded cyber safeguards under OpenAI’s Preparedness Framework, with GPT-5.4 positioned as a more capable, more token-efficient foundation for real-world agent workflows. - Agents of Chaos
This paper reports a two-week live red-teaming study of autonomous LLM agents with persistent memory, email, Discord, shell access, and tool use. Across 11 case studies, researchers observed unauthorized compliance, sensitive data disclosure, denial-of-service, identity spoofing, indirect prompt injection, and cross-agent failure propagation. The failures stem from structural gaps: no stakeholder model, weak identity verification, limited self-awareness of competence and resources, and poor handling of multi-agent dynamics. The study argues that realistic agent deployments introduce governance and accountability risks not captured by static benchmarks, calling for stronger identity, authorization, and oversight frameworks. - Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
This paper introduces Draft-Thinking, a training framework that reduces the cost of long chain-of-thought reasoning without sacrificing accuracy. Instead of compressing outputs post hoc, it teaches models to internalize concise “draft-style” reasoning via supervised distillation and staged reinforcement learning. On MATH500, it cuts reasoning tokens by 82.6% with only a 2.6% drop in accuracy, achieving a 5.6× improvement in token efficiency. An adaptive prompting mode further allows models to switch between concise and deep reasoning based on problem difficulty, making reasoning budget a flexible, model-controlled behavior.
Stay ahead with AGI Advance
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Ready to Optimize Your Model for Real-World Needs?
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

