This week’s AGI Advance shares how Turing stress-tested frontier LLMs using a 2,000-sample LSAT-grade dataset to uncover reasoning failures in logic, inference, and comprehension, resulting in a 97% acceptance rate and 20+ mapped failure types.
We’re also celebrating the release of Apriel-1.5-15B-Thinker, a compact model developed by ServiceNow that rivals DeepSeek-R1 at 1/40th the size. Additionally, we dive into research on coding agents in complex codebases, the limits of reasoning in physics tasks, and why exponential AI progress is still widely underestimated.
This week, we’re spotlighting how Turing built a benchmark-grade LSAT dataset to uncover reasoning blind spots in frontier LLMs. Designed to push models beyond pattern recognition, this dataset targeted weaknesses in logic games, reading comprehension, and multi-step argumentative reasoning.
Here’s what we’re seeing:
🎉Turing × ServiceNow: Apriel-1.5-15B-Thinker
ServiceNow released a 15B parameter model that matches DeepSeek-R1-0528’s performance, at just 1/40th the size. It runs on a single GPU and already rivals frontier models on benchmarks like Artificial Analysis 52 and IFBench 62. The model hasn’t even undergone RL training yet. Turing supported this effort with high-quality tuning data across code, agentic tasks, and complex reasoning. Hats off to the teams on both sides for pushing compact model capabilities forward.
Turing will be at two major AI conferences in the coming months—join us to discuss the future of AGI:
If you’re attending, reach out—we’d love to connect and exchange insights!
Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.
Partner with Turing to fine-tune, validate, and deploy models that learn continuously.