AGI Advance: Weekly AI & AGI Insights (Oct 7, 2025)

Turing Staff

09 Oct 2025•3 mins read

LLM training and enhancement

What we're thinking

What we’re celebrating

What we're reading

Where we’ll be

Stay ahead with AGI Advance

LLM training and enhancement

This week’s AGI Advance shares how Turing stress-tested frontier LLMs using a 2,000-sample LSAT-grade dataset to uncover reasoning failures in logic, inference, and comprehension, resulting in a 97% acceptance rate and 20+ mapped failure types.

We’re also celebrating the release of Apriel-1.5-15B-Thinker, a compact model developed by ServiceNow that rivals DeepSeek-R1 at 1/40th the size. Additionally, we dive into research on coding agents in complex codebases, the limits of reasoning in physics tasks, and why exponential AI progress is still widely underestimated.

What we're thinking

This week, we’re spotlighting how Turing built a benchmark-grade LSAT dataset to uncover reasoning blind spots in frontier LLMs. Designed to push models beyond pattern recognition, this dataset targeted weaknesses in logic games, reading comprehension, and multi-step argumentative reasoning.

Here’s what we’re seeing:

2,000+ adversarial LSAT prompts: Custom-written to mimic official LSAT difficulty and structure, exposing model failure modes in logic, inference, and comprehension.
97% acceptance rate: Every sample validated through expert audits, structured rubrics, and a custom LLM QA agent.
20+ failure types mapped: Including conditional misreads, negation traps, invalid analogies, and flawed answer eliminations.

→ Read the case study

What we’re celebrating

🎉Turing × ServiceNow: Apriel-1.5-15B-Thinker

ServiceNow released a 15B parameter model that matches DeepSeek-R1-0528’s performance, at just 1/40th the size. It runs on a single GPU and already rivals frontier models on benchmarks like Artificial Analysis 52 and IFBench 62. The model hasn’t even undergone RL training yet. Turing supported this effort with high-quality tuning data across code, agentic tasks, and complex reasoning. Hats off to the teams on both sides for pushing compact model capabilities forward.

→ Explore the model

What we're reading

Getting AI to Work in Complex Codebases
This deep dive explores how context engineering can unlock LLM performance in huge, messy codebases. Using a “frequent intentional compaction” workflow (research → plan → implement), the authors shipped 35k LOC to a 300k-line Rust project in 7 hours with model-generated code that passed expert review. By keeping context utilization low and guiding models through structured markdown specs, they avoided common pitfalls like context collapse and model slop.
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Researchers introduce CritPt, the first benchmark designed to test LLMs on unpublished, expert-curated, research-level physics problems. Covering 71 full-scale challenges and 190 sub-tasks across fields like quantum, astrophysics, and fluid dynamics, CritPt exposes the massive gap between current LLM capabilities and the reasoning needed for real scientific discovery. Even GPT-5 with tool use scored just 11.7% accuracy, while most models hovered near zero. The benchmark sets a new bar for scientific AI, pushing beyond pattern-matching to genuine reasoning under expert standards.
Failing to Understand the Exponential, Again
This piece argues that many people still underestimate how quickly AI is improving. Benchmarks from METR and OpenAI show models like Claude and GPT-5 are now handling multi-hour tasks and matching expert-level performance in many fields. Despite skepticism, the trends suggest AI will play a major role in the economy much sooner than most expect, possibly within the next year.

Where we’ll be

Turing will be at two major AI conferences in the coming months—join us to discuss the future of AGI:

COLM 2025 [Montreal, Canada | Oct 7 – 10]
The Conference on Language Modeling (COLM) aims to create a community of researchers with expertise in different disciplines, focused on understanding, improving, and critiquing the development of LM technology.
NeurIPS 2025
[Mexico City | Nov 30 – Dec 5]
[San Diego Convention Center | Dec 2 – 7]
The Neural Information Processing Systems Foundation is a non-profit that promotes research in AI and ML by organizing a leading annual conference focused on ethical, diverse, and interdisciplinary collaboration.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]