AGI Advance: Weekly AI & AGI Insights (Nov 4, 2025)

Turing Staff

05 Nov 2025•3 mins read

LLM training and enhancement

What we're doing

What we're reading

Where we’ll be

Stay ahead with AGI Advance

LLM training and enhancement

This week, we’re spotlighting how Turing built a 7,000+ sample SlideVQA dataset to stress-test multimodal models on real business and STEM visuals, from misread charts to flawed floorplan logic. On the research front, we explore LIMI, a 78-demo model outperforming frontier agents, a precision mismatch fix for RLHF, and a benchmark challenging models to autonomously conduct real-world LLM research.

What we're doing

This week, we’re spotlighting how Turing helped a frontier AI lab build over 7,000 expert-verified SlideVQA tasks to benchmark and fine-tune LMMs for real-world slide reasoning. Each task was designed to surface model failures in visual grounding, multi-hop reasoning, and layout understanding across business, STEM, and finance decks.

Here’s what we’re seeing:

7,000+ model-breaking prompts: Generated across 20+ knowledge domains using realistic slide visuals, including charts, diagrams, blueprints, and tables.
100% visual grounding required: All prompts require visual references, such as charts, infographics, maps, or diagrams.
3-tier difficulty structure: Tasks labeled Easy, Medium, or Hard based on logical steps, visual complexity, and cross-slide dependencies.

💡 From chart misreads to layout confusion, this dataset exposes how models see and fail to reason about real-world slides.

→ Read the full case study

What we're reading

LIMI: Less is More for Agency
LIMI is a new agentic LLM that dramatically outperforms larger models on autonomous task execution using only 78 training examples. Trained on real-world collaborative tasks like coding and scientific workflows, LIMI achieved 73.5% on AgencyBench, outperforming models like GLM-4.5 (45.1%) and Kimi-K2-Instruct (24.1%). It also generalized well across tool use and reasoning benchmarks, including TAU2 and SciCode. The secret? Not more data, but strategically curated, high-quality agentic demonstrations. This challenges the “more is better” paradigm and offers a blueprint for data-efficient agent training.
Defeating the Training-Inference Mismatch via FP16
Researchers show that instability in RL fine-tuning for LLMs stems from a precision mismatch between training and inference engines, specifically when using BF16. Their solution is simple: switch to FP16. Across multiple algorithms (GRPO, GSPO, PG), frameworks (VeRL, Oat), model families (Qwen, OctoThinker), and architectures (LoRA, MoE), FP16 yielded more stable optimization, faster convergence, and higher evaluation scores. On perfectible benchmarks, it enabled 99% training accuracy, outperformed BF16 even with complex algorithmic patches, and generalized better on evals like AIME 2024.
InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research
This benchmark tests whether LLM agents can autonomously complete research-grade tasks like dataset creation, reward shaping, and scaffold design, based on actual NeurIPS and ICLR papers. Using the ResearchGym platform, models like Claude 4, GPT-5, and GLM-4.5 were evaluated across 20 multi-day challenges. While Claude led overall, all models struggled with long-horizon reasoning and fragile execution, highlighting the gap between current agent capabilities and real-world scientific research.

Where we’ll be

Turing will be at this major AI conference in the coming month—join us to discuss the future of AGI:

NeurIPS 2025
[Mexico City | Nov 30 – Dec 5]
[San Diego Convention Center | Dec 2 – 7]
The Neural Information Processing Systems Foundation is a non-profit that promotes research in AI and ML by organizing a leading annual conference focused on ethical, diverse, and interdisciplinary collaboration.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]