AGI Advance: Weekly AI & AGI Insights (Nov 4, 2025)

Turing Staff
05 Nov 20253 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week, we’re spotlighting how Turing built a 7,000+ sample SlideVQA dataset to stress-test multimodal models on real business and STEM visuals, from misread charts to flawed floorplan logic. On the research front, we explore LIMI, a 78-demo model outperforming frontier agents, a precision mismatch fix for RLHF, and a benchmark challenging models to autonomously conduct real-world LLM research.

What we're doing

This week, we’re spotlighting how Turing helped a frontier AI lab build over 7,000 expert-verified SlideVQA tasks to benchmark and fine-tune LMMs for real-world slide reasoning. Each task was designed to surface model failures in visual grounding, multi-hop reasoning, and layout understanding across business, STEM, and finance decks.

Here’s what we’re seeing:

  • 7,000+ model-breaking prompts: Generated across 20+ knowledge domains using realistic slide visuals, including charts, diagrams, blueprints, and tables.
  • 100% visual grounding required: All prompts require visual references, such as charts, infographics, maps, or diagrams.
  • 3-tier difficulty structure: Tasks labeled Easy, Medium, or Hard based on logical steps, visual complexity, and cross-slide dependencies.

💡 From chart misreads to layout confusion, this dataset exposes how models see and fail to reason about real-world slides.

Read the full case study

What we're reading

  • LIMI: Less is More for Agency
    LIMI is a new agentic LLM that dramatically outperforms larger models on autonomous task execution using only 78 training examples. Trained on real-world collaborative tasks like coding and scientific workflows, LIMI achieved 73.5% on AgencyBench, outperforming models like GLM-4.5 (45.1%) and Kimi-K2-Instruct (24.1%). It also generalized well across tool use and reasoning benchmarks, including TAU2 and SciCode. The secret? Not more data, but strategically curated, high-quality agentic demonstrations. This challenges the “more is better” paradigm and offers a blueprint for data-efficient agent training.
  • Defeating the Training-Inference Mismatch via FP16
    Researchers show that instability in RL fine-tuning for LLMs stems from a precision mismatch between training and inference engines, specifically when using BF16. Their solution is simple: switch to FP16. Across multiple algorithms (GRPO, GSPO, PG), frameworks (VeRL, Oat), model families (Qwen, OctoThinker), and architectures (LoRA, MoE), FP16 yielded more stable optimization, faster convergence, and higher evaluation scores. On perfectible benchmarks, it enabled 99% training accuracy, outperformed BF16 even with complex algorithmic patches, and generalized better on evals like AIME 2024.
  • InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research
    This benchmark tests whether LLM agents can autonomously complete research-grade tasks like dataset creation, reward shaping, and scaffold design, based on actual NeurIPS and ICLR papers. Using the ResearchGym platform, models like Claude 4, GPT-5, and GLM-4.5 were evaluated across 20 multi-day challenges. While Claude led overall, all models struggled with long-horizon reasoning and fragile execution, highlighting the gap between current agent capabilities and real-world scientific research.

Where we’ll be

Turing will be at this major AI conference in the coming month—join us to discuss the future of AGI:

  • NeurIPS 2025
    [Mexico City | Nov 30 – Dec 5]
    [San Diego Convention Center | Dec 2 – 7]

    The Neural Information Processing Systems Foundation is a non-profit that promotes research in AI and ML by organizing a leading annual conference focused on ethical, diverse, and interdisciplinary collaboration.

If you’re attending, reach out—we’d love to connect and exchange insights!

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously