AGI Advance: Weekly AI & AGI Insights (Dec 23, 2025)

Turing Staff
23 Dec 20254 mins read
LLM training and enhancement
AGI_Advance_Newsletter

This week’s edition explores a new frontier in training data. With the launch of Project Lazarus, we’re preserving the full operational history of defunct startups, including codebases, design docs, testing artifacts, and infrastructure traces to train agents on the messy, high-stakes reality of building real products. We also discuss how Mistral’s OCR 3 redefines document parsing for agents, AWS and UW unveil a skill-learning RL agent framework, and new research warns of homogenized LLM outputs even in open-ended tasks.

What we're doing

Last week, we launched Project Lazarus, a groundbreaking initiative to preserve the complete operational history of defunct companies as training data for frontier models. While today’s datasets are optimized for publication, instruction, or synthetic control, they fail to capture the texture of real work: incomplete specs, deadline-driven tradeoffs, and human judgment under pressure.

Here’s what we’re doing:

  • Unfiltered reality, not synthetic instruction: We’re acquiring full private repositories, design docs, testing logs, infrastructure manifests, and internal wikis from defunct startups across FinTech, pharmaceutical, manufacturing, and SaaS.
  • Reconstructing end-to-end complexity: This data provides causal traces of actual work, making it possible to train agents that can reason over months of system evolution, handle partial information, and debug ambiguous specs.
  • Foundations for autonomous agents: The result is a pretraining and benchmarking corpus where LLMs can learn multi-turn, tool-rich workflows not from toy environments, but from how real companies actually shipped.

💡 The world doesn’t need more synthetic examples, it needs data grounded in reality. Project Lazarus is how we preserve it.

Learn more

What we're celebrating

🎉 Databricks × Turing: OfficeQA for Grounded Enterprise Reasoning

Databricks released OfficeQA, a benchmark for long-context, cross-document reasoning on real-world PDFs, built with question contributions from Turing. The dataset spans 246 QA pairs over 89,000+ pages of U.S. Treasury Bulletins, testing AI agents on analytical depth, retrieval precision, and answer grounding. It sets a new bar for enterprise-relevant evaluation.

Explore the benchmark

What we're reading

  • Introducing Mistral OCR 3
    Mistral OCR 3 sets a new bar in document intelligence with a 74% win rate over its predecessor on forms, handwriting, low-quality scans, and complex tables. Unlike OCR tools tuned for narrow use cases, OCR 3 generalizes across enterprise documents, reconstructing table layouts with HTML fidelity, parsing forms with handwritten entries, and digitizing noisy historical records. Available via API or drag-and-drop in the Document AI Playground, it's now being used to power invoice automation, archive digitization, and agent-ready document processing at scale.
  • Reinforcement Learning for Self-Improving Agent with Skill Library
    AWS and University of Wisconsin introduce SAGE (Skill-Augmented GRPO for self-Evolution), a new RL framework that helps LLM agents accumulate and reuse executable skills over multi-task chains. Applied to AppWorld, SAGE uses Sequential Rollout and a novel Skill-Integrated Reward to train agents that learn from experience, storing reusable function-call chains in a skill library. The result? A 60.7% scenario completion rate on unseen tasks, an 8.9% gain over prior RL baselines, while using 59% fewer tokens and fewer steps. As agents move from static prompts to lifelong learning, SAGE sets a new path for long-horizon, tool-using intelligence.
  • Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
    This paper introduces the Artificial Hivemind problem, showing that large language models frequently converge on the same ideas when answering open-ended questions, both within a single model and across different model families. To study this systematically, the authors release INFINITY-CHAT, a dataset of ~26K real-world open-ended prompts paired with a new taxonomy and 31K+ dense human annotations capturing pluralistic preferences. Using embedding-based similarity metrics, they find that even with high-temperature or diversity-oriented decoding, responses often cluster tightly, with average similarities exceeding 0.8 within models and 0.7–0.8 across models. The analysis also reveals that reward models and LLM judges struggle to align with human judgments when multiple answers are equally valid, despite comparable overall quality. Together, the work exposes a structural risk of homogenized AI outputs and provides a benchmark for studying diversity, pluralism, and long-term alignment in open-ended generation.

Stay ahead with AGI Advance

Turing is leading the charge in bridging AI research with real-world applications. Subscribe to AGI Advance for weekly insights into breakthroughs, research, and industry shifts that matter.

[Subscribe & Read More]

Ready to Optimize Your Model for Real-World Needs?

Partner with Turing to fine-tune, validate, and deploy models that learn continuously.

Optimize Continuously