Building a 1,500+ artifact benchmark to evaluate SOTA model performance across enterprise document formats
Built a large-scale artifact generation benchmark, generating and validating document and infographic artifacts across different output formats, AI providers, and query complexity levels. The project delivered a model- and format-aware benchmark dataset with a 99.9% artifact acceptance rate, providing clean signal for evaluating enterprise-grade artifact generation capabilities.
1,500+
artifacts generated and validated across PPTX, HTML, DOCX, PDF, JSON, Excel, CSV, TXT, and infographics.
99.9%
artifact acceptance rate achieved, reflecting strong execution discipline and format compliance enforcement at scale.
4
query complexity levels covered, testing model performance from basic generation through high-complexity enterprise instruction-following.

The challenge
The client needed a benchmark that could reliably test whether leading AI models could generate required artifact formats from realistic enterprise prompts, at different levels of complexity, across a diverse matrix of document and infographic types.
Key challenges included:
- Multi-provider execution consistency: Running identical benchmark queries across AI providers, including Claude, Gemini, OpenAI, Perplexity, Manus, and NanoBanana, using native provider UIs without automation, while maintaining controlled, one-session-per-query execution discipline.
- Format compliance validation: Verifying that every generated artifact met format-specific structural requirements, such as matching the requested format, and satisfying layout, schema, and content constraints specific to each artifact type.
- Failure taxonomy and follow-up handling: Systematically managing the full range of generation failure modes, from wrong-format outputs and missing downloadable files to citation loss after download and provider clarification loops, without losing execution integrity.
- File-grounded artifact collection: Supporting prompts that required reference file uploads alongside queries, introducing additional execution complexity around file handling, upload sequencing, and generation timing.
The approach
Turing deployed a structured execution framework with artifact-specific QA, a standardized failure taxonomy, and detailed metadata logging across every provider run.
1. Four-level query complexity coverage
Benchmark queries were designed across four complexity types to stress-test frontier models at every level of instruction difficulty:
- Type 1: Short, topic-based prompts testing basic artifact generation and structural inference.
- Type 2: Structure, layout, chart, table, and visual style constraints testing instruction-following precision.
- Type 3: Specific data points, regulatory details, and company information testing content inclusion and organization.
- Type 4: High-specificity enterprise prompts combining detailed content requirements with multiple formatting constraints, restrictions, and citation requirements.
2. Native provider execution and clarification handling
All benchmark prompts were executed directly in native provider UIs, with one controlled session maintained per query-provider pair:
- Our experts submitted queries exactly as specified, without API automation or UI scripting
- Where providers asked clarifying questions, a minimal-response policy was applied and the clarification was logged
- Follow-up prompts were used where necessary to convert research or narrative outputs into the exact requested artifact format
- All generation timings, statuses, and operator comments were recorded per run
3. Artifact-specific format compliance QA
Every generated artifact was validated against format-specific QA criteria before acceptance:
- PPTX artifacts were checked for correct slide count, layout constraints, and presence of required tables, charts, and citations
- HTML artifacts were verified for browser rendering, semantic structure, and accessibility constraints
- JSON artifacts were validated for schema compliance, required field presence, and nesting depth
- Excel artifacts were checked for required tabs, formula integrity, and chart or pivot table presence
- CSV, TXT, PDF, and DOCX artifacts were each assessed against their own structural and formatting requirements
Artifacts that failed format compliance checks were logged with a standardized failure code rather than accepted, ensuring the benchmark dataset reflected true generation success rates.
4. Structured failure taxonomy and metadata logging
Every failed or partial run was classified using a defined failure taxonomy covering ten failure types, including no artifact generated, wrong format, download failure, provider error, safety refusal, timeout, file upload failure, corrupt file, partial output, and clarification loop. This taxonomy enabled systematic analysis of failure patterns across providers and formats, rather than treating all failures as equivalent rejection events.
Key findings
Benchmark execution surfaced a consistent set of operational and format-specific patterns across providers and artifact types:
Operational failure patterns
Pattern | Observed Behavior |
|---|---|
Downloadable artifact issues | Some providers generated content successfully but did not immediately produce a downloadable file, requiring additional steps to retrieve the artifact |
Follow-up prompts required | Several outputs needed a follow-up prompt to convert research or narrative content into the exact requested file format |
Intermediate wrong format | Some providers first produced Markdown, PDF, or narrative summaries before generating the requested CSV, TXT, or other target format |
Citation and source loss | Sources used during generation were not preserved in the downloaded artifact in some cases, particularly for TXT and research-heavy outputs |
Clarifying questions | Certain providers asked clarifying questions before generating artifacts, especially for complex or underspecified prompts |
Manual completion | A subset of artifacts required manual creation or operator completion where direct model output was not sufficient |
Model-level observations
Model / Provider | Observed Behavior |
|---|---|
Claude Deep Research 4.5 Opus | Strong coverage across document formats; occasionally produced research summaries before delivering the exact target artifact and sometimes asked clarifying questions |
OpenAI 5.2 Pro Deep Research | Performed well on research-heavy outputs; some runs required follow-up handling to obtain downloadable final files |
ChatGPT 5.2 Pro | Generated PPTX outputs directly in downloadable format across tracked runs |
Gemini Pro Deep Research | Broad format coverage across business-document outputs; some cases required additional handling around downloadable output generation |
Gemini Pro Canvas | Used specifically for PPTX artifact generation |
M365 Research Agent / Think Deeper | Used across document formats; several runs reflected downloadable-file handling and provider workflow constraints |
Perplexity Pro / Pro Deep Research | Used primarily for TXT and Markdown-style outputs; citation visibility after download was a recurring issue for some runs |
Manus 1.6 Max / Max Wide Research | Handled PPTX and file-grounded document, PDF, and XLSX generation; runs reflected both direct artifact generation and cases requiring download or follow-up handling |
Format-level observations
Format | Observed Behavior |
|---|---|
PPTX | Largest tracked format group; also showed the highest rate of downloadable-output handling and manual completion cases |
CSV | Higher-complexity prompts frequently produced Markdown or PDF summaries before the requested CSV was generated |
TXT | Generally feasible across providers; citation visibility in downloaded files was a recurring issue for some provider runs |
DOCX / PDF / XLSX | File-grounded workstream tasks showed strong acceptance rates, particularly when reference files were uploaded correctly |
HTML / JSON / Excel | Generated across multiple providers; acceptance tracked at the format-compliance level |
Key results
- Delivered more than 1,500 validated artifacts, with every artifact logged with provider, model version, format type, query complexity, generation time, and QA status
- 99.9% artifact acceptance rate achieved, reflecting strong execution discipline and format compliance enforcement
- Full complexity coverage maintained across all document formats, enabling comparison of provider behavior from basic generation through high-complexity enterprise instruction-following
The outcome
The client received a clean, model- and format-aware benchmark dataset with granular execution metadata across every artifact. The benchmark provides reliable signal for evaluating enterprise artifact generation capability and for diagnosing where specific providers and formats fall short under realistic enterprise prompting conditions.
This foundation enables the client to:
- Compare artifact generation quality across providers at every complexity level, from basic generation to high-specificity enterprise instruction-following
- Identify format- and model-specific failure patterns using a structured taxonomy rather than aggregate pass/fail metrics
- Extend the benchmark framework to additional models, formats, or query types using a validated, repeatable execution workflow
- Make informed decisions about artifact generation service capabilities grounded in real execution data rather than provider benchmarks
Need benchmark execution across AI providers and artifact formats?
Request a sample benchmark dataset covering multi-format artifact generation, format compliance validation, and structured failure analysis across leading AI providers.
Request SampleFAQ
What artifact formats and providers were covered?
The benchmark covered PPTX, HTML, DOCX, PDF, JSON, Excel, CSV, TXT, and infographics across providers, including Claude, Gemini, OpenAI, Microsoft 365, Perplexity, Flux, and NanoBanana.
What query complexity levels were included?
Four levels: open (short, topic-based), format-constrained (layout and structure requirements), content-specific (data points and regulatory details), and complex (high-specificity enterprise prompts with multiple constraints)
How were failures handled and classified?
Every failed or partial run was classified using a ten-category failure taxonomy covering no artifact generated, wrong format, download failure, provider error, safety refusal, timeout, file upload failure, corrupt file, partial output, and clarification loop.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
How fast can I get a sample?
Within three business days after NDA execution.
Evaluating AI providers for enterprise artifact generation capability?
Work with Turing to design and execute format-aware benchmarks across providers, complexity levels, and document types.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


