Building a 1,500+ artifact benchmark to evaluate SOTA model performance across enterprise document formats

Built a large-scale artifact generation benchmark, generating and validating document and infographic artifacts across different output formats, AI providers, and query complexity levels. The project delivered a model- and format-aware benchmark dataset with a 99.9% artifact acceptance rate, providing clean signal for evaluating enterprise-grade artifact generation capabilities.

1,500+

artifacts generated and validated across PPTX, HTML, DOCX, PDF, JSON, Excel, CSV, TXT, and infographics.

99.9%

artifact acceptance rate achieved, reflecting strong execution discipline and format compliance enforcement at scale.

4

query complexity levels covered, testing model performance from basic generation through high-complexity enterprise instruction-following.

MethodEvaluation
DomainMultimodality
Dataset scale1,500+ artifacts
CapabilityBenchmarks
Building a 1,500+ artifact benchmark to evaluate SOTA model performance across enterprise document formats

The challenge

The client needed a benchmark that could reliably test whether leading AI models could generate required artifact formats from realistic enterprise prompts, at different levels of complexity, across a diverse matrix of document and infographic types. 

Key challenges included:

  • Multi-provider execution consistency: Running identical benchmark queries across AI providers, including Claude, Gemini, OpenAI, Perplexity, Manus, and NanoBanana, using native provider UIs without automation, while maintaining controlled, one-session-per-query execution discipline.
  • Format compliance validation: Verifying that every generated artifact met format-specific structural requirements, such as matching the requested format, and satisfying layout, schema, and content constraints specific to each artifact type.
  • Failure taxonomy and follow-up handling: Systematically managing the full range of generation failure modes, from wrong-format outputs and missing downloadable files to citation loss after download and provider clarification loops, without losing execution integrity.
  • File-grounded artifact collection: Supporting prompts that required reference file uploads alongside queries, introducing additional execution complexity around file handling, upload sequencing, and generation timing.

The approach

Turing deployed a structured execution framework with artifact-specific QA, a standardized failure taxonomy, and detailed metadata logging across every provider run.

1. Four-level query complexity coverage

Benchmark queries were designed across four complexity types to stress-test frontier models at every level of instruction difficulty:

  • Type 1: Short, topic-based prompts testing basic artifact generation and structural inference.
  • Type 2: Structure, layout, chart, table, and visual style constraints testing instruction-following precision.
  • Type 3: Specific data points, regulatory details, and company information testing content inclusion and organization.
  • Type 4: High-specificity enterprise prompts combining detailed content requirements with multiple formatting constraints, restrictions, and citation requirements.

2. Native provider execution and clarification handling

All benchmark prompts were executed directly in native provider UIs, with one controlled session maintained per query-provider pair:

  • Our experts submitted queries exactly as specified, without API automation or UI scripting
  • Where providers asked clarifying questions, a minimal-response policy was applied and the clarification was logged
  • Follow-up prompts were used where necessary to convert research or narrative outputs into the exact requested artifact format
  • All generation timings, statuses, and operator comments were recorded per run

3. Artifact-specific format compliance QA

Every generated artifact was validated against format-specific QA criteria before acceptance:

  • PPTX artifacts were checked for correct slide count, layout constraints, and presence of required tables, charts, and citations
  • HTML artifacts were verified for browser rendering, semantic structure, and accessibility constraints
  • JSON artifacts were validated for schema compliance, required field presence, and nesting depth
  • Excel artifacts were checked for required tabs, formula integrity, and chart or pivot table presence
  • CSV, TXT, PDF, and DOCX artifacts were each assessed against their own structural and formatting requirements

Artifacts that failed format compliance checks were logged with a standardized failure code rather than accepted, ensuring the benchmark dataset reflected true generation success rates.

4. Structured failure taxonomy and metadata logging

Every failed or partial run was classified using a defined failure taxonomy covering ten failure types, including no artifact generated, wrong format, download failure, provider error, safety refusal, timeout, file upload failure, corrupt file, partial output, and clarification loop. This taxonomy enabled systematic analysis of failure patterns across providers and formats, rather than treating all failures as equivalent rejection events.

Key findings

Benchmark execution surfaced a consistent set of operational and format-specific patterns across providers and artifact types:

Operational failure patterns

Pattern

Observed Behavior

Downloadable artifact issues

Some providers generated content successfully but did not immediately produce a downloadable file, requiring additional steps to retrieve the artifact

Follow-up prompts required

Several outputs needed a follow-up prompt to convert research or narrative content into the exact requested file format

Intermediate wrong format

Some providers first produced Markdown, PDF, or narrative summaries before generating the requested CSV, TXT, or other target format

Citation and source loss

Sources used during generation were not preserved in the downloaded artifact in some cases, particularly for TXT and research-heavy outputs

Clarifying questions

Certain providers asked clarifying questions before generating artifacts, especially for complex or underspecified prompts

Manual completion

A subset of artifacts required manual creation or operator completion where direct model output was not sufficient

Model-level observations

Model / Provider

Observed Behavior

Claude Deep Research 4.5 Opus

Strong coverage across document formats; occasionally produced research summaries before delivering the exact target artifact and sometimes asked clarifying questions

OpenAI 5.2 Pro Deep Research

Performed well on research-heavy outputs; some runs required follow-up handling to obtain downloadable final files

ChatGPT 5.2 Pro

Generated PPTX outputs directly in downloadable format across tracked runs

Gemini Pro Deep Research

Broad format coverage across business-document outputs; some cases required additional handling around downloadable output generation

Gemini Pro Canvas

Used specifically for PPTX artifact generation

M365 Research Agent / Think Deeper

Used across document formats; several runs reflected downloadable-file handling and provider workflow constraints

Perplexity Pro / Pro Deep Research

Used primarily for TXT and Markdown-style outputs; citation visibility after download was a recurring issue for some runs

Manus 1.6 Max / Max Wide Research

Handled PPTX and file-grounded document, PDF, and XLSX generation; runs reflected both direct artifact generation and cases requiring download or follow-up handling

Format-level observations

Format

Observed Behavior

PPTX

Largest tracked format group; also showed the highest rate of downloadable-output handling and manual completion cases

CSV

Higher-complexity prompts frequently produced Markdown or PDF summaries before the requested CSV was generated

TXT

Generally feasible across providers; citation visibility in downloaded files was a recurring issue for some provider runs

DOCX / PDF / XLSX

File-grounded workstream tasks showed strong acceptance rates, particularly when reference files were uploaded correctly

HTML / JSON / Excel

Generated across multiple providers; acceptance tracked at the format-compliance level

Key results

  • Delivered more than 1,500 validated artifacts, with every artifact logged with provider, model version, format type, query complexity, generation time, and QA status
  • 99.9% artifact acceptance rate achieved, reflecting strong execution discipline and format compliance enforcement
  • Full complexity coverage maintained across all document formats, enabling comparison of provider behavior from basic generation through high-complexity enterprise instruction-following

The outcome

The client received a clean, model- and format-aware benchmark dataset with granular execution metadata across every artifact. The benchmark provides reliable signal for evaluating enterprise artifact generation capability and for diagnosing where specific providers and formats fall short under realistic enterprise prompting conditions.

This foundation enables the client to:

  • Compare artifact generation quality across providers at every complexity level, from basic generation to high-specificity enterprise instruction-following
  • Identify format- and model-specific failure patterns using a structured taxonomy rather than aggregate pass/fail metrics
  • Extend the benchmark framework to additional models, formats, or query types using a validated, repeatable execution workflow
  • Make informed decisions about artifact generation service capabilities grounded in real execution data rather than provider benchmarks

Need benchmark execution across AI providers and artifact formats?

Request a sample benchmark dataset covering multi-format artifact generation, format compliance validation, and structured failure analysis across leading AI providers.

Request Sample

Share

FAQ

What artifact formats and providers were covered?

The benchmark covered PPTX, HTML, DOCX, PDF, JSON, Excel, CSV, TXT, and infographics across providers, including Claude, Gemini, OpenAI, Microsoft 365, Perplexity, Flux, and NanoBanana.

What query complexity levels were included?

Four levels: open (short, topic-based), format-constrained (layout and structure requirements), content-specific (data points and regulatory details), and complex (high-specificity enterprise prompts with multiple constraints)

How were failures handled and classified?

Every failed or partial run was classified using a ten-category failure taxonomy covering no artifact generated, wrong format, download failure, provider error, safety refusal, timeout, file upload failure, corrupt file, partial output, and clarification loop.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Evaluating AI providers for enterprise artifact generation capability?

Work with Turing to design and execute format-aware benchmarks across providers, complexity levels, and document types.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now