Building a Document Understanding Dataset Across 15,000+ OCR, Summarization, and Translation Tasks

Delivered a large-scale document understanding dataset spanning OCR, summarization, and translation tasks across 10+ languages. The dataset covers 10+ document subdomains, from handwritten notes and rotated scans to printed financial reports and web screenshots, sourced from diverse real-world origins to reflect the full complexity of documents an AI agent encounters in production.

15,000+

tasks delivered across single-page and multi-page documents, spanning OCR, summarization, and translation capabilities.

>95%

summarization accuracy achieved across both single-page and multi-page tasks.

10+

document subdomains covered, including printed documents, scanned records, handwriting, forms, slides, web screenshots, financial reports, and academic papers.

MethodDataset generation

DomainDocument understanding

Dataset scale15,000+ tasks

CapabilityData packs

Building a Document Understanding Dataset Across 15,000+ OCR, Summarization, and Translation Tasks

The Challenge

The client needed a high-quality dataset to improve their AI agent's ability to understand documents across various formats, languages, and content types. The dataset had to reflect the full complexity of real-world document understanding, including structured layouts, handwritten content, multi-column formats, rotated or skewed images, and mathematically dense documents sourced from genuinely diverse origins, spanning different regions, time periods, and document styles.

Key challenges included:

Sourcing diverse, high-quality document images across 10+ subdomains and multiple languages without duplication or quality degradation, covering everything from contemporary web screenshots to older scanned records
Producing OCR transcriptions that preserved semantic structure, including tables, headers, checkboxes, signatures, mathematical notation, and inline formatting
Generating accurate, appropriately concise, and correctly structured summaries as a mix of paragraphs and bullet points without introducing assumptions or external information
Handling multi-page documents of up to 15+ pages while maintaining reading order, section coherence, and formatting consistency across every page
Enforcing strict quality controls to catch both surface-level errors and sophisticated mistakes such as misidentified superscripts, misread hyphens and em dashes, and unjustified text misrendered with extra spacing

The Approach

Turing deployed a structured sourcing, annotation, and quality assurance workflow across three capabilities.

1. Automated sourcing and human validation

Document images were sourced through a combination of an automated internal pipeline and third-party vendors. The automated pipeline first identified candidate images from publicly available sources and applied programmatic checks for resolution, language, domain classification, and duplication. For subdomains where suitable images were not available through the pipeline, images were sourced externally through vendors.

A human validation layer then reviewed each candidate before annotation began, rejecting tasks that contained:

Personally identifiable information
Blurry or illegible content
AI-generated images
NSFW material
Duplicate images within the same task

This combined sourcing and validation approach ensured that annotators only worked on documents that met baseline quality and diversity requirements.

2. OCR transcription with structural fidelity

Each OCR task required annotators to produce a semantically faithful transcription in Markdown format, using LaTeX for all mathematical notation. Key standards included:

Following human semantic reading order throughout
Preserving structural elements, including headers, tables, forms, checkboxes, bullet styles, indentation, signatures, and special characters
Rendering superscripts and subscripts using HTML tags outside of math contexts
Excluding design elements with no semantic meaning, such as decorative lines or background engravings
Manually verifying all numbers and characters in multi-page documents, using auto-generated OCR only as a reference

3. Summarization with structured formatting rules

Summarization tasks were governed by strict formatting and content rules. Key requirements included:

Single-page summaries written as a single paragraph of three to six sentences
Multi-page summaries structured as a mix of prose, bolded section headings drawn directly from the source document, and bullet points capped at 20 words each, with no bullet set exceeding eight items
Original reading order maintained throughout, with no content combined across sections
No interpretations, assumptions, external information, or personal pronouns permitted

4. Multilingual OCR and translation

The multilingual component extended OCR and translation capabilities across single-page and multi-page documents in assigned non-English languages. Annotators applied the same transcription and formatting standards as the English OCR workflow, with tasks rejected for:

Mixed-language content
Illegible text
Images below 500 pixels in resolution
Content sourced from the same website across multiple tasks

5. Quality assurance

Turing implemented a multi-layer QA process combining automated checks, human review, and dedicated spot-checking.

Agentic reviewer: An automated reviewer was implemented at the task level to check accuracy across multiple QA parameters, including schema compliance, formatting rules, subdomain classification, and content validity. Human reviewers could override flags where appropriate.
L1 review: A first human review pass checked surface-level compliance, including formatting, structural accuracy, and task type requirements.
L2 review: A second human review pass validated transcription and summarization accuracy, catching sophisticated errors such as misidentified superscripts, em dash and hyphen confusion, unjustified text misrendering, and incorrect sentence counts in summaries.
Spot-check team: A dedicated spot-check team conducted final random sampling across approved tasks to validate sustained quality at scale.

Errors tracked and corrected included:

Missing headers, footers, or page numbers
Location descriptions or background detail in transcriptions
Formatting errors in tables and forms
Incorrect heading level usage
Failure to replicate boxed text
Misidentified document subdomains
Incorrect sentence counts or bullet point usage in single-page summaries
Inclusion of PII or NSFW content

Key Results

Delivered more than 15,000 tasks across single-page and multi-page documents
Achieved 95%+ summarization accuracy across both single-page and multi-page tasks
Covered 10+ document subdomains across OCR, summarization, and translation capabilities, reflecting diverse real-world document types, formats, and origins
Supported 10+ languages in the multilingual component with consistent annotation standards applied throughout
Applied a multi-layer QA process combining an agentic reviewer, L1 and L2 human review, and dedicated spot-checking

The Outcome

The client received a production-ready document understanding dataset built to train and evaluate an AI agent on realistic, high-complexity document tasks. With consistent formatting standards, strict content rules, broad subdomain and language coverage, and a multi-layer QA process, the dataset provides reliable signal across OCR accuracy, summarization quality, and translation.

This foundation supports:

Training agents to extract and structure information from real-world documents with varied layouts, formats, and origins
Evaluating summarization quality across single-page and multi-page documents with mixed content types
Extending document understanding capabilities across languages with consistent annotation standards
Benchmarking agent performance across a diverse and representative sample of document subdomains

Need a document understanding dataset spanning OCR, summarization, and translation tasks?

Request a sample of annotated documents across real-world subdomains, including multi-page layouts, handwriting, forms, and financial reports.

Request Sample

What document subdomains are covered?

The dataset spans 10+ subdomains including printed documents, scanned files, handwriting, photo documents, web screenshots, rotated or skewed images, slides, forms, receipts and invoices, cropped tables, financial reports, and academic papers.

How were multi-page documents handled?

Multi-page tasks ranged from 4 to 16 pages. Annotators followed the original reading order across all pages and applied consistent formatting, structural fidelity, and content accuracy standards throughout each document.

How was quality ensured?

All tasks went through a four-layer QA process combining an agentic automated reviewer, L1 and L2 human review passes, and a dedicated spot-check team. This process caught both surface-level and sophisticated errors before final delivery.

Does the dataset support multilingual use cases?

Yes. The multilingual component covers the same subdomains and annotation standards as the English OCR workflow, applied to documents in 10+ languages.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Evaluating 50,000+ Multimodal AI Responses Across Image-Grounded Reasoning Tasks

Read

Case Study

Evaluating Agent Workflows With Verifier-Grounded Execution Benchmarks

Read

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Looking to build robust document understanding across languages and formats?

Request annotation datasets tailored to your OCR, summarization, and translation needs.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now