Building a Document Understanding Dataset Across 15,000+ OCR, Summarization, and Translation Tasks
Delivered a large-scale document understanding dataset spanning OCR, summarization, and translation tasks across 10+ languages. The dataset covers 10+ document subdomains, from handwritten notes and rotated scans to printed financial reports and web screenshots, sourced from diverse real-world origins to reflect the full complexity of documents an AI agent encounters in production.
15,000+
tasks delivered across single-page and multi-page documents, spanning OCR, summarization, and translation capabilities.
>95%
summarization accuracy achieved across both single-page and multi-page tasks.
10+
document subdomains covered, including printed documents, scanned records, handwriting, forms, slides, web screenshots, financial reports, and academic papers.

The Challenge
The client needed a high-quality dataset to improve their AI agent's ability to understand documents across various formats, languages, and content types. The dataset had to reflect the full complexity of real-world document understanding, including structured layouts, handwritten content, multi-column formats, rotated or skewed images, and mathematically dense documents sourced from genuinely diverse origins, spanning different regions, time periods, and document styles.
Key challenges included:
- Sourcing diverse, high-quality document images across 10+ subdomains and multiple languages without duplication or quality degradation, covering everything from contemporary web screenshots to older scanned records
- Producing OCR transcriptions that preserved semantic structure, including tables, headers, checkboxes, signatures, mathematical notation, and inline formatting
- Generating accurate, appropriately concise, and correctly structured summaries as a mix of paragraphs and bullet points without introducing assumptions or external information
- Handling multi-page documents of up to 15+ pages while maintaining reading order, section coherence, and formatting consistency across every page
- Enforcing strict quality controls to catch both surface-level errors and sophisticated mistakes such as misidentified superscripts, misread hyphens and em dashes, and unjustified text misrendered with extra spacing
The Approach
Turing deployed a structured sourcing, annotation, and quality assurance workflow across three capabilities.
1. Automated sourcing and human validation
Document images were sourced through a combination of an automated internal pipeline and third-party vendors. The automated pipeline first identified candidate images from publicly available sources and applied programmatic checks for resolution, language, domain classification, and duplication. For subdomains where suitable images were not available through the pipeline, images were sourced externally through vendors.
A human validation layer then reviewed each candidate before annotation began, rejecting tasks that contained:
- Personally identifiable information
- Blurry or illegible content
- AI-generated images
- NSFW material
- Duplicate images within the same task
This combined sourcing and validation approach ensured that annotators only worked on documents that met baseline quality and diversity requirements.
2. OCR transcription with structural fidelity
Each OCR task required annotators to produce a semantically faithful transcription in Markdown format, using LaTeX for all mathematical notation. Key standards included:
- Following human semantic reading order throughout
- Preserving structural elements, including headers, tables, forms, checkboxes, bullet styles, indentation, signatures, and special characters
- Rendering superscripts and subscripts using HTML tags outside of math contexts
- Excluding design elements with no semantic meaning, such as decorative lines or background engravings
- Manually verifying all numbers and characters in multi-page documents, using auto-generated OCR only as a reference
3. Summarization with structured formatting rules
Summarization tasks were governed by strict formatting and content rules. Key requirements included:
- Single-page summaries written as a single paragraph of three to six sentences
- Multi-page summaries structured as a mix of prose, bolded section headings drawn directly from the source document, and bullet points capped at 20 words each, with no bullet set exceeding eight items
- Original reading order maintained throughout, with no content combined across sections
- No interpretations, assumptions, external information, or personal pronouns permitted
4. Multilingual OCR and translation
The multilingual component extended OCR and translation capabilities across single-page and multi-page documents in assigned non-English languages. Annotators applied the same transcription and formatting standards as the English OCR workflow, with tasks rejected for:
- Mixed-language content
- Illegible text
- Images below 500 pixels in resolution
- Content sourced from the same website across multiple tasks
5. Quality assurance
Turing implemented a multi-layer QA process combining automated checks, human review, and dedicated spot-checking.
- Agentic reviewer: An automated reviewer was implemented at the task level to check accuracy across multiple QA parameters, including schema compliance, formatting rules, subdomain classification, and content validity. Human reviewers could override flags where appropriate.
- L1 review: A first human review pass checked surface-level compliance, including formatting, structural accuracy, and task type requirements.
- L2 review: A second human review pass validated transcription and summarization accuracy, catching sophisticated errors such as misidentified superscripts, em dash and hyphen confusion, unjustified text misrendering, and incorrect sentence counts in summaries.
- Spot-check team: A dedicated spot-check team conducted final random sampling across approved tasks to validate sustained quality at scale.
Errors tracked and corrected included:
- Missing headers, footers, or page numbers
- Location descriptions or background detail in transcriptions
- Formatting errors in tables and forms
- Incorrect heading level usage
- Failure to replicate boxed text
- Misidentified document subdomains
- Incorrect sentence counts or bullet point usage in single-page summaries
- Inclusion of PII or NSFW content
Key Results
- Delivered more than 15,000 tasks across single-page and multi-page documents
- Achieved 95%+ summarization accuracy across both single-page and multi-page tasks
- Covered 10+ document subdomains across OCR, summarization, and translation capabilities, reflecting diverse real-world document types, formats, and origins
- Supported 10+ languages in the multilingual component with consistent annotation standards applied throughout
- Applied a multi-layer QA process combining an agentic reviewer, L1 and L2 human review, and dedicated spot-checking
The Outcome
The client received a production-ready document understanding dataset built to train and evaluate an AI agent on realistic, high-complexity document tasks. With consistent formatting standards, strict content rules, broad subdomain and language coverage, and a multi-layer QA process, the dataset provides reliable signal across OCR accuracy, summarization quality, and translation.
This foundation supports:
- Training agents to extract and structure information from real-world documents with varied layouts, formats, and origins
- Evaluating summarization quality across single-page and multi-page documents with mixed content types
- Extending document understanding capabilities across languages with consistent annotation standards
- Benchmarking agent performance across a diverse and representative sample of document subdomains
Need a document understanding dataset spanning OCR, summarization, and translation tasks?
Request a sample of annotated documents across real-world subdomains, including multi-page layouts, handwriting, forms, and financial reports.
Request SampleFAQ
What document subdomains are covered?
The dataset spans 10+ subdomains including printed documents, scanned files, handwriting, photo documents, web screenshots, rotated or skewed images, slides, forms, receipts and invoices, cropped tables, financial reports, and academic papers.
How were multi-page documents handled?
Multi-page tasks ranged from 4 to 16 pages. Annotators followed the original reading order across all pages and applied consistent formatting, structural fidelity, and content accuracy standards throughout each document.
How was quality ensured?
All tasks went through a four-layer QA process combining an agentic automated reviewer, L1 and L2 human review passes, and a dedicated spot-check team. This process caught both surface-level and sophisticated errors before final delivery.
Does the dataset support multilingual use cases?
Yes. The multilingual component covers the same subdomains and annotation standards as the English OCR workflow, applied to documents in 10+ languages.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
How fast can I get a sample?
Within three business days after NDA execution.
Looking to build robust document understanding across languages and formats?
Request annotation datasets tailored to your OCR, summarization, and translation needs.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


