3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

Delivering 3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

Delivered a multimodal dataset combining real-world code edits, visual question-answering (VQA), and structural sketches derived from website screenshots. The dataset enables the evaluation and improvement of LLMs across layout interpretation, UI navigation, and visual-textual reasoning tasks.

1,500+

real web screenshots: Used as grounding input for code rewrites, VQA tasks, and sketch-based UI layout representations.

3,000+

multimodal supervision tasks: Spanning HTML/CSS/JS code edits, functional and reasoning-based VQA, and structured layout sketches.

3

distinct output types: Code edits, sketch drawings, and VQA annotations.

IndustrySoftware Development

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

Delivering 3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

The Challenge

Understanding how models interpret web interfaces requires more than simple text extraction or classification. To evaluate real-world capabilities, the client needed:

Accurate simulation of code-based UI behavior changes
Structured sketch representations capturing layout hierarchy and grouping
Challenging VQA questions addressing UI logic, function, and multi-step reasoning
Consistency in labeling, grouping, and prompt structure
Clear documentation of input-output transformations for each modality

The Approach

Turing implemented a three-pronged annotation pipeline:

1. Code edit tasks

Modified full-page HTML files with inline JavaScript/CSS
Created prompts at multiple difficulty levels:
- Basic: Style tweaks and text changes
- Intermediate: Navigation edits, accessibility fixes, and hover effects
- Advanced: Structural or behavioral logic such as carousel fixes or performance optimizations
Ensured all changes directly reflected the prompt with no extraneous code
Tagged with before-and-after files, commit-ready structure, and prompt documentation

2. Web sketches

Created hand-drawn or tool-based layouts from screenshots
Grouped UI components by role (Header, Footer, Nav Links, Sidebar, IMG, Buttons, etc.)
Captured exact text and component positioning using a standardized annotation format
Delivered high-resolution files ready for sketch-to-layout model training

3. Visual Question Answering (VQA)

Authored five high-quality questions per screenshot, spanning:
Functional (navigation, interactivity)
Complex reasoning (multi-step inference)
General image understanding (basic layout and content)
Ensured multi-step logic and OCR integration where required
Maintained strict alignment between questions and visual evidence
Provided answers with optional bounding boxes

Multi-stage quality review

To ensure accuracy and consistency, each task underwent a two-step human validation process:

Initial annotation by trained contributors
Independent review by a second annotator to verify completeness, alignment with prompt, and logic accuracy

Reworks were triggered if responses failed to meet updated quality criteria such as:

Multi-step reasoning in complex questions
Visual-textual alignment in answers
No subjective or irrelevant content
Strict adherence to prompt coverage for code edits

This double-review system aligned with updated SOPs and directly contributed to higher annotation consistency across modalities.

Key Results

Annotated 1,500+ real web screenshots across all tasks
Created 3,000+ multimodal supervision tasks spanning HTML/CSS/JS code edits, functional and reasoning-based VQA, and structured layout sketches
Authored thousands of multi-step, image-grounded Q&A pairs
Covered 20+ real-world web domains including ecommerce, SaaS, education, publishing, dashboards, and search platforms
Validated each code edit for prompt alignment, functionality, and visual accuracy
Sketched components using 10+ standardized tags to ensure layout clarity and reuse
Delivered instruction-following rewrites in HTML/CSS/JS with inline comments and metadata

The Outcome

The resulting dataset, which includes model-breaking tasks, provides a strong foundation for evaluating and fine-tuning models across:

Sketch-to-code generation
Visual code editing
Multimodal QA from static UI layouts
OCR, logic, and design reasoning

Need to test your model’s ability to follow visual instructions?

Get annotated tasks combining prompt-based code rewrites, functional logic, and layout interpretation.

Request Sample

What domains were covered?

The dataset covers more than 20 domains across ecommerce, search, education, dashboards, publishing, SaaS, and related areas.

Are bounding boxes included?

Yes. Some questions include optional bounding box annotations for spatial grounding.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Building 7K+ High-Complexity SlideVQA Tasks Across 20+ Knowledge Domains

Read

Advancing Code-Based Physics and 2D-3D Simulation Understanding with 3,800+ Tasks

Case Study

Advancing Code-Based Physics & 2D/3D Simulation Understanding with 3,800+ Tasks

Read

Case Study

Creating a 1,500-Task Real-World Software Engineering Benchmark with E2E UI Test Oracles

Read

How well does your model understand web interfaces?

Request a sample to evaluate layout grouping, reasoning QA, and instruction-following.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now

Delivering 3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

1,500+

real web screenshots: Used as grounding input for code rewrites, VQA tasks, and sketch-based UI layout representations.

3,000+

multimodal supervision tasks: Spanning HTML/CSS/JS code edits, functional and reasoning-based VQA, and structured layout sketches.

3

distinct output types: Code edits, sketch drawings, and VQA annotations.

The Challenge

The Approach

Key Results

The Outcome

Need to test your model’s ability to follow visual instructions?

Share

FAQ

What domains were covered?

Are bounding boxes included?

What’s the NDA process?

How fast can I get a sample?

Related resources

Case Study

Building 7K+ High-Complexity SlideVQA Tasks Across 20+ Knowledge Domains

Case Study

Advancing Code-Based Physics & 2D/3D Simulation Understanding with 3,800+ Tasks

Case Study

Creating a 1,500-Task Real-World Software Engineering Benchmark with E2E UI Test Oracles

How well does your model understand web interfaces?

AGI Advance Newsletter