Delivering 3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

Delivered a multimodal dataset combining real-world code edits, visual question-answering (VQA), and structural sketches derived from website screenshots. The dataset enables the evaluation and improvement of LLMs across layout interpretation, UI navigation, and visual-textual reasoning tasks.

1,500+

real web screenshots: Used as grounding input for code rewrites, VQA tasks, and sketch-based UI layout representations.

3,000+

multimodal supervision tasks: Spanning HTML/CSS/JS code edits, functional and reasoning-based VQA, and structured layout sketches.

3

distinct output types: Code edits, sketch drawings, and VQA annotations.

IndustrySoftware Development
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement
Delivering 3,000+ Multi-Modal Web QA Tasks Across Code Edits, Sketches, and Visual Reasoning

The Challenge

Understanding how models interpret web interfaces requires more than simple text extraction or classification. To evaluate real-world capabilities, the client needed:

  • Accurate simulation of code-based UI behavior changes
  • Structured sketch representations capturing layout hierarchy and grouping
  • Challenging VQA questions addressing UI logic, function, and multi-step reasoning
  • Consistency in labeling, grouping, and prompt structure
  • Clear documentation of input-output transformations for each modality

The Approach

Turing implemented a three-pronged annotation pipeline:

1. Code edit tasks

  • Modified full-page HTML files with inline JavaScript/CSS
  • Created prompts at multiple difficulty levels:
    - Basic: Style tweaks and text changes
    - Intermediate: Navigation edits, accessibility fixes, and hover effects
    - Advanced: Structural or behavioral logic such as carousel fixes or performance optimizations
  • Ensured all changes directly reflected the prompt with no extraneous code
  • Tagged with before-and-after files, commit-ready structure, and prompt documentation

2. Web sketches

  • Created hand-drawn or tool-based layouts from screenshots
  • Grouped UI components by role (Header, Footer, Nav Links, Sidebar, IMG, Buttons, etc.)
  • Captured exact text and component positioning using a standardized annotation format
  • Delivered high-resolution files ready for sketch-to-layout model training

3. Visual Question Answering (VQA)

  • Authored five high-quality questions per screenshot, spanning:
  • Functional (navigation, interactivity)
  • Complex reasoning (multi-step inference)
  • General image understanding (basic layout and content)
  • Ensured multi-step logic and OCR integration where required
  • Maintained strict alignment between questions and visual evidence
  • Provided answers with optional bounding boxes

Multi-stage quality review

To ensure accuracy and consistency, each task underwent a two-step human validation process:

  • Initial annotation by trained contributors
  • Independent review by a second annotator to verify completeness, alignment with prompt, and logic accuracy

Reworks were triggered if responses failed to meet updated quality criteria such as:

  • Multi-step reasoning in complex questions
  • Visual-textual alignment in answers
  • No subjective or irrelevant content
  • Strict adherence to prompt coverage for code edits

This double-review system aligned with updated SOPs and directly contributed to higher annotation consistency across modalities.

Key Results

  • Annotated 1,500+ real web screenshots across all tasks
  • Created 3,000+ multimodal supervision tasks spanning HTML/CSS/JS code edits, functional and reasoning-based VQA, and structured layout sketches
  • Authored thousands of multi-step, image-grounded Q&A pairs
  • Covered 20+ real-world web domains including ecommerce, SaaS, education, publishing, dashboards, and search platforms
  • Validated each code edit for prompt alignment, functionality, and visual accuracy
  • Sketched components using 10+ standardized tags to ensure layout clarity and reuse
  • Delivered instruction-following rewrites in HTML/CSS/JS with inline comments and metadata

The Outcome

The resulting dataset, which includes model-breaking tasks, provides a strong foundation for evaluating and fine-tuning models across:

  • Sketch-to-code generation
  • Visual code editing
  • Multimodal QA from static UI layouts
  • OCR, logic, and design reasoning

Need to test your model’s ability to follow visual instructions?

Get annotated tasks combining prompt-based code rewrites, functional logic, and layout interpretation.

Request Sample

Share

FAQ

What domains were covered?

The dataset covers more than 20 domains across ecommerce, search, education, dashboards, publishing, SaaS, and related areas.

Are bounding boxes included?

Yes. Some questions include optional bounding box annotations for spatial grounding.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

How well does your model understand web interfaces?

Request a sample to evaluate layout grouping, reasoning QA, and instruction-following.

Request Sample