Delivered a multimodal dataset combining real-world code edits, visual question-answering (VQA), and structural sketches derived from website screenshots. The dataset enables the evaluation and improvement of LLMs across layout interpretation, UI navigation, and visual-textual reasoning tasks.

Understanding how models interpret web interfaces requires more than simple text extraction or classification. To evaluate real-world capabilities, the client needed:
Turing implemented a three-pronged annotation pipeline:
1. Code edit tasks
2. Web sketches
3. Visual Question Answering (VQA)
Multi-stage quality review
To ensure accuracy and consistency, each task underwent a two-step human validation process:
Reworks were triggered if responses failed to meet updated quality criteria such as:
This double-review system aligned with updated SOPs and directly contributed to higher annotation consistency across modalities.
The resulting dataset, which includes model-breaking tasks, provides a strong foundation for evaluating and fine-tuning models across:
Get annotated tasks combining prompt-based code rewrites, functional logic, and layout interpretation.
Request SampleThe dataset covers more than 20 domains across ecommerce, search, education, dashboards, publishing, SaaS, and related areas.
Yes. Some questions include optional bounding box annotations for spatial grounding.
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request a sample to evaluate layout grouping, reasoning QA, and instruction-following.