Over 200 math problems were annotated with strict binary correctness per step, including written justifications. Annotators followed a zero-tolerance error-carry-forward policy, ensuring robust analysis of chain-of-thought consistency in long-form mathematical outputs.

Salesforce AI Research developed Hard2Verify, a benchmark to evaluate how well verifiers can assess step-level correctness in long-form math solutions. The benchmark uses open-ended, Olympiad-level problems sourced from recent competitions like IMO, Putnam, and INMO. Responses were generated directly by frontier models such as GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4, with no modifications, preserving natural reasoning errors for realistic evaluation.
Salesforce needed a trusted partner to verify multi-step reasoning chains and the evaluation required:
Turing’s role was to execute high-precision annotation, reviewer alignment, and delivery at scale, following Salesforce’s internal evaluation rubric.
Turing evaluated over 200 long-form math responses, each containing 8–15 structured reasoning steps, generated by frontier reasoning models. For every step, annotators issued a binary Correct/Incorrect judgment and authored a short written justification, following a strict zero-tolerance rubric:
Annotation was performed by a team of rigorously vetted experts, including graduates, postgraduates, and PhDs in mathematics, and followed Salesforce's benchmark rubric with multi-round review for accuracy and auditability.
Salesforce used this dataset to:
Partner with Turing to annotate symbolic CoT logic with binary step grading and reasoning trace validation.
Request SampleTuring provided step-level Correct/Incorrect labels with written justifications for each model-generated math proof, using Salesforce’s rubric.
Logic leaps, invalid theorem applications, missing cases, and any result based on an earlier error, per the error-carry-forward rule.
Evaluation only. These samples were used to assess model behavior, not train or fine-tune it.
Every task passed through a 3-stage review. Final delivery required approval from senior reviewers.
Yes. Turing can apply the same step-level judgment and justification methodology to coding, domain-specific, or STEM questions.
A standard mutual NDA; Turing returns countersignature within one business day.
Within 3 business days of NDA execution.
Get step-level reasoning traces with formal logic consistency and justification writing.