Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Over 200 math problems were annotated with strict binary correctness per step, including written justifications. Annotators followed a zero-tolerance error-carry-forward policy, ensuring robust analysis of chain-of-thought consistency in long-form mathematical outputs.

200+

Model-generated math solutions: Annotated at the step level with binary labels and justifications.

500+

Hours of expert annotation: Delivered by PhDs, postdocs, and domain reviewers.

100%

Compliance with strict evaluation rubric: Including zero-tolerance carry-forward logic and justification standards set by Salesforce.

IndustrySoftware Development
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement

The Challenge

Salesforce AI Research developed Hard2Verify, a benchmark to evaluate how well verifiers can assess step-level correctness in long-form math solutions. The benchmark uses open-ended, Olympiad-level problems sourced from recent competitions like IMO, Putnam, and INMO. Responses were generated directly by frontier models such as GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4, with no modifications, preserving natural reasoning errors for realistic evaluation.

Salesforce needed a trusted partner to verify multi-step reasoning chains and the evaluation required:

  • Strict binary judgments (Correct/Incorrect) for each reasoning step
  • Zero tolerance for carry-forward errors; if one step was incorrect, all dependent steps were marked incorrect
  • Short written justifications for each step, explaining correctness judgments
  • High reviewer alignment due to the interpretive difficulty of parallel versus dependent reasoning branches

Turing’s role was to execute high-precision annotation, reviewer alignment, and delivery at scale, following Salesforce’s internal evaluation rubric.

The Approach

Turing evaluated over 200 long-form math responses, each containing 8–15 structured reasoning steps, generated by frontier reasoning models. For every step, annotators issued a binary Correct/Incorrect judgment and authored a short written justification, following a strict zero-tolerance rubric:

  • Correct: Logically and computationally sound, aligned with prior steps
  • Incorrect: Any flaw, such as logical, procedural, or inherited via carry-forward, led to rejection
  • Carry-forward logic: Once a step was marked Incorrect, all dependent steps were also considered invalid
  • No partial credit: Incomplete logic, vague claims, or unverifiable “hand-wavy” statements were rejected

Annotation was performed by a team of rigorously vetted experts, including graduates, postgraduates, and PhDs in mathematics, and followed Salesforce's benchmark rubric with multi-round review for accuracy and auditability.

Key Results

  • Annotated 200+ model responses, each broken down and evaluated step-by-step
  • Applied error-carry-forward logic and justification writing with full reviewer alignment
  • Achieved 100% delivery pass-through via Salesforce’s internal 3-tier review pipeline
  • Maintained average annotation time of 90 minutes per model output and 63 minutes per review

The Outcome

Salesforce used this dataset to:

  • Evaluate LLM performance on open-domain math tasks with strict reasoning fidelity
  • Detect early-stage reasoning flaws that derail multi-step logic chains
  • Align model outputs with human-grade reasoning standards for future fine-tuning and scoring research

Need a human-led verifier for complex math reasoning?

Partner with Turing to annotate symbolic CoT logic with binary step grading and reasoning trace validation.

Request Sample

Share

FAQ

What exactly did Turing annotate?

Turing provided step-level Correct/Incorrect labels with written justifications for each model-generated math proof, using Salesforce’s rubric.

What kinds of errors were caught?

Logic leaps, invalid theorem applications, missing cases, and any result based on an earlier error, per the error-carry-forward rule.

Was this for training or evaluation?

Evaluation only. These samples were used to assess model behavior, not train or fine-tune it.

How did reviewers ensure accuracy?

Every task passed through a 3-stage review. Final delivery required approval from senior reviewers.

Can this workflow be extended to other domains?

Yes. Turing can apply the same step-level judgment and justification methodology to coding, domain-specific, or STEM questions.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How fast can I get a sample?

Within 3 business days of NDA execution.

Want to evaluate chain-of-thought logic with human precision?

Get step-level reasoning traces with formal logic consistency and justification writing.

Request Sample