Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Back

Back

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Over 200 math problems were annotated with strict binary correctness per step, including written justifications. Annotators followed a zero-tolerance error-carry-forward policy, ensuring robust analysis of chain-of-thought consistency in long-form mathematical outputs.

200+

Model-generated math solutions: Annotated at the step level with binary labels and justifications.

500+

Hours of expert annotation: Delivered by PhDs, postdocs, and domain reviewers.

100%

Compliance with strict evaluation rubric: Including zero-tolerance carry-forward logic and justification standards set by Salesforce.

IndustrySoftware Development

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

The Challenge

Salesforce AI Research developed Hard2Verify, a benchmark to evaluate how well verifiers can assess step-level correctness in long-form math solutions. The benchmark uses open-ended, Olympiad-level problems sourced from recent competitions like IMO, Putnam, and INMO. Responses were generated directly by frontier models such as GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4, with no modifications, preserving natural reasoning errors for realistic evaluation.

Salesforce needed a trusted partner to verify multi-step reasoning chains and the evaluation required:

Strict binary judgments (Correct/Incorrect) for each reasoning step
Zero tolerance for carry-forward errors; if one step was incorrect, all dependent steps were marked incorrect
Short written justifications for each step, explaining correctness judgments
High reviewer alignment due to the interpretive difficulty of parallel versus dependent reasoning branches

Turing’s role was to execute high-precision annotation, reviewer alignment, and delivery at scale, following Salesforce’s internal evaluation rubric.

The Approach

Turing evaluated over 200 long-form math responses, each containing 8–15 structured reasoning steps, generated by frontier reasoning models. For every step, annotators issued a binary Correct/Incorrect judgment and authored a short written justification, following a strict zero-tolerance rubric:

Correct: Logically and computationally sound, aligned with prior steps
Incorrect: Any flaw, such as logical, procedural, or inherited via carry-forward, led to rejection
Carry-forward logic: Once a step was marked Incorrect, all dependent steps were also considered invalid
No partial credit: Incomplete logic, vague claims, or unverifiable “hand-wavy” statements were rejected

Annotation was performed by a team of rigorously vetted experts, including graduates, postgraduates, and PhDs in mathematics, and followed Salesforce's benchmark rubric with multi-round review for accuracy and auditability.

Key Results

Annotated 200+ model responses, each broken down and evaluated step-by-step
Applied error-carry-forward logic and justification writing with full reviewer alignment
Achieved 100% delivery pass-through via Salesforce’s internal 3-tier review pipeline
Maintained average annotation time of 90 minutes per model output and 63 minutes per review

The Outcome

Salesforce used this dataset to:

Evaluate LLM performance on open-domain math tasks with strict reasoning fidelity
Detect early-stage reasoning flaws that derail multi-step logic chains
Align model outputs with human-grade reasoning standards for future fine-tuning and scoring research

Need a human-led verifier for complex math reasoning?

Partner with Turing to annotate symbolic CoT logic with binary step grading and reasoning trace validation.

Request Sample

What exactly did Turing annotate?

Turing provided step-level Correct/Incorrect labels with written justifications for each model-generated math proof, using Salesforce’s rubric.

What kinds of errors were caught?

Logic leaps, invalid theorem applications, missing cases, and any result based on an earlier error, per the error-carry-forward rule.

Was this for training or evaluation?

Evaluation only. These samples were used to assess model behavior, not train or fine-tune it.

How did reviewers ensure accuracy?

Every task passed through a 3-stage review. Final delivery required approval from senior reviewers.

Can this workflow be extended to other domains?

Yes. Turing can apply the same step-level judgment and justification methodology to coding, domain-specific, or STEM questions.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How fast can I get a sample?

Within 3 business days of NDA execution.

Related resources

Article

Lean & Symbolic Reasoning for LLM-Based Mathematical Problem Solving

Read

Audio SFT- Enhancing AI with Real-World Spoken Prompt Training_Hero_1232-770

Article

Audio SFT: Teaching AI to Understand Human Voice in Noisy, Real-World Scenarios

Read

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Read

Want to evaluate chain-of-thought logic with human precision?

Get step-level reasoning traces with formal logic consistency and justification writing.

Request Sample

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

200+

Model-generated math solutions: Annotated at the step level with binary labels and justifications.

500+

Hours of expert annotation: Delivered by PhDs, postdocs, and domain reviewers.

100%

Compliance with strict evaluation rubric: Including zero-tolerance carry-forward logic and justification standards set by Salesforce.

The Challenge

The Approach

Key Results

The Outcome

Need a human-led verifier for complex math reasoning?

Share

FAQ

What exactly did Turing annotate?

What kinds of errors were caught?

Was this for training or evaluation?

How did reviewers ensure accuracy?

Can this workflow be extended to other domains?

What’s the NDA process?

How fast can I get a sample?

Related resources

Article

Lean & Symbolic Reasoning for LLM-Based Mathematical Problem Solving

Article

Audio SFT: Teaching AI to Understand Human Voice in Noisy, Real-World Scenarios

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Want to evaluate chain-of-thought logic with human precision?