Delivering 1000+ HLE-Grade Math Prompts to Benchmark SOTA Models

Delivered high-difficulty math prompts aligned with the rigor of the Humanity’s Last Exam (HLE) dataset. Each prompt was designed to break state-of-the-art (SOTA) LLMs while maintaining novelty, correctness, and reviewer traceability.

1000+

Research-level math prompts spanning 10+ subdomains, each reviewed for novelty, clarity, and solution validity.

100%

Review coverage: Dual-layer expert QA to ensure correctness, difficulty calibration, and formatting compliance.

2x

Model break criteria: Every question required to break two internal benchmark models, with at least 50% also breaking a third external SOTA model during evaluation.

IndustrySoftware Development
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement

The Challenge

The client required a benchmark-grade math dataset capable of exposing LLM weaknesses in symbolic reasoning, multi-step logic, and problem formulation.

Each prompt needed to:

  • Match or exceed the difficulty of original HLE tasks
  • Be completely novel and non-retrievable via web search
  • Include a verifiable final answer and rationale
  • Break multiple SOTA models under standardized evaluation criteria

The Approach

To meet the client’s technical and benchmarking standards, Turing implemented a multi-step process focused on prompt novelty, model breakage, and graduate-level precision.

Dataset design

Prompts were designed to mirror the structure and rigor of the original HLE benchmark. Two format types were included:

  • Exact match questions (90%) requiring closed-form numerical or symbolic answers
  • Multiple-choice questions (10%) with five options and one correct answer

Each prompt was constrained to break two internal benchmark models, while 50% were also required to also break a third external SOTA model during evaluation. The dataset was distributed across more than 10 subdomains, including:

  • Algebra
  • Analysis
  • Geometry
  • Topology
  • Discrete Math
  • Probability
  • Statistics
  • Applied Math

Reviewer criteria

Each task was reviewed using a 10-item checklist, covering:

  • Subdomain correctness
  • Graduate or PhD proficiency level
  • Final answer accuracy
  • Model breakage validity
  • Novelty check with link trace
  • LaTeX formatting and prompt grammar

Quality assurance

Turing used a dual-review system. Every prompt underwent two rounds of human QA to ensure clarity, compliance, and correctness. Reviewers tracked:

  • Common failure modes (e.g., ambiguous notation, invalid symbolic solutions)
  • Subdomain gaps to ensure even coverage
  • Time-per-prompt and reviewer consistency via an internal dashboard

A custom novelty-checker flagged prompts with high retrieval risk, and each was manually verified via Google search. Uniqueness was enforced at both conceptual and numeric levels.

Internal Evaluation & Dataset Impact

Subdomain distribution

We matched the distribution of the original HLE dataset, covering 10+ domains such as Discrete Math (27.5%), Algebra (18.2%), and Analysis (16.9%), with supporting coverage across Topology, Geometry, Applied Math, and more.

Subdomain distribution

Model breakage results

We tested a subset of the dataset against four established models: Nova, R1, Sonnet, and Qwen

  • All prompts achieved Replication-level breakage (broke two internal models)
  • Over 50% achieved Advanced-level breakage, also failing a third external model

Math performance and failure analysis

Key Results

  • Delivered 1000+ expert-level prompts in <30 days across 10+ subdomains
  • Ensured 100% novelty and formatting compliance via automated checks and human QA
  • Met the target to break external SOTA model on 50% of the final dataset

The Outcome

This project enabled the client to:

  • Evaluate symbolic reasoning and math instruction-following using an HLE-aligned dataset
  • Diagnose chain-of-thought failures with a verified, model-breaking test bed
  • Scale prompt generation through a structured, research-grade QA pipeline
  • Extend evaluation and fine-tuning initiatives with reusable tasks and reviewer guidelines

Stress-test your model with research-level math QA

Request a sample of verified math tasks to build better reward models, evaluators, and chain-of-thought responses.

Request Sample

Share

FAQ

What subdomains were covered?

Domains included Algebra, Geometry, Topology, Analysis, Discrete Math, Applied Math, Probability, and Statistics.

What was required for a prompt to “break” a model?

The model had to return an incorrect final answer. Reasoning flaws did not qualify unless they resulted in a wrong outcome.

Were rationales required?

Yes. Each final answer included a concise, correct rationale added to every prompt.

How was formatting ensured?

Formatting consistency was maintained through exclusive use of LaTeX syntax with \( and \). Each prompt was verified against defined formatting standards and reviewer Standard Operating Procedures (SOPs).

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Need model-breaking math QA for evaluation or fine-tuning?

Request graduate-level prompts with final answers and chain-of-thought rationales.

Request Sample