Delivered high-difficulty math prompts aligned with the rigor of the Humanity’s Last Exam (HLE) dataset. Each prompt was designed to break state-of-the-art (SOTA) LLMs while maintaining novelty, correctness, and reviewer traceability.

The client required a benchmark-grade math dataset capable of exposing LLM weaknesses in symbolic reasoning, multi-step logic, and problem formulation.
Each prompt needed to:
To meet the client’s technical and benchmarking standards, Turing implemented a multi-step process focused on prompt novelty, model breakage, and graduate-level precision.
Dataset design
Prompts were designed to mirror the structure and rigor of the original HLE benchmark. Two format types were included:
Each prompt was constrained to break two internal benchmark models, while 50% were also required to also break a third external SOTA model during evaluation. The dataset was distributed across more than 10 subdomains, including:
Reviewer criteria
Each task was reviewed using a 10-item checklist, covering:
Quality assurance
Turing used a dual-review system. Every prompt underwent two rounds of human QA to ensure clarity, compliance, and correctness. Reviewers tracked:
A custom novelty-checker flagged prompts with high retrieval risk, and each was manually verified via Google search. Uniqueness was enforced at both conceptual and numeric levels.
Subdomain distribution
We matched the distribution of the original HLE dataset, covering 10+ domains such as Discrete Math (27.5%), Algebra (18.2%), and Analysis (16.9%), with supporting coverage across Topology, Geometry, Applied Math, and more.
Model breakage results
We tested a subset of the dataset against four established models: Nova, R1, Sonnet, and Qwen.
This project enabled the client to:
Request a sample of verified math tasks to build better reward models, evaluators, and chain-of-thought responses.
Request SampleDomains included Algebra, Geometry, Topology, Analysis, Discrete Math, Applied Math, Probability, and Statistics.
The model had to return an incorrect final answer. Reasoning flaws did not qualify unless they resulted in a wrong outcome.
Yes. Each final answer included a concise, correct rationale added to every prompt.
Formatting consistency was maintained through exclusive use of LaTeX syntax with \( and \). Each prompt was verified against defined formatting standards and reviewer Standard Operating Procedures (SOPs).
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
Within three business days after NDA execution.
Request graduate-level prompts with final answers and chain-of-thought rationales.