Created a 2K-sample dataset to uncover reasoning blind spots in frontier LLMs like GPT 5 using adversarial LSAT-style questions across logic games, reading comprehension, and logical reasoning.
A global frontier lab partnered with Turing to evaluate how its most advanced language models handled LSAT-caliber reasoning.
While most models exhibit strong pattern recognition, they still struggle with precise logical thinking, often misapplying conditionals, ignoring quantifiers, or failing to recognize contradictory premises. The client needed a diagnostic benchmark that would:
The goal was to strengthen the model’s comprehension, argumentative reasoning, and formal logic, not only for LSAT-style tasks, but also for broader reasoning benchmarks like SATBench, GPQA, MMLU, and the AI2 Reasoning Challenge.
Dataset
Turing delivered 2,000+ curated samples, each with:
Example: In one task, the model was asked to schedule a set of demos across time slots under multiple conditional and biconditional rules. It incorrectly selected a configuration that violated adjacency constraints, exposing a failure to apply chained logic correctly, a common LSAT reasoning trap.
Evaluation
Turing implemented a multi-tier QA process to ensure all questions, critiques, and answer keys are aligned to LSAT norms:
This hybrid approach surfaced over 20 recurrent model failure patterns, including:
With this LSAT-style benchmark, the client is now able to:
Get access to SOTA model-breaking prompts with critiques, rubrics, and traceable reasoning errors.
Request SampleOne or more LSAT-style tasks with correct answers, failed model outputs, and detailed critiques.
No, all tasks are original and never reused from past LSAT papers.
Logic games, logical reasoning, and reading comprehension, with full subtype coverage.
All tasks pass a multi-layer human + agentic QA process and adhere to LSAT formatting and logic complexity standards.
A standard mutual NDA; Turing returns countersignature within one business day.
Within 3 business days of NDA execution.
Go beyond pattern-matching with benchmark data built for formal logic and argumentation.