Stress-Testing Frontier Models with 2K+ Expert-Written LSAT Questions

Created a 2K-sample dataset to uncover reasoning blind spots in frontier LLMs like GPT 5 using adversarial LSAT-style questions across logic games, reading comprehension, and logical reasoning.

2,000+

Model-breaking LSAT samples: Spanning logic games, reading comprehension, and multi-step argumentative reasoning.

97%

Acceptance rate: Validated through QA pipelines, expert audits, and programmatic checks.

20+

Distinct failure types tracked: From negation misreads and quantifier errors to logic-chain breakdowns and answer misalignment.

IndustryAI Research
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement
Stress-testing frontier models on LSAT-grade reasoning

The Challenge

A global frontier lab partnered with Turing to evaluate how its most advanced language models handled LSAT-caliber reasoning.

While most models exhibit strong pattern recognition, they still struggle with precise logical thinking, often misapplying conditionals, ignoring quantifiers, or failing to recognize contradictory premises. The client needed a diagnostic benchmark that would:

  • Generate model-breaking LSAT prompts across logic games, logical reasoning, and reading comprehension
  • Include correct answers, critiques of the model's incorrect outputs, and human-grade reasoning chains
  • Cover LSAT subtypes like conditional logic, flaw identification, principle matching, logic games (grouping, sequencing, matching), and multi-passage inference
  • Capture failure modes including overreliance on surface similarity, flawed formal logic, and misinterpreted causal relationships

The goal was to strengthen the model’s comprehension, argumentative reasoning, and formal logic, not only for LSAT-style tasks, but also for broader reasoning benchmarks like SATBench, GPQA, MMLU, and the AI2 Reasoning Challenge.

The Approach

Dataset

Turing delivered 2,000+ curated samples, each with:

  • Original LSAT-style questions, never reused from public benchmarks
  • Correct and incorrect model outputs
  • Human-authored critiques of each incorrect model answer, including:
    a. Summary of the flaw
    b. Structured feedback by reasoning aspect (e.g., inference structure, answer elimination logic, principle misalignment)
  • Coverage across:
    a. Logical reasoning: Flaws, Sufficient Assumptions, Match the Flaw, Inference
    b. Logic games: Sequencing, Matching, Hybrid setups with conditional rules
    c. Reading comprehension: Main Point, Inference, Attitude, Primary Purpose, Analogies

Example: In one task, the model was asked to schedule a set of demos across time slots under multiple conditional and biconditional rules. It incorrectly selected a configuration that violated adjacency constraints, exposing a failure to apply chained logic correctly, a common LSAT reasoning trap.

Evaluation

Turing implemented a multi-tier QA process to ensure all questions, critiques, and answer keys are aligned to LSAT norms:

  • Expert-level auditors: Annotators and reviewers included students and alumni from renowned global universities such as Stanford and Columbia. This ensured that questions and critiques reflected true LSAT difficulty and precision.
  • Agentic LLM reviewer integration: We deployed a custom LLM agentic reviewer that aligned to client’s rubrics. This reviewer flagged logic and critique errors pre-delivery and was iteratively improved with tighter prompt controls and conditional logic. It enabled:
    a. A reduction in false negatives from 50% → 22%
    b. A reduction in false positives from 30% → 11%
  • Programmatic QA checks: We layered automation to catch frequent formatting, logic, and delivery schema issues.
  • SOP-driven pass/fail tagging for:
    a. Answer correctness
    b. Reasoning chain completeness
    c. Model critique structure
    d. Principle conformity and logic chain resolution

This hybrid approach surfaced over 20 recurrent model failure patterns, including:

  • Misinterpreted negations and double negatives
  • Inverted conditional logic
  • Topic match errors over logical form mismatch
  • Inconsistent elimination logic
  • Invalid analogies in reasoning structure

Stress-testing frontier models on LSAT-grade reasoning flowchart

Key Results

  • 97% overall QA pass rate across 2,000+ samples
  • Highly granular failure traceability per LSAT subtype
  • Used internally by client to update instruction-following scaffolds and reward models
  • Enhanced test harness for model evaluation with chain-of-thought tracing and structured logic alignment

The Outcome

With this LSAT-style benchmark, the client is now able to:

  • Stress-test LLMs on human-caliber logic traps
  • Debug formal logic comprehension via structured critiques
  • Align model output to rule-based reasoning, not just language patterning
  • Establish clear, testable baselines across logical reasoning and comprehension modes

Benchmark your model against LSAT-grade tasks

Get access to SOTA model-breaking prompts with critiques, rubrics, and traceable reasoning errors.

Request Sample

Share

FAQ

What’s included in the LSAT sample?

One or more LSAT-style tasks with correct answers, failed model outputs, and detailed critiques.

Are these from public exams?

No, all tasks are original and never reused from past LSAT papers.

What types of LSAT sections are covered?

Logic games, logical reasoning, and reading comprehension, with full subtype coverage.

What’s the quality guarantee?

All tasks pass a multi-layer human + agentic QA process and adhere to LSAT formatting and logic complexity standards.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How fast can I get a sample?

Within 3 business days of NDA execution.

How many logic traps is your model still falling for?

Go beyond pattern-matching with benchmark data built for formal logic and argumentation.

Request Sample