Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Built a benchmark-grade dataset to evaluate LLMs on hardware design challenges, including Register-Transfer Level (RTL) generation, debugging, module reuse, and functional verification. The dataset supported both agentic and non-agentic settings, incorporated real-world simulation harnesses, and exposed model failure modes in complex, tool-driven workflows.

1,500+

RTL design tasks: Authored and tested across agentic and non-agentic problem types.

10+

Hardware design categories covered: Including code completion, RTL-to-spec mapping, testbench generation, assertion creation, and bug fixing.

Multi-tool

Simulation support: Commercial and open-source verification toolchains, including Cadence Xcelium, Icarus Verilog, and additional testing environments.

IndustryComputer Hardware Manufacturing
Company typeEnterprise
CountryUnited States
Capabilites usedTuring AGI Advancement
Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

The Challenge

Hardware design presents challenges that standard code generation benchmarks can’t capture. The client needed a dataset that would:

  • Reflect production-level RTL workflows, including debugging, testbenching, and module reuse
  • Evaluate multi-turn problem-solving across deeply nested repository structures and multiple file systems
  • Enable evaluation using open-source and commercial EDA tools like Icarus Verilog and Cadence Xcelium
  • Support both non-agentic (single-turn) and agentic (multi-turn, tool-aware) evaluation modes, with robust simulation infrastructure to detect logic and syntax flaws
  • Integrate seamlessly with EDA tools and multi-file workflows

Previous benchmarks were limited in scope, often yielding high pass rates (>60%) due to oversimplified prompts and constrained test setups. The client required a more difficult benchmark with meaningful failure diversity and tooling realism.

The Approach

Dataset

Turing delivered a curated, simulation-ready dataset of 1,500 RTL design problems, encompassing multiple complexity tiers:

  • Copilot datapoints: Single-file RTL prompts with compact context and golden solution
  • Agentic datapoints: Multi-file, multi-step tasks with tool-invocation requirements and broader problem framing
  • Heavy datapoints: Full Git-style projects involving bug resolution, architectural comprehension, and simulation execution with contexts exceeding >200k tokens

The dataset spanned 13 categories across design generation, verification, and comprehension, including RTL-to-spec mapping, checker and assertion generation, and tool-invoked debugging. Each datapoint included:

  • A clear prompt and minimal context
  • A correct, simulation-passable reference solution
  • A test harness for pass/fail validation
  • Metadata for category, complexity, and coverage tracking

Evaluation

To meet semiconductor-grade QA standards, each datapoint underwent a multi-stage review process:

  • Manual review by engineers trained in Verilog, RTL design, and functional verification
  • Harness-based pass/fail testing with Icarus Verilog, and Cadence Xcelium for advanced cases
  • LLM-based quality filtering for ambiguity, test-solution alignment, and behavioral validity
  • Tasks labeling for difficulty, coverage, and category fit, followed by filtering for determinism and evaluation clarity

This pipeline ensured that each retained task tested a substantive capability, such as prompt alignment, specification translation, or multi-module reasoning, rather than trivial syntax fixes.

Hardware development workflow

Key Results

  • Authored and tested 1,500+ simulation-ready RTL tasks across agentic and non-agentic settings
  • Created a Hard Subset of tasks for advanced reasoning, requiring deeper VLSI expertise, long-context handling, and architectural constraint comprehension
  • Tasks delivered across 13 RTL task categories, including spec-to-RTL mapping, assertion/testbench generation, and debugging
  • Enabled cycle-accurate detection of failure modes for corner cases like FSM errors, signal width mismatches, and semantic violations
  • Supported real-world evaluation of frontier models (Claude, GPT, LLaMA) across pass@1 and BLEU metrics

The Outcome

The dataset built by Turing became the foundation for Comprehensive Verilog Design Problems (CVDP), now the most challenging hardware design benchmark available:

  • GPT-4o dropped from 63% (on prior benchmarks) to 29% on CVDP
  • Claude 3.7 Sonnet peaked at just 33.56% on non-agentic code generation, revealing real-world capability gaps
  • Agentic tasks showed an additional 10–20% drop-off, reflecting tool-based reasoning challenges
  • Structured tasks enabled category-level error clustering and root-cause analysis

The benchmark now serves as a standard for evaluating LLMs in hardware design, and was purpose-built for extensibility as model capabilities advance.

Where does your model fail in RTL design?

Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation.

Request Sample

Share

FAQ

What’s in the sample?

Each sample includes a task prompt, full context, verified golden solution, and corresponding test harness.

What tools are required?

Simulations run on open-source Verilog tools such as Icarus and CocoTB, while advanced tasks may require commercial tools like Cadence Xcelium and JasperGold.

Is this agent-compatible?

Yes. Tasks can be evaluated in agentic (multi-turn with tool use) or non-agentic (single-prompt) settings.

How complex are the tasks?

Tasks range from basic RTL code edits to multi-turn, agentic bug fixes exceeding 200k tokens.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Is your agent ready for tool-integrated RTL workflows?

Request a multi-turn Verilog design challenge with a harness-based simulator setup, purpose-built for real agentic evaluation.

Request Sample