Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Built a benchmark-grade dataset to evaluate LLMs on hardware design challenges, including Register-Transfer Level (RTL) generation, debugging, module reuse, and functional verification. The dataset supported both agentic and non-agentic settings, incorporated real-world simulation harnesses, and exposed model failure modes in complex, tool-driven workflows.

1,500+

RTL design tasks: Authored and tested across agentic and non-agentic problem types.

10+

hardware design categories covered: Including code completion, RTL-to-spec mapping, testbench generation, assertion creation, and bug fixing.

Multi-tool

simulation support: Commercial and open-source verification toolchains, including Cadence Xcelium, Icarus Verilog, and additional testing environments.

MethodBenchmark development

DomainHardware design

Dataset scale1,500+ tasks

CapabilityData Packs

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

The Challenge

Hardware design presents challenges that standard code generation benchmarks can’t capture. The client needed a dataset that would:

Reflect production-level RTL workflows, including debugging, testbenching, and module reuse
Evaluate multi-turn problem-solving across deeply nested repository structures and multiple file systems
Enable evaluation using open-source and commercial EDA tools like Icarus Verilog and Cadence Xcelium
Support both non-agentic (single-turn) and agentic (multi-turn, tool-aware) evaluation modes, with robust simulation infrastructure to detect logic and syntax flaws
Integrate seamlessly with EDA tools and multi-file workflows

Previous benchmarks were limited in scope, often yielding high pass rates (>60%) due to oversimplified prompts and constrained test setups. The client required a more difficult benchmark with meaningful failure diversity and tooling realism.

The Approach

Dataset

Turing delivered a curated, simulation-ready dataset of 1,500 RTL design problems, encompassing multiple complexity tiers:

Copilot datapoints: Single-file RTL prompts with compact context and golden solution
Agentic datapoints: Multi-file, multi-step tasks with tool-invocation requirements and broader problem framing
Heavy datapoints: Full Git-style projects involving bug resolution, architectural comprehension, and simulation execution with contexts exceeding >200k tokens

The dataset spanned 13 categories across design generation, verification, and comprehension, including RTL-to-spec mapping, checker and assertion generation, and tool-invoked debugging. Each datapoint included:

A clear prompt and minimal context
A correct, simulation-passable reference solution
A test harness for pass/fail validation
Metadata for category, complexity, and coverage tracking

Evaluation

To meet semiconductor-grade QA standards, each datapoint underwent a multi-stage review process:

Manual review by engineers trained in Verilog, RTL design, and functional verification
Harness-based pass/fail testing with Icarus Verilog, and Cadence Xcelium for advanced cases
LLM-based quality filtering for ambiguity, test-solution alignment, and behavioral validity
Tasks labeling for difficulty, coverage, and category fit, followed by filtering for determinism and evaluation clarity

This pipeline ensured that each retained task tested a substantive capability, such as prompt alignment, specification translation, or multi-module reasoning, rather than trivial syntax fixes.

Hardware development workflow

Key Results

Authored and tested 1,500+ simulation-ready RTL tasks across agentic and non-agentic settings
Created a Hard Subset of tasks for advanced reasoning, requiring deeper VLSI expertise, long-context handling, and architectural constraint comprehension
Tasks delivered across 13 RTL task categories, including spec-to-RTL mapping, assertion/testbench generation, and debugging
Enabled cycle-accurate detection of failure modes for corner cases like FSM errors, signal width mismatches, and semantic violations
Supported real-world evaluation of frontier models (Claude, GPT, LLaMA) across pass@1 and BLEU metrics

The Outcome

The dataset built by Turing became the foundation for Comprehensive Verilog Design Problems (CVDP), now the most challenging hardware design benchmark available:

GPT-4o dropped from 63% (on prior benchmarks) to 29% on CVDP
Claude 3.7 Sonnet peaked at just 33.56% on non-agentic code generation, revealing real-world capability gaps
Agentic tasks showed an additional 10–20% drop-off, reflecting tool-based reasoning challenges
Structured tasks enabled category-level error clustering and root-cause analysis

The benchmark now serves as a standard for evaluating LLMs in hardware design, and was purpose-built for extensibility as model capabilities advance.

Where does your model fail in RTL design?

Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation.

Request Sample

What’s in the sample?

Each sample includes a task prompt, full context, verified golden solution, and corresponding test harness.

What tools are required?

Simulations run on open-source Verilog tools such as Icarus and CocoTB, while advanced tasks may require commercial tools like Cadence Xcelium and JasperGold.

Is this agent-compatible?

Yes. Tasks can be evaluated in agentic (multi-turn with tool use) or non-agentic (single-prompt) settings.

How complex are the tasks?

Tasks range from basic RTL code edits to multi-turn, agentic bug fixes exceeding 200k tokens.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Evaluating Olympiad-Grade Math Reasoning for Salesforce AI Research

Read

Advancing Code-Based Physics and 2D-3D Simulation Understanding with 3,800+ Tasks

Case Study

Advancing Code-Based Physics & 2D/3D Simulation Understanding with 3,800+ Tasks

Read

Article

Lean & Symbolic Reasoning for LLM-Based Mathematical Problem Solving

Read

Is your agent ready for tool-integrated RTL workflows?

Request a multi-turn Verilog design challenge with a harness-based simulator setup, purpose-built for real agentic evaluation.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now