Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP
Built a benchmark-grade dataset to evaluate LLMs on hardware design challenges, including Register-Transfer Level (RTL) generation, debugging, module reuse, and functional verification. The dataset supported both agentic and non-agentic settings, incorporated real-world simulation harnesses, and exposed model failure modes in complex, tool-driven workflows.
1,500+
RTL design tasks: Authored and tested across agentic and non-agentic problem types.
10+
hardware design categories covered: Including code completion, RTL-to-spec mapping, testbench generation, assertion creation, and bug fixing.
Multi-tool
simulation support: Commercial and open-source verification toolchains, including Cadence Xcelium, Icarus Verilog, and additional testing environments.

The Challenge
Hardware design presents challenges that standard code generation benchmarks can’t capture. The client needed a dataset that would:
- Reflect production-level RTL workflows, including debugging, testbenching, and module reuse
- Evaluate multi-turn problem-solving across deeply nested repository structures and multiple file systems
- Enable evaluation using open-source and commercial EDA tools like Icarus Verilog and Cadence Xcelium
- Support both non-agentic (single-turn) and agentic (multi-turn, tool-aware) evaluation modes, with robust simulation infrastructure to detect logic and syntax flaws
- Integrate seamlessly with EDA tools and multi-file workflows
Previous benchmarks were limited in scope, often yielding high pass rates (>60%) due to oversimplified prompts and constrained test setups. The client required a more difficult benchmark with meaningful failure diversity and tooling realism.
The Approach
Dataset
Turing delivered a curated, simulation-ready dataset of 1,500 RTL design problems, encompassing multiple complexity tiers:
- Copilot datapoints: Single-file RTL prompts with compact context and golden solution
- Agentic datapoints: Multi-file, multi-step tasks with tool-invocation requirements and broader problem framing
- Heavy datapoints: Full Git-style projects involving bug resolution, architectural comprehension, and simulation execution with contexts exceeding >200k tokens
The dataset spanned 13 categories across design generation, verification, and comprehension, including RTL-to-spec mapping, checker and assertion generation, and tool-invoked debugging. Each datapoint included:
- A clear prompt and minimal context
- A correct, simulation-passable reference solution
- A test harness for pass/fail validation
- Metadata for category, complexity, and coverage tracking
Evaluation
To meet semiconductor-grade QA standards, each datapoint underwent a multi-stage review process:
- Manual review by engineers trained in Verilog, RTL design, and functional verification
- Harness-based pass/fail testing with Icarus Verilog, and Cadence Xcelium for advanced cases
- LLM-based quality filtering for ambiguity, test-solution alignment, and behavioral validity
- Tasks labeling for difficulty, coverage, and category fit, followed by filtering for determinism and evaluation clarity
This pipeline ensured that each retained task tested a substantive capability, such as prompt alignment, specification translation, or multi-module reasoning, rather than trivial syntax fixes.
Key Results
- Authored and tested 1,500+ simulation-ready RTL tasks across agentic and non-agentic settings
- Created a Hard Subset of tasks for advanced reasoning, requiring deeper VLSI expertise, long-context handling, and architectural constraint comprehension
- Tasks delivered across 13 RTL task categories, including spec-to-RTL mapping, assertion/testbench generation, and debugging
- Enabled cycle-accurate detection of failure modes for corner cases like FSM errors, signal width mismatches, and semantic violations
- Supported real-world evaluation of frontier models (Claude, GPT, LLaMA) across pass@1 and BLEU metrics
The Outcome
The dataset built by Turing became the foundation for Comprehensive Verilog Design Problems (CVDP), now the most challenging hardware design benchmark available:
- GPT-4o dropped from 63% (on prior benchmarks) to 29% on CVDP
- Claude 3.7 Sonnet peaked at just 33.56% on non-agentic code generation, revealing real-world capability gaps
- Agentic tasks showed an additional 10–20% drop-off, reflecting tool-based reasoning challenges
- Structured tasks enabled category-level error clustering and root-cause analysis
The benchmark now serves as a standard for evaluating LLMs in hardware design, and was purpose-built for extensibility as model capabilities advance.
Where does your model fail in RTL design?
Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation.
Request SampleFAQ
What’s in the sample?
Each sample includes a task prompt, full context, verified golden solution, and corresponding test harness.
What tools are required?
Simulations run on open-source Verilog tools such as Icarus and CocoTB, while advanced tasks may require commercial tools like Cadence Xcelium and JasperGold.
Is this agent-compatible?
Yes. Tasks can be evaluated in agentic (multi-turn with tool use) or non-agentic (single-prompt) settings.
How complex are the tasks?
Tasks range from basic RTL code edits to multi-turn, agentic bug fixes exceeding 200k tokens.
What’s the NDA process?
A standard mutual NDA. Turing provides the countersigned agreement within one business day.
How fast can I get a sample?
Within three business days after NDA execution.
Is your agent ready for tool-integrated RTL workflows?
Request a multi-turn Verilog design challenge with a harness-based simulator setup, purpose-built for real agentic evaluation.
AGI Advance Newsletter
Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.


