Building 200+ Labeled Python Notebooks for NL-to-ML Tasks

Delivered a dataset of more than 200 Python notebooks generated from natural language prompts. Each notebook followed a standardized structure covering data loading, exploration, modeling, and artifact generation. The notebooks were designed for traceability, reproducibility, and CUJ (customer user journey) coverage.

200+

fully structured notebooks mapped from natural language prompts to end-to-end ML pipelines

10+

predefined step types labeled (data loading, exploration, wrangling, modeling, evaluation, and others).

100%

artifact coverage, including charts, tables, and summaries.

MethodDataset creation

DomainCode generation

Dataset scale200+ notebooks generated

CapabilityCoding

Building 200+ Labeled Python Notebooks for NL-to-ML Task

The Challenge

The client needed a high-quality dataset of real-world, structured Python notebooks that would:

Translate data science questions into reproducible workflows
Use actual BigQuery tables across finance, economics, engineering, software, and other domains
Follow a fixed, interpretable step structure that aligned with internal ML pipelines
Include complete code artifacts and summary insights for each analysis

The goal was to standardize multi-step analytical workflows across variable tasks while ensuring output was correct, modular, and visually documented.

The Approach

Turing implemented a rigorous notebook generation and review protocol with built-in traceability.

Task framing

Used a set of natural language prompts to guide problem framing such as “What factors influence retail churn?”
Paired each task with a real BigQuery dataset
Selected appropriate CUJs such as classification, regression, time series, or clustering

Notebook structure

Defined 10 canonical steps (data loading → exploration → cleaning → wrangling → visualization → splitting → modeling → optimization → evaluation → summary)
Ensured each step included:
- A labeled markdown block
- Library annotations
- Output charts and tables with standardized naming

Artifact management

Saved all key outputs such as .png, .json, .parquet files using a strict naming convention
Created subfolders per notebook, with per-step artifacts verified post-execution

Reviewer protocol

Each notebook was reviewed for step correctness, prompt clarity, and CUJ alignment
An LLM was used as a reasoning assistant, but not a code generator; final code was manually authored and verified

Key Results

Delivered more than 200 Python notebooks with verified, labeled steps
Created artifact-complete workflows with reproducible output at each analysis step
Spanned multiple domains and CUJs including time series forecasting, geospatial analysis, and statistical testing
Standardized a scalable pattern for prompt-to-notebook transformation in real-world ML pipelines

The Outcome

This dataset directly powered the launch of a new natural-language-to-notebook generation feature in the client’s platform. Users can now enter a single prompt into the client’s notebook interface and receive a complete, multi-step data science workflow in response.

This capability is enabled by the prompt-response structures and labeled notebooks delivered by Turing.

Need structured NL-to-code data for ML workflow agents?

Request a dataset of labeled Python notebooks built from real prompts and datasets with modular steps, artifacts, and CUJ mappings.

Request Sample

What’s included in each sample notebook?

Each notebook starts from a natural language task prompt and includes a complete, step-labeled ML workflow from data loading and exploration to modeling and evaluation.

How are steps structured?

Each step includes a markdown label, the prompt it answers, the libraries used, and a series of code cells.

What kinds of tasks are covered?

The dataset spans classification, regression, clustering, time series forecasting, geospatial analysis, and statistical testing, mapped to real-world BigQuery tables across multiple domains.

What kinds of artifacts are included?

Each step produces saved outputs such as .png charts, .json visualizations, and .parquet tables. All artifacts are verified and aligned to the step that generated them.

Can I use this to train or evaluate NL-to-code agents?

Yes. This dataset is ideal for training or benchmarking agents that translate natural language into multi-step data science workflows with modular logic.

How is code quality ensured?

All notebooks were authored and reviewed by human data scientists.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Benchmarking Model Fidelity with 500 Expert-Verified Software Engineering Tasks

Read

Case Study

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Read

Case Study

Improving LLM Performance with 4,000+ Apex and SOQL Notebook Tasks

Read

Designing NL-to-code agents for data science workflows?

Request labeled, domain-grounded notebooks with stepwise reasoning, code outputs, and full artifact trails from natural language inputs.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now