Building 200+ Labeled Python Notebooks for NL-to-ML Tasks

Delivered a dataset of more than 200 Python notebooks generated from natural language prompts. Each notebook followed a standardized structure covering data loading, exploration, modeling, and artifact generation. The notebooks were designed for traceability, reproducibility, and CUJ (customer user journey) coverage.

200+

fully structured notebooks mapped from natural language prompts to end-to-end ML pipelines

10+

predefined step types labeled (data loading, exploration, wrangling, modeling, evaluation, and others).

100%

artifact coverage, including charts, tables, and summaries.

MethodDataset creation
DomainCode generation
Dataset scale200+ notebooks generated
CapabilityCoding
Building 200+ Labeled Python Notebooks for NL-to-ML Task

The Challenge

The client needed a high-quality dataset of real-world, structured Python notebooks that would:

  • Translate data science questions into reproducible workflows
  • Use actual BigQuery tables across finance, economics, engineering, software, and other domains
  • Follow a fixed, interpretable step structure that aligned with internal ML pipelines
  • Include complete code artifacts and summary insights for each analysis

The goal was to standardize multi-step analytical workflows across variable tasks while ensuring output was correct, modular, and visually documented.

The Approach

Turing implemented a rigorous notebook generation and review protocol with built-in traceability.

Task framing

  • Used a set of natural language prompts to guide problem framing such as “What factors influence retail churn?”
  • Paired each task with a real BigQuery dataset
  • Selected appropriate CUJs such as classification, regression, time series, or clustering

Notebook structure

  • Defined 10 canonical steps (data loading → exploration → cleaning →  wrangling → visualization → splitting → modeling → optimization → evaluation → summary)
  • Ensured each step included:
    - A labeled markdown block
    - Library annotations
    - Output charts and tables with standardized naming

Artifact management

  • Saved all key outputs such as .png, .json, .parquet files using a strict naming convention
  • Created subfolders per notebook, with per-step artifacts verified post-execution

Reviewer protocol

  • Each notebook was reviewed for step correctness, prompt clarity, and CUJ alignment
  • An LLM was used as a reasoning assistant, but not a code generator;  final code was manually authored and verified

Key Results

  • Delivered more than 200 Python notebooks with verified, labeled steps
  • Created artifact-complete workflows with reproducible output at each analysis step
  • Spanned multiple domains and CUJs including time series forecasting, geospatial analysis, and statistical testing
  • Standardized a scalable pattern for prompt-to-notebook transformation in real-world ML pipelines

The Outcome

This dataset directly powered the launch of a new natural-language-to-notebook generation feature in the client’s platform. Users can now enter a single prompt into the client’s notebook interface and receive a complete, multi-step data science workflow in response.

This capability is enabled by the prompt-response structures and labeled notebooks delivered by Turing.

Need structured NL-to-code data for ML workflow agents?

Request a dataset of labeled Python notebooks built from real prompts and datasets with modular steps, artifacts, and CUJ mappings.

Request Sample

Share

FAQ

What’s included in each sample notebook?

Each notebook starts from a natural language task prompt and includes a complete, step-labeled ML workflow from data loading and exploration to modeling and evaluation.

How are steps structured?

Each step includes a markdown label, the prompt it answers, the libraries used, and a series of code cells.

What kinds of tasks are covered?

The dataset spans classification, regression, clustering, time series forecasting, geospatial analysis, and statistical testing, mapped to real-world BigQuery tables across multiple domains.

What kinds of artifacts are included?

Each step produces saved outputs such as .png charts, .json visualizations, and .parquet tables. All artifacts are verified and aligned to the step that generated them.

Can I use this to train or evaluate NL-to-code agents?

Yes. This dataset is ideal for training or benchmarking agents that translate natural language into multi-step data science workflows with modular logic.

How is code quality ensured?

All notebooks were authored and reviewed by human data scientists.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Designing NL-to-code agents for data science workflows?

Request labeled, domain-grounded notebooks with stepwise reasoning, code outputs, and full artifact trails from natural language inputs.

Request Sample