For Developers

Introduction to DAGsHub and DVCs in Machine Learning for Beginners.

DAGsHUB and DVC in Machine Learning.

Every machine learning problem demands a unique solution subjected to its distinctiveness. There is no one size fits all solution that would work wonders for every machine learning problem. However, the time-consuming nature of the process also requires time and money to find an ideal configuration for a unique solution.

To counter this problem, organizations often track experiments to ensure their optimal performance. It can become a daunting task if you try to record these experiments manually. It also doesn't solve the issue of handling tons of data manually in a spreadsheet.

In such a case, we approach the problem differently. The actionable insights from previous organizations help in designing better experiments that enhance the productivity of the entire process. Here’s where DVC (Data Version Control) and DagsHub come into the picture, which diligently functions by ensuring the same user experience.

In this article, we will discover, what are DagsHub and DVCs and the way in which DagsHub makes it simple for machine learning beginners to follow different investigations. We are likewise going to prepare our model on the unique Titanic dataset and run various examinations upheld by the grouping models.

That isn't all. Eventually, we will likewise picture and contrast the results using DAGsHub intelligent dashboards. So before we take care of business, here’s a speedy brief regarding what DVC, FastDS, and DAGsHub really are.

What is DagsHub?

DAGsHub is a web platform built on top of GitHub and DVC that is a hub of a variety of open-source tools. We use it in MindMeld as a central model registry to make our model easily shareable and accessible from anywhere. It is optimized for data science and equivalent to GitHub, which helps information researchers and AI engineers in sharing the trials, code, data, and models.

In short, we can state that it is oriented towards the open-source community. It offers permission to just share, survey, and reuse the work done, giving the experience of GitHub to AI.

DAGsHub platform isn't just that, it shows up with tests, MLflow coordination, AI pipeline representation, and heaps of a greater amount of such helpful highlights. It helps machine learning engineers to version their code, data, models, and experiments. Catch a glance at some of the incredible features of DagsHub that will leave you awestruck.

  • Comparison of ML metrics
  • Data version control
  • Track ML experiments
  • ML flow integration and pipeline visualization
  • DVC and Git integration
  • Label Annotation via The Label Studio and others.

The best part of DAGsHub is the fact that it is so natural to mess with various elements and the manner in which the whole stage is focused on helping information researchers and AI engineers. To put it simply, it eases MLOps for solving a machine learning problem by

  • Storing the results of the experiment
  • Creating the data pipeline
  • Mirroring the repository

Now, let’s get acquainted with DVC in machine learning.

What is DVC?

DVC is a python library or a framework for information form control. It stores the data and model files seamlessly for most parts, like Git, yet it is mostly utilized for information. With DVC, you can keep the data about various adaptations of your information in Git while putting away your unique information elsewhere. We use it in MindMeld to track trained models in our applications.

Likewise, the linguistic structure of DVC is very much like Git. Therefore, it becomes easy to learn DVC when one is aware of Git and its commands. However, it demands setting up cloud services for storing all the data, which is when DagsHub comes to the rescue.

You can install it using the following command.

pip install 'mindmeld[dvc]'

Another term that you will come across while working with DagsHub and DVCs is FastDs. As expressed on their authority site, FastDs is an "open-source order line covering around Git and DVC, intended to limit the possibilities of human mistake, mechanize dreary undertakings, and give a smoother arrival to new clients."

It implies, FastDS helps machine learning architects to form control of the code and the information, at the same time. We can say that:

FastDS = git + DVC 

How to use DagsHub to track ML experiments?

DagsHub is an easy and efficient option to track machine learning experiments in place of using a spreadsheet for the same task. It eliminates the need to track hundreds of parameters manually, which can be the root cause of many errors.

This web platform based on open-source tools is a lifesaver for a machine learning practitioner or a data scientist. In this article, we will illustrate how you can log and visualize the experiments using dagsHub.

Let’s get started.

Considering that you have a basic understanding of sklearn and Git, we will get you acquainted with some prerequisites followed by the demonstration of using DagsHub for experiment tracking.

Creating a file management system

In the following example, we are going to create a file management system that we will use in the rest of the demonstration. Here’s how you can create it.

ª   model.ipynb
ª  
+---data
ª      
+---models
ª

Code source

In the above structure of the file management system, model.ipynb function is to generate different models. Data and models are two additional folders residing in the working directory. The data folder includes the dataset on which we will be working. On the contrary, models are the folder that saves the pickle files of various models that will be created during each experiment run.

Building a function to churn out models for experiment

We will use an iris plant dataset that will classify entries into one of the three classes of iris plants. The classification will be based on physical characteristics. Here’s how we will be creating a function for the same to harvest some models for our experiment pipeline.

#dependencies
import pandas as pd
import dagshub
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score
import pickle

#function to log the model parameters
def model_run(model_type):
    #reading the data
    df = pd.read_csv('data\iris.csv')
    #splitting in features and labels
    X  = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm']]
    y = df['Species']
    #test train split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)
    with dagshub.dagshub_logger() as logger:
        #model defnition
        model = model_type(random_state=42)
        #log the model parameters
        logger.log_hyperparams(model_class=type(model).__name__)
        logger.log_hyperparams({'model': model.get_params()})
        #training the model
        model.fit(X_train, y_train)
        #log the model's performances
        logger.log_metrics({f'accuracy':round(accuracy_score(y_test, y_pred),3)})
        #saving the model
        file_name = model_type.__name__ + '_model.sav'
        pickle.dump(model, open('models/'+ file_name, 'wb'))
    pass

#running an experiment
model_run(RandomForestClassifier)

Code source

In the above code, the dependencies import the python packages that will run the function in the initial phase. The function then reads the data using Pandas.

Next, it splits the dataset into its respective labels and features. Further, the dataset splits into train and test data. 30% of the total data is tested using - sklearn’s test_train_split.

Later, the DagsHub logger was introduced that records the model metrics and hyperparameters. Now, the model fit function calls on the training set and is saved into the model's folder using Python pickle.

If the above code runs as it should, you can find a .sav file, metrics.csv, and params.yml in the model's folder. You can verify the same with the below file structure.

ª   model.ipynb
ª   metrics.csv
ª   params.yml
ª  
+---data
ª       iris.csv
ª      
+---models
ª       RandomForestClassifier_model.sav

Tracking experiment using DagsHub Logger

We will start by pushing the files to DagsHub. But, before that, we need to set up a remote repository. Post that, we need to include DVC and Git in our working folder. Catch a glance at the below-given image to understand more.

DagsHub dasboard.webp

Image source

Once you log in to the DagsHub dashboard, click on the “+Create” button located in the top right corner.

When you click on it, choose “new repository” from the dropdown menu. Once you click on it, you will have the below-given window on your screen.

New Repository window in DagsHub dashboard.webp

Image source

Start by adding the repository name and your remote repository is all set. Next, we will initialize git on the working directory.

git init

git remote add origin https:// dagshub. com/srishti.chaudhary/dagshub-tutorial. git

Next, we initialize DVC to configure DagsHub as DVC remote storage with a few additional steps for the purpose of experiment tracking.

pip install dvc
dvc init
dvc remote add origin https://dagshub.com/srishti.chaudhary/dagshub-tutorial.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user srishti.chaudhary
dvc remote modify origin --local password your_token

Once done, we push the files with our experiment to Git followed by adding files to DVC remote storage on DagsHub. You will also find .gitignore files and .dvc files in the models and data folders that are also to be pushed to Git.

After completing the above tasks, you can view the files in the repository. Next, click on the experiments tab where you can find our experiment as the first entry using the random forest classifier. You can run as many experiments as you want since there is no limit to them.

Conclusion

This article looks at the nutshell of DAGsHub and DVC. We likewise saw, how we can think about the aftereffects of different ML experiments utilizing DAGsHub. We can also investigate every one of the highlights of this astounding variant control framework by messing around with it.

Author

  • Author

    Srishti Chaudhary

    Srishti is a competent content writer and marketer with expertise in niches like cloud tech, big data, web development, and digital marketing. She looks forward to grow her tech knowledge and skills.

Frequently Asked Questions

DagsHub is a community-first platform that is built on top of open-source tools for machine learning.

DVC checkout function is to restore the corresponding versions of DVC-tracked directories and data from the cache to the workspace. It limits what needs to be checkout.

Model versioning in machine learning is the process of tracking and managing the changes made in the model already built.

You can utilize two options to clone the repository - dvc and git.

DVC or Data Version Control is an experiment management tool for data science and machine learning that makes ML models reproducible and shareable.

Start by copying the DVC link. Next, add it to your local project as remote. You can find the DVC link on the repo’s homepage. It will help you get started using Dagshub storage.

View more FAQs
Press

Press

What's up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work.
Checkout our blog here.
Contact

Contact

Have any questions?
We'd love to hear from you.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers