Before performing data analysis or using machine learning algorithms to run data, you should always make sure that the data is clean and suitable. It’s equally vital to understand recurring patterns and correlations that may present themselves in the data. This article will examine this process - known as exploratory data analysis (EDA) - in Python.
Pandas is the data manipulation library used in Python. It’s mainly used when data is being preprocessed. It also helps in exploring the data and storing it after preprocessing.
Exploratory data analysis is a process of data analytics that seeks to understand data in-depth and learn its different characteristics using a graphical approach. It allows you to get a better understanding of your data and find useful patterns.
It’s very important to understand data in detail before analyzing it or running it through any algorithm. Patterns must be recognized and the importance of variables determined. Likewise, you need to know the variables that don’t play a significant part in determining the output. Some will also have correlations with other variables. Further, errors must be identified in said data.
Interested in SQL content? How about exploring SQL data analyst opportunities too?
Example: When organizing a trip, there are many aspects that need to be sorted through and planned. Locations need to be explored, budgets calculated, timelines set, and travel options checked.
Similarly, when building a machine learning algorithm, you need to be sure that the data makes sense. The main purpose of EDA is to obtain purity in the data to the extent that you are ready to engage your machine learning algorithm.
The main reason for EDA is to detect errors and understand various patterns in data. Before making any assumption, EDA will allow analysts to understand the data better. The output of EDA helps businesses know their customers more, expand their organizational operations, and make better decisions.
EDA enables you to know whether selected characteristics are good for the model, required for the model, and have any correlations based on which you either perform data preprocessing or move ahead with modeling.
Once the EDA is complete and insights are ready, the features can be used to supervise machine learning models. In each workflow, the last step is to report insights to an analyst. And, while a data scientist can explain each code, you still need to understand the audience.
EDA yields different plots, graphs, frequency distributions, and correlation matrices with hypotheses. They enable you to understand what the data is about and what insights can be derived from exploring the dataset.
The EDA process comprises many steps as explained in detail below:
A top priority is understanding the different types of data and their characteristics. A smart first step is, to begin with, the describe() function in Python. With Pandas, you can apply describe() on a DataFrame which helps generate descriptive statistics that summarize the dispersal and shape of the dataset’s disposal as well as the central tendency, excluding NaN values.
import pandas as pd from sklearn.datasets import load_turkey turkey = load_turkey() x = turkey.data y = turkey.target columns = turkey.feature_names #creating dataframes turkey_df = pd.DataFrame(turkey.data) turkey_df.columns = columns turkey_df.describe()
It’s not possible to ascertain how homogeneous and clean the data you have will be as values may be missing during collection due to any number of reasons. Such missing data needs to be handled carefully as it reduces the quality of the performance matrix. It can lead to wrong output prediction and cause a bias towards the model you use.
Many options are available for handling missing data. The choice you make depends on the nature of the data and the values that are missing. Here are some techniques that can help:
Fill in the missing values
The most common technique is to replace the missing values with a test statistic mean or mode of a particular feature where the value belongs.
Drop NULL or a missing value
The fastest and easiest technique is to drop NULL or a missing value. However, it’s not usually recommended as it reduces the quality of the model and the sample size. This is because it works by deleting all the observations where any variable goes missing.
Predict the missing values using an ML algorithm
This is the most efficient technique for handling missing data. You can either use a classification or regression model to predict the missing value according to the class to which your data belongs.
An outlier separates or differentiates itself from the crowd. Sometimes, an outlier can be the outcome of a mistake that occurs during data collection or an indication of differences in the data. Below are some techniques that will help in detecting and handling outliers:
Scatterplot: A scatterplot is a mathematical diagram that uses Cartesian coordinates to display two values of a dataset. The data will be displayed as a collection of points that will each determine the position on the horizontal axis and the value of the variable on the vertical axis.
Interquartile range: The IQR is a measure of statistical dispersion, and is equal to the variance between upper and lower quartiles. IQR = Q3 - Q1.
Boxplot: A boxplot is a technique that graphically depicts a group of numerical data with its quartiles. The box will extend from Q1 to Q3 data points with a line at median Q2.
Z-score: The Z-score is a signed standard deviation number by which the value of the observation or data point is above its mean. When calculating the z-score, rescale and center the data and wait for data points that are far from zero.
You can get the number of unique values in a particular column using the unique() function. It returns the unique values that are present in the data.
You can visualize the unique values that are present in the data by using the Seaborn library. Call it using the sns.countlot() function and specify the variable to the specific plot for counting the plot. Though EDA has two different approaches, a blend of both graphical and non-graphical should provide you with a bigger picture.
Understanding and knowing the data types that you explore is a crucial yet easy process. You can use the types function for this and see each attribute and its data type.
You can filter the data using the head function and the logic on which you require the data to be sorted out.
You can create a boxplot for any numerical column using a single code with the boxplot function.
The corr function lets you find the correlation among variables. It gives a fair idea of the correlation strength between each variable. The correlation matrix ranges from +1 to -1. +1 is highly correlated and -1 is negatively correlated. You can also visualize the correlation matrix using the Seaborn library.
You can get different relationships among the variables by visualizing the dataset. Here are some techniques to use:
Heatmap
The heatmap technique shows the distribution of a quantitative variable with all combinations of two categorical characteristics. When one of the two characteristics represents time, the evolution of the variable is easily seen using this map. A gradient color scaling method is used to represent the values.
The correlation between the two variables is numeric and runs from -1 to +1. It indicates a strong inverse relationship, a strong direct relationship, and no relationship, respectively.
Histogram
A histogram is a tool that helps quickly assess a probability distribution that is easy to interpret by anyone. Python offers different options to build and plot histograms.
Some people often don’t appreciate the importance of EDA and tend to skip the process in the machine learning stage. However, doing so is unfortunate as it can lead to generating accurate models with wrong data, generating inaccurate models, using resources ineffectively as they have outliers or missing values, finding that some values are inconsistent, and not creating the right variable types in data preparation.
If you take the trip example as mentioned earlier, failing to explore the intended location can lead to problems with directions, travel errors, and unnecessary costs which can be reduced by EDA. The same protocol works for machine learning problems. It’s a vital part of the analysis as it lets you learn many things about the data at hand and helps you find answers to important questions.