Feature Engineering in Machine Learning With Python: A Guide

Sep 18, 2023•8 min read

Languages, frameworks, tools, and trends

Traditional machine-learning concepts are used for various purposes in many businesses today. The process usually involves gathering and storing data, its preparation, and utilization to train machine-learning models. These models enable regression, classification, or clustering predictions.

Additionally, they are used to build complex deep-learning models that specifically focus on Natural Language Processing (NLP) or Image Processing. However, the accuracy of these models depends on the quality of the data used.

In this article, we will dive into the concept of feature engineering and explore how it helps to improve model performance and accuracy. Feature engineering involves synthesizing raw data to provide more valuable insights for our machine-learning models. This article will show how to use Python programming language to carry out feature engineering concepts.

What is feature engineering?

Feature engineering is the process of transforming selected features in a dataset to create certain patterns, provide insight, and improve understanding of the data. This will eventually improve the accuracy of the model when trained with the data.

Features are the unique properties or characteristics of a particular record in a dataset. By carrying out engineering methods, which include data manipulation, cleaning, and transforming, it provides a better understanding of the dataset as a whole.

Data analysts and/or scientists carry out the feature engineering process. These are done because raw data are not very useful in creating machine learning models because of missing values, inconsistent, or irrelevant information, etc.

Importance of feature engineering

Feature engineering is important in traditional machine learning concepts. The following are the importance of feature engineering:

1. Enhanced model performance with well-engineered features: When feature engineering techniques are carried out on features in a dataset, machine learning models are provided with reliable data that enables them to provide better accuracy and results.

2. Improved data representation and pattern extraction: Properly engineered or transformed features provide reliable and detailed insights into data. This also aids data scientists or analysts in drawing out valuable conclusions from it.

3. Dimensionality reduction and prevention of overfitting: Dimensionality reduction involves removing or filtering unuseful or irrelevant features which in turn will yield better model performance, especially in high dimension data. Dimensionality reduction reduces the chance of model overfitting.

4. Handling missing data effectively: Feature engineering involves methods in which missing data are handled without harming model performance.

5. Incorporating domain knowledge into the model: Applying feature engineering techniques allows us to include domain knowledge by selecting useful features and removing irrelevant features in the dataset before training in the machine learning model.

Feature engineering techniques in Python

In this section, we will look into some feature engineering techniques in Python, what they do, and their uses.

Handling missing data

Data are gathered in raw format, most of which are unstructured. There are instances where such data contains missing values. Machine learning models don’t perform well with data containing missing values. There are several ways of handling missing values in a dataset.

Dropping or removing all records containing missing values is one of those ways, but this leads to data loss which is why it's not advisable. Let's find out other ways to handle missing values that don’t have the risk of data loss.

Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will intentionally introduce some missing values in the dataset to demonstrate how to handle them using different techniques.

Mean/Median/Mode imputation: This is taking the mean, median, or mode of all other values in the feature and inputting the derived value into empty or missing columns.

Code:

Handling missing data MeanMedianMode imputation code.webp

Output:

Mean Median Mode imputation output.webp

We used the SimpleImputer from Scikit-learn to fill in missing values in the 'Age' and 'TestScore' columns with their respective mean or median values. We can also use the most frequent value for categorical data. In this example, we used mean imputation for simplicity.

Forward Fill/Backward Fill: The forward fill / backward fill method process of filling in the missing values with the previous known value or the next value in front of the missing value.

Forward Fill and Backward Fill code.webp

Forward Fill and Backward Fill output.webp

We used the fillna() method with 'ffill' and 'bfill' methods to propagate the previous or next valid value to fill the missing values in the 'Age' and 'TestScore' columns.

Interpolation: This is the process of estimating and filling in missing values using techniques like linear or polynomial interpolation.

Interpolation code.webp

Interpolation output.webp

We used the interpolate() method to fill in missing values to create a smooth progression between the existing data points.

K-nearest neighbors imputation: Fill in missing data using the features derived from using the KNN(K- Nearest Neighbors) algorithm on the other features of the dataset.

K-nearest neighbors imputation code.webp

KNN imputation output.webp

We used the KNeighborsRegressor from Scikit-learn to predict the missing values based on the k-nearest neighbors of the missing data points in the 'Age' and 'TestScore' columns.

Handling categorical data

Machine learning algorithms and models only work with numerical or boolean data, Strings or categorical values must be converted into a numerical format. The conversion is done using some encoding techniques.

Let's consider a dataset containing information about fruits, including their type and color. We'll explore three techniques for handling categorical data: One-Hot Encoding, Label Encoding, and Target Encoding.

One-hot encoding: The categorical variables are converted or transformed into binary(0 and 1) vectors assigned as a separate feature to the dataset.

Label encoding code.webp

One hot encoding output.webp

We used the OneHotEncoder from Scikit-learn to convert categorical features into binary vectors, where each category converts into a separate binary column. We dropped the first category to avoid multicollinearity issues. One-Hot Encoding is suitable when the categorical features do not have a natural order.

Label encoding: The label encoding technique assigns a respective integer value to each categorical variable.

One hot encoding code.webp

Label encoding output.webp

We used the LabelEncoder from Scikit-learn to transform each category in the 'Fruit' and 'Color' columns into numerical values. Label Encoding is of use when the categorical features have an ordinal relationship.

Target encoding: This encoding scheme assigns the mean or median of the target variable to each category.

Target encoding code.webp

Target encoding output.webp

We used the TargetEncoder from the category_encoders library to encode categorical features by replacing each category with the mean target value of the target variable. Target Encoding is helpful when dealing with high-cardinality categorical variables.

Feature scaling

Feature Scaling is a method of feature engineering that involves transforming features. The features are transformed into floats within a boundary of values, usually between 0 and 1. The features, being within the same boundary have none dominating the other.

Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will demonstrate two feature scaling techniques: Min-Max Scaling (Normalization) and Standardization.

Min-Max Scaling (Normalization): Min-max is a feature scaling technique that normalizes features in a dataset between a minimum and maximum value.

Min-Max Scaling code.webp

Min-Max Scaling output.webp

We used the MinMaxScaler from Scikit-learn to scale the features to a specified range (usually [0, 1]). This transformation preserves the original distribution of the data and is suitable when the data has a bounded range.

Standardization: Standardization converts the features in a dataset using the mean and standard deviation. The value of the mean is 0 and the standard deviation is 1.

Standardization code.webp

Standardization output.webp

We used the StandardScaler from Scikit-learn to scale the features to have a mean of 0 and a standard deviation of 1. This technique is useful when the data has outliers or non-normal distribution.

Creating polynomial features

Creating polynomial features is another method of feature engineering. Giving power to existing features to create polynomial features.

Let's consider a dataset containing information about houses, including the area and their corresponding sale prices. We will demonstrate how to create polynomial features to capture non-linear relationships between the house area and sale prices.

Polynomial features in Python: These are the features created from existing features in a dataset. The polynomial features can be power n. Where n is a number that represents exponential power.

Polynomial features code in Python.webp

Polynomial features output in Python.webp

In this example, we created a sample dataset with the 'Area' of houses and their corresponding 'SalePrice'. We then used PolynomialFeatures from Scikit-learn to create polynomial features to capture the non-linear relationship between the 'Area' and 'SalePrice'. We chose a degree of 2 (quadratic) to create polynomial features up to the square of the 'Area'.

The polynomial features help capture the non-linear patterns in the data, and we then used linear regression to fit a model to these features. The model can now predict the sale prices of houses based on their areas, accounting for the non-linear relationship.

Feature selection

Feature Selection is a feature engineering technique that selects only dominating or relevant features in a dataset. It uses algorithms to determine which features have the most impact or relation to the target variable. When a model is trained only with the relevant features selected, it can improve the machine learning model’s accuracy.

Let's consider a dataset that contains information about students' performance, including their study hours, test scores, grades, and participation in extracurricular activities. We will demonstrate two feature selection techniques: Univariate Feature selection, and L1 Regularization (Lasso).

Univariate feature selection: Univariate feature selection removes all features whose variance doesn’t meet a particular threshold value.

Univariate feature selection code.webp

Univariate feature selection output.webp

We used SelectKBest from Scikit-learn to select the top k features based on their relevance with the target variable using the f_regression score function.

L1 regularization (Lasso): Lasso regression algorithm can reduce the coefficients and remove those with lower values.

L1 regularization (Lasso) code.webp

L1 regularization (Lasso) output.webp

We used Lasso regression, which applies L1 regularization, to penalize features with low importance by driving their coefficients to zero. We selected the top k features with the highest absolute coefficients.

Conclusion

In this article, we discussed what feature engineering is, the importance of feature engineering in training machine learning models, and how to implement them using Python programming languages.

Feature engineering is a great skill to acquire as a data scientist or a machine learning engineer. In addition to these feature engineering techniques listed in the above article, other advanced techniques are used while dealing with computer vision, Natural Language Processing (NLP), or Time series.

How to Repeat a String N-Times in Python

Author
Ezeana Michael

Ezeana Michael is a data scientist with a passion for machine learning and technical writing. He has worked in the field of data science and has experience working with Python programming to derive insight from data, create machine learning models, and deploy them into production environments.