Feature Engineering in Machine Learning With Python: A Guide
•8 min read
- Languages, frameworks, tools, and trends

Traditional machine-learning concepts are used for various purposes in many businesses today. The process usually involves gathering and storing data, its preparation, and utilization to train machine-learning models. These models enable regression, classification, or clustering predictions.
Additionally, they are used to build complex deep-learning models that specifically focus on Natural Language Processing (NLP) or Image Processing. However, the accuracy of these models depends on the quality of the data used.
In this article, we will dive into the concept of feature engineering and explore how it helps to improve model performance and accuracy. Feature engineering involves synthesizing raw data to provide more valuable insights for our machine-learning models. This article will show how to use Python programming language to carry out feature engineering concepts.
What is feature engineering?
Feature engineering is the process of transforming selected features in a dataset to create certain patterns, provide insight, and improve understanding of the data. This will eventually improve the accuracy of the model when trained with the data.
Features are the unique properties or characteristics of a particular record in a dataset. By carrying out engineering methods, which include data manipulation, cleaning, and transforming, it provides a better understanding of the dataset as a whole.
Data analysts and/or scientists carry out the feature engineering process. These are done because raw data are not very useful in creating machine learning models because of missing values, inconsistent, or irrelevant information, etc.
Importance of feature engineering
Feature engineering is important in traditional machine learning concepts. The following are the importance of feature engineering:
1. Enhanced model performance with well-engineered features: When feature engineering techniques are carried out on features in a dataset, machine learning models are provided with reliable data that enables them to provide better accuracy and results.
2. Improved data representation and pattern extraction: Properly engineered or transformed features provide reliable and detailed insights into data. This also aids data scientists or analysts in drawing out valuable conclusions from it.
3. Dimensionality reduction and prevention of overfitting: Dimensionality reduction involves removing or filtering unuseful or irrelevant features which in turn will yield better model performance, especially in high dimension data. Dimensionality reduction reduces the chance of model overfitting.
4. Handling missing data effectively: Feature engineering involves methods in which missing data are handled without harming model performance.
5. Incorporating domain knowledge into the model: Applying feature engineering techniques allows us to include domain knowledge by selecting useful features and removing irrelevant features in the dataset before training in the machine learning model.
Feature engineering techniques in Python
In this section, we will look into some feature engineering techniques in Python, what they do, and their uses.
Handling missing data
Data are gathered in raw format, most of which are unstructured. There are instances where such data contains missing values. Machine learning models don’t perform well with data containing missing values. There are several ways of handling missing values in a dataset.
Dropping or removing all records containing missing values is one of those ways, but this leads to data loss which is why it's not advisable. Let's find out other ways to handle missing values that don’t have the risk of data loss.
Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will intentionally introduce some missing values in the dataset to demonstrate how to handle them using different techniques.
- Mean/Median/Mode imputation: This is taking the mean, median, or mode of all other values in the feature and inputting the derived value into empty or missing columns.
Code:
Output:
We used the SimpleImputer from Scikit-learn to fill in missing values in the 'Age' and 'TestScore' columns with their respective mean or median values. We can also use the most frequent value for categorical data. In this example, we used mean imputation for simplicity.
- Forward Fill/Backward Fill: The forward fill / backward fill method process of filling in the missing values with the previous known value or the next value in front of the missing value.
We used the fillna() method with 'ffill' and 'bfill' methods to propagate the previous or next valid value to fill the missing values in the 'Age' and 'TestScore' columns.
- Interpolation: This is the process of estimating and filling in missing values using techniques like linear or polynomial interpolation.
We used the interpolate() method to fill in missing values to create a smooth progression between the existing data points.
- K-nearest neighbors imputation: Fill in missing data using the features derived from using the KNN(K- Nearest Neighbors) algorithm on the other features of the dataset.
We used the KNeighborsRegressor from Scikit-learn to predict the missing values based on the k-nearest neighbors of the missing data points in the 'Age' and 'TestScore' columns.
Handling categorical data
Machine learning algorithms and models only work with numerical or boolean data, Strings or categorical values must be converted into a numerical format. The conversion is done using some encoding techniques.
Let's consider a dataset containing information about fruits, including their type and color. We'll explore three techniques for handling categorical data: One-Hot Encoding, Label Encoding, and Target Encoding.
- One-hot encoding: The categorical variables are converted or transformed into binary(0 and 1) vectors assigned as a separate feature to the dataset.
We used the OneHotEncoder from Scikit-learn to convert categorical features into binary vectors, where each category converts into a separate binary column. We dropped the first category to avoid multicollinearity issues. One-Hot Encoding is suitable when the categorical features do not have a natural order.
- Label encoding: The label encoding technique assigns a respective integer value to each categorical variable.
We used the LabelEncoder from Scikit-learn to transform each category in the 'Fruit' and 'Color' columns into numerical values. Label Encoding is of use when the categorical features have an ordinal relationship.
- Target encoding: This encoding scheme assigns the mean or median of the target variable to each category.
We used the TargetEncoder from the category_encoders library to encode categorical features by replacing each category with the mean target value of the target variable. Target Encoding is helpful when dealing with high-cardinality categorical variables.
Feature scaling
Feature Scaling is a method of feature engineering that involves transforming features. The features are transformed into floats within a boundary of values, usually between 0 and 1. The features, being within the same boundary have none dominating the other.
Let's consider a dataset that contains information about students, including their age, test scores, and grades. We will demonstrate two feature scaling techniques: Min-Max Scaling (Normalization) and Standardization.
- Min-Max Scaling (Normalization): Min-max is a feature scaling technique that normalizes features in a dataset between a minimum and maximum value.
We used the MinMaxScaler from Scikit-learn to scale the features to a specified range (usually [0, 1]). This transformation preserves the original distribution of the data and is suitable when the data has a bounded range.
- Standardization: Standardization converts the features in a dataset using the mean and standard deviation. The value of the mean is 0 and the standard deviation is 1.
We used the StandardScaler from Scikit-learn to scale the features to have a mean of 0 and a standard deviation of 1. This technique is useful when the data has outliers or non-normal distribution.
Creating polynomial features
Creating polynomial features is another method of feature engineering. Giving power to existing features to create polynomial features.
Let's consider a dataset containing information about houses, including the area and their corresponding sale prices. We will demonstrate how to create polynomial features to capture non-linear relationships between the house area and sale prices.
- Polynomial features in Python: These are the features created from existing features in a dataset. The polynomial features can be power n. Where n is a number that represents exponential power.
In this example, we created a sample dataset with the 'Area' of houses and their corresponding 'SalePrice'. We then used PolynomialFeatures from Scikit-learn to create polynomial features to capture the non-linear relationship between the 'Area' and 'SalePrice'. We chose a degree of 2 (quadratic) to create polynomial features up to the square of the 'Area'.
The polynomial features help capture the non-linear patterns in the data, and we then used linear regression to fit a model to these features. The model can now predict the sale prices of houses based on their areas, accounting for the non-linear relationship.
Feature selection
Feature Selection is a feature engineering technique that selects only dominating or relevant features in a dataset. It uses algorithms to determine which features have the most impact or relation to the target variable. When a model is trained only with the relevant features selected, it can improve the machine learning model’s accuracy.
Let's consider a dataset that contains information about students' performance, including their study hours, test scores, grades, and participation in extracurricular activities. We will demonstrate two feature selection techniques: Univariate Feature selection, and L1 Regularization (Lasso).
- Univariate feature selection: Univariate feature selection removes all features whose variance doesn’t meet a particular threshold value.
We used SelectKBest from Scikit-learn to select the top k features based on their relevance with the target variable using the f_regression score function.
- L1 regularization (Lasso): Lasso regression algorithm can reduce the coefficients and remove those with lower values.
We used Lasso regression, which applies L1 regularization, to penalize features with low importance by driving their coefficients to zero. We selected the top k features with the highest absolute coefficients.
Conclusion
In this article, we discussed what feature engineering is, the importance of feature engineering in training machine learning models, and how to implement them using Python programming languages.
Feature engineering is a great skill to acquire as a data scientist or a machine learning engineer. In addition to these feature engineering techniques listed in the above article, other advanced techniques are used while dealing with computer vision, Natural Language Processing (NLP), or Time series.

Author
Ezeana Michael
Ezeana Michael is a data scientist with a passion for machine learning and technical writing. He has worked in the field of data science and has experience working with Python programming to derive insight from data, create machine learning models, and deploy them into production environments.