Imagine this: you're working for a company, you’ve generated your hypothesis, cleaned your data, created relevant features, and discussed the importance of variables. Now your stakeholders want to see a baseline model within an hour. What will you do?
You have a huge number of data points and many variables in your training set. In a situation like this, the best course would be to use Naive Bayes. It’s a technique for constructing classifiers that is very fast compared to other classification algorithms like logistic regression, support vector classifier, etc.
This article will cover the basics of Naive Bayes algorithm, how it works, and everything you need to know about what goes on under the hood.
Naive Bayes falls within the boundaries of supervised machine learning algorithms that are mainly used for classification. In this context, ‘supervised’ means that the algorithm is trained with both input features and categorical outputs. But why is it called Naive? In basic terms, Naive Bayes classifier assumes that the presence of a particular feature in a class is not related to the presence of any other feature. Or, that the effect of an attribute value on a given class is independent of the values of the other attributes.
The model is easy to use and is especially useful for large datasets. Along with simplicity, it’s known to surpass even highly advanced classification methods.
Before getting into the nitty-gritty of this algorithm, it’s important to understand the Bayes theorem and conditional probability as the algorithm works on the principle of the latter.
Here’s an example of a population of 12 people that loves oranges and grapes.
Image source: Author
Image source: Author
4 people love oranges, 3 love grapes, 2 love both oranges and grapes, and the other 3 people don’t like either fruit. Below is a contingency table for the data we have:
Image source: Author
Image source: Author
Now, what if you see a green object on the floor as if it were a grape? This will tell you that the next person you meet will definitely love grapes. What is the probability that the next person also loves oranges, after knowing that the next person you meet will love soda? In other words, what is the probability that someone loves grapes and oranges, given that you know they love grapes?
This can also be written in mathematical form:
Image source: Author
Image source: Author
The vertical line is used to mean “given that”. You can read it as: what is the probability that someone likes grapes and oranges, given that they already love grapes? In statistical terms, this is called conditional probability.
Let’s calculate this probability. The contingency table shows that the probability that someone likes grapes and oranges has already been calculated - but without knowing for a fact that they like grapes. Since it’s not precisely known that they like grapes, the denominator consisted of a total number of people in the population. Now, since you know that the person already likes grapes, the population will come down to only those who like grapes, which is 5.
Image source: Author
Image source: Author
Just like before, there are only 2 people who like oranges and grapes so the numerator will be 2. Since you already know they love grapes, the denominator will be 5.
Image source: Author
Image source: Author
Before it was known that the people liked grapes, the probability was 2/12=0.16. This probability increased from 0.16 to 0.4 after knowing they liked grapes. Similarly, you can calculate the probability that someone doesn’t like oranges, given that you know they love grapes.
Note that the probability may change if additional information about the problem is provided. This is what is done with machine learning problems. You need to predict something, given that you already know something about it.
Conditional probability can also be written as:
Image source: Author
Image source: Author
Bayes’ theorem or Bayes’ rule is named after Thomas Bayes. It’s a mathematical rule based on statistics and probability that aims to calculate the probability of one scenario based on its relationship with another scenario.
Consider this scenario: Your friend asks to play with him and tells you that he’s bringing a friend. There’s a 50% chance the friend is female. Your friend then texts you to ask if you remember Ariana. With this additional information, under Bayes’ theorem, the probability is more likely the friend is female.
Historically, Bayes’ theorem led to significant breakthroughs. It was even used to crack Enigma codes during World War II. Alan Turing, the famous British mathematician, used Bayes’ theorem to determine the German messaging code. He and his team used probability models to break down the almost infinite number of possible translations based on the messages that were most likely to be translatable, ultimately cracking the Enigma code.
Now that the basic concepts are clear, let’s understand this mathematically.
Consider that A and B are any two events. Using your understanding of conditional probability, you have:
Image source: Author
Image source: Author
Image source: Author
Image source: Author
Naive Bayes algorithm is a classification technique based on Bayes’ theorem, which assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. There are various applications of this algorithm including face recognition, NLP problems, medical diagnoses and a lot more.
Here’s an example: there’s a dataset and you want to classify whether the text written is in a sports context or not. Let’s suppose that ‘a very good game’ is one of the texts in the dataset. According to Naive Bayes assumption, you are no longer looking at the entire sentence, but the individual words.
For the purpose of this example, ‘a very good game’ is the same as ‘a good very game’ and ‘game a good very’. It is written as:
Image source: Author
Image source: Author
When there are various X variables, it is simplified by assuming that Xs are independent, so:
Image source: Author
Image source: Author
For n number of X, the formula becomes Naive Bayes:
Image source: Author
Image source: Author
which can be expressed as:
Image source: Author
Image source: Author
To save time and effort, ignore the denominator since it is a constant value. The formula finally becomes:
Image source: Author
Image source: Author
Below is training data on which Naive Bayes algorithm is applied:
Image source: Author
Image source: Author
Step 1: Make a Frequency table of the data.
Image source: Author
Image source: Author
Step 2: Create a Likelihood table by finding probabilities like Overcast probability = 0.29.
Image source: Author
Image source: Author
Step 3: Use Naive Bayes equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if the weather is Rainy. Is this statement correct?
You can solve it using the above discussed method of posterior probability.
P(Yes | Rainy) = P( Rainy | Yes) * P(Yes) / P (Rainy)
Here, you have P (Rainy |Yes) = 2/9 = 0.22, P(Rainy) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
Now, P (Yes | Rainy) = 0.22 * 0.64 / 0.36 = 0.39, which has a higher probability.
Naive Bayes uses a similar method to predict the probability of different classes based on various attributes. This algorithm is mostly used in NLP problems like sentiment analysis, text classification, etc.
Naive Bayes is the most basic algorithm that produces good results in textual data. If you’re a beginner to it, you can learn it by making an SMS-SPAM CLASSIFIER that classifies a message as spam if it contains negative comments. You can also learn Streamlit, an open-source app framework for machine learning and data science teams that is used to create attractive web pages in minutes.
Author is a seasoned writer with a reputation for crafting highly engaging, well-researched, and useful content that is widely read by many of today's skilled programmers and developers.