For Developers

How to Create Naive Bayes Document Classification in Python?

document classification in python

The Naive Bayes text classification algorithm is a type of probabilistic model used in machine learning. Harry R. Felson and Robert M. Maxwell designed the first text classification method to classify text documents with zero or more words from the document being classified as authorship or genre.

Since then, Naive Bayes has become one of the most popular and effective classification methods for unsupervised learning of data. This article is an introduction to creating a simple Naive Bayes document classification system in python.

Naive Bayes is a probability-based machine learning algorithm that uses Bayes' theorem with the assumption of “naive” independence between the variables (features), making it effective for small datasets. The Naive Bayes algorithms are most useful for classification problems and predictive modeling.

What is the Naive Bayes classification?

An algorithm based on Naive Bayes is a probabilistic classification algorithm. Based on strong independent assumptions, it uses probability models. There is often no impact on reality due to independent assumptions. As a result, they are considered naive.

Bayes' theorem can provide probability models (credited to Thomas Bayes). It is possible to train the Naive Bayes algorithm in supervised learning, depending on the nature of the probability model.

Naive Bayes models consist of a large cube with the following dimensions:

  • Name of the input field.

  • Depending on the input field type, the value range can be continuous or discrete. By using a Naive Bayes algorithm, continuous fields get divided into discrete bins.

  • Value of the target field.

The Naive Bayes algorithm

Bayes theorem

Let’s say you defined a hypothesis regarding your data.

The theorem will state the chances that the hypothesis will occur to be true by multiplying the probable chances. This way the hypothesis will occur true given certain scenarios.

It further divides the product by the probability that the defined scenario will show.

Pr(H|E) = Pr(H) * Pr(E|H) / Pr(E) 

Because we are classifying documents, the hypothesis is that the document belongs to Categorical C. The evidence is the occurrence of the word W.

We can use the ratio form of the Bayes theorem in classification tasks because we are comparing two or more hypotheses, which involves comparing the numerators within the formula (for Bayes aficionados: the prior times the likelihood) for each hypothesis:

Pr(C₁|W) / Pr(C₂|W)= Pr(C₁) * Pr(W|C₁) / Pr(C₂) * Pr(W|C₂)

Due to a large number of words in a document, the formula becomes:

Pr(C₁|W₁, W₂ ...Wn) / Pr(C₂|W₁, W₂ ...Wn)=

Pr(C₁) * (Pr(W₁|C₁) * Pr(W₂|C₁) * ...Pr(Wn|C₁)) /

Pr(C₂) * (Pr(W₁|C₂) * Pr(W₂|C₂) * ...Pr(Wn|C₂))

The Naive Bayes classifier can be used in the following applications:

  • Emails are automatically classified into folders, including: "Family", "Friends", "Updating", and "Promotions".

  • Job listings are automatically tagged. Job listings in raw text format can get classified according to keywords: "software development", "design", and "marketing".

  • Products automatically get categorized. We can classify the products according to their description, such as books, electronics, clothing, etc.

Naive Bayes Classifier in Python.webp

Often, even very sophisticated classification methods, especially those utilizing very large datasets, do not perform as Naive Bayes. This is mainly because Naive Bayes is very simple.

Pros and cons of Naive Bayes

Pros

  • This algorithm is fast and easy to use and helps in predicting the class of a dataset very quickly.

  • You can easily solve multiclass prediction problems as it's quite useful.

  • As compared to other models with independent features, the Naive Bayes classifier performs better with less training data.

  • The Naive Bayes algorithm performs exceptionally well with categorical input variables.

  • Using this method, you can predict the class of test data easily and quickly. It also performs well when predicting multiple classes at once.

  • When the assumption of independence is true, Naive Bayes classifiers outperform logistic regression.

  • For categorical variables, it performs well compared to numerical input variables. When dealing with numerical input variables, the bell curve is assumed.

Cons

  • It is impossible for the Naive Bayes model to make any predictions if your test data set contains a categorical variable that was not present in your training data set. A smoothing technique known as Zero Frequency can solve this problem.

  • In addition to being a lousy estimation algorithm, 'predict_proba' also computes probability outputs.

  • While in theory, it sounds great, you'll not find many independent features in real life.

  • Consequently, the model fails to predict if it assigns a zero (zero) probability to the categorical variable (in the test data set) that it did not observe in the training data set. In this case, we are dealing with "Zero Frequency". This is accomplished by using Laplace estimation, one of the simplest smoothing techniques.

  • Alternatively, Naive Bayes is a poor estimator, so we shouldn't take too much advantage of the results from predict_proba.

  • Naive Bayes suffers from another limitation in that it assumes independent predictors. In reality, independent predictors are almost impossible to obtain in practice.

Naive Bayes assumption

Assuming that each word is independent of all the others will help us with the equation and, ultimately, with creating codes.

To simplify the math, we can make this assumption, which, in practice, works quite well. Knowing which words come before/after has a direct impact on the next/previous word.

Naive Bayes is based on this assumption. Based on that assumption, we can decompose the numerator as follows.

Naive Bayes text classification python.webp

When to use Naive Bayes

A Naive Bayesian classifier performs worse than a complex classifier due to the strict assumptions it makes about the data. The classifier, however, has some advantages:

  • Training and predicting the model is done at a high speed.

  • Probabilistic predictions can be created purely based on the data.

  • They are usually pretty easy to interpret.

  • Their parameters are usually not tunable.

Document Classification using Naive Bayes.webp

An initial baseline classifier based on a Naive Bayesian classifier offers these advantages. In case it performs well, you will have a classifier for your problem that is intuitive and very fast to interpret.

With some basic knowledge of how well they should perform, you can explore more sophisticated models if it does not perform well initially

How to execute Naive Bayes in Python

Let's get started and upload the libraries first:

import numpy as np, pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
sns.set() # use seaborn plotting style

We will now load the data (training and test data):

# Now Load the dataset
data_ = fetch_20newsgroups()
# Get the text categories in action
text_categories_ = data_.target_names
# Now we define the training set
train_data_ = fetch_20newsgroups(subset="train", categories=text_categories_)
# define the test set
test_data_ = fetch_20newsgroups(subset="test", categories=text_categories_)

Let's count the classes and samples:

print("Number of unique classes {}".format(len(text_categories_)))
print("Number of training samples {} ".format(len(train_data_.data_)))
print("Number of test samples {}".format(len(test_data_.data_)))

You will get output as:

Number of unique classes 20
Number of training samples 11314
Number of test samples 7532

As a result, we have a 20-class (which is the by default no. of classes in which the algorithm divides the data) text classification problem with a training sample size of 11314 and a test sample size of 7532 (text sentences).

Let's take a look at the third training sample:

print(test_data_.data_[3])

You should see something like this printed out since our data is texts (specifical emails):

Outputs

It looks like Ben Baz's mind and heart are also blind, not only his eyes.
>I used to respect him, today I lost the minimal amount of respect that
>I struggled to keep for him.
>To All Muslim netters: This is the same guy who gave a "Fatwah" that
>Saudi Arabia can be used by the United States to attack Iraq .

They were attacking the Iraqis to drive them out of Kuwait,
a country whose citizens have close blood and business ties
to Saudi citizens.  And me thinks if the US had not helped out
the Iraqis would have swallowed Saudi Arabia, too (or at
least the eastern oilfields).  And no Muslim country was doing
much of anything to help liberate Kuwait and protect Saudi
Arabia; indeed, in some masses of citizens were demonstrating
in favor of that butcher Saddam (who killed lotsa Muslims),
just because he was killing, raping, and looting relatively
rich Muslims and also thumbing his nose at the West.

So how would have *you* defended Saudi Arabia and rolled
back the Iraqi invasion, were you in charge of Saudi Arabia???

>Fatwah is as legitimate as this one. With that kind of "Clergy", it might
>be an Islamic duty to separate religion and politics, if religion
>means "official Clergy".

I think that it is a very good idea to not have governments have an
official religion (de facto or de jure), because with human nature
like it is, the ambitious and not the pious will always be the
ones who rise to power.  There are just too many people in this
world (or any country) for the citizens to really know if a
leader is really devout or if he is just a slick operator.

>
> 	 CAIRO, Egypt (UPI) -- The Cairo-based Arab Organization for Human
>  Rights (AOHR) Thursday welcomed the establishement last week of the
>  Committee for Defense of Legal Rights in Saudi Arabia and said it was
>  necessary to have such groups operating in all Arab countries.

You make it sound like these guys are angels, Ilyess.  (In your
clarinet posting you edited out some stuff; was it the following???)
Friday's New York Times reported that this group definitely is
more conservative than even Sheikh Baz and his followers (who
think that the House of Saud does not rule the country conservatively
enough).  The NYT reported that, besides complaining that the
government was not conservative enough, they have:

    - asserted that the (approx. 500,000) Shiites in the Kingdom
      are apostates, a charge that under Saudi (and Islamic) law
      brings the death penalty.  

      Diplomatic guy (Sheikh bin Jibrin), isn't he Ilyess?

    - called for severe punishment of the 40 or so women who
      drove in public a while back to protest the ban on
      women driving.  The guy from the group who said this,
      Abdelhamoud al-Toweijri, said that these women should
      be fired from their jobs, jailed, and branded as
      prostitutes.

      Is this what you want to see happen, Ilyess?  I've
      heard many Muslims say that the ban on women driving
      has no basis in the Qur'an, the ahadith, etc.
      Yet these folks not only like the ban, they want
      these women falsely called prostitutes?  

      If I were you, I'd choose my heroes wisely,
      Ilyess, not just reflexively rally behind
      anyone who hates anyone you hate.

    - say that women should not be allowed to work.

    - say that TV and radio are too immoral in the Kingdom.

Now, the House of Saud is neither my least nor my most favorite government
on earth; I think they restrict religious and political reedom a lot, among
other things.  I just think that the most likely replacements
for them are going to be a lot worse for the citizens of the country.
But I think the House of Saud is feeling the heat lately.  In the
last six months or so I've read there have been stepped up harassing
by the muttawain (religious police---*not* government) of Western women
not fully veiled (something stupid for women to do, IMO, because it
sends the wrong signals about your morality).  And I've read that
they've cracked down on the few, home-based expartiate religious
gatherings, and even posted rewards in (government-owned) newspapers
offering money for anyone who turns in a group of expartiates who
dare worship in their homes or any other secret place. So the
government has grown even more intolerant to try to take some of
the wind out of the sails of the more-conservative opposition.
As unislamic as some of these things are, they're just a small
taste of what would happen if these guys overthrow the House of
Saud, like they're trying to in the long run.

Is this really what you (and Rached and others in the general
west-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)
want, Ilyess?

--
Dave Bakken
==>"the President is doing a fine job, but the problem is we don't know what
	to do with her husband." James Carville (Clinton campaign strategist),2/93
==>"Oh, please call Daddy. Mom's far too busy."  Chelsea to nurse, CSPAN, 2/93

Next, we will build a Naive Bayes classifier and train it. Our example will generate a matrix of token counts based on a collection of text documents. To do so, we will use the make_pipeline function.

# Model building
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# Training the model with the training data
model.fit(train_data.data, train_data.target)
# Predicting the test data categories
predicted_categories = model.predict(test_data.data)

We can predict the labels of the test set in the las line of the code.

Here are the predicted category names:

print(np.array(test_data.target_names)[predicted_categories])
array(['rec.autos', 'sci.crypt', 'alt.atheism', ..., 'rec.sport.baseball',
'comp.sys.ibm.pc.hardware', 'soc.religion.christian'],dtype='<U24')

Let's construct a multi-class confusion matrix to check if the model is suitable or if it only predicts certain text types correctly.

# plotting the confusion matrix
mat = confusion_matrix(test_data.target, predicted_categories)
sns.heatmap(mat.T, square = True, annot=True, fmt = "d", xticklabels=train_data.target_names,yticklabels=train_data.target_names)
plt.xlabel("true labels")
plt.ylabel("predicted label")
plt.show()
print("Accuracy: {}".format(accuracy_score(test_data.target, predicted_categories)))
Accuracy: 0.7738980350504514

Naive Bayes is a powerful machine learning algorithm that you can use in Python to create your own spam filters and text classifiers. Naive Bayes classifiers are simple and robust probabilistic classifiers that are particularly useful for text classification tasks. The Naive Bayes algorithm relies on an assumption of conditional independence of features given a class, which is often a good first approximation to real-world phenomena.

Naive Bayes is becoming a popular text classification technique that can quickly provide a somewhat accurate "guess" as to the category of a document. It is a probabilistic classifier and can give very impressive results. It also scales nicely, allowing you to process thousands of documents. Its ability to keep up with new words makes it more accurate in predicting categories than other popular methods.

Author

  • Author

    Sanskriti Singh

    Sanskriti is a tech writer and a freelance data scientist. She has rich experience into writing technical content and also finds interest in writing content related to mental health, productivity and self improvement.

Frequently Asked Questions

A Naive Bayes model can be easily constructed with large data sets and can be useful for many applications. Most developers prefer this model due to its simplicity. It also beat other highly sophisticated classification models in the context of performance.

Due to the better results in multi-class problems, this model gets commonly used for text classification. Identifying spam e-mail and social media sentiment analysis are two main objectives for preferring this model.

The Naive Bayes Classifier is a collection of multiple algorithms based on the Bayes theorem. Each algorithm works on the same principle of classifying each pair of features separately.

This Naive Bayes classifier is a probabilistic machine learning model. This is mainly used to analyze and classify texts using machine learning. Text analysis is a vast field for ML algorithms.

The Naive Bayes is one of the easiest classifier that gets used to build a quick machine learning model for predictions. This algorithm uses probability classification. It means this classifier makes a prediction based on the probability of a given object.

This theorem also goes by the names Bayes' Rule and Bayes' Law, which determines the probability of a foreknowledge assumption. It depends on conditional probability.

View more FAQs
Press

Press

What's up with Turing? Get the latest news about us here.
Blog

Blog

Know more about remote work.
Checkout our blog here.
Contact

Contact

Have any questions?
We'd love to hear from you.

Hire and manage remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers