Introduction to Self-Supervised Learning in NLP

Feb 11, 2022•12 min read

Languages, frameworks, tools, and trends

Self-supervised learning (SSL) is a prominent part of deep learning. This is a legit method that is used to train most of the models as it can learn from the unlabeled data, making it easier to leverage a larger volume of raw data. But how is it done?

When neural networks are provided with data, these networks try to connect the patterns within the data and extract relevant features. These features are then used to make decisions, like classifying objects (image classification), predicting a number (regression), generating captions (caption generator), and more.

In this blog, we will discuss some techniques by which state-of-the-art deep learning models can be trained with less time and resources. We will also look into the details of self-supervised learning, its types, and the applications in which these models are used.

Transfer learning

Transfer learning.webp

Recently the IT industry saw an increase in the use of deep learning methods, thanks to the available large amounts of data and enough compute/processing power. This resulted in the training of heavier deep learning models on much larger datasets, particularly in computer vision and natural language processing tasks.

Therefore, the output models, which achieved state-of-the-art results on standard benchmarks, were made available as pre-trained models for use. It was mainly used to fine-tune custom datasets. This technique forms the basis for transfer learning.

Transfer learning is defined as transferring the knowledge gained by a model while learning one task. Then these trained models can be applied to a similar task with little modifications. You can also use pre-trained models like Resnet50, EfficientNet, and more. These models are trained on millions of images (like the ImageNet dataset) and fine-tuned on your data.

Let's consider an example to train an image classification model of Cats vs Dogs (i.e., PETs Dataset). You can use the ResNet50 pre-trained model that has been trained on the Imagenet dataset with 1000 classes. Use the model to fine-tune the PETS dataset, containing different images of cats and dogs.

Since you are using the pre-trained model, you are not starting the model training from scratch which means that the model has some knowledge about the appearance of a cat or dog. Now you can fine-tune the pre-trained model with the dataset and achieve good results quickly as compared to the training model from scratch.

But you can't use transfer learning when you have no pre-trained models available. Here, you can use the self-supervised learning technique to train models and produce accurate results in lesser time and with minimum resources.

Self-supervised learning

self-supervised learning.webp

Self-supervised learning is a technique used to train models in which the output labels are a part of the input data, thus no separate output labels are required. It is also known as predictive learning or pretext learning. In this method, the unsupervised problem is changed into a supervised one using auto-generation of labels. A classic example of this would be language models.

The language model is a word sequence prediction model trained to predict the next word based on the previous input sentence. This kind of task is a self-supervised learning task because you are not defining separate output labels. Instead, you are providing the texts as inputs and outputs in a specific way that will help the model to understand the fundamentals and style of the language used in the dataset (or the language used in the dataset).

Self-supervised learning is combined with transfer learning to create a more advances NLP model. When you don't have any pre-trained models for our dataset, you can create one using self-supervised learning. You can train a language model using the text corpus available in the train and test dataset.

You can train a language model by providing an independent variable like the text of a specific length, and then provide the same text by appending the next word as the output label. This works when a lot of text is provided in the model and it can easily find patterns and learn the basic style of the text.

In this way, the model will learn to predict the next word in the sentence. Usually, the language model generates text but it is not useful for any downstream task until it is fine-tuned.

The language model acts as a pre-trained model that can be used to conduct transfer learning i.e., fine-tuning with the same dataset or different datasets for any downstream tasks like text classification, sentiment analysis, and more.

One of the best resources to find the language models is HuggingFace. It has many models trained using corpora of text data of different styles and languages. Machine learning algorithms are broadly classified as

Supervised
Semi-supervised
Unsupervised
Reinforcement learning

Let's see what each of these means in brief:

Supervised learning

The supervised learning technique is a popular technique that helps with training your neural networks on labeled data for a specific task. In this technique, a machine learning model will have inputs and corresponding labels to learn about.

For example, you can consider image classification, regression analysis, or more. For instance, take a classroom where a student is learning from the teacher about any concept with different examples for better understanding.

Unsupervised learning

A deep learning technique that is used to find data from implicit patterns without explicitly training on labeled data is known as unsupervised learning models. Contrary to the supervised learning models, it doesn’t need a feedback loop and annotations for training.

In this technique, a machine learning model will get inputs and the model will find patterns from it and use those to predict the output. For example, clustering, and principal component analysis all come under this.

Types of machine learning algorithms.webp

Semi-supervised learning

It is a combination of both unsupervised and supervised learning models. This technique comes in handy when you have a small set of labeled data points for training the model. With the training process, you can use a pseudo-label and a set of labeled data for the rest of the dataset.

For instance, when a student learns how to deal with specific problems from their teacher, they have to figure out how to solve those problems by themselves. It is a semi-supervised learning model.

Reinforcement learning

It is a method used for training AI agents to learn environmental behavior in certain contexts with the help of a reward feedback policy. With this technique, a machine learning model learns from actions and rewards.

It depends on how agents take action in an environment for maximizing the awards received. Examples of this learning model are path planning, chess engine, a child who wants to win a stage game, and more.

Difference between supervised and unsupervised learning

Both supervised and unsupervised learning have distinct objectives and provide you with distinct solutions as per your requirement.

Supervised learning models and unsupervised learning models can be complementary learning models as both don’t require labeling of the datasets. Unsupervised learning models must be the superset of self-supervised learning as they don’t provide any feedback loops.

Difference between supervised and unsupervised learning_11zon.webp

Supervised vs unsupervised learning model.webp

On the other hand, the self-supervised learning model has many supervisory signals which act as responses in the process of the training. An unsupervised learning model focuses more on the model and not on the data. In contrast, the supervised learning model works the other way.

But, unsupervised learning models are exceptional at dimensionality reduction and clustering, whereas supervised learning models are a pretext technique for classification and regression tasks. Another major difference is that supervised learning functions around labeled data but unsupervised learning mainly deals with unlabeled data.

Why was self-supervised learning introduced?

Self-supervised learning model was designed to address the following common issues.

High Cost: To acquire labeled data, it was necessary to use more of the learning models. However, to acquire higher quality labeled data, you have to pay a higher price and the task might also be tedious and time-consuming.
Generic AI: The self-supervised learning model framework is closely related to embedding human cognition with machines. So the self-supervised machine is also considered artificial intelligence.
Lengthy lifecycle: While developing machine learning models, the data preparation lifecycle will be a highly lengthy process. You will need to filter, annotate, review, clean, and restructure the data according to your training framework.

The self-supervised learning model applications first came into existence because of the above-mentioned concerns. It not only overcomes the above concerns but also provides additional benefits like flexibility, and integrity of data which all come at a lower cost.

Self-supervised learning applications in NLP

SSL created huge steps in the NLP (Natural Language Processing) field. The self-supervised learning is widely used everywhere starting from application documentation processing, sentence completion, text suggestions, and more.

But, the learning abilities of the self-supervised model evolves majorly after the release of the Word2Vec research paper, which took the natural language processing domain to the next level. The model can predict the next word depending on the prior pattern, this is the idea behind word embedding approaches.

It is because of such improvements that came from the Word2Vec research paper, you will now be able to get a beneficial representation using the allotment of word-embedded systems which are used for scenarios like word prediction, sentence completion, and so on. BERT (Bidirectional Encoder Representations from Transformers) is one of the most eminent SSL methods that is used in natural language processing.

Now, let us discuss some of the vital applications of self-supervised learning models:

Next sentence prediction

With next sentence prediction, you can pick up two concurrent sentences from a document and an unspecified sentence from the same or different document, so you have a sentence 1, sentence 2, and sentence 3.

Then, you can ask the self-supervised learning model the relative position of sentence 1 to sentence 2, and the model will provide an output either with IsNotNextSentence or IsNextSentence. You can use the same self-supervised model with all the combinations.

You should consider the below scenarios:

The mission to the moon is finally starting after so many years.
You can watch TV when you reach home.
You can go home after you come from school.

Next sentence prediction.webp

When you ask an individual to reorder any different sentences which will be suitable for our logical understanding, they would mostly choose sentence 1 and 2 as one after the other. The vital reason for using this model is to predict sentences depending on the everlasting contextual dependencies.

BERT published a paper written by Google’s artificial intelligence team of researchers who have proficient in various NLP tasks like natural language inference, question answering, and many more.

For similar tasks, BERT will provide a great method for capturing the relationships between sentences that is not possible with other language modeling methods. Here is how the self-supervised model for NLP works:

For making BERT handle different varieties of downstream tasks, input representation definitively represents a set of sentences that are conjoint in a single sequence. A sequence cites an input token order to BERT.
The initial token of each sequence is a singular classification token. The concluding hidden state after this token is utilized as the average sequence presentation for classifying tokens.

You should differentiate between the sentences in two methods. Firstly, you should disconnect them with a unique token. Secondly, you have to add a learning model embedding each token which indicates whether it applies to sentence 1 or sentence 2.

You should denote the input embedding as 5, then the concluding hidden vector of the unique token as 3, and the concluding hidden vector for the ith input token as Ti. Then, vector 3 is used for the NSP application.

Auto-regressive language modeling

When you have auto-encoding models like BERT from transformers or you want to employ self-supervised learning for functions like sentence classification, then the application of SSL propositions occurs in the text generation domain.

Auto-regressive models like GPT (Generative Pre-trained Transformers) agreed on the vintage language modeling task. It will forecast the upcoming word after reading all the preceding ones. Models like these will respond to the decoder part of the mask and transformer at the top of the completed sentence before and after the text.

Now, let us understand more about how these models work by trying to understand the GPT training framework. The training approach will have two phases:

Unsupervised pre-training phase

This is the first stage. It will help you learn a powerful language model on a huge amount of text. When you have an unsupervised amount of tokens U = {u1, . . . . . . . . un}, you can use standard language modeling objectives for maximizing the following probability:

Unsupervised pre-training phase2.webp

In the above string, k is the measurement of the context window, and P is the dependent probability modeled with the help of neural networks with parameters. These variables are taught with the help of stochastic gradient descent.

Autoregressive Language Model.webp

You are training a complex transformer decoder for the self-supervised learning model, which will also act as an alternative to the transformer. It applies a complex operation over the input context tokens following position-wise feed-forward layers for producing an output distribution over the targetted tokens:

Unsupervised pre-training phase3.webp

In the above equation, U = ( u-k, . . . . . . . , u -1) is the context vector of tokens, We are the token embedding matrix, Wp is the position embedding matrix, and n is the total number of layers. This attention where every token can visit the context to h<0 will bring the self-supervised approach into the picture.

Supervised fine-tuning

In this step, you should think of a labeled dataset C, where every instance contains a set of input tokens,
x1, . . . . . , xm along with a label y. You will pass the inputs through the pre-trained model for obtaining the final transformer block parameters Wy for predicting y:

Supervised fine-tuning2.webp

It will provide us with the objective for maximizing:

Supervised fine-tuning 3.webp

When you want to add language modeling for the auxiliary purpose of the fine-tuning, it will help in understanding that improving the generality of the supervised learning model and increasing convergence is the best choice. You should optimize the below objective with a weight:

Supervised fine-tuning 4.webp

On the whole, the only extra variables you need when fine-tuning are Wy, and embeddings for delimiter tokens. The transformer architecture and training objectives are the input transformations for fine-tuning various tasks.

You must convert all the structured inputs into sequences of tokens that need to be processed by the pre-trained model and after that, we should process the linear + softmax layer. For various tasks, various types of processing are needed like Textual Entailment. There are various iterations required when you want to have an improvement over the original GPT model. It will also help you in understanding how you can use it for your requirements.

Wrapping up

When you are using the above techniques like transfer learning and self-supervised learning, you will be able to train the deep learning models even when there aren’t enough resources. It will provide us with a way to train deep learning models that are not task-specific but can support multiple tasks with the help of fine-tuning.

Self-supervised learning has highly helped in the development of AI systems, which can learn with less help. With GPT-3 and BERT you can see that SSL is easily used in natural language processing. This network aims to learn using good representations from unlabeled data.

It will highly reduce the dependency on huge amounts of data similar to the supervised learning model. At present, self-supervised learning is still a growing technology and developers are massively being hired to make the process smoother and efficient.

Author
Turing Staff