For Developers

Brief Introduction to Transformers and Their Power

Introduction to transformers and their power

Introduction to sequence models

Our complete life is a course of events that either we or our circumstances control. Anywhere you glance, you will be able to spot countless patterns. Something that occurs with the dependency on its predecessor is termed as a sequence. For example, in a sentence of any language, all the words have a pre-defined structure to deliver a message. These sentences are nothing but a sequence of different words.

In deep learning, we encounter sequences very frequently in a dataset. The factor that concludes whether to use a sequence model or not is the presence of other highly correlated features in the dataset and what are we expecting from the model as an outcome. We use sequence models to learn the sequence order and how those elements interact with one another throughout the period. This helps forecast the sequence in the future. The most popular design for such models, before the invention of transformers, were Recurrent Neural networks (RNNs), Long Short Term Memory Networks (LSTM Networks), and Gated Recurrent Units (GRUs).

Irrespective of their exceptional performance and well-adopted application throughout the industry, these models faced a bottleneck. These were unable to learn long-distance dependencies within the sequences to maintain an appropriate context. With such a problem and high demand for a solution, scientists came up with a concept: Attention. In transformers, we work very closely with self-attention. To read in detail about these concepts, check out the paper: Attention Is All You Need.

Introduction to Transformers

What is a Transformer

The transformer is a network architecture proposed in the same paper listed above, which constitutes an encoder and decoder within itself. This architecture even managed to significantly improve the state-of-the-art Neural Machine Translation (text in one language to another) just with attention mechanisms and some dense and normalization layers. Additionally, these are much faster to train and simpler to parallelize. These two properties reduce the overall computational investment and have proven to be very powerful yet efficient.

Let us get into a little detail of how exactly these work behind the scenes. First and foremost, these use attention, especially self-attention as the base concept. In simple words, self-attention can be imagined as ‘everything in a sequence is looking within itself’. There are mathematical equations involved to get deep into the topic, but I will refrain from doing so in this introductory article. For more clarity, think of self-attention as taking an embedding vector for one of the words in the input sequence. This embedding vector performs many computations with itself and other embedding vectors of different words within the sequence; to establish interrelations among each other. Further, these interrelations pass through a small perceptron model to introduce nonlinearity in the learning of the parameters. The word embedding vectors are also added with positional embeddings (calculated to maintain the order in which the words appear in the input sequence).



The left side of the diagram is the Encoder, where the right side is the decoder. Both components take sequences in the form of embeddings as the input and produce the necessary output. The Encoder takes a batch of sentences represented as sequences of word tokens (numerical values assigned concerning input corpus of words to compute) and encodes each word into a 512-dimensional embedding vector. Hence, input size for the encoder is [batch_size, max input sentence length] which on encoding gives [batch_size, max input sentence length, 512]. On the other hand, the decoder takes in the target sentence (also represented as a sequence of word tokens) while training the network as well as the outputs of the encoder as its inputs. Please keep in mind that the top part of the decoder is stacked ‘N’ times, so the final output from the encoder is fed to the decoder at each of these ‘N’ levels.

Functioning in brief

Until now we have understood the core functioning of the transformer as well as its connection within itself. But there are a few things that are worth a brief introduction before finalizing the article. These are the components we can see in the architecture but haven't touched upon:

  1. Multi-head attention: On a superficial level, imagine multi-head attention as a multi-tasker. You have a word with yourself in which the transformer needs to learn the next word. Multi-head attention performs different parallel computations for the same word to achieve different results. These results are then concatenated and SoftMax to output the best suitable word. These parallel computations might look at the tense of the word, the context of the text, type of a word (verb or noun, etc.), and so on. The combination of these finds the highest probable word using SoftMax.
  2. Masked multi-head attention: It is like the above-mentioned multi-head attention except that it hides the future words in the sequence relative to the word decoder is currently working on. This masking prevents the transformer from looking into the future and realistically learning from the data.
  3. Residual connection: The directed arrows going from one 'Add and Norm' to another without passing through the attention module is called a skip connection or residual connection. This helps in preventing the degradation of the network and helps maintain the gradient flow throughout the network for robust training.

Famous models of Transformers

Most of the state-of-the-art models currently in use in the world are built using transformers. Talking speech to text conversion, Machine translation, Text generation, paraphrasing, question answering, sentiment analysis and so many more. These are some of the best and most famous models out there.

  • BERT: Stands for Bidirectional Encoder Representations from Transformers. This technique was designed by Google for natural language processing and is based on pre-trained transformers. Until 2020, BERT was used in almost every English-language query on Google’s search engine. For more information on BERT check out the link.
  • GPT-2, GPT-3: Stands for Generative Pre-trained Transformers 2 and 3, respectively. GPT is an open-source A.I. used for Natural language processing (NLP) related tasks such as machine translation, question answering, text summarizer, and many more. The biggest difference in both is the scale at which these are built. GPT-3 is the latest model with a lot of added functionalities than GPT-2. On top of this, it has a capacity of 175 billion machine learning parameters, while GPT-2 has a capacity of only 1.5 billion parameters. For more information, visit OpenAI’s page.

To conclude this article, we saw what are sequential models, how Transformers were developed later to all the other types of sequential architectures, How and why they proved themselves to be the most powerful among all. Then we delve deep into transformer’s composition and their working behind the scene, followed by some of their most popular applications. Additionally, there are even more such examples where you can realize just how useful and reliable they are.



What's up with Turing? Get the latest news about us here.


Know more about remote work.
Checkout our blog here.


Have any questions?
We'd love to hear from you.

Hire and manage remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Hire Developers