Our complete life is a course of events that either we or our circumstances control. Anywhere you glance, you will be able to spot countless patterns. Something that occurs with the dependency on its predecessor is termed as a sequence. For example, in a sentence of any language, all the words have a pre-defined structure to deliver a message. These sentences are nothing but a sequence of different words.
In deep learning, we encounter sequences very frequently in a dataset. The factor that concludes whether to use a sequence model or not is the presence of other highly correlated features in the dataset and what are we expecting from the model as an outcome. We use sequence models to learn the sequence order and how those elements interact with one another throughout the period. This helps forecast the sequence in the future. The most popular design for such models, before the invention of transformers, were Recurrent Neural networks (RNNs), Long Short Term Memory Networks (LSTM Networks), and Gated Recurrent Units (GRUs).
Irrespective of their exceptional performance and well-adopted application throughout the industry, these models faced a bottleneck. These were unable to learn long-distance dependencies within the sequences to maintain an appropriate context. With such a problem and high demand for a solution, scientists came up with a concept: Attention. In transformers, we work very closely with self-attention. To read in detail about these concepts, check out the paper: Attention Is All You Need.
The transformer is a network architecture proposed in the same paper listed above, which constitutes an encoder and decoder within itself. This architecture even managed to significantly improve the state-of-the-art Neural Machine Translation (text in one language to another) just with attention mechanisms and some dense and normalization layers. Additionally, these are much faster to train and simpler to parallelize. These two properties reduce the overall computational investment and have proven to be very powerful yet efficient.
Let us get into a little detail of how exactly these work behind the scenes. First and foremost, these use attention, especially self-attention as the base concept. In simple words, self-attention can be imagined as ‘everything in a sequence is looking within itself’. There are mathematical equations involved to get deep into the topic, but I will refrain from doing so in this introductory article. For more clarity, think of self-attention as taking an embedding vector for one of the words in the input sequence. This embedding vector performs many computations with itself and other embedding vectors of different words within the sequence; to establish interrelations among each other. Further, these interrelations pass through a small perceptron model to introduce nonlinearity in the learning of the parameters. The word embedding vectors are also added with positional embeddings (calculated to maintain the order in which the words appear in the input sequence).
The left side of the diagram is the Encoder, where the right side is the decoder. Both components take sequences in the form of embeddings as the input and produce the necessary output. The Encoder takes a batch of sentences represented as sequences of word tokens (numerical values assigned concerning input corpus of words to compute) and encodes each word into a 512-dimensional embedding vector. Hence, input size for the encoder is [batch_size, max input sentence length] which on encoding gives [batch_size, max input sentence length, 512]. On the other hand, the decoder takes in the target sentence (also represented as a sequence of word tokens) while training the network as well as the outputs of the encoder as its inputs. Please keep in mind that the top part of the decoder is stacked ‘N’ times, so the final output from the encoder is fed to the decoder at each of these ‘N’ levels.
Until now we have understood the core functioning of the transformer as well as its connection within itself. But there are a few things that are worth a brief introduction before finalizing the article. These are the components we can see in the architecture but haven't touched upon:
Most of the state-of-the-art models currently in use in the world are built using transformers. Talking speech to text conversion, Machine translation, Text generation, paraphrasing, question answering, sentiment analysis and so many more. These are some of the best and most famous models out there.
To conclude this article, we saw what are sequential models, how Transformers were developed later to all the other types of sequential architectures, How and why they proved themselves to be the most powerful among all. Then we delve deep into transformer’s composition and their working behind the scene, followed by some of their most popular applications. Additionally, there are even more such examples where you can realize just how useful and reliable they are.
Tell us the skills you need and we'll find the best developer for you in days, not weeks.