I’m going to explain transformers encoders to you in very simple way. People who are having trouble learning transformers may read this blog post all the way through, and if you are interested in working in the NLP field, you should be aware of transformers at least as most industries use this state-of-the-art models for various jobs. Transformers, introduced in the paper “Attention Is All You Need,” are the state-of-the-art models in NLP tasks, surpassing traditional RNNs and LSTMs. Transformers overcome the challenge of capturing long-term dependencies by relying on self-attention rather than recurrence. They have revolutionised NLP and paved the way for architectures like BERT, GPT-3, and T5.
In this article, you will learn:
This article was published as a part of the Data Science Blogathon.
We encountered a significant obstacle while working with RNN and LSTM as this was a recursive model which was still unable to understand the long-term dependencies and was becoming more computationally expensive by dealing with complex data. The publication “Attention Is All You Need” developed a new design called Transformers to get over this constraint of conventional sequential networks, and they are now the most advanced model for a number of NLP applications.
For many NLP tasks, the transformers model is currently state-of-the-art model.The introduction of the transformers led to a significant advancement in the field of NLP and prepared the way for cutting-edge systems like the BERT, GPT-3, T5, and others.
Let’s understand how the transformers and self-attention works with a language translation task.The transformer consists of an encoder-decoder architecture.We feed the input sentence(source sentence) to the encoder. The encoder learns the representation of the input sentence and sends the representation to the decoder. The decoder learns receives the representation learned by the encoder as input and generated the output sentence(target sentence)
Let’s say we want to translate a phrase from English to French.We require the English sentence as input to the encoder, as indicated in the following figure.The encoder learn the representations of the given English sentence and feeds the representation to the decoder.The decoder takes the encoder’s representation as input and generates the French sentence as output.
All well, but what precisely is happening here? How does the transformer’s encoder and decoder translate an English sentence (the source sentence) into a French sentence (the target sentence)? What precisely occurs within the encoder and decoder? As a result, we’ll only be looking at the encoder network in this post because we want to keep it brief and focus on the encoder right now. We’ll cover the decoder component in the future article, for sure. In the sections that follow, let’s find out.
The encoder is just a neural network that is designed to receive an input and transform it into different representation/form where a machine can understand.The transformers consists of a stack of N number of encoders.The output of one encoder is sent as input to the other encoder above it. As shown in the following figure we have a stack of N number of encoders. Each encoder sends its output to the encoder above it. The final encoder returns the representation of the given resource sentence as output.We feed the source sentence as input to the encoder and get the representation of the source sentence as output:
The authors of the original paper Attention Is All You Need ,chose N = 6, which means that they stacked six encoders one on top of the other. Nevertheless, we can experiment with other values of N. Let’s retain N = 2 for simplicity and better understanding.
Okay, the question is how exactly does the encoder works? How is it generating the representations for a given source sentence(input sentence)? Let’s see what is there in encoder
From the above figure, we can understand that all the encoder blocks are identical.We can also observe that each encoder block consists of two components.
Let’s get into the details and learn how exactly these two components works actually.To understand how multi-head attention works, first we need to understand the self-attention mechanism.
Let’s understand the self-attention mechanism with an example.Consider the following sentence
I swam across the river to get to the other bank
In the above example 1, if I ask any you to tell me the meaning of bank here.So in order to answer this question the you have to understand the words which surrounds the word bank.
So is it :-
Bank == financial institution ?
Bank == the ground at the edge of a river ?
By reading the sentence you can easily say the words ‘Bank’ means the ground at the edge of a river
Let’s see other example –
A dog ate the food because it was hungry
How does a machine can understand that in a given sentence that what all these unknown words refer to? Here is where the self-attention mechanism helps machine to understand.
In the given sentence, A dog ate the food because it was hungry , first , our model will compute the representation of the word A, next it will compute the representation of the word dog, then it will compute the representation of the word ate, and so on. While computing the representation of each word, it will relate each word to all other words in the sentence to understand more about the word
For instance, while computing the representation of the word it, our model relates the word it, to all the other words in the sentence to understand more about the word it.
In the image below, our model connects the word “it” to every word in the phrase to calculate its representation. By doing so, our model understands that “it” is associated with “dog” and not “food” in the given sentence. The thickness of the line connecting “it” and “dog” is greater, indicating a higher score and a stronger relationship. This enables the machine to make predictions based on the higher score.
All right, but exactly how does this operate? Let’s learn more about the self-attention process in detail now that we have a fundamental understanding of what it is.
Assume I have:
SourceSentence = I am good
Tokenized = [‘I’, ‘am’, ‘good’]
Here, representation is nothing but a word embedding model.
From above input matrix(Embedding Matrix), we can understand that the first row of the matrix implies the embedding of the word I, the second row implies the embedding of the word am, and the third row implies the embedding of the word good. Thus the dimension of the input matrix will be – [sentence length x embedding dimension].The number of words in our sentence(sentence length) is 3. Let the embedding dimension be 3 for now as per explanation.Then, our input matrix(input embedding) dimension will be [3,3]. So, if you are taking dimension as 512 then your shape would be [3×512].So for ease we are taking [3,3]
We now generate three new matrices from the aforementioned matrix, X: a query matrix, Q, a key matrix, K, and a value matrix, V.Wait. What exactly are these three matrices? And why do we require them? They are employed in the self-awareness mechanism.In a moment, we’ll see how these three matrices are employed.
So let me offer you an example to help you grasp and imagine self-awareness. I’m just looking for good data science tutorials to help me learn data science.Despite the fact that the YouTube database is so huge, it allows me to insert a query and have it provide me the outcome from among various data.So if I supply the query Data Science Tutorial, my question will be Data Science Tutorial, which will compute the score among other data sequences(keys) and return which ever its related to it(which has a higher score).
Let me return to the [key, query, and values] notions.Now consider how we may generate these three matrices for self attention mechanism.So, in order to generate these three matrices, we add three new weights W[Q], W[K], and W[V].By multiplying the input matrix, X, by W[Q], W[K], and W[V], we get the query, Q, key, K, and value, V matrices.
NOTE: W[Q], W[K], and W[V] weight matrices are randomly initialised, and their optimal values are learnt during training.We will receive more accurate query, key, and values matrices as we learn the ideal weights.
As indicated in the diagram below, we multiply the input matrix (X) by the weights matrices, W[Q], W[K], and W[V], yielding query, key, and value.Furthermore, these are arbitrary values rather than accurate embeddings for just understanding purpose.
So why we calculated query, key, values matrices? Let’s understand with 4 steps:
NOTE: The computing dot product indicates how comparable they are.The stronger the relationship, the higher the score.
And what may happen if we don’t undertake this type of scaling?
As a result, without scaling, the magnitudes of the dot products might vary depending on the size of the key vectors. When the key vectors are larger, the dot products might also get larger. This can cause gradients to expand or shrink too fast during training, causing the optimisation process to become unstable and model training to suffer.
As a result, in this way Self-Attention Mechanism operates in transformer-based Encoders.
Consequently, we have gained a comprehensive understanding of how the transformer’s encoder and self-attention approach operate. I believe that possessing knowledge of the architecture of various frameworks and effectively integrating them into NLP-based tasks is a crucial aspect of this line of work. In the future, we will incorporate additional sections on the Decoder, Bert, Large Language Models, and more. And I propose that you understand any architecture like this before deploying it elsewhere, so that you feel more knowledgeable and engaged in Data Science.
A. The attention mechanism was first used in 2014 in computer vision, to try and understand what a neural network is looking at while making a prediction. This was one of the first steps to try and understand the outputs of Convolutional Neural Networks (CNNs).
A. The idea behind using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate as model can attend to different parts of the input simultaneously, enabling it to capture various types of information and maintain a richer representation and improves the model’s robustness and stability by reducing reliance on a single attention head and aggregating information from multiple perspectives.
A. Yes, the transformer encoder can capture long-range dependencies effectively. It achieves this through the use of self-attention, which allows each position in the sequence to attend to all other positions, capturing relevant information regardless of distance. The parallel computation and multi-head attention mechanism further enhance the model’s ability to capture diverse relationships.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.