Transformers Encoder | The Crux of the NLP Issues

Rahul Dogra Last Updated : 07 Jul, 2023

11 min read

Introduction

I’m going to explain transformers encoders to you in very simple way. People who are having trouble learning transformers may read this blog post all the way through, and if you are interested in working in the NLP field, you should be aware of transformers at least as most industries use this state-of-the-art models for various jobs. Transformers, introduced in the paper “Attention Is All You Need,” are the state-of-the-art models in NLP tasks, surpassing traditional RNNs and LSTMs. Transformers overcome the challenge of capturing long-term dependencies by relying on self-attention rather than recurrence. They have revolutionised NLP and paved the way for architectures like BERT, GPT-3, and T5.

Learning Objectives

In this article, you will learn:

Why did transformers become so popular?
The role of Self-Attention mechanism in the fields of NLP.
We will see how to create Keys, Queries and Value matrices from our own input data.
Will see how to compute attention matrix using Keys, Queries and Value matrices .
Importance of applying softmax function in the mechanism.

This article was published as a part of the Data Science Blogathon.

Introduction
What led to the outperformance of Transformers over RNN and LSTM models?
What Transformer consists? How does it operate?
Understanding the Encoder of the Transformer
Self-attention Mechanism
Example 1
Example 2
Understanding the Self-attention Mechanism
Conclusion
Frequently Asked Questions

What led to the outperformance of Transformers over RNN and LSTM models?

We encountered a significant obstacle while working with RNN and LSTM as this was a recursive model which was still unable to understand the long-term dependencies and was becoming more computationally expensive by dealing with complex data. The publication “Attention Is All You Need” developed a new design called Transformers to get over this constraint of conventional sequential networks, and they are now the most advanced model for a number of NLP applications.

In RNN and LSTM, inputs and tokens are fed one at a time while the complete sequence is transmitted simultaneously through the transformers(parallel feeding of data).
The Transformers model totally eliminates the recursion process and is exclusively reliant on the attention mechanism. Use Self-attention which is a unique kind of attention mechanism.

What Transformer consists? How does it operate?

For many NLP tasks, the transformers model is currently state-of-the-art model.The introduction of the transformers led to a significant advancement in the field of NLP and prepared the way for cutting-edge systems like the BERT, GPT-3, T5, and others.

Let’s understand how the transformers and self-attention works with a language translation task.The transformer consists of an encoder-decoder architecture.We feed the input sentence(source sentence) to the encoder. The encoder learns the representation of the input sentence and sends the representation to the decoder. The decoder learns receives the representation learned by the encoder as input and generated the output sentence(target sentence)

Let’s say we want to translate a phrase from English to French.We require the English sentence as input to the encoder, as indicated in the following figure.The encoder learn the representations of the given English sentence and feeds the representation to the decoder.The decoder takes the encoder’s representation as input and generates the French sentence as output.

All well, but what precisely is happening here? How does the transformer’s encoder and decoder translate an English sentence (the source sentence) into a French sentence (the target sentence)? What precisely occurs within the encoder and decoder? As a result, we’ll only be looking at the encoder network in this post because we want to keep it brief and focus on the encoder right now. We’ll cover the decoder component in the future article, for sure. In the sections that follow, let’s find out.

Understanding the Encoder of the Transformer

The encoder is just a neural network that is designed to receive an input and transform it into different representation/form where a machine can understand.The transformers consists of a stack of N number of encoders.The output of one encoder is sent as input to the other encoder above it. As shown in the following figure we have a stack of N number of encoders. Each encoder sends its output to the encoder above it. The final encoder returns the representation of the given resource sentence as output.We feed the source sentence as input to the encoder and get the representation of the source sentence as output:

The authors of the original paper Attention Is All You Need ,chose N = 6, which means that they stacked six encoders one on top of the other. Nevertheless, we can experiment with other values of N. Let’s retain N = 2 for simplicity and better understanding.

Okay, the question is how exactly does the encoder works? How is it generating the representations for a given source sentence(input sentence)? Let’s see what is there in encoder

Components of Encoder | Transformers Encoder | NLP — Components of Encoder

From the above figure, we can understand that all the encoder blocks are identical.We can also observe that each encoder block consists of two components.

Multi-head attention
Feedforward network

Let’s get into the details and learn how exactly these two components works actually.To understand how multi-head attention works, first we need to understand the self-attention mechanism.

Self-attention Mechanism

Let’s understand the self-attention mechanism with an example.Consider the following sentence

I swam across the river to get to the other bank

Example 1

In the above example 1, if I ask any you to tell me the meaning of bank here.So in order to answer this question the you have to understand the words which surrounds the word bank.

So is it :-

Bank == financial institution ?

Bank == the ground at the edge of a river ?

By reading the sentence you can easily say the words ‘Bank’ means the ground at the edge of a river

So Context Matters!

Let’s see other example –

A dog ate the food because it was hungry

Example 2

How does a machine can understand that in a given sentence that what all these unknown words refer to? Here is where the self-attention mechanism helps machine to understand.

In the given sentence, A dog ate the food because it was hungry , first , our model will compute the representation of the word A, next it will compute the representation of the word dog, then it will compute the representation of the word ate, and so on. While computing the representation of each word, it will relate each word to all other words in the sentence to understand more about the word

For instance, while computing the representation of the word it, our model relates the word it, to all the other words in the sentence to understand more about the word it.

In the image below, our model connects the word “it” to every word in the phrase to calculate its representation. By doing so, our model understands that “it” is associated with “dog” and not “food” in the given sentence. The thickness of the line connecting “it” and “dog” is greater, indicating a higher score and a stronger relationship. This enables the machine to make predictions based on the higher score.

All right, but exactly how does this operate? Let’s learn more about the self-attention process in detail now that we have a fundamental understanding of what it is.

Assume I have:

SourceSentence = I am good

Tokenized = [‘I’, ‘am’, ‘good’]

Here, representation is nothing but a word embedding model.

Input Matrix (Embedding Matrix)

From above input matrix(Embedding Matrix), we can understand that the first row of the matrix implies the embedding of the word I, the second row implies the embedding of the word am, and the third row implies the embedding of the word good. Thus the dimension of the input matrix will be – [sentence length x embedding dimension].The number of words in our sentence(sentence length) is 3. Let the embedding dimension be 3 for now as per explanation.Then, our input matrix(input embedding) dimension will be [3,3]. So, if you are taking dimension as 512 then your shape would be [3×512].So for ease we are taking [3,3]

X Matrix(Embedding Matrix) | Transformers Encoder | NLP — X Matrix(Embedding Matrix)

We now generate three new matrices from the aforementioned matrix, X: a query matrix, Q, a key matrix, K, and a value matrix, V.Wait. What exactly are these three matrices? And why do we require them? They are employed in the self-awareness mechanism.In a moment, we’ll see how these three matrices are employed.

So let me offer you an example to help you grasp and imagine self-awareness. I’m just looking for good data science tutorials to help me learn data science.Despite the fact that the YouTube database is so huge, it allows me to insert a query and have it provide me the outcome from among various data.So if I supply the query Data Science Tutorial, my question will be Data Science Tutorial, which will compute the score among other data sequences(keys) and return which ever its related to it(which has a higher score).

NOTE: The above explanation is just an example to make you visualize how my query is being compared with other words/sequences as keys here.

Let me return to the [key, query, and values] notions.Now consider how we may generate these three matrices for self attention mechanism.So, in order to generate these three matrices, we add three new weights W[Q], W[K], and W[V].By multiplying the input matrix, X, by W[Q], W[K], and W[V], we get the query, Q, key, K, and value, V matrices.

NOTE: W[Q], W[K], and W[V] weight matrices are randomly initialised, and their optimal values are learnt during training.We will receive more accurate query, key, and values matrices as we learn the ideal weights.

As indicated in the diagram below, we multiply the input matrix (X) by the weights matrices, W[Q], W[K], and W[V], yielding query, key, and value.Furthermore, these are arbitrary values rather than accurate embeddings for just understanding purpose.

Creating query, key and value matrices | Transformers Encoder | NLP — Creating query, key and value matrices

Understanding the Self-attention Mechanism

So why we calculated query, key, values matrices? Let’s understand with 4 steps:

Step 1

The dot product of the query matrix, Q, and the key matrix, K(Transpose) is computed as the initial step in the self-attention process.

The following shows the result of the dot product between the query matrix,Q and the key matrix,K(Transpose)

Dot Product between the query and key | Transformers Encoder | NLP — Dot Product between the query and key:

But what is the use of computing the dot product between the query and key matrices? What exactly does Q.K(Transpose) signify? Let’s understand this by looking at the result of Q.K(Transpose) in detail.
Let’s look into the first row of the Q.K(Transpose) matrix as shown in following figure below.We can observe that we are computing the dot product between query vector q1 (I) and all the key vectors – k1(I), k2(am), and k3(good).

NOTE: The computing dot product indicates how comparable they are.The stronger the relationship, the higher the score.

So anyhow, here dot product just measures the similarity between the query vectors and the key vectors to compute attention scores.
And in same way we calculate dot products of other rows as well.

Dot Product between query and key vectors

STEP 2

The Q.K(Transpose) matrix is then divided by the square root of the key vector’s dimension in the self-attention process. But why are we forced to do so?

And what may happen if we don’t undertake this type of scaling?

As a result, without scaling, the magnitudes of the dot products might vary depending on the size of the key vectors. When the key vectors are larger, the dot products might also get larger. This can cause gradients to expand or shrink too fast during training, causing the optimisation process to become unstable and model training to suffer.

Dividing Dot product by square root of dk

Let dk be the key vector’s dimension.So, if my embedding size is 512, let us suppose the key vector dimension is 64.So, if we take the square root of that, we get 8.

STEP 3

We can tell that the aforementioned similarity scores are in the unnormalised form by looking at them. As a result, we use the softmax function to normalise them. The softmax function assists in getting the score to the range of 0 to 1, and the total of the scores equals 1, as seen in the image below:

Refer to the previous matrix as a scoring matrix, which allows us to understand the interconnectedness between each word in the sentence by analyzing the scores assigned to them. Examining the first row of the score matrix, we observe that the word “I” is 90% connected to itself, connecting 7% to the word “am,” and 3% connected to the word “good.” This newfound attention on my word is certainly gratifying.

STEP 4

So, what’s next? We generated the dot product of the query and key matrices, calculated the scores, and then normalised the scores using the softmax function. Compute the attention matrix, Z, as the final step in the self-attention mechanism.
Each word in the phrase has its own attention value in the attention matrix. The attention matrix, Z, compute by multiplying the score matrix with the Value matrix, V, as illustrated:

As a result, our sequence will have the following attention matrix:

The attention matrix is calculated by adding the weighted sum of the value vectors. Let’s break this down row by row to better comprehend it. First, consider how the self-attention of the word I is calculated in the first row:

From the preceding image, we can deduce that the computation of self-attention for the word “I” involves weighting the value vectors by the scores and summing them together. As a result, the value will comprise 90% of the values v1 (I) from the value vector (I), 7% of the values from the value vector v2(am), and 3% of the values from the value vector v3(good) and so on for others.

Self-attention mechanism | Transformers Encoder | NLP — Self-attention mechanism

As a result, in this way Self-Attention Mechanism operates in transformer-based Encoders.

Conclusion

Consequently, we have gained a comprehensive understanding of how the transformer’s encoder and self-attention approach operate. I believe that possessing knowledge of the architecture of various frameworks and effectively integrating them into NLP-based tasks is a crucial aspect of this line of work. In the future, we will incorporate additional sections on the Decoder, Bert, Large Language Models, and more. And I propose that you understand any architecture like this before deploying it elsewhere, so that you feel more knowledgeable and engaged in Data Science.

It is important to approach complex architectures with the mindset that nothing is inherently tough. With the right knowledge, dedication, and utilization of your talents, you can simplify and navigate through these architectures effectively, making them more manageable and empowering your work in data science.
Understanding the architecture of a framework, such as a transformer’s encoder and self-attention approach, is crucial for working effectively in NLP-based activities. It allows you to grasp the underlying principles and mechanisms that power these models.
Integrating the architecture of a framework correctly in any task is an essential skill. It enables you to leverage the capabilities of the framework effectively and achieve better results in NLP tasks.

Frequently Asked Questions

Q1. When was the self attention mechanism introduced?

A. The attention mechanism was first used in 2014 in computer vision, to try and understand what a neural network is looking at while making a prediction. This was one of the first steps to try and understand the outputs of Convolutional Neural Networks (CNNs).

Q2. Why do we use multi-head attention in transformers?

A. The idea behind using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate as model can attend to different parts of the input simultaneously, enabling it to capture various types of information and maintain a richer representation and improves the model’s robustness and stability by reducing reliance on a single attention head and aggregating information from multiple perspectives.

Q3. Can the transformer encoder capture long-range dependencies effectively?

A. Yes, the transformer encoder can capture long-range dependencies effectively. It achieves this through the use of self-attention, which allows each position in the sequence to attend to all other positions, capturing relevant information regardless of distance. The parallel computation and multi-head attention mechanism further enhance the model’s ability to capture diverse relationships.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Rahul Dogra

I am a Data Scientist with 6+ years of AI/ML expertise. I programmatically and mathematically build and perform experiments in various business challenges supported by robust modelling and better ML system designs utilising Deep Vision and Language Multimodal models.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Transformers Encoder | The Crux of the NLP Issues

Introduction

Learning Objectives

Table of contents

What led to the outperformance of Transformers over RNN and LSTM models?

What Transformer consists? How does it operate?

Understanding the Encoder of the Transformer

Self-attention Mechanism

Example 1

So Context Matters!

Example 2

Input Matrix (Embedding Matrix)

NOTE: The above explanation is just an example to make you visualize how my query is being compared with other words/sequences as keys here.

Understanding the Self-attention Mechanism

Step 1

STEP 2

STEP 3

STEP 4

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS