Pratically Demistifying BERT Language Representation Model

Purnendu Last Updated : 28 Aug, 2021

9 min read

This article was published as a part of the Data Science Blogathon

Introduction

Understanding BERT

What is Bert
Difference From Word2Vec
BERT Training
BERT Model Variations

Implementing BERT in Tensorflow

Downloading Model
Creating Data
Preprocessing Input
Encoding To Create Embeddings
Understanding Embeddings(using keys)

Conclusion

References

As one can see it will be a little long blog but if you like you can skip BERT TRAINING Section.

INTRODUCTION

With recent developments in NLP and NLU, the algorithms tend to understand more about how human interacts. One such algorithm is BERT, which captures the imagination of what data-scientists always desired to do leading to explore all sorts of new practices and developments at state of art level accuracy.

Today BERT is powering every query asked by google search and for developers google open-sourced it. So developers can use it for different tasks such as :

NSP – In this, the model is given an input sentence and its job is to predict the entire sentence/ word(G- Mail)

Q&A – As we humans do, the model is provided with inputs as questions and its job is to predict the answers for them. (Google Search)

Text-Generation – One can generate text, given enough training data.

Translation – Given input sentences in a specific language, its task is to predict its translation in the desired language(Google Translate). and much more(almost all sorts of NLP and NLU tasks)

Note:- These are all what BERT can be used for and not what it does

UNDERSTANDING BERT

Before diving into code let’s understand what really is the BERT and how it is different from the previous model, designed to do the same tasks, how it was trained and what variations are available.

What Is BERT

BERT is a short form of Bi-Directional Encoder Representation From Transformers which is a language model and as the name suggests it has 3 important factors to focus on, let’s understand them one by one:

1. BI-DIRECTIONAL

Bi-directionality means it can see the inputs from both directions and even from between, which was not possible in previous models as they were uni-directional(from left to right/right to left), due to which the model was unable to see the entire sentences which lead to biased representation of word(encodings).

2. ENCODER

The main purpose of the BERT is to generate encoding/ embeddings in a way such that they are not biased and later use them in tasks of the desired type, due to these type of embedding model learns the general patters of how human talks and generate responses/output based on that which are not biased.

To do this the BERT model consists of multiple encoder layers whose work is to do preprocessing and then encode the inputs in some form of embeddings which can later be used by some model. Refer figure1:

**ENCODER LAYER OF BERT – High-Level Overview**

3. TRANSFORMER

Bert is from the transformer family but different in the way that it only has encoder blocks and uses only attention and feed-forward layer to generate segment embeddings. Also, the hyperparameters for the attention head are different(usually 12-16 attention heads) compared to the original transformer model.

Overall it is a language model, which can be integrated with multiple models, tasks and gives very great results.

Difference: – Word2Vec & BERT

Due to all the above mentioned(more specifically attention heads and bi-directionality), it allows capturing the meaning/semantics of words in different contexts not possible earlier with a most commonly used model like Word_2_Vec. To understand this consider 2 sentences:

1. They were playing fair and square:- here ‘fair’ means accordance to rules

2. There was a fair organized nearby: – here ‘fair’ means an event

But an algorithm like Word_2_Vec will output the same embeddings in both cases, i.e it does not understand the context of the sentence as it can only take the inputs one at a time or for parallel from right to left or left to right(does not allow to see entire sentence).

On the other hand, as BERT takes input in parallel it can see the entire sentence and using the attention heads it has, it can give attention to the word which it thinks has the most impact on the predicted word, thus allowing for more generalized embedding and a different representation of the same word.

BERT Training

The varying usage of BERT is due to its training procedure. Usually, there were 2 steps involved in the training model- Pre-Training and Fine-Tuning. Let’s look at each one of them in more detail:

PRE-TRAINING: – The model was trained on 2 types of semi-supervised tasks namely MLM and NSP

Masked Language Model (MLM)

In this semisupervised task, the input to the model is masked (blank)with a [MASK] token and the model job is to predict the word in place of masked token which is then later fed to a softmax unit having inputs as vocabulary and predicts the output word.

However in the 2nd part (the prediction feeding) word is no longer masked, so it creates a model not to generalize well so a different procedure is applied to the way of masking input as only 15% of words are masked but not all with mask token i.e out of 15%:

80% of the time have [MASK] token
20% of the time have random tokens as masked
10% of the time have the same input token which was masked.

This allows the model to use its bi-directional learning aspect leading to better representation of words.

Next Sentence Prediction(NSP)

In this task 2 sentences(A and B) are taken such that :

50% of the time B is the next sentence to A
50% of the time B is some random sentence from training data

and the model’s job is to learn the relationship/semantic meanings between each sentence by doing a binary classification(is next/ not next).

Source Paper

Doing above allows it to unlock its relationship capturing abilities and create contextualized word embeddings. Having known enough, let’s look at the final aspects of the model.

FINE-TUNING:- After training the model it can be fine-tuned over a variety of tasks such as Sentence – Sentence Prediction,

Question- Answering,
Single-Sentence-Classification,
Single-Sentence-Tagging,
Language Translation,
Feature Extraction(create embeddings) and much more.

BERT Model Variations

Bert is available in 2 variations:

BERT_Large (Archive SOTA results)

Layers – 24
Hidden State – 1024
Self Attention Heads – 16
Total Parameters – 340 Million

BERT_base (small model)

Layers – 12
Hidden State – 768
Self Attention Heads – 12
Total Parameters – 110 Million

For most of the use cases, BERT_base is sufficient and provides good results on finetuning.

IMPLEMENTING BERT IN TENSORFLOW

Enough words! let’s get our hands dirty on coding. For understanding purposes, we will create word embeddings using the model and try to understand the different aspects of the encoder layers defined earlier in a practical way.

As you have seen above in the diagram, the model has 2 parts preprocessing and encoding, and luckily tensorflow hub allows us to directly use the model for fine-tuning.

Downloading Essentials

Head to the TensorFlow Hub website – a place where the pre-trained models can be accessed and used.

Scroll down and below you can see:

Download the first model and preprocessing file for it. Open them one by one and copy their URL as they can be directly used to download the required files. To do so open a python file and write the following:

preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

here we are storing those URLs as text in preprocess_url and encoder_url respectively for later references.

Using these we can create preprocessor and encoder objects. To do so we need to use the following syntax:

bert_preprocess_model = hub.KerasLayer(preprocess_url)
bert_encoder_model = hub.KerasLayer(encoder_url)

We just used hub.KerasLayer() to wrap saved model as Keras layers and store it in variables used.

Creating Data

To create the data, we can simply create a python list of sentences to test as:

test_text = ['I love pizza',
 	     'India is my country',
 	     'Italy is fun place to visit']

Here we have used India and Italy – country – same reference and Italy and Pizza – somewhat the same context. Our job here is to make data biased and see how BERT performs. So let’s move on to the next part.

Preprocessing Input

As mentioned earlier the preprocessor object can now be used as a function pointer which can take in test_text and spit out the processed_text dictionary:

processed_text = bert_preprocess_model(test_text)
print(processed_text.keys())

>> dict_keys(['input_mask', 'input_word_ids', 'input_type_ids'])

As can be seen the inputs to the model have been processed in a way that the encoder expects. Basically, it created a mask, added ids, and tokens for each word and all input sentences. More on this is given in the UNDERSTANDING KEYS section below.

Encoding

To generate embedding now all we need to do is to pass this processed_text to the encoder object we created earlier(bert_encoder_model):

bert_results = bert_encoder_model(text_preprocessed)
bert_results.keys()

>> dict_keys(['default', 'sequence_output', 'pooled_output', 'encoder_outputs'])

UNDERSTANDING KEYS

As can be seen, the entire output is again in the form of dictionaries. so understanding them from both the processing and encoding stage becomes crucial.

PROCESSING

After processing 3 keys are generated namely – input_mask, input_type_ids, input_words_ids. Let’s understand them one by one :

input_mask

A mask of the words in a sentence – all masks starts with `CLF` token and `SEP` token
So the masked array is of size `sentence size + 2`
The token is defined for CLF – 101 and SEP – 102
Tensor shape – (no of sentence, 128- maximum length of sentence- other padded with zeros as no values present)

Let’s visualize the input mask used:

# input_mask 
print(text_preprocessed['input_mask'])

Returns:

**input_mask arrayoutput shape:** (no of the sentence, 128- the maximum length of sentence)

input_type_ids

Give id’s to multiple input sentences array in one training data
Help in identifying sentence positional encoding

To see the input_type_ids use:

# input_mask 
print(text_preprocessed['input_type_ids'])

Returns:

**input_type_ids arrayoutput shape:** (0, 128-max length of sentence) in our case, as we have only a single sentence(as a list):

input_words_ids

has the token ids of the input sequences.
Give unique id’s for individual words
Each word is encoded(ids can be from a vocabulary), padded and separated
Length : (no of sentence, 128-max length of each sentence)

To visualize the input_words_ids use:

# input_mask 
print(text_preprocessed['input_words_ids'])

Returns:

input_words_ids array

**output shape:** (0, 128-max length of sentence)

ENCODING

After encoding 3 keys are generated namely – pooled_output, sequence_output, encoder_outputs. Let’s understand them one by one.

pooled_output

Embedding for the entire sentence
Length : `(no of sentence, no of hidden units – 768[this case])`
Also, these 768 elements will not be 0 as bert carries some of the contextual meaning for each meaning i.e relates how much one feature differs from each other [-ve less relatable, +ve – very relatable], this is the feature why bert is so popular and powerful in NLP task.

To see the pooled_output :

print(bert_results['pooled_output'])

Returns:

sequence_output

This is the representation/ embedding of individual words of a sentence

Includes:

Due to processing sentence become a length of 128
The hidden state is 768
No of sentences

So the length of the sequence output array/ tensor becomes
(no of sentences, length of sentence-128, no of hidden units -768 )`

sequence_output can be accessed as :

print(bert_results['sequence_output'])

Returns:

encoder_outputs

It is the intermediate activations of transformer block and its last encoding is similar to sequence outputs
Length same as sequence output – `(no of sentences, length of each sentence- 128, no of hidden units-768)`

To understand what was said run the following code:

bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

Here we are just checking if the last output([-1]) of encoder_outputs is the same as sequence outputs using == operator which will return boolean values as:

Clearly, all are returning TRUE.

CONCLUSION

With this, we have covered all the basics required to implement BERT. Hope you found this useful and if so kindly share the article and provide feedback as they are what inspire me to write these pieces. For more knowledge-hungry people, I have provided some impactful references where you can visit to gain in-depth knowledge of the topic.

REFERENCES

Collab File:- For all codes refer here.

BERT Details(in-depth):- BERT (Bidirectional Encoder Representation From Transformers), For a visual representation refer here.

Inspiration: A humble and respectful thanks to code basics which inspire me to write the content.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Purnendu

A dynamic and enthusiastic individual with a proven track record of delivering high-quality content around Data Science, Machine Learning, Deep Learning, Web 3.0, and Programming in general.

Here are a few of my notable achievements👇

🏆 3X times Analytics Vidhya Blogathon Winner under guides category.

🏆 Stackathon by Winner Under Circle API Usage Category - My Detailed Guide

🏆 Google TensorFlow Developer ( for deep learning) and Contributor to Open Source

🏆 A Part Time Youtuber - Programing Related content coming every week!

Feel free to contact me if you wanna have a conversation on Data Science, AI Ethics & Web 3 / share some opportunities.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Pratically Demistifying BERT Language Representation Model

TABLE OF CONTENTS

INTRODUCTION

UNDERSTANDING BERT

What Is BERT

1. BI-DIRECTIONAL

2. ENCODER

3. TRANSFORMER

Difference: – Word2Vec & BERT

BERT Training

Masked Language Model (MLM)

Next Sentence Prediction(NSP)

BERT Model Variations

IMPLEMENTING BERT IN TENSORFLOW

Downloading Essentials

UNDERSTANDING KEYS

PROCESSING

ENCODING

CONCLUSION

REFERENCES

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS