This article was published as a part of the Data Science Blogathon
Introduction
Understanding BERT
Implementing BERT in Tensorflow
Conclusion
References
As one can see it will be a little long blog but if you like you can skip BERT TRAINING Section.
With recent developments in NLP and NLU, the algorithms tend to understand more about how human interacts. One such algorithm is BERT, which captures the imagination of what data-scientists always desired to do leading to explore all sorts of new practices and developments at state of art level accuracy.
Today BERT is powering every query asked by google search and for developers google open-sourced it. So developers can use it for different tasks such as :
NSP – In this, the model is given an input sentence and its job is to predict the entire sentence/ word(G- Mail)
Q&A – As we humans do, the model is provided with inputs as questions and its job is to predict the answers for them. (Google Search)
Text-Generation – One can generate text, given enough training data.
Translation – Given input sentences in a specific language, its task is to predict its translation in the desired language(Google Translate). and much more(almost all sorts of NLP and NLU tasks)
Note:- These are all what BERT can be used for and not what it does
Before diving into code let’s understand what really is the BERT and how it is different from the previous model, designed to do the same tasks, how it was trained and what variations are available.
BERT is a short form of Bi-Directional Encoder Representation From Transformers which is a language model and as the name suggests it has 3 important factors to focus on, let’s understand them one by one:
Bi-directionality means it can see the inputs from both directions and even from between, which was not possible in previous models as they were uni-directional(from left to right/right to left), due to which the model was unable to see the entire sentences which lead to biased representation of word(encodings).
The main purpose of the BERT is to generate encoding/ embeddings in a way such that they are not biased and later use them in tasks of the desired type, due to these type of embedding model learns the general patters of how human talks and generate responses/output based on that which are not biased.
To do this the BERT model consists of multiple encoder layers whose work is to do preprocessing and then encode the inputs in some form of embeddings which can later be used by some model. Refer figure1:
Bert is from the transformer family but different in the way that it only has encoder blocks and uses only attention and feed-forward layer to generate segment embeddings. Also, the hyperparameters for the attention head are different(usually 12-16 attention heads) compared to the original transformer model.
Overall it is a language model, which can be integrated with multiple models, tasks and gives very great results.
Due to all the above mentioned(more specifically attention heads and bi-directionality), it allows capturing the meaning/semantics of words in different contexts not possible earlier with a most commonly used model like Word_2_Vec. To understand this consider 2 sentences:
1. They were playing fair and square:- here ‘fair’ means accordance to rules
2. There was a fair organized nearby: – here ‘fair’ means an event
But an algorithm like Word_2_Vec will output the same embeddings in both cases, i.e it does not understand the context of the sentence as it can only take the inputs one at a time or for parallel from right to left or left to right(does not allow to see entire sentence).
On the other hand, as BERT takes input in parallel it can see the entire sentence and using the attention heads it has, it can give attention to the word which it thinks has the most impact on the predicted word, thus allowing for more generalized embedding and a different representation of the same word.
The varying usage of BERT is due to its training procedure. Usually, there were 2 steps involved in the training model- Pre-Training and Fine-Tuning. Let’s look at each one of them in more detail:
PRE-TRAINING: – The model was trained on 2 types of semi-supervised tasks namely MLM and NSP
In this semisupervised task, the input to the model is masked (blank)with a [MASK] token and the model job is to predict the word in place of masked token which is then later fed to a softmax unit having inputs as vocabulary and predicts the output word.
However in the 2nd part (the prediction feeding) word is no longer masked, so it creates a model not to generalize well so a different procedure is applied to the way of masking input as only 15% of words are masked but not all with mask token i.e out of 15%:
This allows the model to use its bi-directional learning aspect leading to better representation of words.
In this task 2 sentences(A and B) are taken such that :
and the model’s job is to learn the relationship/semantic meanings between each sentence by doing a binary classification(is next/ not next).
Source Paper
Doing above allows it to unlock its relationship capturing abilities and create contextualized word embeddings. Having known enough, let’s look at the final aspects of the model.
FINE-TUNING:- After training the model it can be fine-tuned over a variety of tasks such as Sentence – Sentence Prediction,
Bert is available in 2 variations:
BERT_Large (Archive SOTA results)
BERT_base (small model)
For most of the use cases, BERT_base is sufficient and provides good results on finetuning.
Enough words! let’s get our hands dirty on coding. For understanding purposes, we will create word embeddings using the model and try to understand the different aspects of the encoder layers defined earlier in a practical way.
As you have seen above in the diagram, the model has 2 parts preprocessing and encoding, and luckily tensorflow hub allows us to directly use the model for fine-tuning.
Head to the TensorFlow Hub website – a place where the pre-trained models can be accessed and used.
Scroll down and below you can see:
Download the first model and preprocessing file for it. Open them one by one and copy their URL as they can be directly used to download the required files. To do so open a python file and write the following:
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3" encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
here we are storing those URLs as text in preprocess_url and encoder_url respectively for later references.
Using these we can create preprocessor and encoder objects. To do so we need to use the following syntax:
We just used hub.KerasLayer() to wrap saved model as Keras layers and store it in variables used.
Creating Data
To create the data, we can simply create a python list of sentences to test as:
test_text = ['I love pizza', 'India is my country', 'Italy is fun place to visit']
Here we have used India and Italy – country – same reference and Italy and Pizza – somewhat the same context. Our job here is to make data biased and see how BERT performs. So let’s move on to the next part.
Preprocessing Input
As mentioned earlier the preprocessor object can now be used as a function pointer which can take in test_text and spit out the processed_text dictionary:
processed_text = bert_preprocess_model(test_text) print(processed_text.keys())
>> dict_keys(['input_mask', 'input_word_ids', 'input_type_ids'])
As can be seen the inputs to the model have been processed in a way that the encoder expects. Basically, it created a mask, added ids, and tokens for each word and all input sentences. More on this is given in the UNDERSTANDING KEYS section below.
Encoding
To generate embedding now all we need to do is to pass this processed_text to the encoder object we created earlier(bert_encoder_model):
bert_results = bert_encoder_model(text_preprocessed) bert_results.keys()
>> dict_keys(['default', 'sequence_output', 'pooled_output', 'encoder_outputs'])
As can be seen, the entire output is again in the form of dictionaries. so understanding them from both the processing and encoding stage becomes crucial.
After processing 3 keys are generated namely – input_mask, input_type_ids, input_words_ids. Let’s understand them one by one :
input_mask
Let’s visualize the input mask used:
# input_mask print(text_preprocessed['input_mask'])
Returns:
input_type_ids
To see the input_type_ids use:
# input_mask print(text_preprocessed['input_type_ids'])
Returns:
input_words_ids
To visualize the input_words_ids use:
# input_mask print(text_preprocessed['input_words_ids'])
Returns:
After encoding 3 keys are generated namely – pooled_output, sequence_output, encoder_outputs. Let’s understand them one by one.
pooled_output
To see the pooled_output :
print(bert_results['pooled_output'])
Returns:
sequence_output
Includes:
So the length of the sequence output array/ tensor becomes
(no of sentences, length of sentence-128, no of hidden units -768 )`
sequence_output can be accessed as :
print(bert_results['sequence_output'])
Returns:
encoder_outputs
To understand what was said run the following code:
bert_results['encoder_outputs'][-1] == bert_results['sequence_output']
Here we are just checking if the last output([-1]) of encoder_outputs is the same as sequence outputs using == operator which will return boolean values as:
Clearly, all are returning TRUE.
With this, we have covered all the basics required to implement BERT. Hope you found this useful and if so kindly share the article and provide feedback as they are what inspire me to write these pieces. For more knowledge-hungry people, I have provided some impactful references where you can visit to gain in-depth knowledge of the topic.
Collab File:- For all codes refer here.
BERT Details(in-depth):- BERT (Bidirectional Encoder Representation From Transformers), For a visual representation refer here.
Inspiration: A humble and respectful thanks to code basics which inspire me to write the content.