This article was published as a part of the Data Science Blogathon
BERT is too kind — so this article will be touching on BERT and sequence relationships!
A significant portion of NLP relies on the connection in highly-dimensional spaces. Typically an NLP processing will take any text, prepare it to generate a tremendous vector/array rendering said text — then make certain transformations.
It’s a highly-dimensional charm. At an exceptional level, there’s not much extra to it. We require to understand what is following in detail and execute this in Python too! So, let’s get incited.
Sentence similarity is one of the most explicit examples of how compelling a highly-dimensional spell can be.
The thesis is this:
BERT, as we previously stated — is a special MVP of NLP. And a massive part of this is underneath BERTs capability to embed the essence of words inside densely bound vectors.
We call them dense vectors because each value inside the vector has a value and has a purpose for holding that value — this is in contradiction to sparse vectors. Hence, as one-hot encoded vectors where the preponderance of proceedings is 0.
BERT is skilled at generating those dense vectors, and all encoder layer (there are numerous) outputs a collection of dense vectors.
For the BERT support, this will be a vector comprising 768 digits. Those 768 values have our mathematical representation of a particular token — which we can practice as contextual message embeddings.
Unit vector denoting each token (product by each encoder) is indeed watching tensor (768 by the number of tickets).
We can use these tensors and convert them to generate semantic designs of the input sequence. We can next take our similarity metrics and measure the corresponding similarity linking separate lines.
The easiest and most regularly extracted tensor is the last_hidden_state tensor, conveniently yield by the BERT model.
Of course, this is a moderately large tensor — at 512×768 — and we need a vector to implement our similarity measures.
To do this, we require to turn our last_hidden_states tensor to a vector of 768 tensors.
For us to transform our last_hidden_states tensor into our desired vector — we use a mean pooling method.
Each of these 512 tokens has separate 768 values. This pooling work will take the average of all token embeddings and consolidate them into a unique 768 vector space, producing a ‘sentence vector’.
At the very time, we can’t just exercise the mean activation as is. We lack to estimate null padding tokens (which we should not hold).
That’s noted on the theory and logic following the process, but how do we employ this in certainty?
We’ll describe two approaches — the comfortable way and the slightly more complicated way.
The usual straightforward approach for us to perform everything we just included is within the sentence; transformers library, which covers most of this rule into a few lines of code.
Let’s generate some sentences, initialize our representation, and encode the lines of words:
#Write some lines to encode (sentences 0 and 2 are both ideltical): sen = [ "Three years later, the coffin was still full of Jello.", "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.", "The person box was packed with jelly many dozens of months later.", "He found a leprechaun in his walnut shell." ] from sentence_transformers import SentenceTransformer model = SentenceTransformer('bert-base-nli-mean-tokens') #Encoding: sen_embeddings = model.encode(sen) sen_embeddings.shape
Output: (4, 768)
Great, we now own four-sentence embeddings, each holding 768 values.
Now, something we do is use those embeddings and discover the cosine similarity linking each. So for line 0:
Three years later, the coffin was still full of Jello.
We can locate the most comparable sentence applying:
from sklearn.metrics.pairwise import cosine_similarity #let's calculate cosine similarity for sentence 0: cosine_similarity( [sentence_embeddings[0]], sentence_embeddings[1:] )
Output: array([[0.33088914, 0.7219258 , 0.5548363 ]], dtype=float32)
Index Sentence Similarity 1 “The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.” 0.3309 2 “The person box was packed with jelly many dozens of months later.” 0.7219 3 “He found a leprechaun in his walnut shell.” 0.5547
Now, here is the extra convenient and more intellectual approach.
Before arriving at the second strategy, it is worth seeing that it does the identical thing as the above, but at one level more below.
from transformers import AutoTokenizer, AutoModel import torch #nitialize our model and tokenizer: tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens') model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens') ###Tokenize the sentences like before: sent = [ "Three years later, the coffin was still full of Jello.", "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.", "The person box was packed with jelly many dozens of months later.", "He found a leprechaun in his walnut shell." ] # initialize dictionary: stores tokenized sentences token = {'input_ids': [], 'attention_mask': []} for sentence in sent: # encode each sentence, append to dictionary new_token = tokenizer.encode_plus(sentence, max_length=128, truncation=True, padding='max_length', return_tensors='pt') token['input_ids'].append(new_token['input_ids'][0]) token['attention_mask'].append(new_token['attention_mask'][0]) # reformat list of tensors to single tensor token['input_ids'] = torch.stack(token['input_ids']) token['attention_mask'] = torch.stack(token['attention_mask'])
#Process tokens through model: output = model(**token) output.keys()
Output: odict_keys([‘last_hidden_state’, ‘pooler_output’])
#The dense vector representations of text are contained within the outputs 'last_hidden_state' tensor embeddings = outputs.last_hidden_state embeddings
embeddings.shape
Output: torch.Size([4, 128, 768])
After writing our dense vector embeddings, we want to produce a mean pooling operation to form a single vector encoding, i.e., sentence embedding).
To achieve this mean pooling operation, we will require multiplying all values in our embeddings tensor by its corresponding attention_mask value to neglect non-real tokens.
# To perform this operation, we first resize our attention_mask tensor:
att_mask = tokens['attention_mask']
att_mask.shape
output: torch.Size([4, 128])
mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float() mask.shape
Output: torch.Size([4, 128, 768])
mask_embeddings = embeddings * mask mask_embeddings.shape
Output: torch.Size([4, 128, 768])
#Then we sum the remained of the embeddings along axis 1: summed = torch.sum(mask_embeddings, 1) summed.shape
Output: torch.Size([4, 768])
#Then sum the number of values that must be given attention in each position of the tensor: summed_mask = torch.clamp(mask.sum(1), min=1e-9) summed_mask.shape
Output: torch.Size([4, 768])
mean_pooled = summed / summed_mask mean_pooled
Once we possess our dense vectors, we can compute the cosine similarity among each — which is the likewise logic we used previously:
from sklearn.metrics.pairwise import cosine_similarity #Let's calculate cosine similarity for sentence 0: # convert from PyTorch tensor to numpy array mean_pooled = mean_pooled.detach().numpy() # calculate cosine_similarity( [mean_pooled[0]], mean_pooled[1:] )
Output: array([[0.3308891 , 0.721926 , 0.55483633]], dtype=float32)
Index Sentence Similarity 1 “The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.” 0.3309 2 “The person box was packed with jelly many dozens of months later.” 0.7219 3 “He found a leprechaun in his walnut shell.” 0.5548
We return around the identical results — the only distinction being that the cosine similarity for index three has slipped from 0.5547 to 0.5548 — an insignificant variation due to rounding.
That’s all for this introduction to mapping the semantic similarity of sentences using BERT reviewing sentence-transformers and a lower-level explanation with Python-PyTorch and transformers.
I hope you’ve relished the article. Let me know if you hold any questions or suggestions via LinkedIn or in the remarks below.
Thanks for reading!
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Its a great article. Is it outputs = output[0]. ??
Hi, I tried the exact same code but I am getting much higher similarity scores. Can you please explain the reason? Or have you fine tuned your model? Code: from transformers import AutoConfig, AutoTokenizer, AutoModel import torch #nitialize our model and tokenizer: config = AutoConfig.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens', num_labels=2) tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens') model = AutoModel.from_config(config) ###Tokenize the sentences like before: sent = [ "Three years later, the coffin was still full of Jello.", "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.", "The person box was packed with jelly many dozens of months later.", "He found a leprechaun in his walnut shell." ] # initialize dictionary: stores tokenized sentences token = {'input_ids': [], 'attention_mask': []} for sentence in sent: # encode each sentence, append to dictionary new_token = tokenizer.encode_plus(sentence, max_length=128, truncation=True, padding='max_length', return_tensors='pt') token['input_ids'].append(new_token['input_ids'][0]) token['attention_mask'].append(new_token['attention_mask'][0]) # reformat list of tensors to single tensor token['input_ids'] = torch.stack(token['input_ids']) token['attention_mask'] = torch.stack(token['attention_mask']) #Process tokens through model: output = model(**token) output.keys() embeddings = output.last_hidden_state embeddings embeddings.shape # To perform this operation, we first resize our attention_mask tensor: att_mask = token['attention_mask'] att_mask.shape mask = att_mask.unsqueeze(-1).expand(embeddings.size()).float() mask.shape mask_embeddings = embeddings * mask mask_embeddings.shape summed = torch.sum(mask_embeddings, 1) summed.shape summed_mask = torch.clamp(mask.sum(1), min=1e-9) summed_mask.shape mean_pooled = summed / summed_mask mean_pooled from sklearn.metrics.pairwise import cosine_similarity #Let's calculate cosine similarity for sentence 0: # convert from PyTorch tensor to numpy array mean_pooled = mean_pooled.detach().numpy() # calculate cosine_similarity( [mean_pooled[0]], mean_pooled[1:] ) Output: array([[0.93569237, 0.9301692 , 0.92021537]], dtype=float32)