Most machine learning algorithms don’t understand text data but numerical data. So it is necessary to represent the text data in numerical form as our computer or machine learning models can handle the numerical data. Word embeddings are an efficient way of representing words in the form of vectors. Word embeddings provide similar vector representations for words with similar meanings. In this article, we are going to learn about fastText.
FastText is a word embedding technique that provides embedding to the character n-grams. It is the extension of the word2vec model. This article will study fastText and how to train the available model in Gensim. It also includes a brief introduction to the word2vec model.
Learning Objectives
Word Embedding is an approach for representing words in vector form. It provides similar vector representations for words that have similar meanings. It helps the model to capture the linguistic meaning of the word. For example, consider 4 words: cricket, football, mountain, and sea. Among these words – Cricket and football are related, and sea and mountain are related, so similar vector representations are given to related words. Figure 1.1 shows cricket and football are placed together, and mountain and sea are placed together. This can help learn the semantic meaning of the word.
If you are a beginner in NLP, I would recommend the following list of articles on embeddings-
Figure 1.1
Some popular word embedding techniques are Word2Vec, GloVe, FastText, ELMo. Word2vec and GloVe embeddings operate on word levels, whereas FastText and ELMo operate on character and sub-word levels. In this article, we will study the FastText word embedding technique.
Word2Vec is a word embedding technique to represent words in vector form. It takes a whole corpus of words and provides embedding for those words in high-dimensional space. Word2Vec model also maintains semantic and syntactic relationships of words. Word2Vec model is used to find the relatedness of words across the model. The word2vec model uses two main architectures to compute the vectors: CBOW and Skip-gram.
CBOW, In this method, the context is given, and the target word is predicted. If a sentence is given and a word is missing, the model must predict the missing word. Skip-gram, In this method, the target word is given, and the probability of the context word is predicted.
Word embedding techniques like word2vec and GloVe provide distinct vector representations for the words in the vocabulary. This leads to ignorance of the internal structure of the language. This is a limitation for morphologically rich language as it ignores the syntactic relation of the words. As many word formations follow the rules in morphologically rich languages, it is possible to improve vector representations for these languages by using character-level information.
To improve vector representation for morphologically rich language, FastText provides embeddings for character n-grams, representing words as the average of these embeddings. It is an extension of the word2vec model. Word2Vec model provides embedding to the words, whereas fastText provides embeddings to the character n-grams. Like the word2vec model, fastText uses CBOW and Skip-gram to compute the vectors.
FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings that are not present at the time of training.
Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not present in the model’s vocabulary. Word embedding models like word2vec or GloVe cannot provide embeddings for the OOV words because they provide embeddings for words; hence, if a new word occurs, it cannot provide embedding.
Since FastText provides embeddings for character n-grams, it can provide embeddings for OOV words. If an OOV word occurs, then fastText provides embedding for that word by embedding its character n-gram.
In FastText, each word is represented as the average of the vector representation of its character n-grams along with the word itself.
Consider the word “equal” and n = 3, then the word will be represented by character n-grams:
< eq, equ, qua, ual, al > and < equal >
So, the word embedding for the word ‘equal’ can be given as the sum of all vector representations of all of its character n-gram and the word itself.
In the Continuous Bag Of Words (CBOW), we take the context of the target word as input and predict the word that occurs in the context.
For example, in the sentence “ I want to learn FastText.” In this sentence, the words “I,” “want,” “to,” and “FastText” are given as input, and the model predicts “learn” as output.
All the input and output data are in the same dimension and have one-hot encoding. It uses a neural network for training. The neural network has an input layer, a hidden layer, and an output layer. Figure 1.2 shows the working of CBOW.
Figure 1.2
Skip-gram works like CBOW, but the input is the target word, and the model predicts the context of the given the word. It also uses neural networks for training. Figure 1.3 shows the working of Skip-gram.
Figure 1.3
FastText can be viewed as an extension to word2vec. Some of the significant differences between word2vec and fastText are as follows:
This session explains how to train the fastText model. The fastText model is available under Gensim, a Python library for topic modeling, document indexing, and similarity retrieval with large corpora.
The Dataset used in this article is taken from Kaggle, “ Word Embedding Analysis on Covid-19 dataset”. The pre-processed dataset that is used can be accessed here.
The first step is to import the necessary libraries and read the dataset,
from gensim.models.phrases import Phrases, Phraser from gensim.models import FastText import pandas as pd
import pandas as pd
df = pd.read_csv('medical_dataset.csv')
print(df.head())
To extract the familiar phrases from the dataset and the most meaningful n-grams, the Phrases model in the Gensim is used.
sent = [row.split() for row in df['Text']] phrases = Phrases(sent, min_count = 30, progress_per = 10000) sentences = phrases[sent]
The next step is model initialization and building the vocabulary for the model. The hyperparameters in the fastText model are as follows,
window: window size for the character n-grams that are to be considered before and after the target word
min_count: minimal number of word occurrences
min_n: minimum length of character n-gram
max_n: maximum length of character n-gram
#Initializing the model model = FastText(size = 100, window = 5, min_count = 5, workers = 4, min_n = 1, max_n = 4)
#Building Vocabulary model.build_vocab(sentences) print(len(model.wv.vocab.keys()))
Output:
As we can see, the total length of vocabulary is 30734.
The model is trained using the phrases we created before on a hundred epochs. The model is saved using the joblib library.
#Training the model model.train(sentences, total_examples = len(sentences), epochs=100)
# Saving the model import joblib path = 'FastText.joblib' joblib.dump(model, path)
vocabulary = model.wv.vocab.keys()
'python' in vocabulary
model.wv.most_similar("python", topn = 5)
'epidemic out-break' in vocabulary
model.wv.most_similar("epidemic out-break", topn = 10)
This article briefly introduced word embedding and word2vec, then explained FastText. A word embedding technique provides embeddings for character n-grams instead of words. It also provides a comparison between word2vec and fastText. As fastText is an extension to word2vec, it overcomes the major disadvantage of the word2vec model. But the performance of both models depends on the corpus. And at last, it provides a demo to train the fastText model.