If you put a status update on Facebook about purchasing a car -don’t be surprised if Facebook serves you a car ad on your screen. This is not black magic! This is Facebook leveraging the text data to serve you better ads.
The picture below takes a jibe at a challenge while dealing with text data.
Well, it clearly failed in the above attempt to deliver the right ad. It is all the more important to capture the context in which the word has been used. This is a common problem in Natural Processing Language (NLP) tasks.
A single word with the same spelling and pronunciation (homonyms) can be used in multiple contexts and a potential solution to the above problem is computing word representations.
Now, imagine the challenge for Facebook. Facebook deals with enormous amount of text data on a daily basis in the form of status updates, comments etc. And it is all the more important for Facebook to utilise this text data to serve its users better. And using this text data generated by billions of users to compute word representations was a very time expensive task until Facebook developed their own library FastText, for Word Representations and Text Classification.
In this article, we will see how we can calculate Word Representations and perform Text Classification, all in a matter of seconds in comparison to the existing methods which took days to achieve the same performance.
FastText is an open-source library for text representation and classification developed by Facebook’s AI Research (FAIR) team. It is designed to efficiently handle large amounts of text data and provides tools for text classification, word representation, and text similarity computation.
At its core, FastText uses the concept of word embeddings, which are dense vector representations of words in a continuous vector space. Word embeddings capture semantic and syntactic relationships between words based on their distributional properties in a given text corpus.
FastText extends the idea of word embeddings to represent entire words or subwords, called n-grams. Instead of considering words as atomic units, FastText breaks them down into smaller subword units, such as character n-grams. By doing so, it can capture morphological information and handle out-of-vocabulary words efficiently.
The training process of FastText involves learning these word and subword embeddings using a technique called continuous bag of words (CBOW) with negative sampling. CBOW predicts a target word based on the surrounding context words, and negative sampling helps train the model efficiently even with large vocabularies.
FastText supports both unsupervised and supervised learning tasks. In the unsupervised setting, it can learn word embeddings solely based on the distributional properties of words in the training corpus. In the supervised setting, it can perform text classification tasks, where it learns to classify text documents into predefined categories.
FastText has gained popularity due to its ability to handle large-scale text data efficiently. It has been used for various applications, including text classification, language identification, information retrieval, and text similarity computation.
FastText is a used for efficient learning of word representations and sentence classification.
This library has gained a lot of traction in the NLP community and is a possible substitution to the gensim package which provides the functionality of Word Vectors etc. If you are new to the Word Vectors and word representations in general then, I suggest you read this article first.
But the question that we should be really asking is – How is FastText different from gensim Word Vectors?
FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.
We will now look at the steps to install the fastText library below.
To make full use of the FastText library, please make sure you have the following requirements satisfied:
If you do not have the above pre-requisites, I urge you to go ahead and install the above dependencies first.
To install FastText, type the code below-
git clone https://github.com/facebookresearch/fastText.git
cd fastText
make
You can check whether FastText has been properly installed by typing the below command inside the FastText folder../fasttext
If everything was installed correctly then, you should see the list of available commands for FastText as the output.
As stated earlier, FastText was designed for two specific purposes- Word Representation Learning and Text Classification. We will see each of these steps in detail. Let us get started with learning word representations.
Words in their natural form cannot be used for any Machine Learning task in general. One way to use the words is to transform these words into some representations that capture some attributes of the word. It is analogous to describing a person as – [‘height’:5.10 ,’weight’:75, ‘colour’:’dusky’, etc.] where height, weight etc are the attributes of the person. Similarly, word representations capture some abstract attributes of words in the manner that similar words tend to have similar word representations. There are primarily two methods used to develop word vectors – Skipgram and CBOW.
We will see how we can implement both these methods to learn vector representations for a sample text file using fasttext.
Learning word representations using Skipgram and CBOW models
./fasttext skipgram -input file.txt -output model
./fasttext cbow -input file.txt -output model
Let us see the parameters defined above in steps for easy understanding.
./fasttext – It is used to invoke the FastText library.
skipgram/cbow – It is where you specify whether skipgram or cbow is to be used to create the word representations.
-input – This is the name of the parameter which specifies the following word to be used as the name of the file used for training. This argument should be used as is.
data.txt – a sample text file over which we wish to train the skipgram or cbow model. Change this name to the name of the text file you have.
-output – This is the name of the parameter which specifies the following word to be used as the name of the model being created. This argument is to be used as is.
model – This is the name of the model created.
Running the above command will create two files named model.bin and model.vec. model.bin contains the model parameters, dictionary and the hyperparameters and can be used to compute word vectors. model.vec is a text file that contains the word vectors for one word per line.
Now since we have created our own word vectors let’s see if we can do some common tasks like print word vectors for a word, find similar words, analogies etc. using these word vectors.
Print word vectors of a word
In order to get the word vectors for a word or set of words, save them in a text file. For example, here is a sample text file named queries.txt that contains some random words. We will get the vector representation of these words using the model we trained above.
./fasttext print-word-vectors model.bin < queries.txt
To check word vectors for a single word without saving into a file, you can do
echo "word" | ./fasttext print-word-vectors model.bin
Finding similar words
You can also find the words most similar to a given word. This functionality is provided by the nn parameter. Let’s see how we can find the most similar words to “happy”.
./fasttext
nn model.bin
After typing the above command, the terminal will ask you to input a query word.
happy
by 0.183204
be 0.0822266
training 0.0522333
the 0.0404951
similar 0.036328
and 0.0248938
The 0.0229364
word 0.00767293
that 0.00138793
syntactic -0.00251774
The above is the result returned for the most similar words to happy. Interestingly, this feature could be used to correct spellings too. For example, when you enter a wrong spelling, it shows the correct spelling of the word if it occurred in the training file.
wrd
word 0.481091
words. 0.389373
words 0.370469
word2vec 0.354458
more 0.345805
and 0.333076
with 0.325603
in 0.268813
Word2vec 0.26591
or 0.263104
Analogies
FastText word vectors can also be used on analogies task of the kind, what is to C, what B is to A. Here, A, B and C are the words.
The analogies functionality is provided by the parameter analogies. Let’s see this with the help of an example.
./fasttext analogies model.bin
The above command will ask to input the words in the form A-B+C, but we just need to give three words separated by space.
happy sad angry
of 0.199229
the 0.187058
context 0.158968
a 0.151884
as 0.142561
The 0.136407
or 0.119725
on 0.117082
and 0.113304
be 0.0996916
Training on a very large corpus will produce better results.
As suggested by the name, text classification is tagging each document in the text with a particular class. Sentiment analysis and email classification are classic examples of text classification. In this era of technology, millions of digital documents are being generated each day. It would cost a huge amount of time as well as human efforts to categorise them in reasonable categories like spam and non-spam, important and unimportant and so on. Text classification techniques of NLP come here to our rescue. Let’s see how by doing hands-on practice based on a sentiment analysis problem. I have taken the data for this analysis from kaggle.
Before we jump upon the execution, there is a word of caution about the training file. The default format of text file on which we want to train our model should be _ _ label _ _ <X> <Text>
Where _ _label_ _ is a prefix to the class and <X> is the class assigned to the document. Also, there should not be quotes around the document and everything in one document should be on one line.
In fact, the reason why I have selected this data for this article is that the data is already available exactly in the required default format.If you are completely new to FastText and implementing text classification for very first time in FastText, I would strongly recommend using the data mentioned above.
In case your data has some other formats of the label, don’t be bothered. FastText will take care of it once you pass a suitable argument. We will see how to do it in a moment. Just stick to the article.
After this briefing about text classification, let’s move ahead and land on the implementation part. We will be using the train.ft text file to train the model and test.ft file to predict.
#training the classifier./fasttext supervised -input train.ft.txt -output model_kaggle -label __label__
Here, the parameters are same as the one mentioned while creating word representations. The only additional parameter is -label. This argument takes care of the format of the label specified. The file that you downloaded contains labels with the prefix __label__.
If you do not wish to use default parameters for training the model, then they can be specified during the training time. For example, if you explicitly want to specify the learning rate of the training process then you can use the argument -lr to specify the learning rate.
./fasttext supervised -input train.ft.txt -output model_kaggle -label __label__
-lr 0.5
The other available parameters that can be tuned are –
The values in the square brackets [] represent the default values of the parameters passed.
# Testing the result./fasttext test model_kaggle.bin test.ft.txt
N 400000
P@1 0.916
R@1 0.916
Number of examples: 400000
P@1 is the precision
R@1 is the recall
# Predicting on the test dataset./fasttext predict model_kaggle.bin test.ft.txt
# Predicting the top 3 labels./fasttext predict model_kaggle.bin test.ft.txt 3
This model can also be used for computing the sentence vectors. Let us see how we can compute the sentence vectors by using the following commands.
echo "this is a sample sentence" | ./fasttext print-sentence-vectors model_kaggle.bin
0.008204 0.016523 -0.028591 -0.0019852 -0.0043028 0.044917 -0.055856 -0.057333 0.16713 0.079895 0.0034849 0.052638 -0.073566 0.10069 0.0098551 -0.016581 -0.023504 -0.027494 -0.070747 -0.028199 0.068043 0.082783 -0.033781 0.051088 -0.024244 -0.031605 0.091783 -0.029228 -0.017851 0.047316 0.013819 0.072576 -0.004047 -0.10553 -0.12998 0.021245 0.0019761 -0.0068286 0.021346 0.012595 0.0016618 0.02793 0.0088362 0.031308 0.035874 -0.0078695 0.019297 0.032703 0.015868 0.025272 -0.035632 0.031488 -0.027837 0.020735 -0.01791 -0.021394 0.0055139 0.009132 -0.0042779 0.008727 -0.034485 0.027236 0.091251 0.018552 -0.019416 0.0094632 -0.0040765 0.012285 0.0039224 -0.0024119 -0.0023406 0.0025112 -0.0022772 0.0010826 0.0006142 0.0009227 0.016582 0.011488 0.019017 -0.0043627 0.00014679 -0.003167 0.0016855 -0.002838 0.0050221 -0.00078066 0.0015846 -0.0018429 0.0016942 -0.04923 0.056873 0.019886 0.043118 -0.002863 -0.0087295 -0.033149 -0.0030569 0.0063657 0.0016887 -0.0022234
Like every library in development, it has its pros and cons. Let us state them explicitly.
Now, its time to take the plunge and actually play with some other real datasets. So are you ready to take on the challenge? Accelerate your NLP journey with the following Practice Problems:
Practice Problem: Identify the Sentiments | Identify the sentiment of tweets | |
Practice Problem : Twitter Sentiment Analysis | To detect hate speech in tweets |
A. Yes, FastText utilizes a neural network architecture. It employs a shallow neural network with a single hidden layer for training word and subword embeddings. The model uses a technique called continuous bag of words (CBOW) with negative sampling for learning. FastText is a neural network-based approach for efficient text representation and classification tasks.
A. The choice between BERT embeddings and FastText depends on the specific task and requirements. BERT embeddings capture contextual information effectively, making them suitable for tasks like sentiment analysis and named entity recognition. FastText is more efficient for handling large-scale text data and can handle out-of-vocabulary words well. Ultimately, the selection should be based on the specific needs of the application.
This article was aimed at making you aware of the FastText library as an alternative to the word2vec model and also letting you make your first vector representation and text classification model.
For people who want to go in greater depth of the difference in performance of fastText and gensim, you can visit this link, where a researcher has carried out the comparison using a jupyter notebook and some standard text datasets.
Please feel free to try out this library and share your experiences in the comment below.
Very good article which gives you a good insight on Fasttext
"./fasttext print-word-vectors model.bin > queries.txt" should have been "./fasttext print-word-vectors model.bin < queries.txt". Source - https://github.com/facebookresearch/fastText
Yes, thanks for bringing it to the notice. Corrected.