This article was published as a part of the Data Science Blogathon
Let’s look at a practical application of the supervised NLP fastText model for detecting sarcasm in news headlines. About 80% of all information is unstructured, and text is one of the most common types of unstructured data. Due to its chaotic nature, analyzing, understanding, organizing, and sorting textual information becomes complex and time-consuming tasks. This is where NLP and text classification comes in.
Text classification is a machine learning technique used to fragment them into categories. Using classifier models, companies can automatically structure all kinds of text, from emails, legal documents, social media posts, chatbot messages, survey results, etc. This saves time spent analyzing information, automates business processes, and makes data-driven business decisions.
fastText is a popular open-source text classification library that was published in 2015 by the Facebook Artificial Intelligence Research Lab. The company also provides models: English word vectors (pre-trained in English web crawl and Wikipedia) and Multi-lingual word vectors (trained models for 157 different languages), which allow the creation of Supervised and Unsupervised learning algorithms for obtaining vector representations of words. In this article, we’ll look at how it can be used to categorize news headlines.
import pandas as pd import fasttext from sklearn.model_selection import train_test_split import re from gensim.parsing.preprocessing import STOPWORDS from gensim.parsing.preprocessing import remove_stopwords pd.options.display.max_colwidth = 1000
A dataset is a collection of news article headlines and their annotation as sarcasm (articles from the news outlet The Onion ) and non-sarcasm (from HuffPost ).
Datalink: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection
is_sarcastic: 1 if the title is sarcastic, otherwise 0
headline: news article title
article_link: link to the original article
# Loading data # Checking the number of variables and observations df_headline.shape (26709, 3) # Display header examples df_headline.head (3)
# Display the number of sarcastic and non-sarcastic articles in the dataset and their percentage df_headline.is_sarcastic.value_counts() 0 14985 1 11724 df_headline.is_sarcastic.value_counts(normalize=True) 0 0.561047 1 0.438953
Here are some examples of sarcastic and non-sarcastic headlines:
df_headline[df_headline['is_sarcastic']==1].head(3)
df_headline[df_headline['is_sarcastic']==0].head(3)
One of the first steps to improve model performance is to use simple text preprocessing. Before we start building the classifier, we need to prepare the text: bring all words to lower case, remove punctuation, special characters, and numbers. To do this, let’s create a cleanup function and apply it to a variable headline.
# Create a text cleanup function
def clean_text (text): text = text.lower () text = re.sub (r '[^ sa-zA-Z0-9 @ []]', '', text) # Removes punctuation text = re.sub (r ' w * d + w *', '', text) # Remove digits text = re.sub (' s {2,}', "", text) # Removes unnecessary spaces return text
# Apply it to the title
df_headline['headline'] = df_headline['headline'].apply(clean_text)
Before we start training the model, we need to split the data like this. Most often, 80% of the information is used for training a model (depending on the amount of data, the sample size can vary) and 20% for testing (accuracy verification).
# Divide data into training and text
train, test = train_test_split(df_headline, test_size = 0.2)
Next, we need to prepare files in the format txt. The default file format should include __label__
# Create text files for training the model with label and text with open ('train.txt', 'w') as f: for every_text, every_lbl in zip (train ['headline'], train ['is_sarcastic']): f.writelines (f '__ label __ {every_lbl} {every_text} n') with open ('test.txt', 'w') as f: for every_text, every_lbl in zip (test ['headline'], test ['is_sarcastic']): f.writelines (f '__ label __ {every_lbl} {every_text} n') # Display what our training data now looks like !head -n 10 train.txt
To train the model, you need to set the fastText input file and its name:
# First model without hyperparameter optimization model1 = fasttext.train_supervised ('train.txt') # Create a function to display the training results of the model def print_results (sample_size, precision, recall): precision = round (precision, 2) recall = round (recall, 2) print (f '{sample_size =}') print (f '{precision =}') print (f '{recall =}') # Apply the function print_results(*model1.test('test.txt')) sample_size=5342 precision=0.85 recall=0.85
The results, while not perfect, look promising.
Finding the best hyperparameters manually can be time-consuming. By default, the fastText model includes each training example only five times during training, which is quite small considering that we have only 12,000 examples in our set. The number of views for each example (also known as the number of epochs) can be increased through manual optimization epoch:
# Second model with 25 epochs model2 = fasttext.train_supervised('train.txt', epoch=25) print_results(*model2.test('test.txt')) sample_size=5342 precision=0.83 recall=0.83
As you can see, the accuracy of the model has not increased. Another way to change the speed of the process is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after each example is processed. A learning rate of 0 would mean that the model does not change at all and therefore does not learn anything. Good learning rates are in the range of 0.1 – 1.0. We can also manually optimize this hyperparameter with an argument lr:
# Third model with 10 epochs and 1 learning rate
model3 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0) print_results(*model3.test('test.txt')) sample_size=5342 precision=0.83 recall=0.83
Finally, we can improve the performance of the model by using bigrams of words rather than just unigrams. This is especially important for classification tasks where word order is important, for example, analyzing sentiments, defining criticism, sarcasm, etc. For this, we will include an argumentwordNgrams equal to 2 in the model.
model4 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0, wordNgrams =2) print_results(*model4.test('test.txt')) sample_size=5342 precision=0.86 recall=0.86
Thanks to this sequence of steps, we were able to go from 86% accuracy:
preprocessing of text;
changing the number of epochs (using an argument epoch, standard range [5 – 50]);
changing the learning rate (using an argument lr, standard range [0,1 – 1,0]);
using n-grams of words (using argument wordNgrams, standard range [1-5]).
You can also adapt the search for hyperparameters through the evaluation of a specific label by adding an argument autotune Metric:
model5 = fasttext.train_supervised('train.txt', autotuneValidationFile='test.txt') print_results(*model5.test('test.txt')) sample_size=5342 precision=0.87 recall=0.87
The fastText auto-tuning feature optimizes hyperparameters to obtain the highest F1. To do this, you need to include the model argument autotune ValidationFileand test dataset:
model6 = fasttext.train_supervised('train.txt', autotuneValidationFile='test.txt', autotuneMetric="f1:__label__1") print_results(*model6.test('test.txt')) sample_size=5342 precision=0.87 recall=0.87
Let’s save the model results and create a function to classify the new data:
# Save the model with optimized hyperparameters and the highest accuracy
model6.save_model('optimized.model')
fastText is also capable of compressing the model to produce a much smaller file, sacrificing little performance through quantification.
model.quantize(input='train.txt', retrain=True)
We can also simulate new data and test models against real headers. This will use the News Aggregator Dataset (https://www.kaggle.com/uciml/news-aggregator-dataset) from Kaggle:
# Loading data df_headline_test = pd.read_csv ('uci-news-aggregator.csv') # Display headers df_headline_test.TITLE.head(3)
Let’s apply the text classification function to the new headings and create variables with the predicted label and its probability:
# Prepare new data for classification df_headline_test ['TITLE'] = df_headline_test ['TITLE']. apply (clean_text) # Create a function to classify text def predict_sarcasm (text): return model.predict (text, k = 1) # Transform variables into a convenient format df_headline_test['predict_score'] = df_headline_test.TITLE.apply(predict_sarcasm) df_headline_test['predict_score'] = df_headline_test['predict_score'].astype(str) df_headline_test[['label','probability']] = df_headline_test.predict_score.str.split(" ",expand=True) df_headline_test['label'] = df_headline_test['label'].str.replace("(", '') df_headline_test['label'] = df_headline_test['label'].str.replace(")", '') df_headline_test['label'] = df_headline_test['label'].str.replace("__", ' ') df_headline_test['label'] = df_headline_test['label'].str.replace(",", '') df_headline_test['label'] = df_headline_test['label'].str.replace("'", '') df_headline_test['label'] = df_headline_test['label'].str.replace("label", '') df_headline_test['probability'] = df_headline_test['probability'].str.replace("array", '') df_headline_test['probability'] = df_headline_test['probability'].str.replace("(", '') df_headline_test['probability'] = df_headline_test['probability'].str.replace(")", '') df_headline_test['probability'] = df_headline_test['probability'].str.replace("[", '') df_headline_test['probability'] = df_headline_test['probability'].str.replace("]", '') # Remove unnecessary variable df_headline_test = df_headline_test.drop (columns = ['predict_score']) # Display the number of predicted sarcastic and non-sarcastic headlines df_headline_test.label.value_counts(normalize=True)
OUTPUT
0 0.710827
1 0.289173
We can see that 28% of the headlines were classified as sarcasm.
In conclusion, it should be noted that fastText is not one of the most recent developments in the classification of texts (the library was published in 2015). At the same time, this is a good basis for beginners: when performing NLP classification of texts of any complexity, the model has a significant advantage due to its ease of use, speed of learning, and automatic tuning of hyperparameter.
Kickstart your text classification journey with our ‘Mastering fastText for Beginners‘ course – learn the fundamentals, build robust models, and lay a strong foundation for your NLP success!