This article was published as a part of the Data Science Blogathon.
Suppose you are working in an IT firm as a support desk specialist and receive hundreds of support tickets you have to handle daily. The first task you do with each ticket is to classify it into one of the categories you have developed, such as “Credentials expired”, “Operating System Faulty”, “Hardware malfunctioning”, etc. If you are to categorize each support ticket manually, it would require a lot of time and effort to do so. Thanks to the text classification algorithms and Machine Learning, you can automate this task and save many man hours.
Text classification is a machine learning algorithm that allocates categories to the input text. These categories are predefined and customizable; for example, in the previous example quoted above, “Operating System Faulty”, “Hardware Malfunctioning”, and “Credentials expired” are all predefined categories against which you would want your existing and new input data to be categorized into.
Source: https://www.pexels.com/photo/assorted-beans-placed-in-rows-on-white-fabric-5913170/
There are various applications of Text Classifications. A few of them include:
and so on…
There are currently various Machine Learning models that are used for Text Classification Problems, such as:
But we have mostly seen the implementation of these models on numeric classification. For text classification, we need to convert text data into numerical data first, where vectorization comes in. Before moving forward, let us briefly understand these models.
“Support Vector Machine (SVM) is an excellent regression and classification algorithm that helps maximize a model’s accuracy and avoids overfitting. SVMs work the best when the dataset size is large. Common SVM applications include Image recognition, Customer Relationship Management (CRM) tools, text classification, extraction, etc.
A Naive Bayes Algorithm (NB), is based on the Bayes theorem and works on the principle of conditional probability, which in turn, measures an event’s probability given that another event has occurred.
Logistic Regression is a supervised learning algorithm that helps predict the probability of an event or an outcome. Common Logistic Regression problems consist of binary classification of the input data, such as if the emails are spam or not, or if the person likes the hamburger.
The logistic regression model is based on a Logistic function which is defined as:
Logistic function = (1)/((1+e^(-x)))
Text Vectorization is a process through which text data are converted into numerical data. Various tools help with vectorization, such as:
In this article, we will focus on Text Classification using a combination of TF-IDF Vectorization and Logistic Regression. Let us first have a brief introduction to TF-IDF Vectorizer and Logistic Regressor.
Using the TF-IDF model, we can define the significance of each input word depending on its frequency in the text. It is based on the composite score representing the word’s power. This composite score is calculated by multiplying the Term Frequency (TF) factor with the Inverse Document Frequency (IDF) factor.
Term Frequency (TF): This factor shows the occurrence of a word out of total words in that document and is calculated as :
TF: Count of a word in a document/ total words count in that document
Inverse Document Frequency (IDF): This factor takes the log value of the ratio of the total number of documents and the total number of documents that contain that particular word. It is calculated as:
IDF: log (total number of documents/documents containing the given word).
The higher the TF-IDF value is, the more chances the word is unique and occurs rarely. The lower the value of the factor, the more common the word is. For example, the commonly occurring words, such as “and,” “the,” and “is,” all have a meager value of TF-IDF, nearly equal to zero.
Let us take a real-life example of text data and vectorize it using a TF-IDF vectorizer. We will be using Jupyter Notebook and Python for this example. So let us first initiate the necessary libraries in Jupyter.
import pandas as pd import warnings from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction import _stop_words
TfidfVectorizer is the required library we need to import from sklearn.feature_extraction. _stop_words is used here to list all the common words in a language.
For this example, we will use the publically available Internet Movie Database (IMDB) movie titles and genres dataset, which can be downloaded from here. I have downloaded the file named “title.basics.tsv.gz“. This is a huge file, around 150MB, with millions of rows. For simplicity, I have taken only the initial 1000+ entries of the dataset and split the dataset into two files, the first 1028 rows for the training dataset (just a random number, no logic behind 1028), called imdb_train.csv, and the remaining 18 entries as imdb_test.csv. We will first be training on the training dataset and then testing our model on the unseen test dataset and letting the model classify the 18 movies into their genres. Finally, we will evaluate how our model did by comparing any random movie’s predicted genre with the actual genre.
Let us load and display the training dataset as follows:
Python Code:
import pandas as pd
train_data = pd.read_csv('imdb_train.csv')
print(train_data.shape)
print(train_data.head())
print(train_data['genres'].unique())
We have 1058 movie titles along with their genres. There are 17 different genres in which 1058 movies are classified.
We will first create a matrix of the movie titles in a corpus.
corpus = train_data['primaryTitle'].values corpus
Then, we will vectorize our corpus
vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names())
First, we generating the Vectorizer object using vectorizer = TfidfVectorizer(stop_words=’english’) command. In the next step, we converted the input text into a TF-IDF matrix using X = vectorizer.fit_transform(corpus) command, and we print the words selected in the TF-IDF matrix in the final step.
Vector_Text=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()) Vector_Text['originalText']=pd.Series(corpus) Vector_Text
In the previous step, we visualized the document term matrix using TF_IDF. Now let us add the genres column back to the vectorized table.
ML_Data=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()) ML_Data['genres']=train_data['genres'] ML_Data.head()
Now let us do some data formatting and adjustments
Target=ML_Data.columns[-1] Predictors=ML_Data.columns[:-1] X=ML_Data[Predictors].values y=ML_Data[Target].values
The Logistic regression model helps estimate an event’s probability based on the independent variables dataset. We can try other models for classification, such as Naive Bayes, Decision Trees, and such, but for simplicity, we are using Logistic Regression here. Readers are encouraged to try the other models and comment if those models produced a better result.
from sklearn.linear_model import LogisticRegression from sklearn import metrics clf = LogisticRegression(C=5, solver='newton-cg',penalty='l2') LOG=clf.fit(X,y) pred=LOG.predict(X) Test_Data=pd.DataFrame(data=X, columns=Predictors) Test_Data['TargetVariable']=y Test_Data['Prediction']=pred print(Test_Data.head()) print(metrics.classification_report(y, pred)) print(metrics.confusion_matrix(pred, y)) F1_Score=metrics.f1_score(y, pred, average='weighted') print('Accuracy of the model on Testing Sample Data:', round(F1_Score,2)
In this step, we will be loading the test dataset and see how our model does with predicting the movies’ genres. We will define a function that converts the words into numeric vectors.
def genres_test(inpText): X=vectorizer.transform(input_text) Prediction=FinalModel.predict(X) Result=pd.DataFrame(data=input_text, columns=['title']) Result['Prediction']=Prediction return(Result)
Now, let’s call the function
movie_name=["Flores y perlas"] predicted_genre=genres_test(input_text=movie_name) predicted_genre
Now let us compare the predicted genre with the original genre of the same title in our dataset.
test_data=pd.read_csv('imdb_test.csv') test_data
We can see in row number 9 that the actual genre of the movie “Flores y perlas” is also “Drama”.
In this article, we started by defining what Text Classification is in the field of Machine Learning and what its applications are. Then, we read how text classification is carried out by first vectorizing our text data using any vectorizer model such as Word2Vec, Bag of Words, or TF-IDF, and then using any classical classification methods, such as Naive Bayes, Decision Trees, or Logistic Regression to do the text classification.
We used the refined IMDB movies dataset with just the movie titles and their genres. Fed the model with a portion of the dataset so it could learn and then fed it with new unseen data to predict the movies’ genres, which it did with high accuracy.
Key takeaways from this article are:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.