We can make use of Machine Learning in order to categorize such harassment incidents without any manual human intervention. Traditionally if a report needs to file regarding a harassment case, the victim needs to provide the description of the case along with filling up several forms. We can automate some parts of this process by categorizing the incidents based on the descriptions and then performing further processes based on the category of the cases. This automatic classification of different forms of sexual harassment will immensely help the authorities as well as the concerned organizations to partially automate their systems that file reports against such incidents.
Here our objective is to use short stories which were submitted online and to be able to automatically categorize each story submitted by a user. We will consider the user stories as our training data. We will try to build a Machine Learning model which will take these stories/descriptions as input and try to predict the categories of harassment it belongs to. The main thing to keep in mind here is that a description may belong to multiple categories. For example, a description could indicate both Commenting and Groping cases.
Data-Source : https://github.com/swkarlekar/safecity
Authors: Sweta Karlekar & Mohit Bansal, University of North Carolina at Chapel Hill
The dataset has 7201 training samples, 990 validation samples, and 1701 test samples. Each data sample consists of a Description of the incident followed by whether it belongs to classes Commenting, Staring, and Groping. As it is a Multi-Label classification problem, each data point can belong to multiple classes.
Dataset Example
Initially, this problem is a Multi-Label Classification problem, but later we will map it to a Multi-Class classification problem, hence we can use multi-class classification metrics for evaluating our model. We use the following metrics :
LogLoss: It is calculated as the average of negative log probabilities of the data point belonging to the correct class label. Log loss can range from 0 to infinity, lower LogLoss indicates that more points are classified correctly. In our case, LogLoss seems to be very helpful as we have multiple classes.
Precision: It gives us an idea about out of all the points predicted to belong to a certain class, how many actually do belong to that class.
Recall: Recall indicates out of all points belonging to a class, how many were actually correctly classified
F1-Score: It’s the Harmonic mean of Precision & Recall which tries to have higher values for both precision and recall. It ranges from 0(worst) to 1(best)
First, let’s have a look at the words of each description and how they can help us to gain insights into the classification. We find the top 15 most frequent words for each category and plot them.
We count the no_of_labels each data point belongs to. This leads to the following observations :
We find the top 15 most frequent words in each class label case and plot the frequency of these words. Also, for visual representation, we also plot the word cloud using the text corpus for each label.
Observations
Similarly, we can also have a look at the most frequent bigrams (2-words)
We count the no of words in each description and plot the Distributions as well as percentile values.
We first used TF-IDF vectorizer to convert word data into a numerical representation.
1. We first use PCA to reduce the dimensions to 100
2. We then use t-SNE to reduce and visualize the data in 2 dimensions
As we can from the plot below, we cannot distinguish the classes using 2d t-SNE representations.
We follow the following Sequence to clean the data
We try to extract features from the text descriptions which can help us further in classification. These are the features extracted
# Fing Average Length df_train["avg_word_length"] = df_train["Description"].apply(lambda x: np.mean([len(word) for word in x.split(" ")])) # Count Total Words df_train["word_count"] = df_train["Description"].apply(lambda x: len(x.split(" ")))
To calculate these Text Scoring metrics we use the Textstat python library which provides functions to calculate the scores.
We can use the count_POS function used below to count the no of nouns, adverbs, verbs, adjectives, and pronouns in a given text.
In order to convert our text descriptions into numerical representations, we use the TF-IDF Weighted GloVe Embeddings techniques using the following steps
To make it easier for us during modelling, as we have only 3 classes we try to pose this as a multi-class problem. A data point can belong to any of the 3 classes, hence we have 2³ = 8 classes. Thus we convert the problem into a multi-class classification problem with 8 classes. We then convert our Dataset labels according to these new classes.
Class 0: None (No Harassment)
Class 1: Staring
Class 2: Groping
Class 3: Commenting
Class 4: Commenting & Staring
Class 5: Staring & Groping
Class 6: Commenting & Groping
Class 7: Staring,Commenting & Groping
We can notice that majority of the data belongs to Class 0. Classes 0,1 and 2 also have a relatively higher number of samples. This imbalance needs to be taken into consideration while we evaluate our predictions after modeling.
At the end of Feature engineering, we stack all the newly created features together. For each data point, we have 311 features and we get Train Data shape: (8189, 311) and Test Data Shape: (1701, 311)
311 Features = TF-IDF Weighted GloVe Embeddings (300) + Word_Count + Average_Word_Count + Difficulty Score + flesch_reading_ease + coleman_liau_index +flesch_kincaid_grade +noun_count + pronoun_count +verb_count + adjective_count +adverb_count
For plotting the evaluation metrics we use we will define a function plot_metrics(). It will take the trained model as an argument and print all the needed metrics.
As the initial step, we will build a Dummy Classifier which acts as a benchmark for evaluating other models. We use the DummyClassifier model provided by sklearn which randomly classifies data.
We get a Log Loss of 26.80 for the test dataset. This value can be used as a baseline, if any model gives us a loss greater than this, then we can say that its performance is even worse compared to a random model.
Logistic Regression
We train a Logistic Regression model with the dataset. Hyperparameter tuning is done using RandomizedSearchCV which gives us the best model with the optimal parameters.
From the results, we can see that on test data our Logistic Regression model performs a lot better than our Dummy Classifier. But, looking at the Precision Matrix we can see that the model favors the majority classes. A higher number of samples belonging to the majority classes were predicted correctly. Very few points were predicted to belong to classes 5, 6, and 7 which are the minority classes.
params = {'max_depth':randint(1,50),"n_estimators":randint(10,200)} xgb_hyp = RandomizedSearchCV( XGBClassifier(), param_distributions=params, verbose=1,n_jobs=-1, scoring=LogLoss) xgb_hyp.fit(X_train,y_train) print_metrics(xgb_hyp)
We observe that using XGBoost the test LogLoss reduced to 1.3763 from 1.5633 when compared to the Logistic Regression model. Also, the precision matrix indicates that we predict the classes better and majority classes are not favored too much. Even classes 5, 6, 7 are better predicted when compared to Logistic Regression Model.
Exploring other models
We tried using other ML models to check if we can reduce the loss further. For each algorithm, we performed hyperparameter tuning to get the best parameters. The models we experimented with are: XGBoost
LightGBM, CatBoost, Logistic Regression, SVM, Random Forest,
Naive Bayes and Decision Tree.
Ensemble models like XGBoost, LightGBM & CatBoost performed Significantly better when compared to other algorithms
As we noticed the majority of the classes belong to class 0, due to this when we observe the Confusion matrix of the trained models we see that the predictions favor Class 0. So in order to fix this issue, we try to build a model which follows a 2-step classification approach.
Prediction Steps: Here we first pass the data to be predicted to Model 1 first, it checks if the sample belongs to Class 0 or not. If it doesn’t belong to class 0 then we pass the sample to Model 2 which can then predict if the sample belongs to Classes 1 to 7.
We implement this models as a TwoStepClassifier class. It has these methods
1. load_models(model1, model2) : We need to pass the trained model1 & model 2 to this function.
2. predict_proba() and predict() are used for prediction
3. get_metrics() : prints the metrics
When we compare the overall models, XGBoost and LightGBM perform significantly better than other algorithms. Our custom 2-Step-Classifier also performed better than most algorithms but not as good as XGBoost & LightGBM.
To provide an intuitive user interface for the users to try out our model, we have deployed the model on Heroku. We provide the feature for the user to enter a short description of the harassment incident. This data entered is then pre-processed and then fed into the ML model which predicts the category to which the incident belongs. Along with the prediction, we also display the probabilities of the incident belonging to each class which our model predicted for further observation of needed. In order quickly to test we also have an option to pick a random sentence from a set of sentences and predict the result just to have a quick look into the working of the app.
For the development of the frontend, we have used the Streamlit library which makes the process of developing UI for ML apps simpler and intuitive. You can find the link below along with the Github repo of the project.
Link: https://safe-city-clf.herokuapp.com/
Github Repo: https://github.com/AshishSalaskar1/Safe-City-Classification
Thanks for posting such an amazing blog and the introduced content is exceptionally virtuous. will surely share with my friends.