Assorting and Locating Varied Forms of Sexual Harassment

yashaswi kakumanu Last Updated : 17 Aug, 2023

10 min read

Introduction

Do you know the inevitable fact about the prevalence of sexual harassment is because of low reporting incidence? If victims don’t report the harassment they have experienced then how would authorities be able to guide people from getting harassed and how would there be a change in the offender’s behaviors? Assorting and locating Varied Forms of Sexual Harassment case study helps victims to express their experience in an anonymous manner and helps in categorizing various types of sexual harassment, victims have experienced so that it helps in fast evaluation of category for filing testimonials and this also helps in providing safety precautions by taking into account of the analysis from the already filed forums.

These safety precautions give heads to the individual by delivering prevalent locations with most types of sexual harassment filed in that region and behavior of offenders. In the future from the above predictions, individuals will benefit a lot as they provide insights and create awareness about the event’s circumstances.

sexual harassment | assorting | locating

Learning Objectives

Predicting multi-label classification of various forms of harassment in society
Utilizing natural language processing techniques on the dataset
Iterating over traditional machine learning algorithms
Implementing convolutional neural networks
The blog discusses the application of these methods to address harassment-related issues

This article was published as a part of the Data Science Blogathon.

Introduction
Business Problem
Business Constraints
Dataset Description
Performance Metric
Preprocessing
Exploratory Data Analysis
Checking Null Values in the Dataset
Geographical Plot
Bar Plot
TSNE
Word Cloud
Machine Learning Models
Deep Learning Models
Deployment of Model
Conclusion
Future Work
Frequently Asked Questions

Business Problem

Here victims stories are categorized into three types of sexual harassment i.e., we convert into multi label classification as the victims can face one or more types of sexual harassment at the time.

Business Constraints

As my case study is a multi-label classification, a misclassification is no longer a hard wrong or right. A prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them i.e. predicting two of the three labels correctly is better than predicting no labels at all. We don’t have any strict latency concerns. Interpretability is very important because it helps in finding why the story is classified as one of the type of harassment

Dataset Description

Data has been collected from safecity online forum and WIN World Survey (WWS) a market research and polling survey for collecting data of sexual harassment predominant countries. Dataset contains two features. Feature 1 — contains victims stories (Description) , Feature 2 contains Geolocation (Location) of the event taken place.

Our class label is multi label classification which contains three types of sexual harassments (Commenting, Ogling and Groping) victim has experienced.

dataset description of sexual harassment | assorting | locating

Performance Metric

For multi label classification predictions for an instance is a set of labels and therefore , our predictions can be fully correct , partially correct or fully incorrect. This makes evaluation of a multi label classifier more challenging than evaluation of a single label classifier. However for the evaluation of partial correctness we can use below metrics for evaluation.

Accuracy — Here accuracy for one instance is calculated as the proportion of the predicted correct labels to the total number(predicted and actual ) of labels. Overall accuracy can be obtained by the average across all instances.

These metrics can be computed on individual class labels and then averaged over all classes. This is termed as Macro Averaging. Alternatively, we can compute these metrics globally over all instances and all class labels. This is termed as Micro averaging.

We use Macro F1-score and Micro F1-score as metric for multi label classification.

Hamming Loss is used as metric for multi label classification , this metric computes the proportion of incorrectly predicted labels to the total number of labels.

Preprocessing

In order to obtain better insights we head for cleaning of our data (like removing symbols, punctuations, special characters etc.). When it comes to text data, cleaning or preprocessing is as important as model building.

Below are the preprocessing steps we need to perform:

Lower casing
Removal of digits
Removal of punctuations
Removal of special characters
Removal of html tags
Removal of stop words
Expanding contractions

Exploratory Data Analysis

It is important to ensure that the data is ready for modelling work. Exploratory Data Analysis (EDA) ensures the readiness of the data for Machine Learning. In fact, EDA ensures that the data is more usable. Without a proper EDA, Machine Learning work suffers from accuracy issues and many times, the algorithms won’t work. EDA helps us to understand the data and get better insights. So we head for the EDA.

Checking Null Values in the Dataset

df.isnull().sum()

We add an extra feature to the data frame which calculates the number of words from a victim story. Plotting distribution plot by taking into account of word count column from our data frame.

plot distribution | sexual harassment | assorting | locating

From the above plot we can deduce that most of the victim prefer sharing their experiences within 100 words.

Geographical Plot

From our data frame we take into consideration of Location column and then calculate number of times for which victims have experienced harassment in a particular region. For plotting geographical plot we Construct data frame of countries(sexual harassment experienced region) and count of victims who have reported from that particular region.

graphical plot | sexual harassment | assorting | locating

From the above graph we can deduce that highest number of victims are experienced in Mexico region(brighter yellow region).

Bar Plot

Bar plots to check number of victim stories in each category.

We are creating a column ‘label’ as follows:

We label 1 when the person experiences only commenting harassment
We label 2 when the person experiences only ogling harassment
We label 3 when the person experiences only groping harassment
We label 4 when the person experiences only commenting and ogling harassment
We label 5 when the person experiences only ogling and groping harassment
We label 6 when the person experiences only commenting and groping harassment
We label 7 when the person doesn’t experience any harassment
We label 8 when the person experiences three types of harassment at the same type

From above bar plot we can observe that Mexico women have experienced highest sexual harassments. We also need to get clear intuition of the words that are frequently occurred in each category. Below are the barplots of most common unigrams,bigrams and trigrams for each category.

Commenting Category

Ogling Category

Groping Category

TSNE

Performing vectorization on our victim stories in order to perform dimensionality reduction for the easy visualization of harassment category.

As we know TSNE is stochastic in nature so for multiple runs we get different visualizations , so I have run multiple perplexities and iterations in order to obtain above plot, this plot clearly indicates one class can be segregated from each other.

Word Cloud

We have also implemented word cloud for the visualization of frequent data in each category.

Commenting Category

From above we can deduce that for comment sexual harassment type most of the offenders were boys for this type event has usually taken place at college, station, bus, school.

Ogling Category

From above we can deduce that for ogling sexual harassment type most of the offenders were guys for this type event has usually taken place on the streets while the victims were walking, passing by, going to college.

Groping Category

From above we can deduce that for groping sexual harassment type most of the offenders were man for this type event has usually taken place in public places like at bus, station while they were traveling where people are crowded.

Scatter Text

Using scatter text for visualizing unique terms and their frequency. Scatter text plot works on categorical data as a binary classifier so we are creating separating columns for each harassment type with categorical values.

Scatter Text Plot for Commenting Category

From above figure we can deduce top commenting words and non commenting words. The top-right of the chart are the most-shared terms and the bottom-left are the least frequent of the most-shared terms.

Scatter Text Plot for Ogling Category

From above figure we can deduce top ogling words and non ogling words. The top-right of the chart are the most-shared terms and the bottom-left are the least frequent of the most-shared terms.

Scatter Text Plot for Groping Category

From above figure we can deduce top groping words and non groping words. The top-right of the chart are the most-shared terms and the bottom-left are the least frequent of the most-shared terms.

Machine Learning Models

For training the model we did a basic train test split and tried various models.

We have performed various machine learning models using BOW, TFIDF , GLOVE 300 dimension and we have observed below values for respective metrics.

machine learning models - 2 | sexual harassment | assorting | locating

From the above we can deduce high Macro F1 score of 0.63 from Linear SVC using BOW vectorizer, moreover BOW and TFIDF vectorizer outperforms GLOVE vectorizers in each metric.

We also head for the implementation of deep learning models.

Deep Learning Models

CNN Model

We have built a convolutional neural network by passing Glove 300 Dimensions into the embedding layer.

CNN deep learning model | sexual harassment | assorting | locating

As we are working on multi label classification we pass our last layer into sigmoid activation and we implement binary cross entropy loss function.

CNN-LSTM Model

We have also built a convolutional neural network by passing Glove 300 Dimensions into the embedding layer and then also added LSTM layer for the CNN-LSTM model.

Summary of Both DL Models

summary of DL Models | sexual harassment | assorting | locating

From the above metrics choosing CNN as best model.

Deployment of Model

I have created web app using Flask and deployed my best model. Below is the video of running instances of my deployed model.

Conclusion

In conclusion, this blog sheds light on the pressing issue of sexual harassment and emphasizes the low reporting incidence as a contributing factor to its prevalence. It highlights the importance of victims reporting their experiences to enable authorities to guide people and drive a change in offender behavior.

This blog also discusses the implementation of natural language processing techniques, traditional machine learning algorithms, and convolutional neural networks, with the CNN model, augmented by an LSTM layer, generating superior results. Through this work, the aim is to empower individuals, provide guidance, and promote societal change in tackling the pervasive issue of sexual harassment.

Key Takeaways

CNN model outperformed traditional machine learning algorithms in predicting multi-label classification of harassment forms.
The LSTM layer used in the CNN model resulted in a significant improvement in performance metrics.
The article highlights the superiority of the CNN model and the impact of the LSTM layer on its performance.

Future Work

We need to gather more data so that it helps us to improve values of our performance metrics on test data set.
We can try BERT embeddings and FastText word embeddings.
We can work on our custom model in order to obtain enhanced values on performance metrics by changing architecture.

You can find my complete code over here.

Frequently Asked Questions

Q1: What is the significance of utilizing NLP techniques in this context?

A: NLP plays a vital role in analyzing textual data and extracting insights. In the context of predicting multi-label harassment, NLP preprocesses victim stories by cleaning the data, removing symbols, digits, and stop words.

Q2: How do CNNs contribute to the prediction of multi-label classifications in this study?

A: CNNs excel at processing structured data like images or text represented as word embeddings. In this article, CNNs process GloVe word embeddings of victim stories, capturing key features and patterns. The model’s convolutional layers extract relevant information, while subsequent layers learn complex relationships for multi-label predictions. Integrating CNNs in this study boosts the classification model’s performance.

Q3: What is the role of the LSTM layer in the CNN model? How does it improve performance?

A: The LSTM (Long Short-Term Memory) layer is a type of recurrent neural network layer that can effectively model sequential data. In the CNN model, add the LSTM layer after the convolutional layers. It helps capture the contextual dependencies and long-term relationships within the victim stories. By incorporating the LSTM layer, the model gains the ability to understand the sequential nature of the text, resulting in improved performance metrics, such as higher accuracy and F1-scores.

Q4: Which vectorization techniques do the machine learning and deep learning models use?

A: The study utilized various vectorization techniques: Bag-of-Words (BOW), TF-IDF, and GloVe word embeddings. BOW and TF-IDF count word occurrences and determine importance based on frequency. GloVe word embeddings represent words as dense vectors, capturing semantic meaning.

Q5: What performance metrics do we use to evaluate the multi-label classification models?

A: Use the two performance metrics, Macro F1-score and Micro F1-score, to evaluate the multi-label classification models. These metrics consider the correctness of predictions for each label individually and average them either across all instances (Micro) or across all labels (Macro).

References

https://aclanthology.org/D18-1303.pdf
https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490
https://www.kdnuggets.com/2020/09/geographical-plots-python.html
https://analyticsindiamag.com/visualizing-sentiment-analysis-reports-using-scattertext-nlp-tool/

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

yashaswi kakumanu

Passionate Machine Learning Engineer with expertise in agile methodology and cloud environments. Proficient in designing,
developing, testing, and deploying applications utilizing cloud technologies. Actively contributes to open-source projects,
demonstrating a commitment to advancing machine learning through continuous learning and improvement.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Assorting and Locating Varied Forms of Sexual Harassment

Introduction

Learning Objectives

Table of contents

Business Problem

Business Constraints

Dataset Description

Performance Metric

Preprocessing

Exploratory Data Analysis

Checking Null Values in the Dataset

df.isnull().sum()

Geographical Plot

Bar Plot

Commenting Category

Ogling Category

Groping Category

TSNE

Word Cloud

Commenting Category

Ogling Category

Groping Category

Scatter Text

Scatter Text Plot for Commenting Category

Scatter Text Plot for Ogling Category

Scatter Text Plot for Groping Category

Machine Learning Models

Deep Learning Models

CNN Model

CNN-LSTM Model

Summary of Both DL Models

Deployment of Model

Conclusion

Key Takeaways

Future Work

Frequently Asked Questions

References

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid