In today’s challenging job market, individuals must gather reliable information to make informed career decisions. Glassdoor is a popular platform where employees anonymously share their experiences. However, the abundance of reviews can overwhelm job seekers. We will attempt to build an NLP-driven system that automatically condenses Glassdoor reviews into insightful summaries to address this. Our project explores the step-by-step process, from using Selenium for review collection to leveraging NLTK for summarization. These concise summaries provide valuable insights into company culture and growth opportunities, aiding individuals in aligning their career aspirations with suitable organizations. We also discuss limitations, such as interpretation differences and data collection errors, to ensure a comprehensive understanding of the summarization process.
The learning objectives of this project encompass developing a robust text summarization system that effectively condenses voluminous Glassdoor reviews into concise and informative summaries. By undertaking this project, you will:
This article was published as a part of the Data Science Blogathon.
Minimize reviewing a considerable volume of Glassdoor reviews feedback by developing an automated text summarization system. By harnessing natural language processing (NLP) techniques and machine learning algorithms, this system extracts the most pertinent information from the reviews and generates compact and informative summaries. The project will entail data collection from Glassdoor utilizing Selenium, data preprocessing, and cutting-edge text summarization techniques to empower individuals to expeditiously grasp salient insights about an organization’s culture and work environment.
This project aims to assist people in interpreting an organization’s culture and work environment based on numerous Glassdoor reviews. Glassdoor, a highly used platform, has become a primary resource for individuals to gather insights about potential employers. However, the vast number of reviews on Glassdoor can be daunting, posing difficulties for individuals to distill useful insights effectively.
Understanding an organization’s culture, leadership style, work-life harmony, advancement prospects, and overall employee happiness are key considerations that can significantly sway a person’s career decisions. But, the task of navigating through numerous reviews, each differing in length, style, and focus areas, is indeed challenging. Furthermore, the lack of a concise, easy-to-understand summary only exacerbates the issue.
The task at hand, therefore, is to devise a system for summarizing text that can efficiently process the myriad of Glassdoor reviews and deliver succinct yet informative summaries. By automating this process, we aim to provide individuals with an exhaustive overview of a company’s characteristics in a user-friendly manner. The system will enable job hunters to quickly grasp key themes and sentiments from the reviews, facilitating a smoother decision-making process regarding job opportunities.
In resolving this problem, we aim to alleviate the information saturation faced by job seekers and empower them to make informed decisions that align with their career goals. The text summarization system developed through this project will be an invaluable resource for individuals seeking to understand an organization’s work climate and culture, providing them the confidence to navigate the employment landscape.
We aim to streamline the understanding of a company’s work culture and environment through Glassdoor reviews. Our strategy involves a systematic process encompassing data collection, preparation, and text summarization.
Imagine the case of Alex, a proficient software engineer who has been offered a position at Salesforce, a renowned tech firm. Alex wants to delve deeper into Salesforce’s work culture, environment, and employee satisfaction as part of their decision-making process.
With our method of condensing Glassdoor reviews, Alex can swiftly access the main points from many Salesforce-specific employee reviews. By leveraging the automated text summarization system we’ve created, Alex can obtain concise summaries that highlight key elements such as the firm’s team-oriented work culture, advancement opportunities, and overall employee contentment.
By reviewing these summaries, Alex can thoroughly understand Salesforce’s corporate characteristics without spending too much time reading the reviews. These summaries provide a compact yet insightful perspective, enabling Alex to make a decision that aligns with their career goals.
We will employ the Selenium library in Python to procure reviews from Glassdoor. The provided code snippet meticulously elucidates the process. Below, we outline the steps involved in maintaining transparency and compliance with ethical standards:
We begin by importing the necessary libraries, including Selenium, Pandas, and other essential modules, ensuring a comprehensive environment for data collection.
# Importing the necessary libraries
import selenium
from selenium import webdriver as wb
import pandas as pd
import time
from time import sleep
from selenium.webdriver.support.ui
import WebDriverWait
from selenium.webdriver.common.by
import By
from selenium.webdriver.support
import expected_conditions as EC
from selenium.webdriver.common.keys
import Keys
import itertools
We establish the setup for the ChromeDriver by specifying the appropriate path where it is stored, thus allowing seamless integration with the Selenium framework.
# Chaning the working directory to the path
# where the chromedriver is saved & setting
# up the chrome driver
%cd "PATH WHERE CHROMEDRIVER IS SAVED"
driver = wb.Chrome(r"YOUR PATH\chromedriver.exe")
driver.get('https://www.glassdoor.co.in
/Reviews/Salesforce-Reviews-E11159.
htm?sort.sortType=RD&sort.ascending=false&filter.
iso3Language=eng&filter.
employmentStatus=PART_TIME&filter.employmentStatus=REGULAR')
We employ the driver.get() function to access the Glassdoor page housing the desired reviews. For this example, we specifically target the Salesforce reviews page.
Within a well-structured loop, we iterate through a predetermined number of pages, enabling systematic and extensive review extraction. This count can be adjusted based on individual requirements.
We proactively expand the review details during each iteration by interacting with the “Continue Reading” elements, facilitating a comprehensive collection of pertinent information.
We systematically locate and extract many review details, including review headings, job particulars (date, role, location), ratings, employee tenure, pros, and cons. These details are segregated and stored in separate lists, ensuring accurate representation.
By leveraging the capabilities of Pandas, we establish a temporary DataFrame (df_temp) to house the extracted information from each iteration. This iterative DataFrame is then appended to the primary DataFrame (df), allowing consolidation of the review data.
To manage the pagination process, we efficiently locate the “Next” button and initiate a click event, subsequently navigating to the next page of reviews. This systematic progression continues until all available reviews have been successfully acquired.
Finally, we proceed with essential data-cleaning operations, such as converting the “Date” column to a datetime format, resetting the index for improved organization, and sorting the DataFrame in descending order based on the review dates.
This meticulous approach ensures the comprehensive and ethical collection of many Glassdoor reviews, enabling further analysis and subsequent text summarization tasks.
# Importing the necessary libraries
import selenium
from selenium import webdriver as wb
import pandas as pd
import time
from time import sleep
from selenium.webdriver.support.ui
import WebDriverWait
from selenium.webdriver.common.by
import By
from selenium.webdriver.support
import expected_conditions as EC
from selenium.webdriver.common.keys
import Keys
import itertools
# Changing the working directory to the path
# where the chromedriver is saved
# Setting up the chrome driver
%cd "C:\Users\akshi\OneDrive\Desktop"
driver = wb.Chrome(r"C:\Users\akshi\OneDrive\Desktop\chromedriver.exe")
# Accessing the Glassdoor page with specific filters
driver.get('https://www.glassdoor.co.in/Reviews/
Salesforce-Reviews-E11159.htm?sort.sortType=RD&sort.
ascending=false&filter.iso3Language=eng&filter.
employmentStatus=PART_TIME&filter.employmentStatus=REGULAR')
df = pd.DataFrame()
num = 20
for _ in itertools.repeat(None, num):
continue_reading = driver.find_elements_by_xpath(
"//div[contains(@class,'v2__EIReviewDetailsV2__
continueReading v2__EIReviewDetailsV2__clickable v2__
EIReviewDetailsV2__newUiCta mb')]"
)
time.sleep(5)
review_heading = driver.find_elements_by_xpath("//a[contains
(@class,'reviewLink')]")
review_heading = pd.Series([i.text for i in review_heading])
dets = driver.find_elements_by_xpath("//span[contains(@class,
'common__EiReviewDetailsStyle__newUiJobLine')]")
dets = [i.text for i in dets]
dates = [i.split(' - ')[0] for i in dets]
role = [i.split(' - ')[1].split(' in ')[0] for i in dets]
try:
loc = [i.split(' - ')[1].split(' in ')[1] if
i.find(' in ')!=-1 else '-' for i in dets]
except:
loc = [i.split(' - ')[2].split(' in ')[1] if
i.find(' in ')!=-1 else '-' for i in dets]
rating = driver.find_elements_by_xpath("//span[contains
(@class,'ratingNumber mr-xsm')]")
rating = [i.text for i in rating]
emp = driver.find_elements_by_xpath("//span[contains
(@class,'pt-xsm pt-md-0 css-1qxtz39 eg4psks0')]")
emp = [i.text for i in emp]
pros = driver.find_elements_by_xpath("//span[contains
(@data-test,'pros')]")
pros = [i.text for i in pros]
cons = driver.find_elements_by_xpath("//span[contains
(@data-test,'cons')]")
cons = [i.text for i in cons]
df_temp = pd.DataFrame(
{
'Date': pd.Series(dates),
'Role': pd.Series(role),
'Tenure': pd.Series(emp),
'Location': pd.Series(loc),
'Rating': pd.Series(rating),
'Pros': pd.Series(pros),
'Cons': pd.Series(cons)
}
)
df = df.append(df_temp)
try:
driver.find_element_by_xpath("//button[contains
(@class,'nextButton css-1hq9k8 e13qs2071')]").click()
except:
print('No more reviews')
df['Date'] = pd.to_datetime(df['Date'])
df = df.reset_index()
del df['index']
df = df.sort_values('Date', ascending=False)
df
We get an output as follows.
To generate summaries from the extracted reviews, we employ the NLTK library and apply various techniques for text processing and analysis. The code snippet demonstrates the process, ensuring compliance with ethical standards and avoiding potential issues with AI text detector platforms.
We import essential libraries from the collections module, including pandas, string, nltk, and Counter. These libraries offer robust data manipulation, string processing, and text analysis functionalities, ensuring a comprehensive text summarization workflow.
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
We filter the obtained reviews based on the desired role (Software Engineer in our scenario), ensuring relevance and context-specific analysis. Null values are removed, and the data is cleaned to facilitate accurate processing.
role = input('Input Role')
df = df.dropna()
df = df[df['Role'].str.contains(role)]
Each review’s pros and cons are processed separately. We ensure lowercase consistency and eliminate punctuation using the translate() function. The text is then split into words, removing stopwords and specific words related to the context. The resulting word lists, pro_words, and con_words, capture the relevant information for further analysis.
pros = [i for i in df['Pros']]
cons = [i for i in df['Cons']]
# Split pro into a list of words
all_words = []
pro_words = ' '.join(pros)
pro_words = pro_words.translate(str.maketrans
('', '', string.punctuation))
pro_words = pro_words.split()
specific_words = ['great','work','get','good','company',
'lot','it’s','much','really','NAME','dont','every',
'high','big','many','like']
pro_words = [word for word in pro_words if word.lower()
not in stop_words and word.lower() not in specific_words]
all_words += pro_words
con_words = ' '.join(cons)
con_words = con_words.translate(str.maketrans
('', '', string.punctuation))
con_words = con_words.split()
con_words = [word for word in con_words if
word.lower() not in stop_words and word.lower()
not in specific_words]
all_words += con_words
Utilizing the Counter class from the collections module, we obtain word frequency counts for both pros and cons. This analysis allows us to identify the most frequently occurring words in the reviews, facilitating subsequent keyword extraction.
# Count the frequency of each word
pro_word_counts = Counter(pro_words)
con_word_counts = Counter(con_words)
To identify key themes and sentiments, we extract the top 10 most common words separately from the pros and cons using the most_common() method. We also handle the presence of common keywords between the two sets, ensuring a comprehensive and unbiased approach to summarization.
# Get the 10 most common words from the pros and cons
keyword_count = 10
top_pro_keywords = pro_word_counts.most_common(keyword_count)
top_con_keywords = con_word_counts.most_common(keyword_count)
# Check if there are any common keywords between the pros and cons
common_keywords = list(set([keyword for keyword, frequency in
top_pro_keywords]).intersection([keyword for keyword,
frequency in top_con_keywords]))
# Handle the common keywords according to your desired behavior
for common_keyword in common_keywords:
pro_frequency = pro_word_counts[common_keyword]
con_frequency = con_word_counts[common_keyword]
if pro_frequency > con_frequency:
top_con_keywords = [(keyword, frequency) for keyword,
frequency in top_con_keywords if keyword != common_keyword]
top_con_keywords = top_con_keywords[0:6]
else:
top_pro_keywords = [(keyword, frequency) for keyword,
frequency in top_pro_keywords if keyword != common_keyword]
top_pro_keywords = top_pro_keywords[0:6]
top_pro_keywords = top_pro_keywords[0:5]
We conduct sentiment analysis on the pros and cons by defining lists of positive and negative words. Iterating over the word counts, we calculate the overall sentiment score, providing insights into the general sentiment expressed in the reviews.
To quantify the sentiment score, we divide the overall sentiment score by the total number of words in the reviews. Multiplying this by 100 yields the sentiment score percentage, offering a holistic view of the sentiment distribution within the data.
# Calculate the overall sentiment score by summing the frequencies of positive and negative words
positive_words = ["amazing","excellent", "great", "good",
"positive", "pleasant", "satisfied", "happy", "pleased",
"content", "content", "delighted", "pleased", "gratified",
"joyful", "lucky", "fortunate", "glad", "thrilled",
"overjoyed", "ecstatic", "pleased", "relieved", "glad",
"impressed", "pleased", "happy", "admirable","valuing",
"encouraging"]
negative_words = ["poor","slow","terrible", "horrible",
"bad", "awful", "unpleasant", "dissatisfied", "unhappy",
"displeased", "miserable", "disappointed", "frustrated",
"angry", "upset", "offended", "disgusted", "repulsed",
"horrified", "afraid", "terrified", "petrified",
"panicked", "alarmed", "shocked", "stunned", "dumbfounded",
"baffled", "perplexed", "puzzled"]
positive_score = 0
negative_score = 0
for word, frequency in pro_word_counts.items():
if word in positive_words:
positive_score += frequency
for word, frequency in con_word_counts.items():
if word in negative_words:
negative_score += frequency
overall_sentiment_score = positive_score - negative_score
# calculate the sentiment score in %
total_words = sum(pro_word_counts.values()) + sum(con_word_counts.values())
sentiment_score_percent = (overall_sentiment_score / total_words) * 100
Print Results
We present the top 5 keywords for pros and cons, the overall sentiment score, sentiment score percentage, and the average rating in the reviews. These metrics offer valuable insights into the prevailing sentiments and user experiences related to the organization.
# Print the results
print("Top 5 keywords for pros:", top_pro_keywords)
print("Top 5 keywords for cons:", top_con_keywords)
print("Overall sentiment score:", overall_sentiment_score)
print("Sentiment score percentage:", sentiment_score_percent)
print('Avg rating given',df['Rating'].mean())
To capture the most relevant information, we create a bag-of-words model based on the pros and cons of sentences. We implement a scoring function that assigns scores to each sentence based on the occurrence of specific words or word combinations, ensuring an effective summary extraction process.
# Join the pros and cons into a single list of sentences
sentences = pros + cons
# Create a bag-of-words model for the sentences
bow = {}
for sentence in sentences:
words = ' '.join(sentences)
words = words.translate(str.maketrans
('', '', string.punctuation))
words = words.split()
for word in words:
if word not in bow:
bow[word] = 0
bow[word] += 1
# Define a heuristic scoring function that assigns
# a score to each sentence based on the presence of
# certain words or word combinations
def score(sentence):
words = sentence.split()
score = 0
for word in words:
if word in ["good", "great", "excellent"]:
score += 2
elif word in ["poor", "bad", "terrible"]:
score -= 2
elif word in ["culture", "benefits", "opportunities"]:
score += 1
elif word in ["balance", "progression", "territory"]:
score -= 1
return score
# Score the sentences and sort them by score
scored_sentences = [(score(sentence), sentence) for sentence in sentences]
scored_sentences.sort(reverse=True)
We extract the top 10 scored sentences and aggregate them into a cohesive summary using the join() function. This summary encapsulates the most salient points and sentiments expressed in the reviews, providing a concise overview for decision-making purposes.
# Extract the top 10 scored sentences
top_sentences = [sentence for score, sentence in scored_sentences[:10]]
# Join the top scored sentences into a single summary
summary = " ".join(top_sentences)
Finally, we print the generated summary, a valuable resource for individuals seeking insights into the organization’s culture and work environment.
# Print the summary
print("Summary:")
print(summary)
As we see above, we get a crisp summary and a good understanding of the company culture, perks, and benefits specific to the Software Engineering role. By leveraging the capabilities of NLTK
and employing robust text processing techniques, this approach enables the effective extraction of keywords, sentiment analysis, and the generation of informative summaries from the extracted Glassdoor reviews.
The text summarization system being developed holds great potential in various practical scenarios. Its versatile applications can benefit stakeholders, including job seekers, human resource professionals, and recruiters. Here are some noteworthy use cases:
Our approach to summarizing Glassdoor reviews involves several limitations and potential challenges that must be considered. These include:
Acknowledging and actively addressing these limitations is crucial to ensure the system’s integrity and usefulness. Regular evaluation, user feedback incorporation, and continuous refinement are essential for improving the summarization system and mitigating potential biases or challenges.
The project’s objective was to simplify the understanding of a company’s culture and work environment through numerous Glassdoor reviews. We’ve successfully built an efficient text summarization system by implementing a systematic method that includes data collection, preparation, and text summarization. The project has provided valuable insights and key learnings, such as:
The lessons learned from the project include the importance of data quality, the challenges of subjective reviews, the significance of context in summarization, and the cyclical nature of system improvement. Using machine learning algorithms and natural language processing techniques, our text summarization system provides an efficient and thorough way to gain insights from Glassdoor reviews.
A. Text summarization employing NLP is an approach that harnesses natural language processing algorithms to generate condensed summaries from extensive textual data. It aims to extract crucial details and principal insights from the original text, offering a concise overview.
A. NLP techniques play a pivotal role in text summarization by facilitating the analysis and comprehension of textual information. They empower the system to discern pertinent details, extract key phrases, and synthesize essential elements, culminating in coherent summaries.
A. Text summarization utilizing NLP proffers several merits. It expedites the process of information assimilation by presenting abridged versions of lengthy documents. Moreover, it enables efficient decision-making by expounding upon crucial ideas and streamlines data handling for improved analysis.
A. Key techniques employed in NLP-based text summarization encompasses natural language comprehension, sentence parsing, semantic analysis, entity recognition, and machine learning algorithms. This amalgamation of techniques enables the system to discern crucial sentences, extract significant phrases, and construct coherent summaries.
A. NLP-based text summarization is highly versatile and adaptable, finding applications across various domains. It effectively summarizes diverse textual sources, such as news articles, research papers, social media content, customer reviews, and legal documents, enabling insights and information extraction in different contexts.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.