Stress is a natural response of the body and mind to a demanding or challenging situation. It is the body’s way of reacting to external pressures or internal thoughts and feelings. Stress can be triggered by a variety of factors, such as work-related pressure, financial difficulties, relationship problems, health issues, or major life events. Stress detection insights, driven by data science and machine learning, aims to forecast stress levels in individuals or populations. By analyzing a variety of data sources, such as physiological measurements, behavioral data, and environmental factors, predictive models can identify patterns and risk factors associated with stress.
This proactive approach enables timely intervention and tailored support. Stress prediction holds potential in health care for early detection and personalized intervention as well as in occupational settings to optimize work environments. It can also inform public health initiatives and policy decisions. With the ability to predict stress, these models provide valuable insights for improving well-being and increasing resilience in individuals and communities.
This article was published as a part of the Data Science Blogathon.
Stress detection using machine learning involves collecting, cleaning, and preprocessing data. Feature engineering techniques are applied to extract meaningful information or create new features that can capture patterns related to stress. This may involve extracting statistical measures, frequency domain analysis, or time-series analysis to capture physiological or behavioral indicators of stress. Relevant features are extracted or engineered to enhance performance.
Researchers train machine learning models like logistic regression, SVM, decision trees, random forests, or neural networks by utilizing labeled data to classify stress levels. They evaluate the performance of the models using metrics such as accuracy, precision, recall, and F1-score. Integration of the trained model into real-world applications enables real-time stress monitoring. Continuous monitoring, updates, and user feedback are crucial for improving accuracy.
It is crucial to consider ethical issues and privacy concerns when dealing with sensitive personal data related to stress. Proper informed consent, data anonymization, and secure data storage procedures should be followed to protect individuals’ privacy and rights. Ethical considerations, privacy, and data security are important during the entire process. Machine learning-based stress detection enables early intervention, personalized stress management, and improved well-being.
The “stress” dataset contains information related to stress levels. Without the specific structure and columns of the dataset, I can provide a general overview of what a data description for a percentile might look like.
The dataset may contain numerical variables that represent quantitative measurements, such as age, blood pressure, heart rate, or stress levels measured on a scale. It may also include categorical variables that represent qualitative characteristics, such as gender, occupation categories, or stress levels classified into different categories (low, medium, high).
# Array
import numpy as np
# Dataframe
import pandas as pd
#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# warnings
import warnings
warnings.filterwarnings('ignore')
#Data Reading
stress_c= pd.read_csv('/human-stress-prediction/Stress.csv')
# Copy
stress=stress_c.copy()
# Data
stress.head()
below function is allowing you to quickly assess the data types and find out missing or null values. This summary is useful when working with large datasets or performing data cleaning and preprocessing tasks.
# Info
stress.info()
Use the code stress.isnull().sum() to check for null values in the “stress” dataset and calculate the sum of null values in each column.
# Checking null values
stress.isnull().sum()
To generate statistical information about the “stress” dataset. By compiling this code, you will get a summary of descriptive statistics for each numerical column in the dataset.
# Statistical Information
stress.describe()
Exploratory Data Analysis (EDA) is a crucial step in understanding and analyzing a dataset. It involves visually exploring and summarizing the main characteristics, patterns, and relationships within the data
lst=['subreddit','label']
plt.figure(figsize=(15,12))
for i in range(len(lst)):
plt.subplot(1,2,i+1)
a=stress[lst[i]].value_counts()
lbl=a.index
plt.title(lst[i]+'_Distribution')
plt.pie(x=a,labels=lbl,autopct="%.1f %%")
plt.show()
The Matplotlib and Seaborn libraries create a count plot for the “stress” dataset. It visualizes the count of stress instances across different subreddits, with the stress labels differentiated by different colors.
plt.figure(figsize=(20,12))
plt.title('Subreddit wise stress count')
plt.xlabel('Subreddit')
sns.countplot(data=stress,x='subreddit',hue='label',palette='gist_heat')
plt.show()
Text preprocessing refers to the process of converting raw text data into a more clean and structured format that is suitable for analysis or modeling tasks. It specially involves a series of steps to remove noise, normalize text, and extract relevant features. Here I added all libraries related to this text processing.
# Regular Expression
import re
# Handling string
import string
# NLP tool
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.lang.en.stop_words import STOP_WORDS
# Importing Natural Language Tool Kit for NLP operations
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from collections import Counter
Some common techniques used in text preprocessing include:
#defining function for preprocessing
def preprocess(text,remove_digits=True):
text = re.sub('\W+',' ', text)
text = re.sub('\s+',' ', text)
text = re.sub("(?<!\w)\d+", "", text)
text = re.sub("-(?!\w)|(?<!\w)-", "", text)
text=text.lower()
nopunc=[char for char in text if char not in string.punctuation]
nopunc=''.join(nopunc)
nopunc=' '.join([word for word in nopunc.split()
if word.lower() not in stopwords.words('english')])
return nopunc
# Defining a function for lemitization
def lemmatize(words):
words=nlp(words)
lemmas = []
for word in words:
lemmas.append(word.lemma_)
return lemmas
#converting them into string
def listtostring(s):
str1=' '
return (str1.join(s))
def clean_text(input):
word=preprocess(input)
lemmas=lemmatize(word)
return listtostring(lemmas)
# Creating a feature to store clean texts
stress['clean_text']=stress['text'].apply(clean_text)
stress.head()
Machine learning model building is the process of creating a mathematical representation or model that can learn patterns and make predictions or decisions from data. It involves training a model using a labeled dataset and then using that model to make predictions on new, unseen data.
Selecting or creating relevant features from the available data. Feature engineering aims to extract meaningful information from the raw data that can help the model learn patterns effectively.
# Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Model Building
from sklearn.model_selection import GridSearchCV,StratifiedKFold,
KFold,train_test_split,cross_val_score,cross_val_predict
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier,RandomForestClassifier,
AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
#Model Evaluation
from sklearn.metrics import confusion_matrix,classification_report,
accuracy_score,f1_score,precision_score
from sklearn.pipeline import Pipeline
# Time
from time import time
# Defining target & feature for ML model building
x=stress['clean_text']
y=stress['label']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
Choosing an appropriate machine learning algorithm or model architecture based on the nature of the problem and the characteristics of the data. Different models, such as decision trees, support vector machines, or neural networks, have different strengths and weaknesses.
Training the selected model using the labeled data. This step involves feeding the training data to the model and allowing it to learn the patterns and relationships between the features and the target variable.
# Self-defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by Logistic regression
def model_lr_tf(x_train, x_test, y_train, y_test):
global acc_lr_tf,f1_lr_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = LogisticRegression()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_lr_tf=accuracy_score(y_test,y_pred)
f1_lr_tf=f1_score(y_test,y_pred,average='weighted')
print('Time :',time()-t0)
print('Accuracy: ',acc_lr_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_lr_tf
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by MultinomialNB
def model_nb_tf(x_train, x_test, y_train, y_test):
global acc_nb_tf,f1_nb_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = MultinomialNB()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_nb_tf=accuracy_score(y_test,y_pred)
f1_nb_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_nb_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_nb_tf
# Self defining function to convert the data into vector form by tf idf
# vectorizer and classify and create model by Decision Tree
def model_dt_tf(x_train, x_test, y_train, y_test):
global acc_dt_tf,f1_dt_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = DecisionTreeClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_dt_tf=accuracy_score(y_test,y_pred)
f1_dt_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_dt_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_dt_tf
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by KNN
def model_knn_tf(x_train, x_test, y_train, y_test):
global acc_knn_tf,f1_knn_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = KNeighborsClassifier()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_knn_tf=accuracy_score(y_test,y_pred)
f1_knn_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_knn_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by Random Forest
def model_rf_tf(x_train, x_test, y_train, y_test):
global acc_rf_tf,f1_rf_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = RandomForestClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_rf_tf=accuracy_score(y_test,y_pred)
f1_rf_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_rf_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
# Self defining function to convert the data into vector form by tf idf
# vectorizer and classify and create model by Adaptive Boosting
def model_ab_tf(x_train, x_test, y_train, y_test):
global acc_ab_tf,f1_ab_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = AdaBoostClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_ab_tf=accuracy_score(y_test,y_pred)
f1_ab_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_ab_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
Model evaluation is a crucial step in machine learning to assess the performance and effectiveness of a trained model. It involves measuring how well the multiple models generalizes to unseen data and whether it meets the desired objectives. Evaluate the trained model’s performance on the testing data. Calculate evaluation metrics such as accuracy, precision, recall, and F1-score to assess the model’s effectiveness in stress detection. Model evaluation provides insights into the model’s strengths, weaknesses, and its suitability for the intended task.
# Evaluating Models
print('********************Logistic Regression*********************')
print('\n')
model_lr_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Multinomial NB*********************')
print('\n')
model_nb_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Decision Tree*********************')
print('\n')
model_dt_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************KNN*********************')
print('\n')
model_knn_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Random Forest Bagging*********************')
print('\n')
model_rf_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Adaptive Boosting*********************')
print('\n')
model_ab_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
This is a crucial step in machine learning to identify the best-performing model for a given task. When comparing models, it is important to have a clear objective in mind. Whether it is maximizing accuracy, optimizing for speed, or prioritizing interpretability, the evaluation metrics and techniques should align with the specific objective.
Consistency is key in model performance comparison. Using consistent evaluation metrics across all models ensures a fair and meaningful comparison. It is also important to split the data into training, validation, and test sets consistently across all models. By ensuring that the models evaluate on the same data subsets, researchers enable a fair comparison of their performance.
Considering these above factors, researchers can conduct a comprehensive and fair model performance comparison, which will lead to informed decisions regarding model selection for the specific problem at hand.
# Creating tabular format for better comparison
tbl=pd.DataFrame()
tbl['Model']=pd.Series(['Logistic Regreesion','Multinomial NB',
'Decision Tree','KNN','Random Forest','Adaptive Boosting'])
tbl['Accuracy']=pd.Series([acc_lr_tf,acc_nb_tf,acc_dt_tf,acc_knn_tf,
acc_rf_tf,acc_ab_tf])
tbl['F1_Score']=pd.Series([f1_lr_tf,f1_nb_tf,f1_dt_tf,f1_knn_tf,
f1_rf_tf,f1_ab_tf])
tbl.set_index('Model')
# Best model on the basis of F1 Score
tbl.sort_values('F1_Score',ascending=False)
Cross-validation is indeed a valuable technique to help avoid overfitting when training machine learning models. It provides a robust evaluation of the model’s performance by using multiple subsets of the data for training and testing. It helps assess the model’s generalization capability by estimating its performance on unseen data.
# Using cross validation method to avoid overfitting
import statistics as st
vector = TfidfVectorizer()
x_train_v = vector.fit_transform(x_train)
x_test_v = vector.transform(x_test)
# Model building
lr =LogisticRegression()
mnb=MultinomialNB()
dct=DecisionTreeClassifier(random_state=1)
knn=KNeighborsClassifier()
rf=RandomForestClassifier(random_state=1)
ab=AdaBoostClassifier(random_state=1)
m =[lr,mnb,dct,knn,rf,ab]
model_name=['Logistic R','MultiNB','DecTRee','KNN','R forest','Ada Boost']
results, mean_results, p, f1_test=list(),list(),list(),list()
#Model fitting,cross-validating and evaluating performance
def algor(model):
print('\n',i)
pipe=Pipeline([('model',model)])
pipe.fit(x_train_v,y_train)
cv=StratifiedKFold(n_splits=5)
n_scores=cross_val_score(pipe,x_train_v,y_train,scoring='f1_weighted',
cv=cv,n_jobs=-1,error_score='raise')
results.append(n_scores)
mean_results.append(st.mean(n_scores))
print('f1-Score(train): mean= (%.3f), min=(%.3f)) ,max= (%.3f),
stdev= (%.3f)'%(st.mean(n_scores), min(n_scores),
max(n_scores),np.std(n_scores)))
y_pred=cross_val_predict(model,x_train_v,y_train,cv=cv)
p.append(y_pred)
f1=f1_score(y_train,y_pred, average = 'weighted')
f1_test.append(f1)
print('f1-Score(test): %.4f'%(f1))
for i in m:
algor(i)
# Model comparison By Visualizing
fig=plt.subplots(figsize=(20,15))
plt.title('MODEL EVALUATION BY CROSS VALIDATION METHOD')
plt.xlabel('MODELS')
plt.ylabel('F1 Score')
plt.boxplot(results,labels=model_name,showmeans=True)
plt.show()
x=stress['clean_text']
y=stress['label']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
model_lr_tf=LogisticRegression()
model_lr_tf.fit(x_train,y_train)
y_pred=model_lr_tf.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_lr=accuracy_score(y_test,y_pred)
f1_lr=f1_score(y_test,y_pred,average='weighted')
print('Accuracy: ',acc_lr)
print('F1 Score: ',f1_lr)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
The dataset contains text messages or documents that are labeled as either stressed or non-stressed. The code loops through the two labels to create a word cloud for each label using the WordCloud library and display the word cloud visualization. Each word cloud represents the most commonly used words in the respective category, with larger words indicating higher frequency. The choice of the color map (‘winter’, ‘autumn’, ‘magma’, ‘Viridis’, ‘plasma’) determines the color scheme of the word clouds. The resulting visualizations provide a concise representation of the most frequent words associated with stressed and non-stressed messages or documents.
Here are word clouds representing stressed and non-stressed words commonly associated with stress detection:
for label, cmap in zip([0,1],
['winter', 'autumn', 'magma', 'viridis', 'plasma']):
text = stress.query('label == @label')['text'].str.cat(sep=' ')
plt.figure(figsize=(12, 9))
wc = WordCloud(width=1000, height=600, background_color="#f8f8f8", colormap=cmap)
wc.generate_from_text(text)
plt.imshow(wc)
plt.axis("off")
plt.title(f"Words Commonly Used in ${label}$ Messages", size=20)
plt.show()
The new input data is preprocessed and features are extracted to match the model’s expectations. The predict function is then used to generate predictions based on the extracted features. Finally, the predictions are printed or utilized as required for further analysis or decision-making.
data=["""I don't have the ability to cope with it anymore. I'm trying,
but a lot of things are triggering me, and I'm shutting down at work,
just finding the place I feel safest, and staying there for an hour
or two until I feel like I can do something again. I'm tired of watching
my back, tired of traveling to places I don't feel safe, tired of
reliving that moment, tired of being triggered, tired of the stress,
tired of anxiety and knots in my stomach, tired of irrational thought
when triggered, tired of irrational paranoia. I'm exhausted and need
a break, but know it won't be enough until I journey the long road
through therapy. I'm not suicidal at all, just wishing this pain and
misery would end, to have my life back again."""]
data=vector.transform(data)
model_lr_tf.predict(data)
data=["""In case this is the first time you're reading this post...
We are looking for people who are willing to complete some
online questionnaires about employment and well-being which
we hope will help us to improve services for assisting people
with mental health difficulties to obtain and retain employment.
We are developing an employment questionnaire for people with
personality disorders; however we are looking for people from all
backgrounds to complete it. That means you do not need to have a
diagnosis of personality disorder – you just need to have an
interest in completing the online questionnaires. The questionnaires
will only take about 10 minutes to complete online. For your
participation, we’ll donate £1 on your behalf to a mental health
charity (Young Minds: Child & Adolescent Mental Health, Mental
Health Foundation, or Rethink)"""]
data=vector.transform(data)
model_lr_tf.predict(data)
The application of machine learning techniques in predicting stress levels provides personalized insights for mental well-being. By analyzing a variety of factors such as numerical measurements ( blood pressure, heart- rate) and categorical characteristics (eg, gender, occupation), machine learning models can learn patterns and make predictions on an individual stress level. With the ability to accurately detect and monitor stress levels, machine learning contributes to the development of proactive strategies and interventions to manage and enhance mental well-being.
We explored the insights from using machine learning in stress prediction and its potential to revolutionize our approach to addressing this critical issue.
In conclusion, this stress prediction analysis provides valuable insights into stress levels and their prediction using machine learning. Use the findings to develop tools and interventions for stress management, promoting overall well-being and improved quality of life.
A: 1. Objective Assessment: It provides an objective and data-driven approach to assess stress levels, eliminating potential biases that may arise in subjective assessments.
2. Scalability: Machine learning algorithms can process large volumes of text data efficiently, making it scalable for analyzing a wide range of textual expressions.
3. Real-time Monitoring: By automating stress detection, it enables real-time monitoring of stress levels, allowing for timely interventions and support.
4. Insights and Research: It can uncover insights and trends related to stress, contributing to the understanding of stress triggers, impacts, and potential interventions.
A: 1. Social Media Posts: Textual content from platforms like Twitter, Facebook, or online forums where individuals express their thoughts and emotions.
2. Chat Logs: Conversational data from messaging apps, online support systems, or mental health chatbots.
3. Online Surveys or Questionnaires: Textual responses to questions related to stress or mental well-being.
4. Electronic Health Records: Clinical notes or patient narratives that contain relevant information about stress-related experiences.
A: 1. Textual expressions of stress can vary greatly across individuals, making it challenging to capture all relevant indicators and patterns.
2. Contextual understanding is crucial in stress detection, as the same text can be read differently depending on the context and individual.
3. Acquiring labeled data for training machine learning models can be time-consuming and resource-intensive, requiring expert input or subjective judgments.
4. Ensuring data privacy, confidentiality, and ethical handling of sensitive mental health information is paramount when working with text data related to stress.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.