WhatsApp Chat Analysis: End-to-End Data Analysis Project with Deployment

Raghav Agrawal Last Updated : 10 Mar, 2023

17 min read

Introduction

Machine learning projects always excite people and inspire them to learn more about them. But the Machine learning model works on data. Before model construction, we need to analyze and understand the data to identify the hidden patterns that come under the data analysis. Whatsapp Chat Data Analysis is a data analysis-based engine where you can upload the WhatsApp chat in text format and generate a complete analysis report according to or group or an individual.

This article was published as a part of the Data Science Blogathon.

Problem Statement and Objective

Everyone today uses Whatsapp for daily conversation. It has become one of the biggest business engines where multiple e-commerce businesses share the product-designs and product details, accept orders, is involved in money transactions, and a lot more. Indeed the business support Whatsapp does not have analytics support where the people or business can analyze their monthly or daily activities to get an idea of where they are lacking, the demand of customers, sales, marketing, activeness of group members, and many things.

So to gain the solution to the above statement, we aim to develop a complete interface where users can upload the WhatsApp chat in text format by exporting the chat from WhatsApp. It will provide users with two options to study the chats. On submitting the chat, the engine will display the complete report with interactive graphs, which is easy to understand. The user can get an in-depth idea of how the business over WhatsApp is performing. The report we want to display will include the following analysis from the chat we need to showcase.

Total number of messages
Total words
Number of Media and links shared
Monthly and Daily Timeline: Chat activity on a daily basis and on a monthly basis.
Most busy day and month – In a week which day outperforms the best, and in a year, which month includes the most conversations?
Weekly activity map
Most Busy Users
Top and common words in conversation
Emoji analysis

I hope you are excited to develop this amazing project.

Prerequisites for Data Analysis

Python: You should be familiar with python basics and syntax.
Pandas: It is a python library used to preprocess the data. We are working with a dataframe, so we will need to apply some processing functions of pandas. Also used for Data Analysis.
Matplotlib: Python library for data visualization and Data Analysis.
Streamlit: Python-based UI framework used for creating the web application without HTML or CSS. The basics of streamlit are sufficient to understand the syntax. Please refer to this article if you do not know about streamlit or want to explore it.

How to Export Chat from WhatsApp?

The WhatsApp chat is our important data for analysis. So to get the text file of the chat follow the simple 3-step process. We are working in a 24-hour date-time format, so before uploading the chat file, convert the data and time setting to a 24-hour format.

Open any WhatsApp group or individual chat you want to analyze
Click the three dots on the top right corner and click more.
You will find an option for export chat, click it and select without media.
Now you can share or download the text file of chats

Data Analysis: Project Development Steps

We are ready with a theoretical explanation of the project and its time for development. Before developing, we need to keep our steps clear so as mentioned. So you have to export one WhatsApp chat and create a new Jupiter notebook or Google colab.

Load the text file and convert the chat to Dataframe.
create an analytics function to meet each objective.
Create a Streamlit app to integrate each function to display our analysis.
Deploy the app to the cloud for the use of people to get an analysis of any chat.

Data Analysis: Create DataFrame from Chat File

We need to create a dataframe from a text file containing WhatsApp chats. The first column will contain the user message and the name, and the second column will contain the message’s date.

import re
import pandas as pd

#read the text file
path = "WhatsApp Chat with IT2 Shining Stars 2018-22.txt"
f = open(path, 'r', encoding='utf-8')
data = f.read()
print(type(data))

#regular expression to find the dates
pattern = '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

#pass the pattern and data to split it to get the list of messages
messages = re.split(pattern, data)[1:]
print(messages)

#extract all dates
dates = re.findall(pattern, data)

#create dataframe
df = pd.DataFrame({'user_message': messages, 'message_date': dates})
# convert message_date type
df['message_date'] = pd.to_datetime(df['message_date'],format='%d/%m/%y, %H:%M - ')
df.rename(columns={'message_date': 'date'}, inplace=True)
df.head(4)

So, first, we load the file in read mode. After that, we have to separate the message and dates, so we will use ReGEX (Regular Expression) to find the data and separate the message. We split the pattern to separate the message from the dates, and after that, we can pick all the dates by applying the pattern. Below is the code snippet with comments to better understand each statement.

Separate the Message and User Name

We have a dataframe, but the user name and message are present in a single column. So to separate them, we just split the string with a colon by matching it with alphanumeric characters and pick the first string as a name and the second as a message.

#separate Users and Message
users = []
messages = []
for message in df['user_message']:
    entry = re.split('([\w\W]+?):\s', message)
    if entry[1:]:  # user name
        users.append(entry[1])
        messages.append(" ".join(entry[2:]))
    else:
        users.append('group_notification')
        messages.append(entry[0])

df['user'] = users
df['message'] = messages
df.drop(columns=['user_message'], inplace=True)

Breaking the Date Column into Different Columns

we will break down the date column into multiple columns for better analysis to create the year, month, day, hour, and minute columns.

#Extract multiple columns from the Date Column
df['only_date'] = df['date'].dt.date
df['year'] = df['date'].dt.year
df['month_num'] = df['date'].dt.month
df['month'] = df['date'].dt.month_name()
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute

Now we have the dataframe ready, we can start our analysis and showcase each with interactive graphs. we also need to add a period column to data representing the hour combination, like data recorded between what hour and what hour. When we create a streamlit app, we will provide an option for overall analysis or user-level analysis where the dataframe will be filtered, so our functions run fine.

#add period column that shows data capture between which 24 hour format
period = []
for hour in df[['day_name', 'hour']]['hour']:
    if hour == 23:
        period.append(str(hour) + "-" + str('00'))
    elif hour == 0:
        period.append(str('00') + "-" + str(hour + 1))
    else:
        period.append(str(hour) + "-" + str(hour + 1))

df['period'] = period
df.head()

Data Analysis: Breaking the date column into different columns

Data Analysis: Display Basic Statistics for Data Analysis

We are supposed to provide an overview of chats that include total messages, words, and media shared to get an idea of how much talk is done.

Get the Total Number of Messages

To find the total number of messages, you only need to print the number of rows in the message column or the number of rows in the dataframe. To find the number of messages of a particular user, you only need to select the user, and the dataframe gets filtered so the results will be correct.

#Total Messages
df.shape[0]

Get the Total Number of Words

To find the total number of words, you need to loop over the messages column and find the sum of the length of each message. In simple words, count the number of words in each message and sum up them.

#Total Number of words
words = []
for message in df['message']:
  words.extend(message.split())

print(len(words))

Get the Number of Media Messages

We selected an option without media when we exported the chat data into a text file. So in the text file, instead of media, there is a text as media omitted. So to count the number of media files shared, we can display the count of word media omitted.

#Number of Media Files shared
df[df['message'] == '<Media omitted>\n'].shape[0]

Get the Total Number of Links Shared

To find the count of links, we have a rich library in python as a URL extractor that can extract all the URLs from the given string in a list. So we will find all the URLs in a message and sum their count.

!pip install urlextract

#Number of Links Shared
from urlextract import URLExtract
extract = URLExtract()

links = []
for message in df['message']:
    links.extend(extract.find_urls(message))

print(len(links))

Data Analysis: Find the Busiest Users in Group

The stats only apply to a group-level analysis and will not work on a user level. We will find the top 5 users whose chats are more compared to others. To find the top users, count the number of messages sent by each, sort the count in descending, and extract the top five. You do not need to follow the complete process because we directly have a function in Pandas to do so. We want to display the analysis with the help of a bar graph.

import matplotlib.pyplot as plt

x = df['user'].value_counts().head()
user_names = x.index
msg_count = x.values

plt.bar(user_names, msg_count)
plt.xticks(rotation='vertical')
plt.show()

Along with displaying the bar graph, we will also display the percentage of chats each user has done. To find the percentage, you only need to divide the count of each user by the total number of messages and multiply by 100. After that, we round off the value with 2 decimal places and convert it to a dataframe by renaming column names.

new_df = round(((df['user'].value_counts() / df.shape[0]) * 100), 2).reset_index().rename(
        columns={'index': 'name', 'user': 'percent'})

new_df.head()

Display Top Words in a Chat

We will display a word cloud that will display the top words frequently used in chats, meaning words with a higher frequency than others, and get displayed according to their size in the word cloud. The word cloud will be generated with the help of your message column, and to display this, python directly supports the word cloud library, which is mainly used in text mining.

pip install wordcloud

We must clean the data a bit to find the most frequent words. And if you want to see the stop words and below problem in data, then run the code without applying the transformations once and then apply the transformation to observe the difference.

Remove Group Notifications: There are many notifications with different analyses, so we need to remove them.
Remove Media Omitted: A lot of media is being shared, and we have omitted the media files, so they are embedded as text which needs to be removed.
Remove stop words: We have the WhatsApp chat data in Hindi plus English because we Indians used to type in both languages. Python stop words library supports only English, so if you have chats only in English, then well and good. Else you can download this file which has stop words in both languages. You can try it according to your chats. You can also remove punctuations or other frequently used characters; for example, some people write ‘Hi’ as ‘Hie.’

import string

def remove_stop_words(message):
  f = open('stop_hinglish.txt', 'r')
  stop_words = f.read()
  y = []
  for word in message.lower().split():
      if word not in stop_words:
          y.append(word)
  return " ".join(y)

def remove_punctuation(message):
  x = re.sub('[%s]'% re.escape(string.punctuation), '', message)
  return x

#Data Cleaning
temp = df[df['user'] != 'group_notification'] #remove group notification
temp = temp[temp['message'] != '<Media omitted>\n'] #remove media message
temp['message'] = temp['message'].apply(remove_stop_words) #remove stopwords
temp['message'] = temp['message'].apply(remove_punctuation) #remove punctuations

#Draw the wordCloud
from wordcloud import WordCloud
plt.figure(figsize=(20, 10))
wc = WordCloud(width=1000,height=750,min_font_size=10,background_color='white')
cloud = wc.generate(temp['message'].str.cat(sep=" "))
plt.imshow(cloud)

The remove stopwords function loads the stop words file and, in each message, checks whether the words present in the message are in the stop words list or not. If found, exclude that word and include all remaining words in a message.
The remove punctuation function checks for any kind of punctuation and removes them. the string module of python provides us with all the functions which we replace in a string with an empty string using ReGEX.

Find the Top 20 Most Common Words

The question is slightly similar to the upper one, but the context is different, where we have to find the top 20 frequently used words other than stop words. Stop words are words that are helpful in forming a sentence, but they do not have any specific meaning to the context. For this, we have to write some custom code where we need a dictionary that stores a word as a key and its count in total messages. After preparing the dictionary, we can find the top words with the highest frequency. The steps will be similar to the above one, where we must clean the data.

temp = df[df['user'] != 'group_notification'] #remove group notification
temp = temp[temp['message'] != '<Media omitted>\n']  #remove media msg
temp['message'] = temp['message'].apply(remove_stop_words) #remove stop words
temp['message'] = temp['message'].apply(remove_punctuation) #remove punctuations

words = []
for message in temp['message']:
  words.extend(message.split())

#apply counter
from collections import Counter
most_common_df = pd.DataFrame(Counter(words).most_common(20))
most_common_df

Emoji Analysis

Expression is a part of body language to convey your message to another person. While chatting, we use different types of emojis to express different feelings. We will analyze which emoji is used and how many times in a chat.

To find the count of each emoji, you need to install one library named emoji. After this, the code is very simple where first, we find the emoji in each message and store it in one list, and after that, we count the occurrence of each emoji.

import emoji

emojis = []
for message in df['message']:
  emojis.extend([c for c in message if c in emoji.EMOJI_DATA])

pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))

Time-based Analysis

Now We will do a time-based analysis where on the x-axis, there will be a timeline, and on the y-axis, we will display the number of messages that will display the month when users were more active. So we will display it on a monthly and daily basis.

Monthly Chats Timeline

We will display the line chart to showcase the number of active chats per month for which year. For this, we count the messages by grouping them according to month and year columns. To plot the chart, we combine month and year columns.

timeline = df.groupby(['year', 'month_num', 'month']).count()['message'].reset_index()
month_timeline = []

for i in range(timeline.shape[0]):
  month_timeline.append(timeline['month'][i] + "-" + str(timeline['year'][i]))

timeline['time'] = month_timeline

#draw plot
plt.figure(figsize=(12,6))
plt.plot(timeline['time'], timeline['message'])
plt.xticks(rotation='vertical')
plt.show()

Daily Timeline

Similarly, we can create a daily timeline where you must group the data according to date and count the number of messages. To display this analysis Line chart is perfect.

daily_timeline = df.groupby('only_date').count()['message'].reset_index()

plt.figure(figsize=(12,6))
plt.plot(daily_timeline['only_date'], daily_timeline['message'])
plt.show()

Day-based Activity Map

The analysis is to find the highest number of chats on which day of the week. Or we can say it as which was the busiest day in a week.

busy_day = df['day_name'].value_counts()
plt.figure(figsize=(12, 6))
plt.bar(busy_day.index, busy_day.values, color='purple')
plt.title("Busy Day")
plt.xticks(rotation='vertical')
plt.show()

Monthly Activity Map

Find in which month the most chats happened. The same you need to find for the busiest month in a year as we have done above.

busy_month = df['month'].value_counts()
plt.figure(figsize=(12, 6))
plt.bar(busy_month.index, busy_month.values, color='orange')
plt.title("Busy Month")
plt.xticks(rotation='vertical')
plt.show()

Which Time User Remains Active?

This is an interesting analysis where we want to showcase in 24 hours at what time a user is more active and offline. The heatmap is an interesting graph to showcase this analysis. Where the black color is there, it shows the user was offline, and where the color is there, it shows the user is online. It helps businesses get an idea to post an advertisement at which time to expect more feedback and clicks within a short span of time to rank their Add.

import seaborn as sns
plt.figure(figsize=(18, 9))
sns.heatmap(df.pivot_table(index='day_name', columns='period', values='message', 
            aggfunc='count').fillna(0))
plt.yticks(rotation='vertical')
plt.show()

Creating Streamlit Web App after Data Analysis

Streamlit is a python web framework used to create data apps without any knowledge of front-end (HTML, CSS, and JS), which is called the fastest way to build and deploy data apps. It includes various functions with HTML elements pre-built, like buttons, sidebar, textbox, input fields, etc. If you do not know about streamlit, you can review this article.

Create one folder for storing project files. First, you need to install the libraries required to create an app. Open the command prompt or Anaconda, and in a project, the folder directory and run the below commands one by one. After that, you need to create the below python files to organize the code where we will combine the complete analysis code.

pip install streamlit
pip install urlextract
pip install matplotlib
pip install wordcloud
pip install emoji

Preprocessor.py – At the start, we did some data preprocessing, so we will store all preprocessing in separate functions in a separate file.
helper.py – We have created different analyses like the monthly, weekly, busy user, etc., so we will store each function in a helper file for each analysis.
app.py is a main web app file where streamlit will be written, and we will get the data. After that, using each helper function, we will display our analysis with streamlit on the UI.

I hope that you have created the above files. We have done all the analysis and preprocessing in a Jupiter notebook. Hence, we only need to combine them in a streamlit file which I am providing in the below code snippets, along with comments as an explanation. You also need to write the below code in respective files.

Preprocessor.py

In this file, we will accept the text file data, and we need to create and return the dataframe with the respective columns as we have prepared in the beginning.

import re
import pandas as pd

def preprocess(data):
    pattern = '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s'

    messages = re.split(pattern, data)[1:]
    dates = re.findall(pattern, data)

    df = pd.DataFrame({'user_message': messages, 'message_date': dates})
    # convert message_date type
    df['message_date'] = pd.to_datetime(df['message_date'], format='%d/%m/%Y, %H:%M - ')

    df.rename(columns={'message_date': 'date'}, inplace=True)

    users = []
    messages = []
    for message in df['user_message']:
        entry = re.split('([\w\W]+?):\s', message)
        if entry[1:]:  # user name
            users.append(entry[1])
            messages.append(" ".join(entry[2:]))
        else:
            users.append('group_notification')
            messages.append(entry[0])

    df['user'] = users
    df['message'] = messages
    df.drop(columns=['user_message'], inplace=True)

    df['only_date'] = df['date'].dt.date
    df['year'] = df['date'].dt.year
    df['month_num'] = df['date'].dt.month
    df['month'] = df['date'].dt.month_name()
    df['day'] = df['date'].dt.day
    df['day_name'] = df['date'].dt.day_name()
    df['hour'] = df['date'].dt.hour
    df['minute'] = df['date'].dt.minute

    period = []
    for hour in df[['day_name', 'hour']]['hour']:
        if hour == 23:
            period.append(str(hour) + "-" + str('00'))
        elif hour == 0:
            period.append(str('00') + "-" + str(hour + 1))
        else:
            period.append(str(hour) + "-" + str(hour + 1))

    df['period'] = period

    return df

Helper.py

In the helper file, we have to write all analytics functions that will accept the dataframe and selected user as the parameter and return the required results.

from urlextract import URLExtract
from wordcloud import WordCloud
from collections import Counter
import pandas as pd
import string
import re
import emoji

extract = URLExtract()

def fetch_stats(selected_user, df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    # fetch the number of messages
    num_messages = df.shape[0]

    # fetch the total number of words
    words = []
    for message in df['message']:
        words.extend(message.split())

    # fetch number of media messages
    num_media_messages = df[df['message'] == '<Media omitted>\n'].shape[0]

    # fetch number of links shared
    links = []
    for message in df['message']:
        links.extend(extract.find_urls(message))

    return num_messages,len(words),num_media_messages,len(links)

#func will only work in group chat analysis
def most_busy_users(df):
    x = df['user'].value_counts().head()
    df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index()
        .rename(columns={'index': 'name', 'user': 'percent'})
    return x,df

def remove_stop_words(message):
    f = open('stop_hinglish.txt', 'r')
    stop_words = f.read()
    y = []
    for word in message.lower().split():
        if word not in stop_words:
            y.append(word)
    return " ".join(y)

def remove_punctuation(message):
  x = re.sub('[%s]'% re.escape(string.punctuation), '', message)
  return x

def create_wordcloud(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    temp = df[df['user'] != 'group_notification']
    temp = temp[temp['message'] != '<Media omitted>\n']
    temp['message'] = temp['message'].apply(remove_stop_words)
    temp['message'] = temp['message'].apply(remove_punctuation)

    wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')
    df_wc = wc.generate(temp['message'].str.cat(sep=" "))
    return df_wc

def most_common_words(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    temp = df[df['user'] != 'group_notification']
    temp = temp[temp['message'] != '<Media omitted>\n']
    temp['message'] = temp['message'].apply(remove_stop_words)
    temp['message'] = temp['message'].apply(remove_punctuation)
    words = []

    for message in temp['message']:
        words.extend(message.split())

    most_common_df = pd.DataFrame(Counter(words).most_common(20))
    return most_common_df

def emoji_helper(selected_user, df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    emojis = []
    for message in df['message']:
        emojis.extend([c for c in message if c in emoji.EMOJI_DATA])

    emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
    return emoji_df

def monthly_timeline(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    timeline = df.groupby(['year', 'month_num', 'month']).count()['message']
              .reset_index()
    month_timeline = []
    for i in range(timeline.shape[0]):
        month_timeline.append(timeline['month'][i]+"-"+str(timeline['year'][i]))

    timeline['time'] = month_timeline
    return timeline

def daily_timeline(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]
    daily_timeline = df.groupby('only_date').count()['message'].reset_index()
    return daily_timeline

def week_activity_map(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]
    return df['day_name'].value_counts()

def month_activity_map(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]
    return df['month'].value_counts()

def activity_heatmap(selected_user,df):
    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    user_heatmap = df.pivot_table(index='day_name', columns='period', 
                values='message', aggfunc='count').fillna(0)
    return user_heatmap

App.py

It is the main file that is displayed to the user. So first, we need to initialize the streamlit object and create a sidebar where we need to accept the chat file in text format, Analysis level (overall or specific user), and a submit button. After this, we will capture an event that, as the button clicks, we will accept the information, run all our analytics functions, and display the results in a report format on the UI.

import streamlit as st
import preprocessor, helper #local file func
import matplotlib.pyplot as plt
import seaborn as sns

st.sidebar.title("WhatsApp Chat Analyzer")

#create a file uploaded to upload txt file
uploaded_file = st.sidebar.file_uploader("Choose a file")
if uploaded_file is not None:
    bytes_data = uploaded_file.getvalue()
    #convert byte to string
    data = bytes_data.decode("utf-8")
    
    #call the preprocess func to give df
    df = preprocessor.preprocess(data)

    #provide an option to analyze data on group level or specific user
    #fetch unique users
    user_list = df['user'].unique().tolist()
    user_list.remove('group_notification')
    user_list.sort()
    user_list.insert(0, 'Overall')
    selected_user = st.sidebar.selectbox("Show analysis wrt",user_list)
    
    #button to analyze chat
    if(st.sidebar.button("Show Analysis")):
        #Display basic stats in 4 cols
        num_messages, words, num_media_messages, num_links = helper.fetch_stats(selected_user,df)
        col1, col2, col3, col4 = st.columns(4)

        with col1:
            st.header(":blue[Total Messages]")
            st.title(num_messages)
        with col2:
            st.header(":blue[Total Words]")
            st.title(words)
        with col3:
            st.header(":blue[Media Shared]")
            st.title(num_media_messages)
        with col4:
            st.header(":blue[Links Shared]")
            st.title(num_links)

        #Monthly timeline
        st.title(":blue[Monthly Chat Timeline]")
        timeline = helper.monthly_timeline(selected_user, df)
        fig,ax = plt.subplots()
        ax.plot(timeline['time'], timeline['message'], color='green')
        plt.xticks(rotation='vertical')
        st.pyplot(fig)

        #Daily Timeline
        st.title(":blue[Daily Timeline]")
        daily_timeline = helper.daily_timeline(selected_user, df)
        fig, ax = plt.subplots()
        ax.plot(daily_timeline['only_date'], daily_timeline['message'], color='black')
        plt.xticks(rotation='vertical')
        st.pyplot(fig)

        # activity map
        st.title(':blue[Activity Map]')
        col1,col2 = st.columns(2)
        #weekly activity
        with col1:
            st.header(":green[Most busy day]")
            busy_day = helper.week_activity_map(selected_user,df)
            fig,ax = plt.subplots()
            ax.bar(busy_day.index,busy_day.values,color='purple')
            plt.xticks(rotation='vertical')
            st.pyplot(fig)
        #monthly activity
        with col2:
            st.header(":green[Most busy month]")
            busy_month = helper.month_activity_map(selected_user, df)
            fig, ax = plt.subplots()
            ax.bar(busy_month.index, busy_month.values,color='orange')
            plt.xticks(rotation='vertical')
            st.pyplot(fig)

        #time activity
        st.title("Weekly Activity Map")
        user_heatmap = helper.activity_heatmap(selected_user,df)
        fig,ax = plt.subplots()
        ax = sns.heatmap(user_heatmap)
        st.pyplot(fig)

        # finding the busiest users in the group(Group level)
        if selected_user == 'Overall':
            st.title(':blue[Most Busy Users]')
            x,new_df = helper.most_busy_users(df)
            fig, ax = plt.subplots()

            col1, col2 = st.columns(2)

            with col1:
                ax.bar(x.index, x.values,color='red')
                plt.xticks(rotation='vertical')
                st.pyplot(fig)
            with col2:
                st.dataframe(new_df)

        # WordCloud (Top Frequent words)
        st.title(":blue[Wordcloud]")
        df_wc = helper.create_wordcloud(selected_user,df)
        fig,ax = plt.subplots()
        ax.imshow(df_wc)
        st.pyplot(fig)

        # most common words
        st.title(':blue[Most commmon words]')
        most_common_df = helper.most_common_words(selected_user,df)
        fig,ax = plt.subplots()
        ax.barh(most_common_df[0],most_common_df[1])
        plt.xticks(rotation='vertical')
        st.pyplot(fig)

        # emoji analysis
        st.title(":blue[Emoji Analysis]")
        emoji_df = helper.emoji_helper(selected_user,df)
        if emoji_df.shape[0] > 0:
            col1,col2 = st.columns(2)
            with col1:
                st.dataframe(emoji_df)
            with col2:
                fig,ax = plt.subplots()
                ax.pie(emoji_df[1].head(),labels=emoji_df[0].head(),autopct="%0.2f")
                st.pyplot(fig)
        else:
            st.write(":red[No Emojis Send by this user]")

Running the App on your localhost

We are done with the coding, and now you must be thinking of how it will look on the server, so open the command prompt in the project file directory and run the below command. After running, you will get a localhost URL to copy and open in your app’s browser.

streamlit run app.py

Deploy Streamlit App on Heroku after Data Analysis

We have created a WhatsApp chat Analyzer which is running on our local server. On the local system only, we can use the app by running the server, but if we want our app to be used by the public and get feedback, we need to deploy an app on the cloud. So we have a free cloud like Heroku that allows deploying any application and is visible through URL.

Prepare Cloud Files

To deploy an application on the cloud, we need to provide some details, and in order to meet the criteria, we need to create some files for the cloud to understand the application and run it.

1. Procfile

Create a file named Procfile without any extension, an indicator to the cloud about which file to run.

web: sh setup.sh && streamlit run app.py

2. Requirements

Create a file named requirements.txt in which we list all the libraries we have used to build the project so that it will install all the required libraries before running the application on a cloud. You can mention the version of libraries in the file to install.

streamlit
matplotlib
seaborn
urlextract
wordcloud
pandas
emoji

3. Setup File

It is an important file for the cloud that helps to create the directory structure on the cloud.

mkdir -p ~/.streamlit/

echo "\
[server]\n\
port = $PORT\n\
enableCORS = false\n\
headless = true\n\
\n\
" > ~/.streamlit/config.toml

Upload the Code to GitHub

There are 2 ways to deploy applications on Heroku: one is through GitHub, and another is through Heroku CLI. GitHub is an easy way to deploy the application over the cloud. So login to GitHub and create a new repository and copy the SSH link of a repository to connect. Now open the GIT bash in the project folder directory and run the below commands one after the other.

Deploy to Heroku

Login to Heroku and create a new app by giving it a unique name. After creating a new App, Connect it with the GitHub repository where the code is uploaded. It will ask for verification of GitHub, after which the repository will be connected. Scroll down and click deploy on the main branch. Observe the logs, and after successful buffering, it will give you a unique URL through which the application is accessible. If the steps are unclear, you can refer to this previous blog.

Conclusion

Hurray! We have developed and deployed a data analysis project that analyses the WhatsApp chat on a group and individual level. We started with the data ingestion with the analysis at different levels and cloud deployment. So let us take the Key learning points we have learned in this article.

We have learned the Data Analysis Life Cycle and Machine learning Life cycle.
Data visualizing charts like Heatmap, Bar graphs, and Line charts with their importance in conveying the data.
Emojis (body language) Play an important role in the conversation, and we have learned at a basic level how to analyze emojis with python. There are more methods for Data Analysis to analyze emojis that you can explore on the internet.

Resources

Below are the links to get the resource to the code and files for easy access and troubleshooting of any errors while developing the project.

Python Notebook for data analysis: Colab
Streamlit Code Files: GitHub

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Raghav Agrawal

I am a software Engineer with a keen passion towards data science. I love to learn and explore different data-related techniques and technologies. Writing articles provide me with the skill of research and the ability to make others understand what I learned. I aspire to grow as a prominent data architect through my profession and technical content writing as a passion.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices