This article was published as a part of the Data Science Blogathon.
Who doesn’t love chocolate? Everybody does. But not everyone likes dark chocolates as they taste bitter. But if you want to be healthy and want to overcome some stressful situation, this bad guy will give you some relief. Just take a bite of it.
Though this analysis is about dark chocolates, I don’t think that this analysis will make you sad. You will come to know many things while reading this article. Just take a cup of coffee and read on.
The data is collected from here. This data consists of more than 2500 rows, where every row contains details about different dark chocolate bars. The column descriptions are given below:
id – id number of the review
manufacturer – Name of the bar manufacturer
company_location – Location of the manufacturer
year_reviewed – From 2006 to 2021
bean_origin – Country of origin of the cocoa beans
bar_name – Name of the chocolate bar
cocoa_percent – Cocoa content of the bar (%)
num_ingredients – Number of ingredients
ingredients – B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
review – Summary of most memorable characteristics of the chocolate bar
Here I did some experiments. Rather than importing the data through an excel file using Pandas, I use Heroku Postgres. If you are not interested in learning about it, just go to my Github repo and download the excel file. Otherwise, just read on.
Heroku offers a free plan for hosting the PostgreSQL database. This is helpful when you are a beginner or want a quickly hosted database for experiments.
To create a database and add a table to that database, you must read an article by clicking this link. If you previously went through that article, then no problem, just stay on this article. Otherwise, you have to create a database and add the table described in the article and after that, go to the next paragraph.
For security purposes, I made a python dictionary with the database credentials, and after that, I accessed those credentials. Here I am using the psycopg2 python library, which helps us to connect with a PostgreSQL database through python.
For connecting to the database, we have to just read the pickle file in which the dictionary is saved, and then we have to pass the required arguments in the connect() method of psycopg2.
import pickle import psycopg2 with open("heroku_database_credentials.pickle", "rb") as cred: credential = pickle.load(cred) conn = psycopg2.connect( database=credential['Database'], host=credential['Host'], user=credential['User'], port=credential['Port'], password=credential['Password'])
The code is self-explanatory, isn’t it? Here I keep the names of the keys of the dictionary as same as the name of the arguments in the connect() method. So that, we don’t have to memorize too much. Now, let’s see if everything works perfectly by running a SQL query using pandas.
import pandas as pd query = "SELECT * FROM chocolate_database" choco_data = pd.read_sql(query, conn) choco_data.head()
If everything works as expected, we can see an output like below.
Now let’s see some basic info about the data frame. This can be done by the info() method of the pandas dataframe.
From the above image, we can see that there are 2530 rows and 10 columns available in the data. Here Cocoa Percent column is of object datatype because of the ‘%’ sign. Also, the column names are too big for some columns. So, let’s remove the ‘%’ sign and short the column names slightly.
column_names = {"ref": "reference_no", "Company (Manufacturer)": "manufacturer", "Company Location": "company_loc", "Review Date": "review_date", "Country of Bean Origin": "bean_origin", "Specific Bean Origin or Bar Name": "bar_name", "Cocoa Percent": "cocoa_percent", "Ingredients": "ingredients", "Most Memorable Characteristics": "taste", "Rating": "rating"} choco_data.rename(columns=column_names, inplace=True) choco_data.head()
After running the above code, you will see the below output.
Now let’s convert the Cocoa Percent column to float data type.
choco_data['cocoa_percent'] = choco_data['cocoa_percent'].str.strip('%') choco_data['cocoa_percent'] = choco_data['cocoa_percent'].astype('float') choco_data['cocoa_percent'].head() |
The output of the below code is something like the below:
Now we are ready to do some analysis. Let’s see the summary statistics of the numerical columns.
choco_data.describe()
From the above summary statistics, we can see that:
Let’s see the summary statistics of these two after converting them into a categorical column.
choco_data['reference_no'] = choco_data['reference_no'].astype('object') choco_data['review_date'] = choco_data['review_date'].astype('object') choco_data_cat = choco_data.select_dtypes(include=['object']) choco_data_cat.describe() |
From the above result, we can say that:
There are 630 unique dark chocolates available in this dataset. From them, the chocolate bar with the reference number 414 appears mostly in the dataset.
There are 580 unique manufacturers exist in this dataset. From them, the company named Soma appears more frequently.
Most of the companies are located in the U.S.A.
Most of the reviews were recorded in 2015.
Most of the chocolates are made from those cocoa beans which are originated in Venezuela.
The Madagascar chocolate bar is more frequent.
The most frequently used ingredients are Beans, Sugar, and Cocoa Beans.
Now, for the sake of simplicity, we are making some functions to keep away ourselves from writing messy codes. The implementations of those functions are given below.
import itertools def count_df(data,data_col): choco_count = data[data_col].value_counts().rename_axis(data_col).reset_index(name='count') choco_data_with_counts = pd.merge(left=data, right=choco_count, left_on=data_col, right_on=data_col) return choco_data_with_counts def number_indicator(val, title_text, row_num, col_num): return go.Indicator(mode = 'number',value = val, number = {'valueformat':'0,f'}, title = {'text':title_text}, domain = {'row':row_num,'column':col_num}) def sort_sliced_dict(main_dict, is_reverse=True, item_count=None): sorted_dict = {k:v for k, v in sorted(main_dict.items(), key=lambda item: item[1], reverse=is_reverse)} if item_count is not None: sorted_dict = dict(itertools.islice(sorted_dict.items(), item_count)) return sorted_dict
The usage of those functions is given below
count_df: This function helps us to join the count of the elements of the categorical column with the main data.
number_indicator: this function help us to create a number indicator in Plotly.
sort_sliced_dict: This function helps us to sort the python dictionary.
Here we find out which chocolate bar got the best average rating among all of them.
For this, First, I filter those chocolate bars which have records of more than 10. Then I grouped the data based on the bar_name column, and from there, we calculated the mean of the rating column. After that, we sort the values in descending order and display the first 10 rows.
import plotly.express as px import plotly.graph_objects as go # removing those chocolates which has a count less than 10 choco_data_with_count = count_df(choco_data, 'bar_name') choco_data_mod = choco_data_with_count[choco_data_with_count['count'] >= 10] # grouping chocolates according to bar_name and calculating mean avg_rating_by_bar = choco_data_mod.groupby('bar_name')['rating'].mean() avg_rating_by_bar_df = avg_rating_by_bar.rename_axis('bar_name').reset_index(name='rating') avg_rating_by_bar_df_sorted = avg_rating_by_bar_df.sort_values(by='rating',ascending=False).head(10) # plotting the results fig = px.bar(avg_rating_by_bar_df_sorted, x='bar_name', y='rating', log_y=True, color_continuous_scale='viridis', color='rating') fig.update_layout(title={'text': 'Most Popular Chocolate Bar'}) fig.show()
From the above visualization, we can easily tell that the chocolate named Kokoa Kamili has the highest average rating among all of them. Let’s see where this chocolate is mostly manufactured.
company_loc_list = list(choco_data[choco_data['bar_name'].isin(['Kokoa Kamili'])]['company_loc']) company_loc_dict = {i:company_loc_list.count(i) for i in company_loc_list} fig = px.pie(values=list(company_loc_dict.values()), names=list(company_loc_dict.keys()), title='Most Common Location where Kokoa Kamili is Manufactured', color_discrete_sequence=px.colors.sequential.Aggrnyl) fig.update_traces(textposition='inside', textinfo='percent+label') fig.update_layout(uniformtext_minsize=12) fig.show()
Hmm, it seems that this chocolate is mostly manufactured in the U.S.A. What about the taste of this chocolate? Let’s see.
If we see the taste column, we see that a comma separates the taste of specific chocolates.
So, we have to hot encode those values. This can be done by the str.get_dummies() method of the panda’s data frame.
taste_encode = choco_data[choco_data['bar_name'].isin(['Kokoa Kamili'])]['taste'].str.get_dummies(sep=', ') taste_encode['nuts'] = taste_encode['nut'] + taste_encode['nuts'] taste_encode['rich_cocoa'] = taste_encode['rich'] + taste_encode['rich cocoa'] + taste_encode['rich cooa'] taste_encode.drop(['nut', 'rich', 'rich cooa'], axis=1, inplace=True)
While one hot encoding the test column, I saw some spelling errors like- nut and nuts are the same things, rich cocoa and rich cooa are the same things, but there is a spelling error. So I add those columns with the perfect ones.
Now we are ready for the visualization. At first, we have to make a dictionary containing every taste as key and their counts as values. After that, we plot that dictionary. The code is given below.
tastes = list(taste_encode.columns) taste_dict = {} for taste in tastes: taste_dict[taste] = sum(taste_encode[taste]) taste_dict = sort_sliced_dict(taste_dict, is_reverse=True, item_count=8) fig = go.Figure(data=[go.Pie(labels=list(taste_dict.keys()), values=list(taste_dict.values()), pull=[0.1, 0, 0, 0])]) fig.update_traces(textinfo='percent+label', textposition='inside') fig.update_layout(uniformtext_minsize=12, title={'text': "Most Memorable Taste"}) fig.show()
It seems that maximum people are optimistic about the chocolate being fruity. So, that’s all I know about this chocolate. Now, let’s see which company got the most average rating among all of them.
In business, trust is a huge thing. We always want to buy products from a specific company whom we can trust. Let’s see who is lucky here.
Here I follow the same procedure I followed for finding the best chocolate with the best average rating.
choco_data_with_sec_count = count_df(choco_data, 'manufacturer') choco_data_mod2 = choco_data_with_sec_count[choco_data_with_sec_count['count'] > 10] avg_rating_by_company = choco_data_mod2.groupby('manufacturer')['rating'].mean() avg_rating_by_company_df = avg_rating_by_company.rename_axis('Company').reset_index(name='Rating') avg_rating_by_company_df_sorted = avg_rating_by_company_df.sort_values(by='Rating', ascending=False).head(10) fig = px.bar(avg_rating_by_company_df_sorted, x='Company', y='Rating', log_y=True, color_continuous_scale='inferno', color='Rating') fig.update_layout(title={'text': 'The Best Chocolate Manufacturer'}) fig.show()
It seems Soma chocolate maker got the best average rating among them. If you google it, you also see that their google rating is decent.
Let’s see how much cocoa they used in their chocolates.
cocoa_list = list(soma_choco_data['cocoa_percent']) cocoa_percent_dict = {} for cocoa_percent in cocoa_list: if str(cocoa_percent) in cocoa_percent_dict: cocoa_percent_dict[str(cocoa_percent)] += 1 else: cocoa_percent_dict[str(cocoa_percent)] = 1 cocoa_percent_dict = sort_sliced_dict(cocoa_percent_dict, item_count=5) fig = px.pie(names=list(cocoa_percent_dict.keys()), values=list(cocoa_percent_dict.values()), title='Mostly used Cocoa Percentage in chocolates made by Soma ChocoMaker', color_discrete_sequence=px.colors.sequential.Agsunset) fig.show()
Found some similarities? Of course, you did. I followed the same procedure before. Just here, I also sort the dictionary using the sort_sliced_dict function. The plot is shown below.
It seems that Soma Chocomaker uses 70% cocoa in most of the chocolates. Now let’s see which cocoa beans are used in their chocolates.
# filtering those chocolates which are manufactured by Soma Chocomaker soma_choco_data = choco_data[choco_data['manufacturer'].isin(['Soma'])] bean_dict = {} bean_origins = list(soma_choco_data['bean_origin']) for origin in bean_origins: if origin in bean_dict: bean_dict[origin] += 1 else: bean_dict[origin] = 1 # creating the dictionary bean_dict = sort_sliced_dict(bean_dict, is_reverse=True, item_count=5) # plotting the graph fig = px.bar(x=list(bean_dict.keys()), y=list(bean_dict.values()), text=list(bean_dict.values()), color_continuous_scale='inferno', color=list(bean_dict.keys())) fig.update_traces(textposition='outside') fig.update_layout(title={'text': 'Beans providers for Soma'},xaxis={'title_text': 'Beans Origin'}, yaxis={'title_text': 'Count'}) fig.show()
You can see from the above visualization that there is a bean origin named Blend. This is not a country, of course; this denotes that the Soma chocomaker mostly makes chocolates by blending different types of chocolates.
Now, what about the chocolate taste? Let’s see.
# removing the misspelled word taste_codo['nutty'] = taste_codo['nut'] + taste_codo['nuts'] + taste_codo['nutty'] taste_codo['woody'] = taste_codo['woodsy'] + taste_codo['woody'] taste_codo['earthy'] = taste_codo['earth'] + taste_codo['earthy'] taste_codo.drop(['nut', 'nuts', 'woodsy', 'earth'], axis=1, inplace=True) # making the taste dictionary tasty_dict = {} tasty_list = list(taste_codo.columns) for taste in tasty_list: tasty_dict[taste] = sum(taste_codo[taste]) tasty_dict = sort_sliced_dict(tasty_dict, is_reverse=True, item_count=7) # plotting the graph fig = px.bar(x=list(tasty_dict.keys()), y=list(tasty_dict.values()), color_continuous_scale='viridis', color=list(tasty_dict.keys())) fig.update_layout(title={'text': 'Most Memorable Taste of Chocolates made by Soma Chocomaker'}, xaxis={'title_text': 'Tastes'}, yaxis={'title_text': 'Count'}) fig.show()
It seems that most of the chocolates manufactured by Soma Chocomaker have a creamy taste.
While researching the ingredient used in chocolates, I discovered that some people don’t like chocolates containing Lecithin, which is allergic to some people. Is that thing reflected in this dataset? Let’s see.
A sample of values of the ingredients column is given below.
You can see from the above image that the values of the ingredients are in ingredient count- ingredients separated by comma format. So, we have to split those ingredients. You have to do some research to properly split those values. Those values can contain an inconsistency. In my case, I got multiple spaces. So, I strip those spaces and then split them.
# split those ingredients values
choco_data['ingredients'] = choco_data['ingredients'].str.strip(' ') choco_data['num_ingredients'] = choco_data['ingredients'].str.split('-', expand=True)[0] choco_data['main_ingredients'] = choco_data['ingredients'].str.split('-', expand=True)[1] choco_data['main_ingredients'] = choco_data['main_ingredients'].str.strip(' ') ingre_encode = choco_data['main_ingredients'].str.get_dummies(sep=',')
# concatenating lecithin column with the main data. Containing lecithin denoted by 1 # and otherwise 0 choco_data_lecithin = pd.concat([choco_data, ingre_encode['L']], axis=1) choco_data_lecithin.rename(columns={'L': 'ingredient_L'}, inplace=True) # split the chocolate data into two parts - chocolate containing lecithin and # chocolate doesn’t contain lecithin. choco_has_lecithin = choco_data_lecithin[choco_data_lecithin['ingredient_L'] == 1] choco_has_no_lecithin = choco_data_lecithin[choco_data_lecithin['ingredient_L'] == 0] # calculating the average for both chocolates rating_by_choco_has_lecithin = choco_has_lecithin['rating'].mean() rating_by_choco_has_no_lecithin = choco_has_no_lecithin['rating'].mean() # plotting the graph fig = go.Figure() fig.add_trace(number_indicator(val=rating_by_choco_has_lecithin,title_text="Average Rating of Bars having Lecithin", row_num=0, col_num=0)) fig.add_trace(number_indicator(val=rating_by_choco_has_no_lecithin, title_text="Average Rating of Bars having no Lecithin", row_num=0, col_num=1)) fig.update_layout(grid={'rows': 1, 'columns': 2, 'pattern': 'independent'}) fig.show()
Indeed it is reflected in the data. People are ignoring those chocolates containing Lecithin. For this, the average rating for chocolates containing Lecithin is lower than the non-Lecithin one.
So, That’s all I got. I know that this article is pretty long, but I think it is worth reading. Now you can easily
But the analysis doesn’t end here. If you got something, let me know in the comments. If there is something wrong from my side, I am always here to listen to you.
If you read this article on python and plotly carefully, you can see that I don’t use the Plotly library normally. I did some experiments on this. If you want to get a guide about Plotly, let me know.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.