Exploratory Analysis Using SPSS, Power BI, R Studio, Excel & Orange

Aryan Last Updated : 28 Dec, 2020

10 min read

Introduction

There’s a lot of talk about Business Analytics in today’s corporate world. But why is it so? Is it just another trend that people will forget soon? Well, this article aims to talk about major business analytics tools and then substantiate the stance of the corporate world.

Be approximately right rather than exactly wrong.

As rightly said by John turkey, “Be approximately right rather than exactly wrong”. He was an American mathematician best known for the development of the Fast Fourier Transform algorithm and box plot.

This Quote helps in understanding the need for Analysis in today’s corporate world. The analysis is very important as it helps in understanding the future.

Let me explain this with a real-time case: –

Abstract: The dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years (Mashable’s primary focus is on technology, lifestyle, and entertainment news. Website: www.mashable.com)

Case: Please carry out an Exploratory Data Analysis and create a compelling story based on the given dataset; also predict which Article will be more popular in the near future.

I have done the Analysis using:

1. R Studio:

It is an integrated development environment for R, a programming language for statistical computing and graphics. It is open-source software (Click here to download).

2. SPSS:

SPSS Statistics is a software package used for interactive, or batched, statistical analysis; IBM acquired it in 2009. (Click here to download).

3. Power BI:

Power BI is a business analytics service by Microsoft. It aims to provide interactive visualizations and business intelligence capabilities with an interface simple enough for end-users to create their reports and dashboards. (Click here to download).

4. Excel:

Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android, and iOS. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications. (Click here to download).

5. Orange:

Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. (Click here to download).

Let’s get started with Exploratory Analysis:

Exploratory Analysis is an approach to analyze data sets to summarise their main characteristics, often with visual methods. A statistical model can be used, but primarily exploratory Analysis is done for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory Analysis is a vast term; it includes:

# Variable Identification

# Univariate Analysis

# Bi-variate Analysis

# Missing values treatment

# Outlier treatment

# Variable transformation

# Variable creation

1st STAGE:

VARIABLE IDENTIFICATION

The First Step in Variable Identification is to identify the target (output) variables.

The second step is to identify the type of data (Like — Categorical, Numerical, Ration, etc.).

Attribute Information:

1. url: URL of the article (non-predictive)

2. n_tokens_title: Number of words in the title

3. n_tokens_content: Number of words in the content

4. n_unique_tokens: Rate of unique words in the content

5. n_non_stop_words: Rate of non-stop words in the content

6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content

7. num_hrefs: Number of links

8. num_imgs: Number of images

9. num_videos: Number of videos

10. average_token_length: Average length of the words in the content

11. num_keywords: Number of keywords in the metadata

12. data_channel_is_lifestyle: Is data channel ‘Lifestyle’?

13. data_channel_is_entertainment: Is the data channel ‘Entertainment’?

14. data_channel_is_bus: Is the data channel ‘Business’?

15. data_channel_is_socmed: Is the data channel ‘Social Media’?

16. data_channel_is_tech: Is the data channel ‘Tech’?

17. data_channel_is_world: Is the data channel ‘World’?

18. weekday_is_monday: Was the article published on a Monday?

19. weekday_is_tuesday: Was the article published on a Tuesday?

20. weekday_is_wednesday: Was the article published on a Wednesday?

21. weekday_is_thursday: Was the article published on a Thursday?

22. weekday_is_friday: Was the article published on a Friday?

23. weekday_is_saturday: Was the article published on a Saturday?

24. weekday_is_sunday: Was the article published on a Sunday?

25. is_weekend: Was the article published on the weekend?

26. shares: Number of shares (TARGET)

So, now we are done with VARIABLE IDENTIFICATION and will proceed with UNI-VARIATE & BI-VARIATE ANALYSIS.

2nd STAGE:

Exploratory Analysis univariate and bivariate

A. UNI_VARIATE ANALYSIS:

Uni-Variate Analysis helps in exploring variables one by one. Method to perform uni-variate Analysis will depend on whether the variable type is categorical or continuous.

B. BI-VARIATE ANALYSIS:

Bi-variate Analysis helps in finding out the relationship between those variables. Bi-variate Analysis can be performed for any combination of categorical and continuous variables. The combination could be Categorical & Categorical, Categorical & Continuous, and Continuous & Continuous. Different methods are used to tackle these combinations during the analysis process.

I have done these analyses by visualizing the Data as:

A PICTURE ALWAYS REINFORCES THE CONCEPT.

To analyze, there should be a common base. For this, the Median Value was taken as a base value from the share’s column. And two new columns were added.

Exploratory Analysis concept reinforcement

Two new columns were named POPULARITY (one is the Categorical column and the other is the Numerical column).

All the shares, which have a value less than or equal to 1500, are considered as Un-Popular and vice versa.

In the above Visualisation, it can be observed that the articles published by Mashable in the past two-years were popular among people.

Exploratory Analysis day wise publication

In the above two visualizations (publishing according to the days & popularity according to the days), it can be observed that Mashable usually publish fewer articles on weekends as people don’t like to read more articles on weekends the reason can be any — Maybe people will get only Saturday and Sunday as holidays, and they might want to relax or travel instead of reading articles.

Now, the popularity has been compared with the different themes (Analysis is based on the Data of past two-years):

From the above Visualisation, it can be observed that Business Articles were less popular on the Mashable website.

From the above Visualisation, it can be observed that Entertainment Articles are were less popular on the Mashable website.

From the above Visualisation, it can be observed that Lifestyle Articles were more popular on the Mashable website.

social media articles Exploratory Analysis

From the above Visualisation, it can be observed that Social Media Articles were more popular on the Mashable website.

From the above Visualisation, it can be observed that Techy Articles were more popular on the Mashable website.

From the above Visualisation, it can be observed that World Articles were less popular on the Mashable website.

From the above Visualisation, it can be observed that the Articles having more images and videos were more popular as compared to the articles having fewer images and videos.

Content is also a major factor on which popularity depends

It can be in any form like — unique, short or long, etc.

So, to analyze CONTENT, we again need to have a common base.

Again, the median value was taken as a common base and marked as:

Content less than or equal to 400 words as popular and vice versa.

From the above visualization, it can be observed that the short articles were more popular than the long articles.

So, now we are done with VARIABLE IDENTIFICATION, UNI-VARIATE & BI-VARIATE ANALYSIS and will proceed for MISSING VALUE TREATMENT (if any).

3rd STAGE:

MISSING VALUES

Now, there is a need to find the missing values and rectify them (if any).

Missing data can be treated in 4 ways:

a. Deletion

b. Mean/Median/Mode

c. Predictive Modelling

d. KNN

The Data has been checked using the DATA TABLE TOOL (Orange Data Mining software).

And it can be seen that the Data does not have any MISSING VALUES. Hence, we can skip this STAGE.

So, now we are done with VARIABLE IDENTIFICATION, UNI-VARIATE & BI-VARIATE ANALYSIS, MISSING VALUE TREATMENT and will proceed for OUTLIER TREATMENT (if any).

4th STAGE:

OUTLIERS were checked using BOX-PLOT and SCATTER-PLOT (R Studio). It was observed that there were a few OUTLIERS. The OUTLIERS were rectified by deleting them.

And again, BOX — PLOT & SCATTER — PLOT was used to check the OUTLIERS. Now, we don’t have any OUTLIER.

So, now we are done with VARIABLE IDENTIFICATION, UNI-VARIATE & BI-VARIATE ANALYSIS, MISSING VALUE TREATMENT, OUTLIERS, and will proceed for VARIABLE TRANSFORMATION & VARIABLE CREATION.

5th STAGE:

The final stage is to predict which article will be more popular in the near future. But for now, let’s check for any one of the articles.

Chosen Article — Social Media

(Social Media Article was randomly chosen).

For this, four different models were selected:

1. Decision Tree:

A Decision Tree is used to comprehend & predict both numerical values and categorical value problems. But there is a drawback that it generally results in overfitting of the data/information. Yet, we can dodge the over fittings by utilizing a pre-pruning approach, for instance, creating a tree with fewer leaves and branches.

2. Stochastic Gradient Descent:

The Stochastic Gradient Descent minimizes a chosen loss function with a linear function. The algorithm approximates a true gradient by considering one sample at a time and simultaneously updates the model based on the gradient of the loss function. For regression, it returns predictors as minimizers of the sum, i.e., M-estimators, and is especially useful for large-scale and sparse datasets.

3. Random Forest:

Random Forest is a tree-based learning algorithm with the power to form accurate decisions as it many decision trees together. As its name says — it’s a forest of trees. Hence, Random Forest takes more training time than a single decision tree. Each branch and leaf within the decision tree works on the random features to predict the output. Then this algorithm combines all the predictions of individual decision trees to generate the final prediction, and it can also deal with the missing values.

4. Artificial Neural Network:

ANN is like our brain; millions and billions of cells — called neurons, which processes information in the form of electric signals. Similarly, in ANN, the network structure has an input layer, a hidden layer, and the output layer. It is also called Multi-Layer Perceptron as it has multiple layers. The hidden layer is known as a “distillation layer” that distills some critical patterns from the data/information and passes it onto the next layer. It then makes the network quicker and more productive by distinguishing the data from the data sources, leaving out the excess data.

It captures a non-linear relationship between the inputs.
It helps in converting the information/data into more useful insight.

From the above image, it can be observed that the Artificial Neural Network has the lowest Mean Square Error.

So, this value is taken for further calculations.

Further calculations were done using SPSS-

H0 = Un-popularity

H1 = Popularity

In the Test of Normality Table, it can be seen that the significance level is less than alpha, i.e., 0.05. Alpha is a universal value. So, the null hypothesis can be rejected, and it can be concluded that the Social media article will be more prevalent soon.

But, for a Deep Analysis. Independent Samples T-test is used:

In the second table, it can be observed that:

In Levene’s Test for Equality of Variances column, the significance Level is less than the alpha value, so, the hypothesis of Levene’s Test can be rejected.

Now, we will look at the T-test for equality of means and corresponding Confidence Interval.

In the T calculated value, we can see that equal variances not assumed has lesser value than equal variance assumed. Hence, there is a slight risk that social media articles will get unpopular in the near future.

From the confidence Interval column, we can see that there is a 0.058% (Average of Upper Bond C.I and Lower Bond C.I) chance that social media articles will get unpopular.

The chances are significantly less, but still, Mashable needs to take some precautions:

Below mentioned are some SUGGESTIONS to reduce the risk:

INCREASE

# Number of embedded links.

# Number of images.

# Number of videos.

# Amount of subjectivity in content.

# Number of more popular words.

DECREASE

# Number of longer words in the content.

# Amount of multi-topic discussion in an article as it can confuse.

# Number of negative words as it can lead to controversies.

From the above case, it can be seen that there is a need for Analysis in today’s corporate world. And it is not a trend that will be vanished in some time; in fact, it is a feeling which helps in growing and developing the corporate world.

“The greatest power of analysis is when it forces you to notice what you never expected and helps you growing.”

Personally, I enjoyed writing this article and would love to learn from your feedback.

Contacts

In case you have any questions or any suggestions on what my next article should be about, please leave a comment below or mail me at [email protected].

If you want to keep updated with my latest articles and projects, follow me on Medium.

Connect with me via:

LinkedIn

Instagram

Aryan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Exploratory Analysis Using SPSS, Power BI, R Studio, Excel & Orange

Introduction

1. R Studio:

2. SPSS:

3. Power BI:

4. Excel:

5. Orange:

Attribute Information:

A. UNI_VARIATE ANALYSIS:

B. BI-VARIATE ANALYSIS:

A PICTURE ALWAYS REINFORCES THE CONCEPT.

MISSING VALUES

1. Decision Tree:

2. Stochastic Gradient Descent:

3. Random Forest:

4. Artificial Neural Network:

DECREASE

Contacts

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk