Correlation Analysis Using R

vipin.shrivastava Last Updated : 27 Jan, 2021

3 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Can you tell how the prices of gold will change if the stock market goes up or how the prices of gold associated with the stock market? Yes, you can with the help of correlation, one of the most common measures used to associate two variables. It is the most common analytical tool used in analytics.

What is Correlation
Practical application using R
Conclusion

What is Correlation?

It is a statistical measure that defines the relationship between two variables that is how the two variables are linked with each other. It describes the effect of change in one variable on another variable.

If the two variables are increasing or decreasing in parallel then they have a positive correlation between them and if one of the variables is increasing and another one is decreasing then they have a negative correlation with each other. If the change of one variable has no effect on another variable then they have a zero correlation between them.

It is used to identify the degree of the linear relationship between two variables. It is represented by 𝝆 and calculated as:-

𝜌 (𝑥, 𝑦) = 𝑐𝑜𝑣(𝑥, 𝑦) /(𝜎𝑥 × 𝜎𝑦 )

Where

𝑐𝑜𝑣(𝑥, 𝑦) = covariance of x and y

𝜎x = Standard deviation of x

𝜎𝑦 = Standard deviation of y

𝜌 (𝑥, 𝑦) = correlation between x and y

The value of 𝜌 (𝑥, 𝑦) varies between -1 to +1.

A positive value has a range from 0 to 1 where 𝜌 (𝑥, 𝑦) = 1 defines the strong positive correlation between the variables.

A negative value has a range from -1 to 0 where 𝜌 (𝑥, 𝑦) = -1 defines the strong negative correlation between the variables.

No correlation is defined if the value of 𝜌 (𝑥, 𝑦) = 0

Practical application of correlation using R:-

Determining the association between Fertility and Infant Mortality Rate (Using the existing dataset “swiss”)

Below is the code to compute the correlation

1. Loading the dataset

> data1<-swiss
> head(data1, 4)

             Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary        80.2        17.0          15        12     9.96             22.2
Delemont          83.1        45.1           6         9    84.84             22.2
Franches-Mnt      92.5        39.7           5         5    93.40             20.2
Moutier           85.8        36.5          12         7    33.77             20.3

2. Creating a scatter plot using ggplot2 library

> library(ggplot2)

> ggplot(data1, aes(x = Fertility, y = Infant.Mortality)) + geom_point() +

+  geom_smooth(method = "lm", se = TRUE, color = 'black')

3. Testing the assumptions (Linearity and Normalcy)

Linearity^#: Visible from the plot itself (True, the relationship is linear)

Normality^$: Using Shapiro test (This is a test of normality, here we are checking whether the variables are normally distributed or not )

> shapiro.test(data1$Fertility)

	Shapiro-Wilk normality test

data:  data1$Fertility
W = 0.97307, p-value = 0.3449

> shapiro.test(data1$Infant.Mortality)

	Shapiro-Wilk normality test

data:  data1$Infant.Mortality
W = 0.97762, p-value = 0.4978

p-value is greater than 0.05, so we can assume the normality

4. Correlation Coefficient

> cor(data1$Fertility,data1$Infant.Mortality)
[1] 0.416556

5. Checking for the significance

> Tes<- cor.test(swiss$Fertility,swiss$Infant.Mortality,method = "pearson")
> 
> Tes

	Pearson's product-moment correlation

data:  swiss$Fertility and swiss$Infant.Mortality
t = 3.0737, df = 45, p-value = 0.003585
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1469699 0.6285366
sample estimates:
     cor 
0.416556

Since the p-value is less than 0.05 (here it is 0.003585, we can conclude that Fertility and Infant Mortality are significantly correlated with a value of 0.41 and a p-value of 0.003585.

Conclusion

As we can see there is a positive value between fertility and infant mortality rate, the point to be noted here is correlation is just a measure of association. It will tell the degree of association along with the direct or indirect proportionality.

Here we discussed only Pearson correlation. There are other types as well such as Kendall, Spearman, and Point-Biserial.

Linearity is a property where the relationship between the variables can be graphically represented as a straight line

Normality refers to the normal distribution (Bell-Shaped curve) of the data

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

vipin.shrivastava

Beginner R Statistics Structured Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Correlation Analysis Using R

Introduction

Table of contents

What is Correlation?

Practical application of correlation using R:-

Below is the code to compute the correlation

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS