Get Started with Statistics for Data Science

Akil Last Updated : 22 Jul, 2021

10 min read

This article was published as a part of the Data Science Blogathon

What is Statistics?

The Oxford dictionary defines statistics as ‘the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.’

Ben Baumer, author of ‘Modern Data Science with R’ book defines it as “the science of modeling the randomness present in real-world observations.”

In my words, from what I learned, this is the study of collecting and extracting information from quantitative data for making inferences and decisions. The result of analysis from sample information is generalized to the whole population data.

Types:

There are 2 broad categories of statistics:

1) Descriptive statistics

2) Inferential statistics

Descriptive statistics:

These represent the numbers that describe the entire data set. These numbers are like a summary of all the observations in the data. For instance, the average value of a data field is a descriptive statistic.

Descriptive statistics are classified into 5 major types:

i) Measure of frequency

This indicates the count-related attributes of the data, like the frequency of occurrence, count, and percent values.

Statistics

ii) Measure of central tendency:

This indicates the central aspect of the data. Mean, median, and mode are the attributes that talk about the central tendency of the data

Input_data = {19,19,23,39,45,48,48,48,67}

Mean (average) : sum of all values/count of all values = 356/9=39.56

Median : middle value in the sorted data = 45

Mode : most occurring value = 48

Data might not always be symmetrical around the center. They may be populated more to the left or right.

Left skewed data:

More data towards the right end.

Eg – age-wise histogram plot of pension scheme account holders

In the case of left skew, Mean < Median

Statistics left skewed

Right skewed data:

More data towards the left end (lower values in magnitude).

Eg – age-wise histogram plot of gaming account holders

In the case of right skew, Mean > Median

right skewed Statistics

iii) Measure of dispersion:

This indicates the spread out with data. Standard deviation and variance are useful values that explain the spread of the data w.r.t the mean.

measure of dispersion Statistics

Fig. This is a frequency plot called a histogram. The age values are plotted against the count of occurrences.

Standard deviation is the average difference between the mean and the points in the data.

iv) Measure of position:

This is helpful to describe the values in relation to one another. Percentiles ranks can give insight into the position of a value w.r.t to other values. Unlike standard deviation, this is not affected by the extreme values in the data.

Note – the quartiles are not evenly separated, because the values are not evenly spread

In real scenarios, the data might not be exactly evenly distributed, ie it can be skewed data. There can be more values towards the left end(lower values) or towards the right end(higher values).

The main characteristic of these kinds of data is the mean and median will not be the same. So the data will be skewed.

v) Measures of association:

It is important to identify the relationship between variables.

Covariance

Covariance is a measure of the relation between two variables. The direction of the result indicates the type of relation between the two variables. A positive result indicates that they are directly proportional, while a negative result indicates they are inversely proportional. The result of the covariance will change when the unit of the variables, the area of a house (sqft, etc), or the currency (INR, USD, etc) vary. Covariance value can range from -infinity to +infinity.

Covariance of 2 variables X and Y is defined as the product of each of its observations with the difference from their mean.

Correlation

Correlation is a measure of how strongly two variables are related to each other. Correlation is not affected by the scale of the variables. It just mentions the strength of the relation between the variables.

While covariance only identifies if it is a positive or a negative relationship, correlation establishes both the direction (positive or negative) and strength of the relation between the two variables. Covariance cannot help to identify the strength of the relationship, as its values are affected by the unit of the two variables.

Pearson correlation coefficient value ranges from -1 to +1. Values nearer to +/-1 indicates strong relation, while values near to 0 indicate a weak relationship

-1 – indicates that the two variables are inversely coupled strongly ie., when one increases, the other decreases, like marks and rank.

0 – indicates no relation between them or in other words, they are uncorrelated, like wind speed and student mark.

+1 – indicates a strong positive correlation ie., when one increases, the other also increases, like sqft area of the house and house price

Generally, as the correlation value is not affected by the unit of the two variables and it can ascertain the strength of the relationship, and it is preferred over covariance.

Some sample python code to calculate the aforementioned values:

import pandas as pd
import statistics
import numpy as np
data = [['fred', 20, 90], ['ron', 18, 85], ['bill', 28, 95], ['george',20, 89]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Mark'])
print('Count',len(df))
print('Mean',statistics.mean(df['Age']))
print('Median',statistics.median(df['Age']))
print('Mode',statistics.mode(df['Age']))
print('Std',statistics.stdev(df['Age']))
print('Variance',statistics.variance(df['Age']))
print('Covariance',df.cov())
print('Correlation',df.corr())

Output:

Count 4
Mean 21.5
Median 20.0
Mode 20
Std 4.43471156521669
Variance 19.666666666666668
Covariance Age Mark
Age 19.666667 17.166667
Mark 17.166667 16.916667
Correlation Age Mark
Age 1.000000 0.941159
Mark 0.941159 1.000000

Distributions

Distributions play a key role in understanding the data. Distributions are the frequency plot of all values for a variable in the data. The distributions are used to calculate the probability of values, based on the spread of the data. The function describing the probability of the values is called the probability density/mass function.

Based on the properties of the data and the spread, there are different types of distributions. Every distribution has a density function for finding probabilities.

Key differentiating points:

1) Data – discrete data(whole numbers with limited values, like months or gender) or continuous data(numbers that can take any value, like temperature, height)

2) Skewedness – distribution can be symmetric or skewed

3) Boundaries – the values can have strict lower or upper boundaries (like a mark cannot exceed 100)

4) Outliers or tail values – the extreme values of the data (usually outliers) can be rare or common

5) Number of modes – mode is the most frequently occurring value in the data. In case there is more than one most recurring value, then it might mean that there is more than one population in the data. The populations have to be separated and dealt with.

Density functions:

i) Probability mass function(PMF):

PMF also known as the discrete density function, is a density function for discrete data. This density function is used to calculate the probability of a certain value of a discrete random variable.

P(X=x) = f(x)

X = random variable

x = one possible value of variable X

The probability value is always a non-negative number. The sum of the probability of all possible values of X will equal 1.

Example:

A fair coin is tossed twice. Let us consider we want to identify the probability that the result has one head. When two coins are tossed, the resultant values can be {HH, HT, TH, TT}.

Let us consider X as the random variable with the count of heads resulted from a coin toss.

X can have {0,1,2}.

We need P(X=1) = P({HT,TH})= number of outcomes with {HT,TH}/total number of outcomes = 2.4 = 0.5

Also, the sum of probability of all values of X will equal 1:

P(X=0) + P(X=1) + P(X=2)

=cnt({TT})/ cnt({HH, HT, TH, TT}) + cnt({HT,TH})/cnt({HH, HT, TH, TT}) + cnt({HH})/cnt({HH, HT, TH, TT})

=1/4 + 2/4 + 1/4

ii) Probability Density Function(PDF):

For continuous variables, p(X=x) = 0, for all x∈R

P(weight = 76.34kg)=0, as it can be argued that weight 76.3!=76.34 and 76.341!=76.34

Hence it will be more appropriate to use P(weight between 76.3 to 76.4) for continuous variables. So instead of calculating the probability of the variable being a particular single value, the probability is calculated for the variable to be within a range of values.

The area between the limit values a and b in the function f(x) is calculated by using integration.

iii) Cumulative distribution function(CDF):

The cumulative density function calculates the probability that a random variable X takes a value less than equal to a particular value x. As this does not calculate for a particular value, this function is applicable for both discrete and continuous variables.

P(X≤x) = f(x), for all x∈R

This can also be written as,

Example:

When tolling a fair die, every outcome has an equal probability of 1/6.

For PMF, we use P(X=x), for all x P(X)=1/6

In the case of CDF, P(X<=x) calculates the cumulative probabilities of all values of X that is before x and x. P(X=1)=1/6, P(X=2)=2/6, P(X=3)=3/6, etc.,

Python library to runt hese functions;

from scipy.stats import norm
norm.pdf(0.01), norm.cdf(0.01)

Types of distributions:

Some of the common distributions are explained here.

1) Uniform distribution:

The uniform distribution indicates that all of a variable’s outcomes are equally probable or that the probability is uniformly distributed.

Fig. rolling fair die results in uniform distribution

2) Binomial distribution:

This is a distribution well suited for discrete data.

A Bernoulli trial is an experiment that can have exactly two possible outcomes – success and failure. These outcomes are mutually exclusive.

The binomial distribution is the repetition of the Bernoulli trial when the probabilities of the outcomes are maintained the same for every trial. This means the Bernoulli trials are independent trials.

For example, consider a card is drawn from a deck of playing cards. To keep the probabilities the same, during the next draw, the drawn card is placed back in the deck.

The PMF, P(x) = ⁿC_{r ·}p^r(1 − p)^n−r

n- total number of trials

p-probability of success in a single Bernoulli trial

r-total number of successful trials, for which we are calculating the probability.

ⁿC_r– number of combinations (formula=[n!/r!(n−r)!])

Example – If there are 4 trails from a deck of playing cards with replacement, what is the probability of getting exactly 2 hearts.

p=1/4

n=4

r=2

On substitution of the variables, P(x) = 0.21

3) Gaussian distribution:

Also known as the normal distribution, this is the most commonly used type of distribution observed in real data. It has a bell-shaped curve and is assumed to be symmetrical around the mean value. The bell shape is because the data values near the mean are more frequent than the ones far away from the mean.

The normal distribution is for continuous data and can be used for discrete data as well

Example of a Gaussian distribution data – marks scored by students in an exam. More students would have secured average marks, while there will be a few exceptional performers and a few underperformers.

Sample representation:

Formula for PDF:

Standard normal distribution (z-distribution):

When the normal distribution has a mean=0 and standard deviation=1, then it is called the standard normal distribution.

Here the formula gets simplified on substitution too,

Standard normal distribution (z-distribution)

Statistics Standard normal distribution (z-distribution)

When the normal distribution data is converted to a standard normal distribution, it aids in comparing different datasets with different means and standard deviations.

The before and after mean values are distinguished by +/- signs.

The data points on the distribution are no longer the actual data points x, but the z-score of the actual data points.

This Z-score aids in decision-making processes, especially in the study of outliers in the data.

4) Poisson distribution:

This is again a distribution suitable for discrete data. This distribution helps us to get the probability for a given number of events to occur within a fixed period of time.

Example – number of customers visiting a shop to purchase product A every week, which can help the shopkeeper to stock product A accordingly.

PMF is given by,

λ – average event occurrence (say 50 customers per week)

k – number of events for which probability is calculated (number of customers for which probability is calculated)

Inferential statistics:

Inferential statistics are generally used to infer details from the sample data, to make decisions for the actual population data.

Here, population – refers to the entire set of real data and sample data – refers to the considerably less volume of the data to which we have access for analysis.

This is a very broad area, and we cannot do it justice in this blog.

Endnote:

This content covers the basics of Statistics, mainly elaborating the Descriptive type and distributions.

Real learning comes by using the knowledge learned by providing a solution in actual data.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Akil

I am a Big data and cloud professional, with more than a decade of experience in data projects.
If interested, please check out my other articles at https://www.analyticsvidhya.com/blog/author/akilaram/

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Get Started with Statistics for Data Science

What is Statistics?

Descriptive statistics:

i) Measure of frequency

iv) Measure of position:

v) Measures of association:

Covariance

Correlation

Distributions

Density functions:

i) Probability mass function(PMF):

ii) Probability Density Function(PDF):

iii) Cumulative distribution function(CDF):

Types of distributions:

1) Uniform distribution:

2) Binomial distribution:

3) Gaussian distribution:

Standard normal distribution (z-distribution):

4) Poisson distribution:

Inferential statistics:

Endnote:

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)