A Guide To Complete Statistics For Data Science Beginners!

Harika Last Updated : 12 Nov, 2024

8 min read

This article was published as a part of the Data Science Blogathon

Introduction:

In this article, we will learn all the important statistical concepts which are required for Data Science roles.

1. Difference between Parameter and Statistic

In our day in day out, we keep speaking about the Population and sample. So, it is very important to know the terminology to represent the population and the sample.

A parameter is a number that describes the data from the population. And, a statistic is a number that describes the data from a sample.

2. Statistics and its types

The Wikipedia definition of Statistics states that “it is a discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.”

It means, as part of statistical analysis, we collect, organize, and draw meaningful insights from the data either through visualizations or mathematical explanations.

Statistics is broadly categorized into two types:

Descriptive Statistics
Inferential Statistics

Descriptive Statistics:

As the name suggests in Descriptive statistics, we describe the data using the Mean, Standard deviation, Charts, or Probability distributions.

Basically, as part of descriptive Statistics, we measure the following:

Frequency: no. of times a data point occurs
Central tendency: the centrality of the data – mean, median, and mode
Dispersion: the spread of the data – range, variance, and standard deviation
The measure of position: percentiles and quantile ranks

Inferential Statistics:

In Inferential statistics, we estimate the population parameters. Or we run Hypothesis testing to assess the assumptions made about the population parameters.

In simple terms, we interpret the meaning of the descriptive statistics by inferring them to the population.

For example, we are conducting a survey on the number of two-wheelers in a city. Assume the city has a total population of 5L people. So, we take a sample of 1000 people as it is impossible to run an analysis on entire population data.

From the survey conducted, it is found that 800 people out of 1000 (800 out of 1000 is 80%) are two-wheelers. So, we can infer these results to the population and conclude that 4L people out of the 5L population are two-wheelers.

3. Data Types and Level of Measurement

At a higher level, data is categorized into two types: Qualitative and Quantitative.

Qualitative data is non-numerical. Some of the examples are eye colour, car brand, city, etc.

On the other hand, Quantitative data is numerical, and it is again divided into Continuous and Discrete data.

Continuous data: It can be represented in decimal format. Examples are height, weight, time, distance, etc.

Discrete data: It cannot be represented in decimal format. Examples are the number of laptops, number of students in a class.

Discrete data is again divided into Categorical and Count Data.

Categorical data: represent the type of data that can be divided into groups. Examples are age, sex, etc.

Count data: This data contains non-negative integers. Example: number of children a couple has.

Data Types | statistics — Data Types (Image by Author)

Level of Measurement

In statistics, the level of measurement is a classification that describes the relationship between the values of a variable.

We have four fundamental levels of measurement. They are:

Nominal Scale
Ordinal Scale
Interval Scale
Ratio Scale

1. Nominal Scale: This scale contains the least information since the data have names/labels only. It can be used for classification. We cannot perform mathematical operations on nominal data because there is no numerical value to the options (numbers associated with the names can only be used as tags).

Example: Which country do you belong to? India, Japan, Korea.

2. Ordinal Scale: In comparison to the nominal scale, the ordinal scale has more information because along with the labels, it has order/direction.

Example: Income level – High income, medium income, low income.

3. Interval Scale: It is a numerical scale. The Interval scale has more information than the nominal, ordinal scales. Along with the order, we know the difference between the two variables (interval indicates the distance between two entities).

Mean, median, and mode can be used to describe the data.

Example: Temperature, income, etc.

4. Ratio Scale: The ratio scale has the most information about the data. Unlike the other three scales, the ratio scale can accommodate a true zero point. The ratio scale is simply said to be the combination of Nominal, Ordinal, and Intercal scales.

Example: Current weight, height, etc.

4. Moments of Business Decision

We have four moments of business decision that help us understand the data.

4.1. Measures of Central tendency

(It is also known as First Moment Business Decision)

Talks about the centrality of the data. To keep it simple, it is a part of descriptive statistical analysis where a single value at the centre represents the entire dataset.

The central tendency of a dataset can be measured using:

Mean: It is the sum of all the data points divided by the total number of values in the data set. Mean cannot always be relied upon because it is influenced by outliers.

Median: It is the middlemost value of a sorted/ordered dataset. If the size of the dataset is even, then the median is calculated by taking the average of the two middle values.

Mode: It is the most repeated value in the dataset. Data with a single mode is called unimodal, data with two modes is called bimodal, and data with more than two modes is called multimodal.

4.2. Measures of Dispersion

(It is also known as Second Moment Business Decision)

Talks about the spread of data from its centre.

Dispersion can be measured using:

Variance: It is the average squared distance of all the data points from their mean. The problem with Variance is, the units will also get squared.

Standard Deviation: It is the square root of Variance. Helps in retrieving the original units.

Range: It is the difference between the maximum and the minimum values of a dataset.

Measure	Population	Sample
Mean	µ = (Σ X_i)/N	x̄ = (Σ x_i)/n
Median	The middle value of the data	The middle value of the data
Mode	Most occurred value	Most occurred value
Variance	σ²= (Σ X_i – µ)²/N	s²= (Σ x_i – x̄ )²/ (n-1)
Standard Deviation	σ = sqrt((Σ X_i – µ)²/N)	s = sqrt((Σ x_i – x̄ )²/ (n-1))
Range	Max-Min	Max-Min

4.3. Skewness

(It is also known as Third Moment Business Decision)

It measures the asymmetry in the data. The two types of Skewness are:

Positive/right-skewed: Data is said to be positively skewed if most of the data is concentrated to the left side and has a tail towards the right.

Negative/left-skewed: Data is said to be negatively skewed if most of the data is concentrated to the right side and has a tail towards the left.

The formula of Skewness is E [(X – µ)/ σ ]) ³= Z³

positive skewed data — Positively skewed data (Image by Author)

Negative skewed | statistics — Negatively Skewed data (Image by Author)

4.4. Kurtosis

(It is also known as Fourth Moment Business Decision)

Talks about the central peakedness or fatness of tails. The three types of Kurtosis are:

Positive/leptokurtic: Has sharp peaks and lighter tails

Negative/Platokurtic: Has wide peaks and thicker tails

MesoKurtic: Normal distribution

The formula of Kurtosis is E [(X – µ)/ σ ]) ⁴-3 = Z⁴– 3

kurtosis | statistics — Kurtosis (Image by Author)

Together, Skewness and Kurtosis are called Shape Statistics.

5. Central Limit Theorem (CLT)

Instead of analyzing entire population data, we always take out a sample for analysis. The problem with sampling is that “sample means is a random variable – varies for different samples”. And random sample we draw can never be an exact representation of the population. This phenomenon is called sample variation.

To nullify the sample variation, we use the central limit theorem. And according to the Central Limit Theorem:

1. The distribution of sample means follows a normal distribution if the population is normal.

2. the distribution of sample means follows a normal distribution even though the population is not normal. But the sample size should be large enough.

3. The grand average of all the sample mean values give us the population mean.

4. Theoretically, the sample size should be 30. And practically, the condition on the sample size (n) is:

n > 10(k₃)², where k₃is the sample skewness.

n > 10(k₄), where K₄is the sample Kurtosis.

6. Probability distributions

In statistical terms, a distribution function is a mathematical expression that describes the probability of different possible outcomes for an experiment.

Please read this article of mine about different types of Probability distributions.

7. Graphical representations

Graphical representation refers to the use of charts or graphs to visualize, analyze and interpret numerical data.

For a single variable (Univariate analysis), we have a bar plot, line plot, frequency plot, dot plot, boxplot, and the Normal Q-Q plot.

We will be discussing the Boxplot and the Normal Q-Q plot.

7.1. Boxplot

A boxplot is a way of visualizing the distribution of data based on a five-number summary. It is used to identify the outliers in the data.

The five numbers are minimum, first Quartile (Q1), median (Q2), third Quartile (Q3), and maximum.

The box region will contain 50% of the data. The lower 25% of the data region is called the Lower whisker and the upper 25% of the data region is called the Upper Whisker.

The Interquartile region (IQR) is the difference between the third and first quartiles. IQR = Q3 – Q1.

Outliers are the data points that lie below the lower whisker and beyond the upper whisker.

The formula to find the outliers is Outlier = Q ± 1.5*(IQR)

The outliers that lie below the lower whisker are given as Q1 – 1.5 * (IQR)

The outliers that lie beyond the upper whisker are given as Q3 + 1.5 * (IQR)

Check out my article on detecting outliers using a boxplot.

7.2. Normal Q-Q plot

A Normal Q-Q plot is a kind of scatter plot that is plotted by creating two sets of quantiles. It is used to check if the data is following normality or not.

On the x-axis, we have the Z-scores and on the y-axis, we have the actual sample quantiles. If the scatter plot forms a straight line, data is said to be normal.

8. Hypothesis Testing

Hypothesis testing in statistics is a way to test the assumptions made on the population parameters.

Check my article on Hypothesis Testing to read it in detail.

End Notes:

Thank you for reading till the conclusion. By the end of this article, we are familiar with the important statistical concepts.

I hope this article is informative. Feel free to share it with your study buddies.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Guide To Complete Statistics For Data Science Beginners!

Introduction:

Table of Contents:

1. Difference between Parameter and Statistic

2. Statistics and its types

Descriptive Statistics:

Inferential Statistics:

3. Data Types and Level of Measurement

Level of Measurement

4. Moments of Business Decision

4.1. Measures of Central tendency

4.2. Measures of Dispersion

Measure

Population

Sample

4.4. Kurtosis

5. Central Limit Theorem (CLT)

6. Probability distributions

7. Graphical representations

7.1. Boxplot

7.2. Normal Q-Q plot

8. Hypothesis Testing

End Notes:

Other Blog Posts by me

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid