15 Basic Statistics Concepts Every Data Science Beginner Should Know

Deepsandhya Shukla Last Updated : 29 Mar, 2024

12 min read

Introduction

At the heart of data science lies statistics, which has existed for centuries yet remains fundamentally essential in today’s digital age. Why? Because basic statistics concepts are the backbone of data analysis, enabling us to make sense of the vast amounts of data generated daily. It’s like conversing with data, where statistics helps us ask the right questions and understand the stories data tries to tell.

From predicting future trends and making decisions based on data to testing hypotheses and measuring performance, statistics is the tool that powers the insights behind data-driven decisions. It’s the bridge between raw data and actionable insights, making it an indispensable part of data science.

In this article, I have compiled top 15 fundamental statistics concepts that every data science beginner should know!

15 Fundamental Statistics Concepts Every Data Science Beginner Should Know

Statistical Sampling and Data Collection
Types of Data and Measurement Scales
Descriptive Statistics
Data Visualization
Probability Basics
Common Probability Distributions
- 68-95-99.7 Rule (Empirical Rule)
Hypothesis Testing
Confidence Intervals
Correlation and Causation
Simple Linear Regression
Multiple Linear Regression
Logistic Regression
ANOVA and Chi-Square Tests
The Central Limit Theorem and Its Importance in Data Science
Bias-Variance Tradeoff

1. Statistical Sampling and Data Collection

We will learn some basic statistics concepts, but understanding where our data comes from and how we gather it is essential before diving deep into the ocean of data. This is where populations, samples, and various sampling techniques come into play.

Imagine we want to know the average height of people in a city. It’s practical to measure everyone, so we take a smaller group (sample) representing the larger population. The trick lies in how we select this sample. Techniques such as random, stratified, or cluster sampling ensure our sample is represented well, minimizing bias and making our findings more reliable.

By understanding populations and samples, we can confidently extend our insights from the sample to the whole population, making informed decisions without the need to survey everyone.

2. Types of Data and Measurement Scales

Data comes in various flavors, and knowing the type of data you’re dealing with is crucial for choosing the right statistical tools and techniques.

Quantitative & Qualitative Data

Quantitative Data: This type of data is all about numbers. It’s measurable and can be used for mathematical calculations. Quantitative data tells us “how much” or “how many,” like the number of users visiting a website or the temperature in a city. It’s straightforward and objective, providing a clear picture through numerical values.
Qualitative Data: Conversely, qualitative data deals with characteristics and descriptions. It’s about “what type” or “which category.” Think of it as the data that describes qualities or attributes, such as the color of a car or the genre of a book. This data is subjective, based on observations rather than measurements.

Four Scales of Measurement

Nominal Scale: This is the simplest form of measurement used for categorizing data without a specific order. Examples include types of cuisine, blood groups, or nationality. It’s about labeling without any quantitative value.
Ordinal Scale: Data can be ordered or ranked here, but the intervals between values aren’t defined. Think of a satisfaction survey with options like satisfied, neutral, and unsatisfied. It tells us the order but not the distance between the rankings.
Interval Scale: Interval scales order data and quantify the difference between entries. However, there’s no actual zero point. A good example is temperature in Celsius; the difference between 10°C and 20°C is the same as between 20°C and 30°C, but 0°C doesn’t mean the absence of temperature.
Ratio Scale: The most informative scale has all the properties of an interval scale plus a meaningful zero point, allowing for an accurate comparison of magnitudes. Examples include weight, height, and income. Here, we can say something is twice as much as another.

3. Descriptive Statistics

Imagine descriptive statistics as your first date with your data. It’s about getting to know the basics, the broad strokes that describe what’s in front of you. Descriptive statistics has two main types: central tendency and variability measures.

Measures of Central Tendency: These are like the data’s center of gravity. They give us a single value typical or representative of our data set.

Mean: The average is calculated by adding up all the values and dividing by the number of values. It’s like the overall rating of a restaurant based on all reviews. The mathematical formula for the average is given below:

Descriptive Statistics | fundamental statistics concepts

Median: The middle value when the data is ordered from smallest to largest. If the number of observations is even, it’s the average of the two middle numbers. It’s used to find the middle point of a bridge.

If n is even, the median is the average of the two central numbers.

Mode: It is the most frequently occurring value in a data set. Think of it as the most popular dish at a restaurant.

Measures of Variability: While measures of central tendency bring us to the center, measures of variability tell us about the spread or dispersion.

Range: The difference between the highest and lowest values. It gives a basic idea of the spread.

Variance: Measures how far each number in the set is from the mean and thus from every other number in the set. For a sample, it’sit’sculated as:

Standard Deviation: The square root of the variance provides a measure of the average distance from the mean. It’s like assessing the consistency of a baker’s cake sizes. It is represented as :

Before we move to the next basic statistics concept, here’s a Beginner’s Guide to Statistical Analysis for you!

4. Data Visualization

Data visualization is the art and science of telling stories with data. It turns complex results from our analysis into something tangible and understandable. It’s crucial for exploratory data analysis, where the goal is to uncover patterns, correlations, and insights from data without yet making formal conclusions.

Charts and Graphs: Starting with the basics, bar charts, line graphs, and pie charts provide foundational insights into the data. They are the ABCs of data visualization, essential for any data storyteller.

We have an example of a bar chart (left) and a line chart (right) below.

Data Visualisation | fundamental statistics concepts

Advanced Visualizations: As we dive deeper, heat maps, scatter plots, and histograms allow for more nuanced analysis. These tools help identify trends, distributions, and outliers.

Below is an example of a scatter plot and a histogram

Visualizations bridge raw data and human cognition, enabling us to interpret and make sense of complex datasets quickly.

5. Probability Basics

Probability is the grammar of the language of statistics. It’s about the chance or likelihood of events happening. Understanding concepts in probability is essential for interpreting statistical results and making predictions.

Independent and Dependent Events:
- Independent Events: One event’s outcome does not affect another’s outcome. Like flipping a coin, getting heads on one flip doesn’t change the odds for the next flip.
- Dependent Events: The outcome of one event affects the result of another. For example, if you draw a card from a deck and don’t replace it, your chances of drawing another specific card change.

Probability provides the foundation for making inferences about data and is critical to understanding statistical significance and hypothesis testing.

6. Common Probability Distributions

Probability distributions are like different species in the statistics ecosystem, each adapted to its niche of applications.

Normal Distribution: Often called the bell curve because of its shape, this distribution is characterized by its mean and standard deviation. It is a common assumption in many statistical tests because many variables are naturally distributed this way in the real world.

A set of rules known as the empirical rule or the 68-95-99.7 rule summarizes the characteristics of a normal distribution, which describes how data is spread around the mean.

68-95-99.7 Rule (Empirical Rule)

This rule applies to a perfectly normal distribution and outlines the following:

68% of the data falls within one standard deviation (σ) of the mean (μ).
95% of the data falls within two standard deviations of the mean.
Approximately 99.7% of the data falls within three standard deviations of the mean.

Binomial Distribution: This distribution applies to situations with two outcomes (like success or failure) repeated several times. It helps model events like flipping a coin or taking a true/false test.

Poisson Distribution counts the number of times something happens over a specific interval or space. It’s ideal for situations where events happen independently and constantly, like the daily emails you receive.

Each distribution has its own set of formulas and characteristics, and choosing the right one depends on the nature of your data and what you’re trying to find out. Understanding these distributions allows statisticians and data scientists to model real-world phenomena and predict future events accurately.

7 . Hypothesis Testing

Think of hypothesis testing as detective work in statistics. It’s a method to test if a particular theory about our data could be true. This process starts with two opposing hypotheses:

Null Hypothesis (H0): This is the default assumption, suggesting therthere’seffect or difference. It’s saying, “Not” ing new here.”
Al “alternative Hypothesis (H1 or Ha): This challenges the status quo, proposing an effect or a difference. It claims, “Something is interesting going on.”

Example: Testing if a new diet program leads to weight loss compared to not following any diet.

Null Hypothesis (H0): The new diet program does not lead to weight loss (no difference in weight loss between those who follow the new diet program and those who do not).
Alternative Hypothesis (H1): The new diet program leads to weight loss (a difference in weight loss between those who follow it and those who do not).

Hypothesis testing involves choosing between these two based on the evidence (our data).

Type I and II Error and Significance Levels:

Type I Error: This happens when we incorrectly reject the null hypothesis. It convicts an innocent person.
Type II Error: This occurs when we fail to reject a false null hypothesis. It lets a guilty person go free.
Significance Level (α): This is the threshold for deciding how much evidence is enough to reject the null hypothesis. It’s often set at 5% (0.05), indicating a 5% risk of a Type I error.

8. Confidence Intervals

Confidence intervals give us a range of values within which we expect the valid population parameter (like a mean or proportion) to fall with a certain confidence level (commonly 95%). It’s like predicting a sports team’s final score with a margin of error; we’re saying, “We’re 95% confident the true score will be within this range.”

Constructing and interpreting confidence intervals helps us understand the precision of our estimates. The wider the interval, our estimate is less precise, and vice versa.

The above figure illustrates the concept of a confidence interval (CI) in statistics, using a sample distribution and its 95% confidence interval around the sample mean.

Here’s a breakdown of the critical components in the figure:

Sample Distribution (Gray Histogram): This represents the distribution of 100 data points randomly generated from a normal distribution with a mean of 50 and a standard deviation of 10. The histogram visually depicts how the data points are spread around the mean.
Sample Mean (Red Dashed Line): This line indicates the sample data’s mean (average) value. It serves as the point estimate around which we construct the confidence interval. In this case, it represents the average of all the sample values.
95% Confidence Interval (Blue Dashed Lines): These two lines mark the lower and upper bounds of the 95% confidence interval around the sample mean. The interval is calculated using the standard error of the mean (SEM) and a Z-score corresponding to the desired confidence level (1.96 for 95% confidence). The confidence interval suggests we are 95% confident that the population mean lies within this range.

9. Correlation and Causation

Correlation and causation often get mixed up, but they are different:

Correlation: Indicates a relationship or association between two variables. When one changes, the other tends to change, too. Correlation is measured by a correlation coefficient ranging from -1 to 1. A value closer to 1 or -1 indicates a strong relationship, while 0 suggests no ties.
Causation: It implies that changes in one variable directly cause changes in another. It is a more robust assertion than correlation and requires rigorous testing.

Just because two variables are correlated does not mean one causes the other. This is a classic case of not confusing “correlation” with “causation.”

10. Simple Linear Regression

Simple linear regression is a way to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered an explanatory variable (independent), and the other is a dependent variable.

Simple linear regression helps us understand how changes in the independent variable affect the dependent variable. It’s a powerful tool for prediction and is foundational for many other complex statistical models. By analyzing the relationship between two variables, we can make informed predictions about how they will interact.

Simple linear regression assumes a linear relationship between the independent variable (explanatory variable) and the dependent variable. If the relationship between these two variables is not linear, then the assumptions of simple linear regression may be violated, potentially leading to inaccurate predictions or interpretations. Thus, verifying a linear relationship in the data is essential before applying simple linear regression.

11. Multiple Linear Regression

Think of multiple linear regression as an extension of simple linear regression. Still, instead of trying to predict an outcome with one knight in shining armor (predictor), you have a whole team. It’s like upgrading from a one-on-one basketball game to an entire team effort, where each player (predictor) brings unique skills. The idea is to see how several variables together influence a single outcome.

However, with a bigger team comes the challenge of managing relationships, known as multicollinearity. It occurs when predictors are too close to each other and share similar information. Imagine two basketball players constantly trying to take the same shot; they can get in each other’s way. Regression can make it hard to see each predictor’s unique contribution, potentially skewing our understanding of which variables are significant.

12. Logistic Regression

While linear regression predicts continuous outcomes (like temperature or prices), logistic regression is used when the result is definite (like yes/no, win/lose). Imagine trying to predict whether a team will win or lose based on various factors; logistic regression is your go-to strategy.

It transforms the linear equation so that its output falls between 0 and 1, representing the probability of belonging to a particular category. It’s like having a magic lens that converts continuous scores into a clear “this or that” view, allowing us to predict categorical outcomes.

The graphical representation illustrates an example of logistic regression applied to a synthetic binary classification dataset. The blue dots represent the data points, with their position along the x-axis indicating the feature value and the y-axis indicating the category (0 or 1). The red curve represents the logistic regression model’s prediction of the probability of belonging to class 1 (e.g., “win”) for different feature values. As you can see, the curve transitions smoothly from the probability of class 0 to class 1, demonstrating the model’s ability to predict categorical outcomes based on an underlying continuous feature.

The formula for logistic regression is given by:

Formula For Logistic Regression | fundamental statistics concepts

This formula uses the logistic function to transform the linear equation’s output into a probability between 0 and 1. This transformation allows us to interpret the outputs as probabilities of belonging to a particular category based on the value of the independent variable xx.

13. ANOVA and Chi-Square Tests

ANOVA (Analysis of Variance) and Chi-Square tests are like detectives in the statistics world, helping us solve different mysteries. It allows us to compare means across multiple groups to see if at least one is statistically different. Think of it as tasting samples from several batches of cookies to determine if any batch tastes significantly different.

On the other hand, the Chi-Square test is used for categorical data. It helps us understand if there’s a significant association between two categorical variables. For instance, is there a relationship between a person’s favorite genre of music and their age group? The Chi-Square test helps answer such questions.

14. The Central Limit Theorem and Its Importance in Data Science

The Central Limit Theorem (CLT) is a fundamental statistical principle that feels almost magical. It tells us that if you take enough samples from a population and calculate their means, those means will form a normal distribution (the bell curve), regardless of the population’s original distribution. This is incredibly powerful because it allows us to make inferences about populations even when we don’t know their exact distribution.

In data science, the CLT underpins many techniques, enabling us to use tools designed for normally distributed data even when our data doesn’t initially meet those criteria. It’s like finding a universal adapter for statistical methods, making many powerful tools applicable in more situations.

15. Bias-Variance Tradeoff

In predictive modeling and machine learning, the bias-variance tradeoff is a crucial concept that highlights the tension between two main types of error that can make our models go awry. Bias refers to errors from overly simplistic models that don’t capture the underlying trends well. Imagine trying to fit a straight line through a curved road; you’ll miss the mark. Conversely, Variances from too complex models capture noise in the data as if it were an actual pattern — like tracing every twist and turning on a bumpy trail, thinking it’s the path forward.

The art lies in balancing these two to minimize the total error, finding the sweet spot where your model is just right—complex enough to capture the accurate patterns but simple enough to ignore the random noise. It’s like tuning a guitar; it won’t sound right if it’s too tight or loose. The bias-variance tradeoff is about finding the perfect balance between these two. The bias-variance tradeoff is the essence of tuning our statistical models to perform their best in predicting outcomes accurately.

Conclusion

From statistical sampling to the bias-variance tradeoff, these principles are not mere academic notions but essential tools for insightful data analysis. They equip aspiring data scientists with the skills to turn vast data into actionable insights, emphasizing statistics as the backbone of data-driven decision-making and innovation in the digital age.

Have we missed any basic statistics concept? Let us know in the comment section below.

Explore our end to end statistics guide for data science to know about the topic!

Deepsandhya Shukla

Beginner Listicle Statistics

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

15 Basic Statistics Concepts Every Data Science Beginner Should Know

Introduction

Table of contents

1. Statistical Sampling and Data Collection

2. Types of Data and Measurement Scales

Quantitative & Qualitative Data

Four Scales of Measurement

3. Descriptive Statistics

4. Data Visualization

5. Probability Basics

6. Common Probability Distributions

68-95-99.7 Rule (Empirical Rule)

7 . Hypothesis Testing

Type I and II Error and Significance Levels:

8. Confidence Intervals

9. Correlation and Causation

10. Simple Linear Regression

11. Multiple Linear Regression

12. Logistic Regression

13. ANOVA and Chi-Square Tests

14. The Central Limit Theorem and Its Importance in Data Science

15. Bias-Variance Tradeoff

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)