15 Most Common Data Science Interview Questions

Subhadeep Mandal Last Updated : 27 Jun, 2022

9 min read

This article was published as a part of the Data Science Blogathon.

Interview Questions — Source – pinterest.com

Introduction

Job interviews are…well, hard! Some interviewers ask hard questions while others ask relatively easy questions. As an interviewee, it is your choice to go prepared. And when it comes to a domain like Machine Learning, preparations might fall short. You have to be prepared for everything.

While preparing, you might have stuck at a point where you wonder what more shall I read. Well, based on almost 15-17 data science interviews that I have attended, here I have put 15, very commonly asked, as well as important Data Science and Machine Learning related questions that were asked to me in almost all of them and I recommend you must study these thoroughly. This will help you utilize your time efficiently and focus on what is important rather than wandering across the web to search for questions and answers to them.

15 Most Common Interview Questions

1. Explain the Difference between Classification and Regression with at least 1 example.

Ans. As the name suggests, Classification algorithms classify the output variables into two or more classes or categories so that they become discrete values. Whereas, Regression algorithms calculate continuous and real values, unlike Classification algorithms.

Explain this difference using an example. Using classification, a company can predict whether an employee might resign or not in the next 5-10 years, using their data, like their salary, bonus, work experience, their age etc. Using regression, the same company can analyse their product demand, analyse product manufacturing errors etc. using their product purchase history, customer satisfaction, returns, complaints etc.

2. Which is your favourite clustering algorithm? Explain why.

Ans. My favourite clustering algorithm is K-Means Clustering. The simplicity of this algorithm draws my attention. In K-Means all we have to do is these simple 4 steps-

1. Choose k number of clusters.

2. Specify the cluster seeds.

3. Assign each data point to a centroid

4. Adjust the centroids

By these steps, K-Means can scale according to the dataset and is also very efficient. The results are easy to interpret too.

Moreover, the elbow method is a very interesting and simplistic approach to optimise the K-Means algorithm taking account of the WCSS (Within Cluster Sum of Squares) Vs. the k number of clusters, which always produces a curve that looks like an elbow!

The elbow point value is the optimal number of clusters for your data! It greatly improves your model performance in a very elegant yet easy way, which I admire!

3. What is the generalized equation of a supervised Machine Learning model and explain it in your own words.

Ans. Almost every supervised Machine Learning model is represented mathematically as y = f(x), where, y is the output variable and x is the dependent input feature. f(x) is a generalized representation of a Machine Learning model, that takes in input features(x_i = x₁, x₂, x₃….,x_n) and gives out predicted output (ŷ).

4. What is bootstrap sampling? How is it useful in ML? Explain with a use-case.

Ans. The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement. It is used in applied Machine Learning to estimate the skill of machine learning models when making predictions on data not included in the training data.

In Machine Learning, bootstrap sampling is used in an ensemble algorithm called bootstrap aggregation or Bagging. It helps avoid overfitting and improves the stability of a Machine Learning algorithm.

For example, the Bootstrap Aggregation algorithm helps create multiple different models from a single training dataset. A very popular use case is predicting election results for an area using sample data instead of taking election data of the entire population.

5. What are overfitting and underfitting? Explain in terms of statistic terminologies and suggest measures to reduce both.

Ans. Overfitting is the phenomenon that occurs when we fit a huge amount of data into a Machine Learning model (which might include noisy and inaccurate data) or also when a model fits exactly to its training data points such that it starts giving wrong predictions. The model becomes ‘overconfident’ and treats data points outside its knowledge to be wrong or inaccurate. Statistically, Overfitting occurs due to low bias and high variance. Overfitting can be reduced by using a linear model for linear data, pruning unnecessary nodes in a Decision tree to reduce its complexity, training with more data where necessary etc.

Underfitting occurs when a Machine Learning model neither fits on the training data nor generalizes to the new data to test on. The model fails to recognize the underlying trends of the data. It usually happens when we have very fewer data to build an accurate model, also when we try to build a linear model with fewer non-linear data. Statistically, underfitting occurs due to high bias and low variance. Underfitting can be reduced by using other Machine Learning algorithms, fitting more data, removing noise from data, increasing the number of features by performing feature engineering etc.

6. Define Euclidean, Manhattan and Minkowski distance. In what kind of problems are they used and why?

Ans. Euclidean distance: Euclidean distance between two points is the length of the line that joins the two points. Consider two points A(x₁, y₁) and B(x₂, y₂) in the cartesian plane. The Euclidean distance between them is represented as:

Manhattan distance: Manhattan distance between two points in an n-dimensional space is the sum of the distances in each dimension. Consider two points A(x₁, y₁) and B(x₂, y₂) in the cartesian plane. The Manhattan distance between them is represented as:

Minkowski distance: Minkowski distance is a generalized form of Euclidean and Manhattan distances. For two points A(x₁, x₂) and B(y₁, y₂) the Minkowski between them is:

7. Describe what you mean by bias-variance trade-off in ML.

Ans. Bias is the difference between the actual and predicted output. In n-dimensional space, bias can be the distance between the true value points and the predicted value points. A high bias means the model is streamlining the common data points and hence such a simplified approach leads to incorrect prediction and a huge distance gap between true and predicted values. Such bias introduces underfitting.

On the other hand, variance is how varied predictions our model is doing. In a space where data points are widely scattered, the model tries to learn every data point so perfectly that instead of recognizing patterns, it becomes more like a forcefully memorizing them. Memorization will only lead to good performance as long as it is predicting training data’s outputs. As soon as we give new unknown data to predict, it will give an error. Such behaviour occurs when the model tries to fit the noise and fluctuations (variance). It introduces overfitting.

So bias and variance are opposites to each other, and the trade-off is simply computing the balance between bias and variance so that error is minimised. Whenever we have a low bias, variance increases and vice versa. So, we have to model in a way that bias and variance both stay balanced and neither overfitting nor underfitting occurs. Such kind of tuning is called Bias Variance Tradeoff.

8. What is Vectorization? Why is Vectorization important in NLP use cases?

Ans. Machine Learning algorithms cannot work on raw text data. They are to be converted to numerical data. Vectorization is a method of converting text data to numerical vectors so that the text can be easily analyzed or consumed by the Machine Learning algorithm.

In NLP, Vectorization is used to map words or phrases from vocabulary to a corresponding vector of real numbers to enable word predictions, word similarities/semantics and grammar checks.

9. Mention some differences between covariance and correlation.

Ans. a. Correlation coefficients range from -1 to 1 and covariance values range from -∞ to +∞.

b. Correlation measures how two variables are related to each other and by how much. Covariance measures whether a variation in one variable results in a variation in another variable.

c. Changing the scale of variables affects covariance i.e. if the variables are changed using similar or different constants, the calculated covariance between the variables changes. But in the case of correlation, this doesn’t happen. Correlation is not affected by a change of scale.

10. Why is the Naive Bayes algorithm, naive? Mention its advantages and disadvantages.

Ans. A naive Bayes classifier presumes that a particular feature of a class is unrelated to any other feature, given the class variable. Simply, Naive Bayes assumes that the variables are independent of each other. This assumption may or may not be correct as it falls under the probabilistic classifiers in statistics. This is the reason it’s called Naive.

Advantages

1. Extremely easy to implement and follows a simple probabilistic approach (Bayes’ Theorem).

2. Works great in small datasets, and can classify stuff even when no data is there!

3. Massively used for spam-filtering and specially text-classification tasks ( as it works better when attributes/features in data are independent of each other).

Disadvantages

1. Works poorly with numerical data, as numerical data is mostly normally distributed.

2. The ‘naive’ approach to considering attributes independent of each other, doesn’t often work well in complex problems.

11. Explain a decision tree. What do you know about the ID3 algorithm?

Ans. A decision tree is a tree-like structure that contains leaves, nodes and edges. It has a root node, parent nodes and child nodes. Each parent node is split into one or more child nodes. The terminal nodes or leaf nodes represent the output variables.

In the decision tree, ID3 stands for Iterative Dichotomiser 3. Dichotomiser means dividing. The algorithm is used for feature selection in a decision tree by the top-down greedy approach (builds the tree from the top and selects the best feature at each step iteratively using Entropy/Information Gain). Each parent node is divided into one or more child nodes iteratively till no more division is possible.

12. Entropy or Gini Impurity? – define them and decide which is better for selecting the best features in your data.

Ans. Entropy: It is the measure of impurity of a variable. It describes how well a node is split. The higher the entropy of a node, the higher will be its impurity. For the best split, it is always suggested to choose nodes with low entropy. Entropy is calculated by –

Gini Impurity: Like Entropy, Gini impurity also measures the impurity of a variable or how good a node is split. In both, Entropy and Gini impurity, a node that has multiple classes is considered impure and a node having a single class is pure.

If we compare both the methods then Gini Impurity is more efficient than entropy in terms of computing power. Training time when using the Entropy criterion is much higher because Entropy is calculated using logarithms. Hence, Gini impurity must be preferred over Entropy.

13. What are leaf nodes in the Decision tree?

Ans. Leaf nodes in a Decision tree are the end or terminal or child nodes. These nodes cannot be split any further. Hence, they are the output classes or categories like YES or NO; GOOD, BAD and SATISFACTORY etc.

14. What are the disadvantages of the Decision tree?

Ans. a. Decision tree has a broad chance of getting overfitted. This generally happens when the dataset is huge, has many features or when we try to fit more data into it. An overfitted model will give wrong predictions for unseen data.

b. It tends to become more complex when we add more data to it. Upon adding new data points, the nodes get calculated again and the tree is restructured. This reduces the efficiency of the model. Hence, for time-series analysis, the Decision tree is not used. Random forest is used instead.

c. Decision tree is vulnerable to noisy datasets. It tries to fit all the data including the noisy data, and later tends to give wrong predictions.

15. Explain Information Gain in the Decision tree algorithm with its mathematical formula.

Ans. Information gain tells us how good a split between a parent node and its children nodes is. The higher the Information gain, the better is the split. High entropy indicates that data is uniform and low entropy means data is more scattered or distributed. Hence, with decreasing entropy, Information gain increases. Information gain is mathematically represented as:

Gain in decision tree algorithm | Interview Questions

Here, E_parent is the entropy of the parent node, and E_children is the entropy of the child nodes.

Tips for Interviews

Some tips to bear in mind before attending/during data science interviews-

Always remember to explain your answers with a pen and paper using graphs/formulas as proof of concepts in an interview. It increases your chance of getting selected as data scientists are much preferred to be extremely detailed with whatever they do! Hence, such a storytelling habit must reflect in your interviews.
Avoid the use of abbreviations like ML/DL/NLP/CV if you are sitting for an entry-level role interview. Be as much thorough as you can.
Try to not put skills or frameworks that you have used barely 1-2 times and only put those you extensively work with.
Try to be honest and humble while answering questions. Learn to humbly address that you don’t know the answer to a particular question asked rather than covering it up with irrelevant information. Most of the time candidates get rejected for such arrogant behaviour.
Be friendly yet professional. Have confidence and don’t show up with a “stressed” and “anxious” face.

Conclusion on Interview Questions

Hope this article will help you get a quick and efficient revision before your gruelling data science interviews.

Key takeaways from this article are-

You learned about very important concepts like bias-variance tradeoff, classification, regression etc.
The pros and cons of one of the most used classifiers Naive Bayes.
Important concepts like Entropy and Gini Impurity in Decision trees.
Very crucial mathematical and statistical concepts like covariance and correlation, vectorization, and distance metrics.
Overfitting Vs Under-fitting which is a MUST to know concept for any machine learning enthusiast.

Cheers and all the best! Hope you ace your data science interviews!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Subhadeep Mandal

A Machine Learning and Deep Learning practitioner with a background in Computer Science Engineering. My work interests include Machine Learning, Deep Learning, Computer Vision and NLP, with expertise in Generative AI and Retrieval Augmented Generation.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

15 Most Common Data Science Interview Questions

Introduction

15 Most Common Interview Questions

Tips for Interviews

Conclusion on Interview Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)