40 Interview Questions asked at Startups in Machine Learning / Data Science

Analytics Vidhya Last Updated : 15 Oct, 2020
18 min read

Overview

  • Contains a list of widely asked interview questions based on machine learning and data science
  • The primary focus is to learn machine learning topics with the help of these questions
  • Crack data scientist job profiles with these questions

 

Introduction

Careful! These questions can make you think THRICE!

Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists. What could be a better start for your aspiring career!

However, still, getting into these roles is not easy. You obviously need to get excited about the idea, team and the vision of the company. You might also find some real difficult techincal questions on your way. The set of questions asked depend on what does the startup do. Do they provide consulting? Do they build ML products ? You should always find this out prior to beginning your interview preparation.

To help you prepare for your next interview, I’ve prepared a list of 40 plausible & tricky questions which are likely to come across your way in interviews. If you can answer and understand these question, rest assured, you will give a tough fight in your job interview.

Note: A key to answer these questions is to have concrete practical understanding on ML and related statistical concepts. You can get that know-how in our course ‘Introduction to Data Science‘! 

Or how about learning how to crack data science interviews from someone who has conducted hundreds of them? Check out the ‘Ace Data Science Interviews‘ course taught by Kunal Jain and Pranav Dar.

40 interview questions, machine learning, data science

40 Interview Questions asked at Startups in Machine Learning / Data Science

 

Interview Questions on Machine Learning

Q1. You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

 

Q2. Is rotation necessary in PCA? If yes, Why? What will happen if you don’t rotate the components?

 

Q3. You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

 

Q4. You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

 

Q5. Why is naive Bayes so ‘naive’ ?

 

Q6. Explain prior probability, likelihood and marginal likelihood in context of naiveBayes algorithm?

 

Q7. You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

 

Q8. You are assigned a new project which involves helping a food delivery company save more money. The problem is, company’s delivery team aren’t able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

 

Q9. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

 

Q10. You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

 

Q11. After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

 
Q12. How is kNN different from kmeans clustering?

 

Q13. How is True Positive Rate and Recall related? Write the equation.

 

Q14. You have built a multiple regression model. Your model R² isn’t as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

 

Q15. After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

 

Q16. When is Ridge regression favorable over Lasso regression?

 

Q17. Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

 

Q18. While working on a data set, how do you select important variables? Explain your methods.

 

Q19. What is the difference between covariance and correlation?

 

Q20. Is it possible capture the correlation between continuous and categorical variable? If yes, how?

 

Q21. Both being tree based algorithm, how is random forest different from Gradient boosting algorithm (GBM)?

 

Q22. Running a binary classification tree algorithm is the easy part. Do you know how does a tree splitting takes place i.e. how does the tree decide which variable to split at the root node and succeeding nodes?

 

Q23. You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

 

Q24. You’ve got a data set to work having p (no. of variable) > n (no. of observation). Why is OLS as bad option to work with? Which techniques would be best to use? Why?

 

11222Q25. What is convex hull ? (Hint: Think SVM)

 

Q26. We know that one hot encoding increasing the dimensionality of a data set. But, label encoding doesn’t. How ?

 

Q27. What cross validation technique would you use on time series data set? Is it k-fold or LOOCV?

 

Q28. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

 

29. ‘People who bought this, also bought…’ recommendations seen on amazon is a result of which algorithm?

 

Q30. What do you understand by Type I vs Type II error ?

 

Q31. You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

 

Q32. You have been asked to evaluate a regression model based on R², adjusted R² and tolerance. What will be your criteria?

 

Q33. In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance ?

 

Q34. Explain machine learning to me like a 5 year old.

 

Q35. I know that a linear regression model is generally evaluated using Adjusted R² or F value. How would you evaluate a logistic regression model?

 

Q36. Considering the long list of machine learning algorithm, given a data set, how do you decide which one to use?

 

Q37. Do you suggest that treating a categorical variable as continuous variable would result in a better predictive model?

 

Q38. When does regularization becomes necessary in Machine Learning?

 

Q39. What do you understand by Bias Variance trade off?

 

Q40. OLS is to linear regression. Maximum likelihood is to logistic regression. Explain the statement.

 

End Notes

You might have been able to answer all the questions, but the real value is in understanding them and generalizing your knowledge on similar questions. If you have struggled at these questions, no worries, now is the time to learn and not perform. You should right now focus on learning these topics scrupulously.

These questions are meant to give you a wide exposure on the types of questions asked at startups in machine learning. I’m sure these questions would leave you curious enough to do deeper topic research at your end. If you are planning for it, that’s a good sign.

Did you like reading this article? Have you appeared in any startup interview recently for data scientist profile? Do share your experience in comments below. I’d love to know your experience.

Looking for a job in analytics? Check out currently hiring jobs in machine learning and data science.

Analytics Vidhya Content team

Responses From Readers

Clear

kavitha
kavitha

thank you so much manish

Gianni
Gianni

Thank you Manish, very helpfull to face on the true reality that a long long journey wait me :-)

Prof Ravi Vadlamani
Prof Ravi Vadlamani

Good collection compiled by you Mr Manish ! Kudos ! I am sure it will be very useful to the budding data scientists whether they face start-ups or established firms.

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details