Machine Learning, a prominent part of Artificial Intelligence, is currently one of the most sought-after skills in data science. If you are a data scientist, you need to be good at python, SQL, and machine learning – no two ways about it. As part of DataFest 2017, we organized various skill tests so that data scientists could assess themselves on these critical skills. These tests included Machine Learning, Deep Learning, Time Series problems, and Probability. This article will lay out the solutions to the machine learning Questions and Answers to skill test and other important data science interview questions.In this article you will get to know and machine learning exam questions and answers and how they are impacting.
More than 1350 people registered for the Machine Learning skill test. The test was designed to test your conceptual knowledge in data science and machine learning. This article gives you a chance to test yourself in case you missed the real-time test. Here are the leaderboard rankings for all the participants in the Machine Learning Skilltest. You can also check out our online training in machine learning.
The skill test covers important data science topics, such as unsupervised and supervised learning, reinforcement learning, Bayes theorem, k-means clustering, recommender systems, linear regression, logistic regression, random forest, and more. It also quizzes you on Statistics concepts such as Normal distribution, p-value, etc.
These data science questions, along with hundreds of others, are part of our ‘Ace Data Science Interviews‘ course. It’s a comprehensive guide, with tons of resources, to crack data science interviews and land your dream job! And if you’re someone who’s just starting out their data science journey, then do check out our most comprehensive program to master Machine Learning:
Take the test and find where you stand!
Below is the distribution of the overall scores that will help you evaluate your performance.
You can access the final scores here. More than 210 people participated in the machine learning skill test, and the highest score obtained was 36. Here are a few statistics about the distribution.
Mean Score: 19.36 | Median Score: 21 | Mode Score: 27
Now, let’s begin!
Which of the following statement is true in the following case?
A) Feature F1 is an example of a nominal variable.
B) Feature F1 is an example of an ordinal variable.
C) It doesn’t belong to any of the above categories.
D) Both of these
Solution: (B)
Ordinal variables are the variables that have some order in their categories. For example, grade A should be considered a high grade than grade B.
A) PCA
B) K-Means clustering
C) KNN
D) None of the above
Solution: (A)
A deterministic algorithm is one in which output does not change on different runs. PCA would give the same result if we run again, but not k-means clustering.
A) TRUE
B) FALSE
Solution: (A)
Y=X2. Note that they are not only associated, but one is a function of the other, and the Pearson correlation between them is 0.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (A)
In SGD, for each iteration, you choose the batch, which generally contains a random sample of data. But in the case of GD, each iteration contains all of the training observations.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (B)
Usually, if we increase the depth of the tree, it will cause overfitting. The learning rate is not a hyperparameter in a random forest. An increase in the number of trees will cause under fitting.
Your data analysis is based on features like author name, number of articles written by the same author, etc. Which of the following evaluation metrics would you choose in that case?
A) Only 1
B) Only 2
C) Only 3
D) 1 and 3
E) 2 and 3
F) 1 and 2
Solution:(A)
You can think that the number of views of articles is the continuous target variable that falls under the regression problem. So, the mean of squared error will be used as an evaluation metric.
A)
A) 1 is tanh, 2 is ReLU, and 3 is SIGMOID activation functions.
B) 1 is SIGMOID, 2 is ReLU, and 3 is tanh activation functions.
C) 1 is ReLU, 2 is tanh, and 3 is SIGMOID activation functions.
D) 1 is tanh, 2 is SIGMOID, and 3 is ReLU activation functions.
Solution: (D)
The range of the SIGMOID function is [0,1].
The tanh function has a range of [-1,1].
The RELU function has the range of [0, infinity].
So Option D is the right answer.
[0,0,0,1,1,1,1,1]
What is the entropy of the target variable?
A) -(5/8 log(5/8) + 3/8 log(3/8))
B) 5/8 log(5/8) + 3/8 log(3/8)
C) 3/8 log(5/8) + 5/8 log(3/8)
D) 5/8 log(3/8) – 3/8 log(5/8)
Solution: (A)
The formula for entropy is So the answer is A.
You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges may you face if you have applied OHE on a categorical variable of the training dataset?
A) All categories of the categorical variables are not present in the test dataset.
B) The frequency distribution of categories is different in the train compared to the test dataset.
C) Train and Test always have the same distribution.
D) Both A and B
E) None of these
Solution: (D)
Both are true. The OHE will fail to encode the categories which is present in the test but not in the train, so it could be one of the main challenges while applying OHE. The challenge given in option B is also true. You need to be more careful while applying OHE if frequency distribution isn’t the same in the train and the test.
Which one of the following models depicts the skip-gram model?
A) A
B) B
C) Both A and B
D) None of these
Solution: (B)
Both models (model1 and model2) are used in the Word2vec algorithm. Model1 represents a CBOW model, whereas Model2 represents the Skip-gram model.
At a particular neuron for any given input, you get the output as “-0.0001”. Which of the following activation function could X represent?
A) ReLU
B) tanh
C) SIGMOID
D) None of these
Solution: (B)
The function is a tanh because this function output range is between (-1,-1).
A) TRUE
B) FALSE
Solution: (B)
Log loss cannot have negative values.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 1 and 3
F) 2 and 3
Solution: (E)
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a “false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false negative”).
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
Solution: (D)
Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s”, etc.) from a word.
Stop words are those words which will have not relevant to the context of the data, for example, is/am/are.
Object Standardization is also a good way to pre-process the text.
The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Let’s say you have applied both algorithms respectively on data “X”, and you got the datasets “X_projected_PCA” , “X_projected_tSNE”.
Which of the following statements is true for “X_projected_PCA” & “X_projected_tSNE”?
A) X_projected_PCA will have interpretation in the nearest neighbor space.
B) X_projected_tSNE will have interpretation in the nearest neighbor space.
C) Both will have interpretation in the nearest neighbor space.
D) None of them will have interpretation in the nearest neighbor space.
Solution: (B)
t-SNE algorithm considers nearest neighbor points to reduce the dimensionality of the data. So, after using t-SNE, we can think that reduced dimensions will also have interpretation in the nearest neighbor space. But in the case of PCA, it is not the case.
Context for Questions 16-17
Given below are three scatter plots for two features (Images 1, 2 & 3, from left to right).
A) Features in Image 1
B) Features in Image 2
C) Features in Image 3
D) Features in Images 1 & 2
E) Features in Images 2 & 3
F) Features in Images 3 & 1
Solution: (D)
In Image 1, features have a high positive correlation, whereas, in Image 2, there is a high negative correlation between the features. So in both images, the pair of features is an example of multicollinear features.
A) Only 1
B) Only 2
C) Only 3
D) Either 1 or 3
E) Either 2 or 3
Solution: (E)
You cannot remove both features because after removing them, you will lose all of the information. So you should either remove only 1 feature or use a regularization algorithm like L1 and L2.
A) Only 1 is correct
B) Only 2 is correct
C) Either 1 or 2
D) None of these
Solution: (A)
After adding a feature in the feature space, whether that feature is an important or unimportant one, the R-squared always increases.
Now, you have added 2 in all values of X (i.e., new values become X+2), subtracting 2 from all values of Y (i.e., new values are Y-2), and Z remains the same. The new coefficients for (X,Y), (Y,Z), and (X,Z) are given by D1, D2 & D3, respectively. How do the values of D1, D2 & D3 relate to C1, C2 & C3?
A) D1= C1, D2 < C2, D3 > C3
B) D1 = C1, D2 > C2, D3 > C3
C) D1 = C1, D2 > C2, D3 < C3
D) D1 = C1, D2 < C2, D3 < C3
E) D1 = C1, D2 = C2, D3 = C3
F) Cannot be determined
Solution: (E)
Correlation between the features won’t change if you add or subtract a value in the features.
The majority class is observed 99% of the time in the training data. Your model has 99% accuracy after taking the predictions on the test set. Which of the following is true in such a case?
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
Solution: (A)
Refer the question number 4 from this article.
Which of the following statements is / are true for weak learners used in the ensemble model?
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) Only 1
E) Only 2
F) None of the above
Solution: (A)
Weak learners are sure about a particular part of a problem. So, they usually don’t overfit, which means that weak learners have low variance and high bias.
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1,2 and 3
Solution: (D)
A larger k-value means less bias towards overestimating the truly expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy while selecting the k.
Context for Questions 23-24
Cross-validation is an important step in machine learning for hyper parameter tuning. Let’s say you are tuning a hyper-parameter “max_depth” for GBM by selecting it from 10 different depth values (values are greater than 2) for a tree-based model using 5-fold cross-validation.
Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds, and for the prediction on remaining 1-fold is 2 seconds.
Note: Ignore hardware dependencies from the equation.
A) Less than 100 seconds
B) 100 – 300 seconds
C) 300 – 600 seconds
D) More than or equal to 600 seconds
E) None of the above
F) Can’t estimate
Solution: (D)
Each iteration for depth “2” in 5-fold cross-validation will take 10 secs for training and 2 seconds for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on a depth greater than 2 will take more time than depth “2”, so overall timing would be greater than 600.
You want to select the right value against “max_depth” (from the given 10 depth values) and learning rate (from the given 5 different learning rates). In such cases, which of the following will represent the overall time?
A) 1000-1500 second
B) 1500-3000 Second
C) More than or equal to 3000 Second
D) None of these
Solution: (D)
Explanation is the same as above.
You want to choose a hyperparameter (H) based on TE and VE.
H | TE | VE |
1 | 105 | 90 |
2 | 200 | 85 |
3 | 250 | 96 |
4 | 105 | 85 |
5 | 300 | 100 |
Which value of H will you choose based on the above table?
A) 1
B) 2
C) 3
D) 4
E) 5
Solution: (D)
Looking at the table, option D seems the best
A) Transform data to zero mean
B) Transform data to zero median
C) Not possible
D) None of these
Solution: (A)
When the data has a zero mean vector, PCA will have the same projections as SVD; otherwise, you have to center the data first before taking SVD.
Context for Questions 27-28
Assume there is a black box algorithm that takes training data with multiple observations (t1, t2, t3,…….. tn) and a new observation (q1).
The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.
You can also think that this black box algorithm is the same as 1-NN (1-nearest neighbor).
Note: Where n (number of training observations) is very large compared to k.
A) TRUE
B) FALSE
Solution: (A)
In the first step, you pass an observation (q1) in the black box algorithm, so this algorithm returns the nearest observation and its class.
In the second step, you go through the nearest observation from train data and again input the observation (q1). The black box algorithm will again return the nearest observation and its class.
You need to repeat this procedure k times.
A) 1
B) 2
C) 3
Solution: (A)
Explanation is the same as above.
Which of the following is in the right order?
Q30) Which of the following option is / are true for the interpretation of log loss as an evaluation metric?
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1,2 and 3
Solution: (D)
Options are self-explanatory.
Context for Questions 31-32
Below are five samples given in the dataset.
Note: Visual distance between the points in the image represents the actual distance.
A) 0
B) 0.4
C) 0.8
D) 1
Solution: (C)
In Leave-One-Out cross-validation, we will select (n-1) observations for training and 1 observation of validation. Consider each point as a cross-validation point and then find the 3 nearest points to this point. So if you repeat this procedure for all points, you will get the correct classification for all positive classes given in the above figure, but the negative classes will be misclassified. Hence you will get 80% accuracy.
A) 1NN
B) 3NN
C) 4NN
D) All have the same leave-one-out error
Solution: (A)
Each point will always be misclassified in 1-NN which means that you will get 0% accuracy.
You are using logistic regression with L1 regularization.
Where C is the regularization parameter, and w1 & w2 are the coefficients of x1 and x2.
Which of the following option is correct when you increase the value of C from zero to a very large value?
A) First, w2 becomes zero, and then w1 becomes zero
B) First, w1 becomes zero, and then w2 becomes zero
C) Both become zero at the same time
D) Both cannot be zero even after a very large value of C
Solution: (B)
By looking at the image, we see that even by just using x2, we can efficiently perform classification. So at first, w1 will become 0. As the regularization parameter increases more, w2 will come closer and closer to 0.
Now consider the points below and choose the option based on these points.
Note: All other hyper parameters are the same, and other factors are not affected.
A) Only 1
B) Only 2
C) Both 1 and 2
D) None of the above
Solution: (A)
If you fit the decision tree of depth 4 in such data means, it will be more likely to underfit the data. So, in case of underfitting, you will have high bias and low variance.
A) 2 and 3
B) 1 and 3
C) 1 and 2
D) All of above
Solution: (D)
All of the options can be tuned to find the global minima.
You trained a model on the training dataset and got the below confusion matrix on the validation dataset.
Based on the above confusion matrix, choose which option(s) below will give you the correct predictions.
A) 1 and 3
B) 2 and 4
C) 1 and 4
D) 2 and 3
Solution: (C)
The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.
The true Positive Rate is how many times you are predicting positive class correctly, so the true positive rate would be 100/105 = 0.95, also known as “Sensitivity” or “Recall”
A)1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
E) Can’t say
Solution: (E)
For all three options, A, B, and C, it is not necessary that if you increase the value of the parameter, the performance may increase. For example, if we have a very high value of depth of the tree, the resulting tree may overfit the data and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can’t say for sure that “higher is better.”
Context for Questions 38-39
Imagine you have a 28 * 28 image, and you run a 3 * 3 convolution neural network on it with an input depth of 3 and an output depth of 8.
Note: Stride is 1, and you are using the same padding.
A) 28 width, 28 height, and 8 depth
B) 13 width, 13 height, and 8 depth
C) 28 width, 13 height, and 8 depth
D) 13 width, 28 height, and 8 depth
Solution: (A)
The formula for calculating output size is = (N – F)/S + 1
where N is input size, F is filter size, and S is stride.
Read this article to get a better understanding.
A) 28 width, 28 height, and 8 depth
B) 13 width, 13 height, and 8 depth
C) 28 width, 13 height, and 8 depth
D) 13 width, 28 height, and 8 depth
Solution: (B)
Explanation is the same as above.
Due to some reason, we forgot to tag the C values with visualizations. In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2, and C3 for image3 ) in the case of an rbf kernel?
A) C1 = C2 = C3
B) C1 > C2 > C3
C) C1 < C2 < C3
D) None of these
Solution: (C)
Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundaries and classifying the training points correctly. For large values of C, the optimization will choose a smaller-margin hyperplane. Read here.
I hope these questions and answers helped you test your knowledge and maybe learn a thing or two about Python, machine learning, and deep learning. You can find all the information about our upcoming skill tests and other events here. Machine learning questions.
Hope you like this article and getting understanding about the machine learning questions and machine learning exam questions and answers.You Will get a clear understanding about these questions and answers and it will make a better for learning the machine learning exam question and answers will help you to clear the interviews.
Key Takeways
Here are some more articles and tutorials if you wish to explore machine learning further.
Machine Learning basics for a newbie
Deep Learning vs. Machine Learning – the essential differences you need to know!
Applied Machine Learning Course
Introduction to Data Science Course
Ace Data Science Interviews Course
For question 25, wouldnt Occam's Razor suggest choosing option 2. Its giving the same VE, but with a lower hyperparameter value. Considering that we should keep our hyperparameters and hence our model simpler, wouldnt option 2 be a choice. Option 4 may be overfitting the training data
Hi Amit, It is true what you are saying but here hyperparameter H doesn't have any interpretation. So in such case you should choose the one which has lower training and validation error and also the close match. Best! Ankit Gupta
Hi, why is the correct answer for question 28 "Not Possible"? For example, to construct a 6-NN classifier from a 2-NN one, we can perform 2-NN three times each with two previous results discarded. Therefore the correct answer here should be "J must be a proper factor of K". Am I missing something here?
Hi Quan, Thanks for noticing it. It was marked incorrectly
It will be interesting to add option J < k. I think this can be a solution too. Thus, "J must be a proper factor of K” is not a strict condition, it is just a sub-case of J < k..