By implementing cutting-edge technology like artificial intelligence (AI) and machine learning, businesses are attempting to increase the accessibility of information and services for consumers. These technologies are increasingly adopted in various business areas, including banking, finance, retail, manufacturing, and healthcare.
Some in-demand organizational roles embracing AI are data scientists, artificial intelligence engineers, machine learning engineers, and data analysts. Knowing the types of machine learning interview questions that hiring managers could pose if you intend to apply for positions in this field is essential because an ML interview would demand rigorous preparation in terms of in-depth knowledge of ML concepts and algorithms, technical and programming skills, etc.
To help you streamline your efforts as you embrace this learning journey, I decided to start a series of the essential ML questions that one is expected to face during the interviews. Each part will consist of 10 questions to provide brief and focussed coverage of each topic. For the first part, I decided to deal with the question pertinent and meaningful to Machine Learning and Statistics. This should provide you with sufficient background and revision material before your following interview. Over the remaining sections, I would deal with questions specific to Deep Learning, Computer Vision, NLP, Time Series Analysis, etc.
So if you are ready to start your dream career in ML, continue reading below to refresh your memory and add new knowledge to your existing know-how.
1. What are the Major Types of Machine Learning Algorithms?
On a broad category, ML algorithms can be sub-divided into three main categories:
A. Supervised Learning: These algorithms give predictions based on inferring a function based on labeled training data, i.e., the target variables are present.
If the target variable is continuous, the usual choice of algorithms is the various regression models (linear, quadratic, polynomial)
If the target variable is categorical, preferred algorithms include Logistic Regression, Naive Bayes, KNN, SVM, Decision Tree, Boosting Algorithms, Random Forest, etc.
B. Unsupervised Learning: These algorithms predict the target variable based on some patterns on the set of given data. The data for this purpose does not have any dependent variable or label to predict. Algorithms that fall into this category include Clustering Algorithms, Anomaly Detection, Latent Space Models, Singular Value Decomposition, Principal Component Analysis, etc.
C. Reinforcement Learning: These algorithms use a trial-and-error-based approach, and learning occurs based on the rewards received from the previous action.
Source: Experfy Insights
2. How can you Determine the Critical Variables from the Dataset you are Working with?
Various means can be implemented to select essential variables from a dataset:
1. Identify and discard the correlated values before finalizing the important variables
2. Chose the variables based on the p” values obtained from hypothesis testing
3. Forward, backward and stepwise selection
4. Lasso Regression
5. Use Random Forest and select variables based on the feature importance plots
6. The top features can be selected based on the information gained from the available set of features
3. Explain Covariance and Correlation.
Covariance indicates the extent to which two random variables depend on each other. A higher number would denote a higher dependency. Their value lies in the range of -∞ and +∞. The problem with covariance is that they are hard to compute without performing normalization over the entire dataset, and a change of scale of the data would affect the covariance.
Correlation is a statistical measure that determines how strongly two variables are related. Its value would range from -1 to +1, which is scale-independent.
Source: Experfy Insights
4. What is the “P” Value?
P – value is used to decide the hypothesis test. The P value denotes the minimum significant level at which we can reject the null hypothesis. A lower the P – value would mean that we are more likely to reject the null hypothesis.
5. What are Parametric and Non-parametric Models?
Parametric models have limited parameters, and only knowledge about the model’s parameters is required to predict new data.
Non-parametric models possess no limits to the number of input parameters allowing for more flexibility in predicting newer data. All we need to know to provide the predictions is the state of the data and the model parameters.
Tabular representation of the differences between Parametric and Non-parametric models
6. What is the Difference between Sigmoid and Softmax functions?
The Sigmoid function is used for Binary Classification methods, where we have only two output classes, whereas the Softmax function is applied to Multiclass methods. Thus it is evident that the input and output of both parts would be slightly different.
The sigmoid function receives just one input and outputs a single number representing the probability of belonging to class 1 or 2.
Whereas the softmax function is vectorized, i.e., it receives a vector with the same number of entries as the number of classes we have. The output vector contains the probabilities of belonging to that class.
Schematic Representation of the Activation Functions, Source: Nomidl
7. How can the Normality of a Dataset be Determined?
The easiest way to determine the normality is to plot the given data. However, a few of the normality tests also exist as below:
Shapiro-Wilk Test
Anderson-Darling Test
Kolmogorov-Smirnov Test
Martinez-Iglewicz Test
D’Agostino Skewness Test
8. How can the K-value be Selected for the K-means Clustering Algorithm?
The K value can be selected in two different ways: Direct Method and Statistical Testing Method.
1. Direct Method: It contains the elbow and silhouette methods
2. Statistical Testing Method: It includes gap statistics
The silhouette method remains the most frequently used for determining the optimal K value.
9. How can you Handle Outliers in a Dataset?
Outliers are data points significantly different from the rest of the dataset. Approaches that can be used to discover the outliers include – Box Plot, Z-Score, Scatter Plot, etc.
The following strategies can typically handle outliers:
1. The easiest way is to drop the outlier values
2. They can be separately marked as outliers and used as a different feature vector
3. The feature can alternatively be transformed to reduce the effect of the outlier
10. Explain the Differences between Loss and Cost Function.
The term loss function can be used when dealing with a single data point, whereas when the sum of the error for multiple data is calculated, the term cost function can be used. As such, intuitively, both terms would mean the same, and no significant difference exists between them. Thus the loss function captures the difference between the actual and predicted values for a single data point, whereas the cost function sums the difference over the entire training data.
Conclusion on Machine Learning
Thus in this first part of the series, we brushed up on the fundamental question of Machine Learning that one is expected to face. Having these thorough who be a boost to your preparation. to summarize, the key takeaways from this article would:
The different categories of machine learning – how and on what basis they can be classified into supervised, unsupervised, and reinforcement learning.
Then we dealt with methods of determining the various essential features of the data, how to find correlation and covariance and how to extract critical, meaningful inferences from such data; we discussed p-value and lasso regression,
Then we discussed parametric and non-parametric models
Key differences between the sigmoid and softmax activation functions were dealt with next
Then an essential step of data normalization was discussed, and the various methods of carrying out the same.
Another critical factor affecting model performance – outliers was discussed next, and the various ways you can handle them were elaborated.
And finally, we finished with the differences between cost and loss function – two of the most common terms you might have used while developing your ML models;
These fundamental questions should be an excellent primer to build upon over the next few blogs to be followed. Stay tuned for the upcoming parts.
In part 2 of this series, I dealt with Deep Learning and the essential aspects of DL. Hope this read could add something valuable to your existing technical know-how of Machine Learning!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Advancing language model research by day and writing about my work online by night. I explore AI breakthroughs and transform complex studies into clear, engaging insights that empower professionals and enthusiasts alike.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.