This article was published as a part of the Data Science Blogathon
Linear Regression, a supervised technique is one of the simplest Machine Learning algorithms. It is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables.
Therefore it becomes necessary for every aspiring Data Scientist and Machine Learning Engineer to have a good knowledge of the Linear Regression Algorithm.
In this article, we will discuss the most important questions on the Linear Regression Algorithm which is helpful to get you a clear understanding of the Algorithm, and also for Data Science Interviews, which covers its very fundamental level to complex concepts.
In simple terms: It is a method of finding the best straight line fitting to the given dataset, i.e. tries to find the best linear relationship between the independent and dependent variables.
In technical terms: It is a supervised machine learning algorithm that finds the best linear-fit relationship on the given dataset, between independent and dependent variables. It is mostly done with the help of the Sum of Squared Residuals Method, known as the Ordinary least squares (OLS) method.
Image Source: Google Images
As we know that the linear regression model is of the form:
The significance of the linear regression model lies in the fact that we can easily interpret and understand the marginal changes in the independent variables(predictors) and observed their consequences on the dependent variable(response).
Therefore, a linear regression model is quite easy to interpret.
For Example, if we increase the value of x1 increases by 1 unit, keeping other variables constant, then the total increase in the value of y will be βi and the intercept term (β0) is the response when all the predictor’s terms are set to zero or not considered.
The basic assumptions of the Linear regression algorithm are as follows:
Now, let’s break these assumptions into different categories:
It is assumed that there exists a linear relationship between the dependent and the independent variables. Sometimes, this assumption is known as the ‘linearity assumption’.
Correlation: It measures the strength or degree of relationship between two variables. It doesn’t capture causality. It is visualized by a single point.Regression: It measures how one variable affects another variable. Regression is all about model fitting. It tries to capture the causality and describes the cause and the effect. It is visualized by a regression line.
Gradient descent is a first-order optimization algorithm. In linear regression, this algorithm is used to optimize the cost function to find the values of the βs (estimators) corresponding to the optimized value of the cost function.The working of Gradient descent is similar to a ball that rolls down a graph (ignoring the inertia). In that case, the ball moves along the direction of the maximum gradient and comes to rest at the flat surface i.e, corresponds to minima.
Mathematically, the main objective of the gradient descent for linear regression is to find the solution of the following expression,
ArgMin J(θ0, θ1), where J(θ0, θ1) represents the cost function of the linear regression. It is given by :
Here, h is the linear hypothesis model, defined as h=θ0 + θ1x,
y is the target column or output, and m is the number of data points in the training set.
Step-1: Gradient Descent starts with a random solution,
Step-2: Based on the direction of the gradient, the solution is updated to the new value where the cost function has a lower value.
The updated value for the parameter is given by the formulae:
Repeat until convergence(upto minimum loss function)
Generally, a Scatter plot is used to see if linear regression is suitable for any given data. So, we can go for a linear model if the relationship looks somewhat linear. Plotting the scatter plots is easy in the case of simple or univariate linear regression.But if we have more than one independent variable i.e, the case of multivariate linear regression, then two-dimensional pairwise scatter plots, rotating plots, and dynamic graphs can be plotted to find the suitableness.
On the contrary, to make the relationship linear we have to apply some transformations.
Mainly, there are five metrics that are commonly used to evaluate the regression models:
The Q-Q plot represents a graphical plotting of the quantiles of two distributions with respect to each other. In simple words, we plot quantiles against quantiles in the Q-Q plot which is used to check the normality of errors.Whenever we interpret a Q-Q plot, we should concentrate on the ‘y = x’ line, which corresponds to a normal distribution. Sometimes, this line is also known as the 45-degree line in statistics.
It implies that each of the distributions has the same quantiles. In case you witness a deviation from this line, one of the distributions could be skewed when compared to the other i.e, normal distribution.
The sum of the residuals in a linear regression model is 0 since it assumes that the errors (residuals) are normally distributed with an expected value or mean equal to 0, i.e.Y = βT X + ε
Here, Y is the dependent variable or the target column, and β is the vector of the estimates of the regression coefficient,
X is the feature matrix containing all the features as the columns, ε is the residual term such that ε ~ N(0, σ2).
Moreover, the sum of all the residuals is calculated as the expected value of the residuals times the total number of observations in our dataset. Since the expectation of residuals is 0, therefore the sum of all the residual terms is zero.
Note: N(μ, σ2) denotes the standard notation for a normal distribution having mean μ and standard deviation σ2.
<
RMSE and MSE are the two of the most common measures of accuracy for linear regression.
MSE (Mean Squared Error) is defined as the average of all the squared errors(residuals) for all data points. In simple words, we can say it is an average of squared differences between predicted and actual values.
RMSE (Root Mean Squared Error) is the square root of the average of squared differences between predicted and actual values.
RMSE stands for Root mean square error, which represented by the formulae:
MSE stands for Mean square error, which represented by the formulae:
Increment in RMSE is larger than MAE as the test sample size increases. In general, as the variance of error magnitudes increase, MAE remains steady but RMSE increases.
OLS stands for Ordinary Least Squares. The main objective of the linear regression algorithm is to find coefficients or estimates by minimizing the error term i.e, the sum of squared errors. This process is known as OLS.This method finds the best fit line, known as regression line by minimizing the sum of square differences between the observed and predicted values.
MAE stands for Mean Absolute Error, which is defined as the average of absolute or positive errors of all values. In simple words, we can say MAE is an average of absolute or positive differences between predicted values and the actual values.
Image Source: Google Images
MAPE stands for Mean Absolute Percent Error, which calculates the average absolute error in percentage terms. In simple words, It can be understood as the percentage average of absolute or positive errors.
Image Source: Google Images
This question can be understood that why one should prefer the absolute error instead of the squared error.1. In fact, the absolute error is often closer to what we want when making predictions from our model. But, if we want to penalize those predictions that are contributing to the maximum value of error.
2. Moreover in mathematical terms, the squared function is differentiable everywhere, while the absolute error is not differentiable at all the points in its domain(its derivative is undefined at 0). This makes the squared error more preferable to the techniques of mathematical optimization. To optimize the squared error, we can compute the derivative and set its expression equal to 0, and solve. But to optimize the absolute error, we require more complex techniques having more computations.
3. Actually, we use the Root Mean Squared Error instead of Mean squared error so that the unit of RMSE and the dependent variable are equal and results are interpretable.
There are mainly two methods used for linear regression:1. Ordinary Least Squares(Statistics domain):
To implement this in Scikit-learn we have to use the LinearRegression() class.
2. Gradient Descent(Calculus family):
To implement this in Scikit-learn we have to use the SGDRegressor() class.
The normal equation for linear regression is :β=(XTX)-1XTY
This is also known as the closed-form solution for a linear regression model.
where,
Y=βTX is the equation that represents the model for the linear regression,
Y is the dependent variable or target column,
β is the vector of the estimates of the regression coefficient, which is arrived at using the normal equation,
X is the feature matrix that contains all the features in the form of columns. The thing to note down here is that the first column in the X matrix consists of all 1s, to incorporate the offset value for the regression line.
To answer the given question, let’s first understand the difference between the Normal equation and Gradient descent method for linear regression:
where,
‘k’ represents the maximum number of iterations used for the gradient descent algorithm, and
‘n’ is the total number of observations present in the training dataset.
Clearly, if we have large training data, a normal equation is not preferred for use due to very high time complexity but for small values of ‘n’, the normal equation is faster than gradient descent.
R-square (R2), also known as the coefficient of determination measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.
The main problem with the R-squared is that it will always remain the same or increases as we are adding more independent variables. Therefore, to overcome this problem, an Adjusted-R2 square comes into the picture by penalizing those adding independent variables that do not improve your existing model.
To learn more about, R2 and adjusted-R2, refer to the link.
There are two major flaws of R-squared:Problem- 1: As we are adding more and more predictors, R² always increases irrespective of the impact of the predictor on the model. As R² always increases and never decreases, it can always appear to be a better fit with the more independent variables(predictors) we add to the model. This can be completely misleading.
Problem- 2: Similarly, if our model has too many independent variables and too many high-order polynomials, we can also face the problem of over-fitting the data. Whenever the data is over-fitted, it can lead to a misleadingly high R² value which eventually can lead to misleading predictions.
To learn more about, flaws of R2, refer to the link.
It is a phenomenon where two or more independent variables(predictors) are highly correlated with each other i.e. one variable can be linearly predicted with the help of other variables. It determines the inter-correlations and inter-association among independent variables. Sometimes, multicollinearity can also be known as collinearity.
Image Source: Google Images
To learn more about, multicollinearity, refer to the link.
It refers to the situation where the variations in a particular independent variable are unequal across the range of values of a second variable that tries to predict it.
Image Source: Google Images
To detect heteroscedasticity, we can use graphs or statistical tests such as the Breush-Pagan test and NCV test, etc.
The main disadvantages of linear regression are as follows:
VIF stands for Variance inflation factor, which measures how much variance of an estimated regression coefficient is increased due to the presence of collinearity between the variables. It also determines how much multicollinearity exists in a particular regression model.
Firstly, it applies the ordinary least square method of regression that has Xi as a function of all the other explanatory or independent variables and then calculates VIF using the given below mathematical formula:
For the following purposes, we can carry out the Hypothesis testing in linear regression:1. To check whether an independent variable (predictor) is significant or not for the prediction of the target variable. Two common methods for this are —
If the p-value of a particular independent variable is greater than a certain threshold (usually 0.05), then that independent variable is insignificant for the prediction of the target variable.
If the value of the regression coefficient corresponding to a particular independent variable is zero, then that variable is insignificant for the predictions of the dependent variable and has no linear relationship with it.
2. To verify whether the calculated regression coefficients i.e, with the help of linear regression algorithm, are good estimators or not of the actual coefficients.
Yes, we can apply a linear regression algorithm for doing analysis on time series data, but the results are not promising and hence is not advisable to do so.The reasons behind not preferable linear regression on time-series data are as follows:
Test your skills and boost your confidence with our ‘Linear Regression Mastery‘ course! Dive into comprehensive lessons and hands-on projects designed to prepare you for your next data analytics interview—enroll today and excel in your career!
Thanks for reading!
I hope you enjoyed the questions and were able to test your knowledge about Linear Regression Algorithm.
If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link
Please feel free to contact me on Linkedin, Email.
Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.
Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.
The media shown in this article on Sign Language Recognition are not owned by Analytics Vidhya and are used at the Author’s discretion.
Chirag, excellent piece of work. Very good explanation. Using more real life example will help to understand more easily and clearly.
good work. its really helpful and knowledgeable