Machine learning models aim to understand patterns within data, enabling predictions, answers to questions, or a deeper understanding of concealed patterns. This iterative learning process involves the model acquiring patterns, testing against new data, adjusting parameters, and repeating until achieving satisfactory performance. The evaluation phase, essential for regression problems, employs loss functions. As a data scientist, it’s crucial to monitor regression metrics like mean squared error and R-squared to ensure the model doesn’t overfit the training data. Libraries like scikit-learn provide tools to train and evaluate regression models, helping data scientists build effective solutions.
This article was published as a part of the Data Science Blogathon.
Loss functions compare the model’s predicted values with actual values, gauging its efficacy in mapping the relationship between X (feature) and Y (target). The loss, indicating the disparity between predicted and actual values, guides model refinement. A higher loss denotes poorer performance, demanding adjustments for optimal training.
Selecting an appropriate loss function hinges on various factors such as the algorithm, data outliers, and the need for differentiability. With many options available, each with distinct properties, there is no universal solution. This article comprehensively lists regression loss functions, outlining their advantages and drawbacks. Implementable across various libraries, the code examples use NumPy to enhance the underlying mechanisms’ transparency.
Let’s delve into the world of regression loss functions without delay.
Here is a list of 13 evaluation metrics
Mean absolute error, or L1 loss, stands out as one of the simplest and easily comprehensible loss functions and evaluation metrics. It computes by averaging the absolute differences between predicted and actual values across the dataset. Mathematically, it represents the arithmetic mean of absolute errors, focusing solely on their magnitude, irrespective of direction. A lower MAE indicates superior model accuracy.
MAE formula is:
where
Python Code:
import numpy as np
def mean_absolute_error(true, pred):
"""
Calculates the Mean Absolute Error (MAE) between the true and predicted values.
Args:
true (numpy.ndarray): An array of true values.
pred (numpy.ndarray): An array of predicted values.
Returns:
float: The Mean Absolute Error.
"""
mae = np.mean(np.abs(true - pred))
return mae
In “Mean Bias Error,” bias reflects the tendency of a measurement process to overestimate or underestimate a parameter. It has a single direction, positive or negative. Positive bias implies an overestimated error, while negative bias implies an underestimated error. Mean Bias Error (MBE) calculates the mean difference between predicted and actual values, quantifying overall bias without considering absolute values. Similar to MAE, MBE differs in not taking the absolute value. Caution is needed with MBE, as positive and negative errors can cancel each other out.
The formula for MBE:
def mean_bias_error(true, pred):
bias_error = true - pred
mbe_loss = np.mean(np.sum(diff) / true.size)
return mbe_loss
Relative root mean square error Absolute Error is calculated by dividing the total absolute error by the absolute difference between the mean and the actual value. The formula for RAE is:
where y_bar is the mean of the n actual values.
RAE measures the performance of a predictive model and is expressed in terms of a ratio. The value of RAE can range from zero to one. A good model will have values close to zero, with zero being the best value. This error shows how the mean residual relates to the mean deviation of the target function from its mean.
def relative_absolute_error(true, pred):
true_mean = np.mean(true)
squared_error_num = np.sum(np.abs(true - pred))
squared_error_den = np.sum(np.abs(true - true_mean))
rae_loss = squared_error_num / squared_error_den
return rae_loss
Calculate Mean Absolute Percentage Error (MAPE) by dividing the absolute difference between the actual and predicted values by the actual value. This absolute percentage is averaged across the dataset. MAPE, also known as Mean Absolute Percentage Deviation (MAPD), increases linearly with error. Lower MAPE values indicate better model performance.
def mean_absolute_percentage_error(true, pred):
abs_error = (np.abs(true - pred)) / true
sum_abs_error = np.sum(abs_error)
mape_loss = (sum_abs_error / true.size) * 100
return mape_loss
MSE is one of the most common regression loss functions and an important error metric. In Mean Squared Error, also known as L2 loss, we calculate the error by squaring the difference between the predicted value and actual value and averaging it across the dataset.
MSE is also known as Quadratic loss as the penalty is not proportional to the error but to the square of the error. Squaring the error gives higher weight to the outliers, which results in a smooth gradient for small errors.
Optimization algorithms benefit from this penalization for large errors as it helps find the optimum values for parameters using the least squares method. MSE will never be negative since the errors are squared. The value of the error ranges from zero to infinity. MSE increases exponentially with an increase in error. A good model will have an MSE value closer to zero, indicating a better goodness of fit to the data.
def mean_squared_error(true, pred):
squared_error = np.square(true - pred)
sum_squared_error = np.sum(squared_error)
mse_loss = sum_squared_error / true.size
return mse_loss
Root Mean Square Error in Machine Learning (RMSE) is a popular metric used in machine learning and statistics to measure the accuracy of a predictive model. It quantifies the differences between predicted values and actual values, squaring the errors, taking the mean, and then finding the square root. RMSE provides a clear understanding of the model’s performance, with lower values indicating better predictive accuracy relative root mean square error.
It is computed by taking the square root of MSE. RMSE is also called the Root Mean Square Deviation. It measures the average magnitude of the errors and is concerned with the deviations from the actual value. RMSE value with zero indicates that the model has a perfect fit. The lower the RMSE, the better the model and its predictions. A higher relative root mean square error in machine learning indicates that there is a large deviation from the residual to the ground truth. RMSE can be used with different features as it helps in figuring out if the feature is improving the model’s prediction or not.
def root_mean_squared_error(true, pred):
squared_error = np.square(true - pred)
sum_squared_error = np.sum(squared_error)
rmse_loss = np.sqrt(sum_squared_error / true.size)
return rmse_loss
To calculate Relative Squared Error, you take the Mean Squared Error (MSE) and divide it by the square of the difference between the actual and the mean of the data. In other words, we divide the MSE of our model by the MSE of a model that uses the mean as the predicted value.
def relative_squared_error(true, pred):
true_mean = np.mean(true)
squared_error_num = np.sum(np.square(true - pred))
squared_error_den = np.sum(np.square(true - true_mean))
rse_loss = squared_error_num / squared_error_den
return rse_loss
The output value of RSE is expressed in terms of ratio. It can range from zero to one. A good model should have a value close to zero while a model with a value greater than 1 is not reasonable.
The Normalized RMSE is generally computed by dividing a scalar value. It can be in different ways like,
# implementation of NRMSE with standard deviation
def normalized_root_mean_squared_error(true, pred):
squared_error = np.square((true - pred))
sum_squared_error = np.sum(squared_error)
rmse = np.sqrt(sum_squared_error / true.size)
nrmse_loss = rmse/np.std(pred)
return nrmse_loss
Opting for the interquartile range can be the most suitable choice, especially when dealing with outliers. NRMSE proves effective for comparing models with different dependent variables or when modifications like log transformation or standardization occur. This metric addresses scale-dependency issues, facilitating comparisons across models of varying scales or datasets.
Relative Root Mean Squared Error (RRMSE) is a variant of Root Mean Square Error in Machine Learning (RMSE), gauging predictive model accuracy relative to the target variable range. It normalizes RMSE by the target variable range and presents it as a percentage for easy cross-dataset or cross-variable comparison. RRMSE, a dimensionless form of RMSE, scales residuals against actual values, allowing comparison of different measurement techniques.
def relative_root_mean_squared_error(true, pred):
num = np.sum(np.square(true - pred))
den = np.sum(np.square(pred))
squared_error = num/den
rrmse_loss = np.sqrt(squared_error)
return rrmse_loss
Root Mean Squared Logarithmic Error is calculated by applying log to the actual and the predicted values and then taking their differences. RMSLE is robust to outliers where the small and the large errors are treated evenly.
It penalizes the model more if the predicted value is less than the actual value while the model is less penalized if the predicted value is more than the actual value. It does not penalize high errors due to the log. Hence the model has a larger penalty for underestimation than overestimation. This can be helpful in situations where we are not bothered by overestimation but underestimation is not acceptable.
def root_mean_squared_log_error(true, pred):
square_error = np.square((np.log(true + 1) - np.log(pred + 1)))
mean_square_log_error = np.mean(square_error)
rmsle_loss = np.sqrt(mean_square_log_error)
return rmsle_loss
What if you want a function that learns about the outliers as well as ignores them? Well, Huber loss is the one for you. Huber loss is a combination of both linear and quadratic scoring methods. It has a hyperparameter delta (𝛿) which can be tuned according to the data. The loss will be linear (L1 loss) for values above delta and quadratic (L2 loss) for values below it. It balances and combines good properties of both MAE (Mean Absolute Error) and MSE (Mean Squared Error).
In other words, for loss values less than delta, MSE will be used and for loss values greater than delta, MAE will be used. The choice of delta (𝛿) is extremely critical because it defines our choice of the outlier. Huber loss reduces the weight we put on outliers for larger loss values by using MAE while for smaller loss values it maintains a quadratic function using MSE.
def huber_loss(true, pred, delta):
huber_mse = 0.5 * np.square(true - pred)
huber_mae = delta * (np.abs(true - pred) - 0.5 * (np.square(delta)))
return np.where(np.abs(true - pred) <= delta, huber_mse, huber_mae)
Log cosh calculates the logarithm of the hyperbolic cosine of the error. This function is smoother than quadratic loss. It works like MSE but is not affected by large prediction errors. It is quite similar to Huber loss in the sense that it is a combination of both linear and quadratic scoring methods.
def log_cosh(true, pred):
logcosh = np.log(np.cosh(pred - true))
logcosh_loss = np.sum(logcosh)
return logcosh_loss
The quantile regression loss function is applied to predict quantiles. The quantile is the value that determines how many values in the group fall below or above a certain limit. It estimates the conditional median or quantile of the response (dependent) variables across values of the predictor (independent) variables. The loss function is an extension of MAE except for the 50th percentile, where it is MAE. It provides prediction intervals even for residuals with non-constant variance and it does not assume a particular parametric distribution for the response.
γ represents the required quantile. The quantile values are selected based on how we want to weigh the positive and the negative errors. Unlike the squared difference loss used in linear regression models, this loss function is based on absolute differences.
In the loss function above, γ has a value between 0 and 1. When there is an underestimation, the first part of the formula will dominate and for overestimation, the second part will dominate. The chosen value of quantile(γ) gives different penalties for over-prediction and under prediction. When γ = 0.5, underestimation and overestimation are penalized by the same factor, and the median is obtained. When the value of γ is larger, overestimation is penalized more than underestimation. For example, when γ = 0.75 the model will penalize overestimation and it will cost three times as much as underestimation. Optimization algorithms based on gradient descent learn from the quantiles instead of the mean.
𝛾 represents the required quantile. The quantiles values are selected based on how we want to weigh the positive and the negative errors.
In the loss function above, 𝛾 has a value between 0 and 1. When there is an underestimation, the first part of the formula will dominate and for overestimation, the second part will dominate. The chosen value of quantile(𝛾) gives different penalties for over-prediction and under prediction. When 𝛾 = 0.5, underestimation and overestimation are penalized by the same factor and the median is obtained. When the value of 𝛾 is larger, overestimation is penalized more than underestimation. For example, when 𝛾 = 0.75 the model will penalize overestimation and it will cost three times as much as underestimation. Optimization algorithms based on gradient descent learn from the quantiles instead of the mean.
def quantile_loss(true, pred, gamma):
val1 = gamma * np.abs(true - pred)
val2 = (1-gamma) * np.abs(true - pred)
q_loss = np.where(true >= pred, val1, val2)
return q_loss
This comprehensive guide navigated through diverse regression loss functions, shedding light on their applications, advantages, and drawbacks. The article demystified complex metrics like MAE, MBE, RAE, MAPE, MSE, RMSE (the root mean squared error), RSE, NRMSE, RRMSE, and RMSLE, and introduced specialized losses like Huber, Log Cosh, and Quantile. It emphasized the nuanced factors influencing loss function selection, from algorithm types to outlier handling. Additionally, it covered the coefficient of determination (R-squared), and r2_score function from sklearn.metrics import, and adjusted r-squared, which are important evaluation metrics for assessing the performance of machine learning algorithms in regression problems.
Thank you for reading down here! I hope this article was helpful in your learning journey. I would love to hear in the comments about any other loss functions that I have missed. Happy Evaluating!
A. The four evaluation metrics are accuracy, precision, recall, and F1 score.
A. Evaluation metrics and scores measure a model’s performance, quantifying its ability to make correct predictions.
A. Evaluation metrics are important because they provide insights into a model’s effectiveness, guiding improvements and ensuring reliable results.
A. In NLP, evaluation metrics assess models based on metrics like BLEU, ROUGE, perplexity, and word error rate, reflecting their linguistic accuracy and coherence.
This is an excellent article. I feel it was very well laid out, structured, and easy to understand. One questions on the Python formula for Relative Root Mean Squared Error (RRMSE), is it missing the division by n?
I read through to the end and it was very educative. Thank you. Question: If you were to choose an evaluator for the comparison of predictions from multiple linear and nonlinear models trained using the same data with small outliers, which top three evaluators will you choose and why?
Thanks for the useful resource. Just want to let you know that you missed 1/n part in the Root Mean Squared Logarithmic Error (RMSLE).