AutoML (Automated Machine Learning) platforms are getting more and more popular these days, as they allow us to automate the process of applying machine learning end-to-end. This offers the additional advantages of producing quicker and more straightforward solutions and models that quite often outperform hand-designed models.
There are several such paid and open-source AutoML platforms in the market like H2O, Data Robot, Google AutoML, TPOT, Auto-Sklearn, etc. All of them come with their pros and cons, and I don’t get into the debate of which one is the best of all. Instead, this article focuses on one of the latest features I observed in H2O AutoML — “Model Explainability”.
I also briefly explain various terms like SHAP Summary, Partial Dependence Plots, and Individual Conditional Expectation which, along with Variable importance, form the critical components of H2O AutoML’s model explainability interface.
Note: I have no personal bias towards H2O. It just happens to be the one I generally use.
We cannot introduce AutoML without mentioning the machine learning project’s life cycle, including data cleaning, feature selection/engineering, model selection, parameter optimization, and finally, model validation. Even with the latest technology advancements, traditional data science projects still incorporate many manual processes and remain repetitive and time-consuming.
AutoML was introduced to automate the entire process from data cleaning to parameter optimization. It provides enormous value for machine learning projects in terms of both performance and time-saving.
We use the famous Teleco Churn Dataset from Kaggle to explain the explainability interface. Dataset has a mix of numeric and categoric variables & our variable of interest is ‘Churn’ which identifies customers who left within the last month. We use the dataset in raw format as our focus is on explaining the model and not the model performance.
This plot shows the relative importance of the most important variables in the model. H2O displays each feature’s importance after scaling between 0 and 1.
Variable importance is calculated by the relative influence of each variable, mainly for tree-based models like Random Forest: whether that variable was picked to split while building the tree, and how much the squared error (overall trees) improved (reduced) as a result.
It is straightforward to interpret this graph. Variable with the longest bar (aka the topmost one) is the most important and the one with the shortest bar (aka the bottom-most one) is the least important.
In this case, ‘Çontract’ is the most important variable, while ‘MultipleLines’ is the least important one when it comes to predicting the churn.
Variable importance heatmap shows important variables across multiple models. By default, the models and variables are ordered by their similarity.
We can see that all the GBM models are stacked together in the x-axis and are the same as the deep learning models. Darker (red) the color, the higher the importance of the variable for that corresponding model.
i.e. ‘Contract’ variable is highly important for all the GBM models while it is not important for any Deep Learning model. Similarly ‘PaymentMethod’ is not important for most GBM models while it is somewhat important for all Deep Learning models.
This plot shows the correlation between the model predictions. For classification tasks, the frequency of identical predictions is used. By default, models are ordered by their similarity (as measured by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red-colored text.
We can see that models belonging to the same family have strong correlations with each other. Darker (red)the color, the stronger the correlation between the two models.
i.e. GBM models have a much stronger correlation among themselves than with Deep Learning models. GLM model seems to be correlated more with Deep Learning models than with GBM ones.
SHAP value which is an acronym for SHapley Additive exPlanations interprets the impact of having a particular value for a given variable compared to the prediction we would make if that variable took some baseline value instead.
We can see that a higher monthly charge is associated with a large and positive(supporting) impact on the churn. Here high comes from the color (red) and large positive (long tail to the base axis’s right at 0.0) from the x value.
i.e. Customers with higher monthly charges tend to churn more.
On the other side, having online security is associated with a medium and negative (opposing) impact on the churn. Here having comes from the color (red is yes and blue is no), and medium negative comes from the (medium tail to the left of the base axis at 0.0) from the x value.
i.e. Customers who have opted for online security tends to churn lesser.
However, the impact of having a high monthly charge is more substantial than having online security, as we see from the spread of the SHAPvalues.
Please refer to this medium article for a simple yet clear interpretation of SHAP values.
While variable importance shows what variables affect predictions the most, partial dependence plots show how a variable affects predictions. For those familiar with linear or regression models, PD plots can be interpreted similarly to the coefficients in those regression models. The effect of a variable is measured in a change in the mean response. It assumes independence between the feature for which PDP is computed and the rest.
This is useful to answer questions like:
We can see a downward trend in the above graph, indicating a drop in churn rate associated with a longer tenure.
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of a variable’s marginal influence on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a variable while ICE plot shows the impact for a single instance. This function will plot the impact for each decile.
While PDPs are simple and easy to understand, this simplicity hides potentially interesting relationships among individual instances. For e.g., if the feature values of a subset of instances trend negative but another subset trend positive, then the averaging process might cancel them out.
ICE plots solve this problem. An ICE plot unwraps the curve, which is the result of the aggregation process in PDP. Instead of averaging the prediction, each ICE curve shows the predictions of varying the feature value for an instance. When presented together in a single plot, it shows relationships between subsets of the instances and differences in how individual instances behave.
Unfortunately, H2O didn’t create ICE plots for this dataset (not sure why!). So I use an ICE plot from another use case to explain.
i.e. the survivability predictions for most of the passengers drops as age increases. However, there are several passengers with opposite predictions.
In this article, we explored different features of H2O’s model explainability interface, which appears to be promising but still in infancy, as we have few options to customize the visuals. However, it is appreciable that AutoML platforms are coming up with such model explainability features that will help them gradually get rid of the ‘Black Box’ tag. I hope they will keep developing the platforms to accommodate more features from this perspective.
It is to be noted that the onus is on us data guys to properly interpret these results and utilize them appropriately for the intended data science use cases.
As always, feedback and comments are most welcome 🙂
Please feel free to stay connected with me on Linkedin
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.