So you’ve built your machine learning model. You’ve even taken the next step – often one of the least spoken about – of putting your model into production (or model deployment). Great – you should be all set to impress your end-users and your clients.
But wait – as a data science leader, your role in the project isn’t over yet. The machine learning model you and your team created and deployed now needs to be monitored carefully. There are different ways to perform post-model deployment and we’ll discuss them in this article.
We will first quickly recap what we have covered in the first three articles of this practical machine learning series. Then, we will understand why and how “auto-healing” in machine learning is a red herring and why every professional should be aware of it. And then we will dive into two types of post-production monitoring and understand where and how to use them.
This is the final article of my four article series that focuses on sharing insights on various components involved in successfully implementing a data science project.
In this series on Practical Machine Learning for Leaders, we have so far discussed:
Once the optimal end-to-end system is deployed, do we declare victory and move on? No! Not yet, at least.
In this fourth (and final) article in this series, we will discuss the various post-production monitoring and maintenance-related aspects that the data science delivery leader needs to plan for once the Machine Learning (ML)-powered end product is deployed. The adage “Getting to the top is difficult, staying there is even harder” is most applicable in such situations.
There is a popular and dangerously incorrect myth about machine learning models that they auto-heal.
In particular, the expectation is that a machine learning model will continuously and automatically identify where it makes mistakes, find optimal ways to rectify those mistakes, and incorporate those changes in the system, all with almost no human intervention.
The reality is that such ‘auto-heal’ is at best a far-fetched dream.
Only a handful of machine learning techniques today are capable of learning from their mistakes as they try to complete a task. These techniques typically fall under the umbrella of Reinforcement Learning (RL). Even in the RL paradigm, several of the model parameters are carefully hand-tuned by a human expert and updated periodically.
And even if we assume that we have plenty of such products deployed in real-life situations, the existing data architectures (read ‘data silos’) within the organizations have to be completely overhauled for the data to seamlessly flow from the customer-facing compute environment to the compute environment that is used for building the machine learning models.
So, it is safe to say that in today’s world, the “auto” in auto-healing is almost non-existent for all practical purposes.
Let us now see why machine learning systems need healing in the first place. There are several aspects of the data ecosystem that can have a significantly negative impact on the performance of the system. I have listed some of these below.
A typical machine learning model is trained on about 10% of the possible universe of data. This is either because of the scarcity of appropriately labeled data or because of the computational constraints of training on massive amounts of data.
The choice of the machine learning model and the training strategies should provide generalizability on the remaining 90% of the data. But there will still be data samples within this pool where the model output is incorrect or less-than-optimal.
In all real-world deployments of machine learning solutions, there will be a subset of the input data which comes from a system that the data science team has little control over. When those systems change the input, the data science teams are not always kept in the loop (happens largely due to the inherent complexities in the data world).
Simple changes in the input data, like a type change from ‘scalar’ to ‘list’, can be relatively easily detected through basic sanity checks. But there are a variety of changes which are difficult to catch, have a substantially detrimental impact on the output of the machine learning system and unfortunately are not uncommon.
Consider, for example, a system deployed to automatically control the air conditioning of a server room. The machine learning system would obviously take the ambient temperature as one of the inputs.
It is fair to assume that the temperature sensors are controlled by a different ecosystem which may decide to change the unit of temperature from Celsius to Fahrenheit without necessarily informing the machine learning system owner. This change in input will have a significant impact on the performance of the system with absolutely no run-time exception thrown.
As the systems get complex, it is almost impossible to anticipate all such likely changes beforehand to encode exhaustive exception handling.
The landscape of just about every business is changing quite rapidly. Words like Dunzo, Doordash, and Zelle, which didn’t exist a few years ago (and would hence be just marked as ‘out-of-vocabulary’), have now become keywords with significant interpretations.
Uber, which used to be associated only with transportation, can now be interpreted as food-related also. Wholefoods, which had nothing to do with Amazon just a few years ago, can now influence Amazon’s financial reporting.
Further along, food delivery, which is today probably associated predominantly with a bachelor-like lifestyle in India, may get associated with working-young-parents-lifestyle in the near future.
What these examples show is that as new business models emerge, existing businesses venture into adjacent spaces, mergers, and acquisitions happen, and the human interpretation of a particular activity may change over time. This dynamic nature of data and its interpretation has serious implications for our machine learning model.
One human capability that is vastly superior to today’s machines is to weave in seemingly disparate sources of information to form a complete context to interpret a data point.
Consider this example from the fin-tech industry:
If we know that a financial account is of a UK resident, then it is relatively easy for both the machine and the human expert to interpret the word “BP” to mean “Bill Payment”. But if the same account holder travels to India and has a financial transaction description that has the word “BP”, human experts can very easily infer from all the context available to them that BP here likely stands for “Bharat Petroleum”.
A machine may find it near impossible to do such context-based switching. And yet, this is not a corner case. As machine learning systems become more and more mainstream, they will be expected to mimic the context-aware human behavior.
While we continue to build systematic ways in which context can be codified into the machine learning systems, we need to build (semi-)automatic techniques to monitor trends in the input and output data.
If we cannot auto-heal, what can be done then? The next best thing to do is to continuously track the health of the machine learning model against a set of key indicators and generate specific event-based alerts.
The obvious follow-up questions are what are these key indicators and which events trigger an alert. These questions are addressed by the proactive model monitoring framework.
The key element of the monitoring framework is to identify which input samples deviate significantly from the patterns seen in the training data and then have those samples closely examined by a human expert.
Unfortunately, there is no universal way of identifying which patterns are most relevant. Patterns of interest largely depend on the domain of the data, the nature of the business problem, and the machine learning model being used.
For example, in the Natural Language Processing (NLP) domain, some of the simple patterns could be:
A slightly more advanced pattern could be based on the machine learning technique used. For example, again in the NLP domain, assume that a distributed representation was used to place every word in an L-dimensional space.
We can quantify the word distribution in the training data using modeling techniques like Gaussian Mixture Models (GMM). Now, given a test data sample, find the probability of the sample given the GMM. All data samples with a probability lower than a certain threshold can be marked as ‘non-representative’ (i.e., anomalous) and sent to the domain experts for further investigation.
Source: ResearchGate
Even more sophisticated patterns for identifying test samples of interest can be devised based on the knowledge of the business problem, the specifics of the data, or the specifics of the machine learning machinery used.
For instance, any machine learning solution can be thought of as a combination of multiple elemental ML components. As an example, a machine learning model for intent mining in a conversational agent may consist of three ML modules:
During the training phase, we can identify the relative proportion of the paths traversed by different training samples through these three modules and the corresponding predicted outputs.
During the model monitoring phase, we can identify the samples that led to a particular output but the path traversed through the three modules wasn’t one of the paths observed during the training phase for that output.
Note that to achieve this level of pattern-based model monitoring, the end-to-end solution needs to have a robust logging mechanism.
After the successful deployment of a machine learning-driven solution, the data science team will almost always feel like they have earned bragging rights like “our system has state-of-the-art 99% accuracy!”.
But instinctively (and rightfully so), the first thing that the customer-facing teams will ask is “what is the plan to address customer escalations on the 1%?”.
This calls for reactive model monitoring which performs root-cause-analysis (RCA) of the customer escalations and provides an estimate of when the bugs will be fixed.
Reactive model monitoring is quite similar to that of proactive model monitoring. But there are subtle differences in terms of the end goals.
Whereas proactive model maintenance identified general patterns in the test data which are outliers compared to those in the training data, the goal of Reactive Model Maintenance is to identify what led to an erroneous output in a specific test sample and how it can be rectified.
The data science team thus needs to be cautious when accepting the rectifications suggested by the reactive model maintenance process as those recommendations can possibly be detrimental to a wide range of data samples.
Some of the other challenging aspects of reactive model maintenance are that some bugs can be resolved by a simple change in one of the config files while some may need elaborate retraining of the ML model. Also, some bugs may be within the tolerance threshold of a typical user while some maybe what I call as the ‘publicity-hungry’ bugs.
A ‘publicity hungry’ bug is any incorrect behavior of the machine learning system which is totally unexpected from a human expert.
For instance, in an ML-powered conversational agent, in response to the user’s query of “I am tired”, if the agent responds with “Hello Mr. Tired, how are you?”, then that is sure to get a lot of tweets and retweets and similar publicity! Such publicity-hungry bugs need immediate resolution.
The Service Level Agreements (SLAs) will thus need to be carefully crafted keeping in mind the severity of the bug on one hand and the systemic changes needed on the other hand.
Given these wide varieties of sources that may lead to a drop in performance of the ML-systems over time and the intense pressure to fix the issues within a given SLA, it can be tempting to have a ‘thin-layer-of-rules’ which bypasses the ML machinery completely to address the immediate customer escalation.
Such a thin-layer or hot-fix approach is actually a ‘lazy-fix’ which has the potential to turn disastrous in the long run. Thus, such a thin-layer of rules should be touched only under extreme conditions and should not be allowed to get beyond a certain ‘thickness’.
When the pre-defined ‘thickness’ is reached, our machine learning model has to be retrained to address the issues encoded in the thin-layer.
To borrow an analogy from the medical domain: addressing symptoms may not need an expert but if that is routinely substituted for a thorough diagnosis, the situation can precipitate quite rapidly.
Just like accurate medical diagnosis comes from analysis of the patient’s history, proactive model maintenance has to be broad enough to quickly help identify the root cause of a customer escalation.
Retraining a machine learning model that is already deployed in a live production environment is much easier said than done. For one, there are multiple ways to solve a particular data-driven problem, and as we see more data our choice of the model may change.
Secondly, the data science team that built the original model and the team that is maintaining the model may not readily agree on the best way to retrain the model. Moreover, the team that built the original model may have tried out a wide variety of training strategies/modeling techniques before settling on one.
This information is typically not documented and hence model retraining may very well lead to a net drop in the accuracy.
To add to the mix, a lot of the times the end client may prefer receiving consistent output over a now-correct-but-earlier-incorrect output. Here is what I mean by this:
Say your original speech recognition system would confuse “Tim” with “Jim” about 80% of the time. The end client estimated this frequency of error and has included mechanisms in their downstream processing to try both ‘Tim’ and ‘Jim’ with an 80-20 proportion.
Suddenly, when the retrained speech recognition system reduces the Tim/Jim confusion to only 10%, the end customer may not readily agree to make the necessary (potentially non-trivial) changes on their end. The business teams and the customer-facing teams may, in such cases, make a decision that certain customers will continue to get the old speech recognition system while the other customers will be migrated to the newer one.
This means the data science teams will now have to maintain two models! This opens up a whole new area of discussion called ‘technical debt of machine learning models’. Consistency can trump accuracy.
Turns out “Be Consistent!” is just as great a motivating phrase for ML models as it is for humans! An area I would love to discuss more, but not in this series.
“What’s in a name?” – William Shakespeare
Finally, the general perception is that the phrases ‘model maintenance’ and ‘model monitoring’ sound ‘uncool’ compared to ‘model building’.
In contrast, what I have seen is that the level of data science maturity, depth of big data engineering, and business understanding needed in ‘model maintenance’ is the order of magnitude more than what is needed in ‘model building’.
I am always tempted to rebrand ‘model maintenance’ as ‘model nurturing‘ particularly so in the light of the critical role maintenance and monitoring play in ensuring customer delight.
If you are in the tech industry, there is no escaping the buzz around Artificial Intelligence, Machine Learning, Data Science and related keywords. I genuinely believe that all this focus on data-driven technologies will help bring in substantial efficiency in existing processes and help conquer new tech frontiers which have long been elusive.
However, the general expectations from these technologies are dangerously unrealistic, largely fed by the popular imagination of sci-fi literature. Part of it is also affirmed by what we see in some of the low-stakes consumer-AI applications.
When executive decision-makers set such expectations of their data science groups, they inadvertently ignore two important factors:
I am certain that data-driven technologies are the best solution to solve most of the problems that the tech world faces today. But, in the same breath, for these technologies to succeed, we need a holistic approach with the right expectations.
Through this four-article series, I am hoping to share my learnings of bridging the gap between a ‘prototype of a data-driven solution’ and an actual ‘data-driven solution deployed in the real-world with stringent SLAs’. I hope you will find these learnings valuable as you continue your journey on data-driven-transformation.
I would absolutely love to hear your thoughts on this. Please do share your comments below or reach out at [email protected] / [email protected].
One of the best articles I have read on AV!! Looking forward to see more from you.