This is the first post in a multi-part blog series on monitoring machine learning models. In this post we describe the importance of monitoring and how monitoring ML is different from application performance management (APM). In subsequent posts, we will dive deeper into what it takes to monitor machine learning by learning from top industry experts.
Any company that deploys software knows that their software can fail. To prepare for failure, development teams perform quality assurance (QA) by testing their code and iteratively deploying it to larger subsets of users. But these QA techniques can’t prevent all issues. That’s why teams have evolved monitoring practices to observe how software is performing. These teams instrument their applications with data-emitting sensors and set up alerts to signal when issues arise.
Curiously, many of the same teams well-versed in monitoring software don’t apply the same emphasis on monitoring when it comes to operating their machine learning systems. One reason for this is speed: companies want to get their ML models out the door as quickly as possible. Another reason is lack of knowledge: the monitoring needs of ML systems are distinct from those of traditional software. Regardless of the reason, it’s critical to understand that failing to appropriately monitor machine learning models is like playing Russian roulette.
To introduce monitoring in the context of machine learning, we’ll start by explaining why it’s important to monitor ML models. We’ll then discuss how ML system monitoring differs from traditional application performance monitoring and explain the difficulties of monitoring machine learning. Finally, we’ll examine the monitoring needs of several concrete ML use cases and describe what could go wrong without proper monitoring.
Why is it Important to Monitor Machine Learning Models?
Consider how the development process of software applications differs from that of machine learning models.
In traditional software, we explicitly plan for and express how a system should react in specific circumstances. This is accomplished by writing automated tests to ensure applications perform as expected.
In contrast, the power of machine learning lies in how it generalizes from historical experience and reacts to new unseen data without us explicitly describing each case. Rather than hard coding logical rules, data scientists use algorithms that learn relationships between input data and what they wish to predict. These probabilistic rules describe how to transform input data, such as an image, to a prediction, such as the category of objects within an image.
Since we can’t explicitly test for all of the possible cases a machine learning system will see, we need to continuously monitor that system to ensure it’s operating effectively. But even though ML systems are types of software systems, ML systems have very different monitoring requirements.
How is ML Monitoring Different From Application Performance Monitoring (APM)?
Application performance management (APM) is the monitoring and management of performance and availability of software applications. APM closely monitors two sets of performance metrics. The first set are metrics that capture performance experienced by end users. One example is the application response time. The second set of metrics measure the computational resources used by an application, such as the CPU rate or average memory consumed.
Since ML systems are a type of software system, it’s still important to monitor APM metrics. For example, we still need to ensure that the ML serving system is running and that it’s returning predictions with acceptable latency. But these metrics are just a subset of the metrics that should be tracked.
Monitoring machine learning systems is about monitoring the quality of decision making that the system enables. The quality of predictions is a function of many things including:
- The quality of the data fed to models at inference time – The input data to a model can come from user requests, backend data processing jobs, and 3rd party systems (as we’ll see in the UberEats example below). Issues in any of these disparate systems can negatively impact the quality of a model’s predictions. Monitoring an ML system isn’t just about monitoring a model; it also involves monitoring all the data sources fed to the model.
- Modeling assumptions remaining relatively constant – Models learn relationships between inputs and outputs from historical data. Since the real world is dynamic, these relationships are constantly changing. This means that model performance naturally degrades with time. Models must adapt to changing conditions, but detecting these changes depends on robust, continuous monitoring.
- The robustness and stability of predictions – It’s understood that input features to machine learning models are not independent. Hence, changes in any part of the system, including hyper-parameters, learning settings, sampling methods, convergence thresholds, and data selection, can cause unpredictable changes to model output. This is known as CACE: Changing Anything Changes Everything.
This means that we need to continuously monitor whether the assumptions baked into the model at training time continue to hold at inference time. This form of monitoring is extremely difficult because it requires advanced statistical capabilities and careful tuning in order to prevent “alert fatigue” i.e. too many false positives. But failing to catch violations of model assumptions negatively impacts both the user experience and business KPIs.
Examples of monitoring machine learning applications
So far we’ve discussed the importance of monitoring ML systems at a conceptual level. To improve our understanding, let’s consider two concrete examples of machine learning monitoring. The first example describes a well-known consumer product: Uber Eats. The second example describes a common application of ML in industry: lead scoring.
UberEats Estimated Time of Delivery Model
Uber Eats is Uber’s food delivery service that allows users to order food through their mobile device and have it delivered by Uber’s drivers. When a customer orders food, machine learning models predict the estimated time-to-delivery (ETD) of the meal. This prediction is updated at each stage of the delivery process, including when the order is acknowledged, when the restaurant completes meal preparation, and when a driver is dispatched and picks up the meal.
To predict the total duration of this multi-stage process in real time, Uber’s data scientists have built regression models that require data from a variety of sources. These models depend on data from customer requests, historical information about restaurant performance, and real time information describing restaurant load, traffic conditions, and available drivers. For instance, one such real time signal is the average meal preparation time at a given restaurant during the last hour. Computing these quantities requires an extensive set of backend data processing jobs.
Uber needs to monitor the distributions of the incoming data to detect deviations that would cause models to act in unexpected ways. Doing this for all of the different features, restaurants, and cities is nontrivial due to the dataset size and issues like trends and seasonality. In fact, Uber recently released a blog post describing Data Quality Monitor (DQM), their in-house system that automatically finds anomalies across datasets and alerts engineers. The fact that Uber invested in building a monitoring solution speaks to the importance of the problem.
Lead Scoring for Sales Optimization
Lead Scoring is a domain that teams often attempt to solve with machine learning. The business goal is to increase the likelihood of converting a lead. The ML goal is to accurately estimate the potential value of a lead and then personalize the prospect’s experience based on that predicted value. Depending on how predictions are utilized, incorrectly scoring subgroups of prospects can lead to missed sales opportunities and suboptimal marketing spend.
For instance, suppose a lead scoring model achieves satisfactory aggregate performance but underperforms on a subset of leads, such as leads originating from mobile advertisements. If only a small number of leads came from mobile ads historically, aggregate performance metrics wouldn’t capture this underperformance, leading to overly-optimistic expectations. If the number of leads generated through mobile ads suddenly increases, the model will underperform on a larger percentage of total leads, which could negatively impact down-funnel performance.
Without monitoring, it would be difficult to detect why overall conversion is declining until it’s too late. Preventing this problem depends on continuously monitoring how the model performs on important subpopulations. In this example these subpopulations can be defined by the lead source. Properly defining these subpopulations is domain specific.
Conclusion
Using machine learning within applications allows us to build entirely new functionality not possible with traditional software development. Although ML can enable new products and services, models can fail in novel and unpredictable ways. These failure modes can range from minor nuisances like inaccurate song recommendations to the life-and-death decisions made each second by autonomous vehicles.
Organizations that operate traditional software systems have relied on monitoring their applications to catch degrading performance as quickly as possible. Since machine learning systems form a subset of software, similar monitoring solutions are needed to ensure proper business performance. But implementing the exact same monitoring systems is insufficient. ML systems have additional monitoring needs due in part to how machine learning algorithms are built.
While we’ve discussed several examples of how to monitor particular machine learning systems, we’ve barely scratched the surface of this complex topic. This post is the first in a multi-part series on ML monitoring. In subsequent posts we’ll continue diving into ML monitoring with other industry experts.
If you’d like to know when these posts are published, sign up below to be notified by email.
Awesome post!
Your point about monitoring the model with all the data sources really resonates. Even more so, as more teams now deploy complex, multi-model systems, it becomes critical to monitor the whole system (e.g., pipelines, pre and post processing stages, multiple models) because issues rooted in one model or pipeline might impact other models.
Wholeheartedly agree with your sentiment that a real key is to go beyond aggregate metrics. I’ve seen countless examples in which issues begin manifesting in small segments of the data or inferences, and are not “caught” until customers complain or business metrics decline.
Hi Yotam. Thanks! I totally agree that it is critical to monitor the entire system, including the entire data pipeline.
Monitoring key data segments is super important. I believe doing this up front during the product planning stage is critical. One step in the planning process should be to identify these subslices of the data and instrument systems with proper logging so that the slices can be properly monitored. Understanding the costs of errors to these subslices is super important as well, as errors to one subpopulation can be more critical than the same error to others.