Lessons Learned from 15 Years of Monitoring Machine Learning in Production

This is the second post in a multi-part blog series (find Part 1 here) on monitoring machine learning models in production. In this post we’ve invited Oren Razon, Co-Founder and CTO of superwise.ai to share his thoughts on what it takes to monitor ML systems. Drawing from over 15 years of experience helping enterprise companies and startups in Healthcare, Fintech, Marketing, AdTech, and Manufacturing, Oren has distilled his knowledge into 3 key lessons for data science teams thinking about developing monitoring solutions. Enjoy his wisdom!

The use of Machine Learning is becoming ubiquitous for a wide variety of use cases. But as data science teams ramp up their activities and deploy more models to production, the MLOps space remains ill-equipped to face the challenges of ensuring accuracy, performance and optimal maintenance once models are used in the real world. This new “day after deployment” phase is still overlooked, as most MLOps tools are focused on the research and development phase. Data science and operational teams require better solutions that are suited to addressing monitoring AI in production.

The current gap in the ecosystem exposes businesses to high risks, creates frustration among data science teams that need to “babysit” and diagnose the systems, and reduces the level of trust that organizations have in their machine learning processes. These circumstances limit the ability to efficiently scale AI operations and pushes teams to try and develop ad-hoc monitoring solutions.

Yet, developing a general-purpose monitoring solution that fits various ML use cases is not a straightforward task. Different machine learning models require different measurements. For instance, the performance metrics of a regression task are not the same as those of a multi-class classification task, and the distribution of structured data is measured differently than that of text or image data. What’s more, ML systems can be deployed in various ways, some of which may require special attention and custom capabilities. As an example, models requiring A/B testing differ from models that are deployed in “shadow mode”. While experienced data scientists may try to develop custom monitoring solutions, few of those solutions will manage to overcome the complexity of the task at hand.

After deploying dozens of ML models in the last 15 years, both for enterprises and small startups in various domains including Healthcare, Fintech, Marketing, AdTech, Manufacturing, and more, I’d like to review some of the ad-hoc solutions I’ve seen used to monitor ML systems, describe their limitations, and offer practical tips for the “day after deployment”.

Lesson #1 – Monitoring performance metrics isn’t enough. You need to monitor input distributions.

A FinTech company helping banks determine whether or not to approve loan requests had developed a binary classification model that predicted the chances of a borrower defaulting on a loan. Using ML techniques, the company improved detection rates for high-risk loans.

During the research and development of the classification model, the main performance KPI to optimize was the precision level of the “positive” class (in this case “approved loans”) with the constraint of a minimal approval rate. In order to assure the model’s ongoing performance in production and trigger alerts if the precision level was suboptimal, the team ran a recurrent weekly script scanning historical requests. Note that in this case, the labels were not being collected automatically and would arrive after long delays – 6 months after the prediction on average.

After a few months, they received numerous complaints. While the approval rate of the model was satisfactory, the approved/rejected decisions made no sense for the bankers who used the model.

Without ground truth labels on the predictions, the teams couldn’t be alerted. After a week of tedious investigations, they discovered a huge drift in the data of the incoming requests, due to campaigns encouraging new target audiences for loans. The model issued predictions for this new audience, despite the fact that 1) no labels existed for the newly targeted segments and 2) its data distribution differed so significantly from the training dataset that the model’s predictions were irrelevant.

Takeaway

You need to measure and monitor all relevant aspects of your model to detect issues at the right time, including:

  • the model input distribution and its drift level
  • metrics describing model inference, as these could be proxies for detecting strange model behavior, which can lead to unexpected results!

In the research phase you can and should evaluate your models with classical model performance metrics, but in production, these metrics fail to provide the right indications in a timely manner.

You need to measure and monitor all relevant aspects of your model to detect issues at the right time, including the model input distribution and its drift level and metrics describing model inference, as these could be proxies for… Click To Tweet

Lesson #2 – Automatic retraining is not (necessarily) the solution

An AdTech company developed a regression model to estimate the potential revenue of new impressions and optimize their real-time bidding strategy. To overcome the ongoing temporal fluctuations of this highly dynamic ecosystem, the data science team collaborated with the engineering team to build a full orchestration flow in production. Each week, a new model would be retrained and automatically deployed based on data from the previous month.

Since the feedback delay cycles were relatively short, the team continuously measured model performance and even triggered retraining if the observed performance fell below a specific threshold. They thought they had it all covered!

However, a few weeks after deployment, the team noticed that the model performance was degrading and negatively impacting business results. Strangely, retraining didn’t solve these issues. At first they assumed that more time would be needed for the retraining to have more relevant data and improve the performance.

After 2 weeks of bad results the team identified the issue: the source of several model inputs in the operational systems had been changed by the data engineering team for a period of time. This led to massive distribution shifts which affected model performance during the time period and made retraining the model useless. Even worse, after the model inputs were returned to normal, the automatic retraining jobs included the “dirty data” period inside the training set, leading to suboptimal results.

Takeaway

Retraining “in the dark” is not enough. Once there’s data drift or a performance incident, you need to be able to investigate the underlying change, and understand what actually happened.

Lesson #3: You need high-resolution measurement for specific subpopulations.

A gaming company developed machine learning models to support its marketing activities. After extensive evaluation, the models were deployed, and the data science team even spent a great deal of effort implementing internal dashboards to observe the quality of the incoming data and the model performance.

After a period of encouraging results, the marketing team complained that they were suffering from high churn in their most important VIP customers, despite following the recommendations of the models. During this entire time, the data science team didn’t see anything suspicious in their dashboards.

Retraining “in the dark” is not enough. Once there’s data drift or a performance incident, you need to be able to investigate the underlying change, and understand what actually happened. Click To Tweet

After investigating all model metrics manually around this specific VIP segment, they realized that while the model KPIs in the general user set were good, the KPIs were drifting and contained anomalies once filtered for the VIP segment only. The fact that the segment was relatively small, masked the importance of the issue.

Takeaway

Having a broad overview is necessary to monitor, but it’s insufficient to guarantee optimal business performance. When looking at models in production, you need to have the ability to slice and dice the data and model metrics, to reach lower granularity and understand qualitative parameters that impact your business.

Final Thoughts

Overall, having the right tools in place to monitor ML is just the beginning of solving the issues of “the day after deployment”. As more and more models reach this new stage of the MLOps process, both data science and business operations teams need to have a shared and objective view of their ML activities. They both need, in their own terms, a clear understanding and robust diagnosis capabilities to keep track of the health of the system.

For these reasons, at superwise.ai, we look at monitoring as a combination of the following: practical capabilities to extract, in time, easily understandable insights about the ML process, and tools to create a common language between data science and operational teams. Beyond the practical, hands-on approach, we seek to empower each of the teams to be more independent and gain insights into their daily work with the goal to scale their AI effectively and with confidence.

Oren Razon is the CTO and a Co-Founder of superwise.ai. You can connect with him on LinkedIn.

This is the second post in a multi-part series on ML monitoring. We’ll continue to explore this complex topic in subsequent posts. If you’d like to know when these posts are published, sign up below to be notified by email.

Leave a Reply

Your email address will not be published. Required fields are marked *