As the leader of a machine learning team, one of my core responsibilities is to critically evaluate areas of risk my team faces. This involves intimately understanding how the team serves stakeholders through our products, services, and analyses, and identifying blind-spots in these offerings that could potentially negatively impact those stakeholders. Once these risks are identified, it’s my job to determine how to mitigate these risks and “shore up our defenses”.
One of the largest categories of risks in any team offering machine learning based solutions is the uncertainty associated with the ML models themselves. On one hand, models allow us to learn from historical experiences and generalize to previously unseen data. On the other hand, since we can’t know precisely what data an ML system will see in the future, we have to take a leap of faith and hope that the future data will be “similar enough” to the historical data. Reducing this inherent risk requires continuously monitoring an ML system to ensure that it’s operating effectively.
The question then is: what qualities should we look for in an ML monitoring system?
To answer this question, I partnered with Itai Bar Sinai, Chief Product Officer of Mona, to describe how an ML monitoring system can empower ML teams and reduce the inherent risk of production machine learning systems. Itai brings a wealth of experience in both ML engineering and production monitoring, having spent 4 years as a tech leader at Google Trends, and the last year and a half building an ML monitoring system at Mona.
Find and resolve issues faster
First and foremost, a good monitoring system helps ML teams go from “flying blind” and “reactive” to “full visibility” and “proactive”.
Many ML systems already have some production monitoring in place, even if they don’t recognize their situation as such. This phenomenon can be dubbed “monitoring by customer complaints”. It’s when you discover that something’s wrong because your customer tells you or because a stakeholder reports that a business KPI has declined.
Once a complaint is received from a customer or business stakeholder, it can take hours if not days for ML teams to find the root cause by querying logs and running specific tests to understand the problem and come up with potential solutions. For some teams this situation escalates to continuous fire-fighting. For these teams, work items addressing complaints dominate the roadmap, and they constantly feel as though they’re falling short.
ML teams who achieve visibility and have the ability to proactively assess their data and models can confidently assume more accountability for the ML system’s entire lifecycle, both in research and production.
Visibility to the rescue
Achieving full visibility begins with methodically collecting data from your ML system.
A good monitoring system will enable effortless data collection, coupled with deep investigative Root Cause Analysis (RCA) tools. These tools should allow you to segment the data granularly (e.g., to look at all model runs for a specific customer in a specific region and time-range), to compare different environments (e.g., training and inference runs), and to track the statistical behavior of data and model KPIs (e.g., the distribution of features or a classifier’s average probability score). The ML team should be able to get a complete understanding of their data and model behavior, leveraging clear and customizable dashboards, reporting mechanisms and APIs.
When data is collected comprehensively with the appropriate RCA tools, pinning the root causes of reported issues should be fast, sometimes even instant.
Getting alerted on the right issues at the right time
To help the ML team transition from “reactive” to “proactive”, a good monitoring system should alert on specific anomalies in data and model KPIs (upstream metrics), the moment the anomalies start to manifest. By focusing on these upstream metrics, the ML team gets a head start on taking corrective measures before the business KPIs are impacted.
To produce valuable and timely alerts, the monitoring system must leverage a catalog of anomaly detection models – finding outliers, gradual shifts and sudden changes in the data’s behavior.
The anomalies detected might indicate problems that originate from a variety of root causes. Broadly speaking, these root causes fall into one of three categories: Biases, drifts, and data integrity issues. Biases and drifts are associated with the statistical properties of the production data relative to the assumptions in the model (and/or to training data), whereas the data integrity issues are typically associated with operational mishaps, e.g., a bug in a data pipeline, or an unannounced format change in a 3rd party data source.
A good monitoring system should not only detect anomalous behavior, but also provide additional insight as to the anomalies probable root cause, enabling you to take swift action without lengthy manual investigations.
Finding the needle in the anomaly haystack
Anomaly-detection systems are inherently noisy. In fact, many of the companies who try to solve for monitoring in-house eventually give up because their alerting is way too noisy. The reason for this is that each anomaly can manifest itself in different perspectives, and it’s hard to tell which perspective is the right one to look through.
Consequently, alerting systems must have built in noise-reduction and anomaly-clustering capabilities.
For example, consider a situation in which you have a very large customer, who accounts for
- most of your business in the UK
- most of your English language business
- and most of your healthcare sector business.
When there’s an issue with this customer’s data, which is unrelated to its geography, language or business sector, you would want to receive a single alert ("customer X has an issue"), not four ("you have an issue with the UK", "you have an issue with healthcare", "you have an issue with the English language", "you have an issue with customer X").
Continuously improve models
Another significant value of a good ML monitoring system is that it allows you to continue your research while operating in production. It can show you where your models have blindspots, alert you on biases and data drifts, and it can expose you to potential features and meta-features which might not have been attainable in the research stage.
For example, you may have trained a sentiment analysis model on a labeled dataset of tweets in American English. The model runs pretty well in production for some time but then gradually degrades. Upon investigation, you find that the lowest average confidence scores come from the Southern US, showing a clear regional bias in the model. Given that the geo-location of the texts was not available in the training dataset, you could not have picked up on the bias in the original training dataset without going through this experience in production.
With this new knowledge in hand you can get new labeled data originating from the problematic regions and enhance your training and test sets. You can then decide to implement a data operations pipeline to continuously get labeled datasets from new regions as they appear in your production environment (your monitoring system should alert you on these new regions and trigger the pipeline via API).
This type of insight is a classic example of how you’re able to improve your research and reduce costs significantly by monitoring your data and models in production. Perhaps bias across geographical regions is an overly simplistic example, but a good monitoring system should help you find far less obvious meta-features and anomalies to take into account for further research.
Increase research and development speed
The regional bias example is also relevant through the lens of R&D prioritization and cadence. Knowing that a data-field (e.g., geo-region) affects your models behavior can help decide what to work on next, e.g., the automatic data labeling pipeline mentioned above. Better prioritization brings the right improvements to production faster.
Additionally, monitoring enables you to test hypotheses in production and avoid lengthy offline experiments to discover new features. For example, if your monitoring system allows you to feed business KPIs to it, you should be able to test whether certain data fields correlate with improvements in these KPIs. Such correlations may make these fields good feature candidates.
Beyond better prioritization, and the ability to test hypotheses in production, there is the intangible benefit of “peace of mind”. If you know you have full visibility into how new models behave in production, would you launch/release them earlier? Most ML teams would.
Version Benchmarking
Last but not least, a good ML monitoring system should include comprehensive benchmarking functionality, enabling ML teams to compare the performance of different model versions. Specifically, you would want to assess your model KPIs (whether they are the standard precision and recall, or other behavioral metrics such as confidence intervals) across the two model versions, the current and the new, both in an A/B testing scenario, and in a shadow deployment scenario.
Such capabilities are especially valuable as they expand on both the metrics and the data sets (from test sets to production data) typically used in version benchmarking. Benchmarking in production enables you to ensure that your model quality is sustained (even improves) with new version deployments. Additionally, you can get stronger indications of the strengths and weaknesses of new model versions, allowing for smarter go/no-go decisions, and perhaps even more nuanced decisions around when to use certain versions of the model.
Conclusion
Machine learning monitoring systems allow teams to reduce risk by continuously ensuring that ML systems are operating effectively. A proper monitoring system allows you to find and resolve issues before they negatively impact business KPIs. By finding new signals and understanding model weak spots, a monitoring solution also enables you to improve model performance by continuing to research in production. Finally, monitoring platforms help you to take your models to market faster by accelerating development cycles.
We’ll continue our exploration of production machine learning monitoring in our next post by interviewing Nimrod “Nemo” Tamir, Chief Technology Officer at Mona. Sign up below to get notified when that post is published!