Monitoring ML: Interview with Nimrod Tamir, Co-founder & CTO of Mona

This is the second post in a 2-part series about ML monitoring systems. In our first post we described the value proposition of a monitoring system. A great ML monitoring system, we explained, enables teams to quickly find and resolve issues, continuously improve their models, and increase the speed of research and development. In short, proper monitoring helps ML teams go from “reactive” to “proactive”.

In this post, I had the opportunity to interview Nimrod “Nemo” Tamir, co-founder and CTO of Mona. Nemo spent the better part of a decade leading the team responsible for the very popular Google Trends product. After that he led engineering at Voyager Labs, a company focused on delivering machine learning solutions for human behavior analysis. Today he’s leveraging his ML monitoring experience by leading the development of a monitoring platform at Mona.

In our interview we spoke about his experiences monitoring machine learning, what ML has to learn from application performance management (APM), and who should own ML monitoring within an organization. Along the way Nemo offers advice on when and how ML teams can incorporate monitoring into their products.

Interview

1. You spent 9 years at Google where you worked on Google Trends. Can you tell us about this experience, and what you learned about monitoring during that time?

I joined Google Tel Aviv in 2007. It was in the early days of the R&D center (which today is one of the largest R&D centers outside of the US). Back in the day Google encouraged us to come up with impactful projects and try to convince leadership why Google should invest resources in them, and this is how Google Trends was founded. I had the amazing experience of leading this team, in charge of the Google Trends global product (along with a suite of niche related tools) and related internal infrastructure – for 9 years and was commonly known as “the Trends guy”.

Since Google Trends was the first product to launch out of the Tel Aviv office, we were the first to go through a checklist of items to get the required approvals for the launch, one of which was “monitoring”. At the time there weren’t any common white-box monitoring tools outside of Google and monitoring was implemented either as external off-the-shelf products (which have no ability to encapsulate knowledge on the software internals) or it was poorly done over logging infrastructure. But Google already had an earlier version of what is now known as Prometheus which allowed this exact functionality. So, I spent a few weeks learning Prometheus and became the local monitoring expert, mentoring and approving launches for many other teams in the office.

Separately, as an owner of a data-savvy product, I noticed the gap that this infrastructure has in offering a solution at the data level. These products were amazing in quickly identifying bugs and outages, memory leaks and specific network issues, but by design they lacked the ability to robustly monitor data behavior. A famous example of when this affected us is the failure of the Google Flu Trends product, which due to an unrecognized concept drift, stopped producing valid forecasts for where the flu is going to hit, despite the fact that the automatic model retraining mechanism was functioning properly. This was a lifetime lesson on the severe potential consequences from a lack of monitoring.

2. In our conversations you mentioned that in your next role as VP of Engineering at Voyager Labs you thought a lot about monitoring. What specific monitoring challenges did you see? How did you solve those?

At Voyager Labs I was in charge of the deployment of large ML engines, consisting of dozens of different types of models ranging from natural language analysis to image recognition, including unsupervised clustering and recommender systems and more. These algorithms used all kinds of ML techniques, including deep neural networks, various types of decision forests and other proprietary homegrown algorithms, developed by a capable and large data science team. Much like many other organizations, that team was siloed and was focused on the next generation of capabilities, and less involved in any operational aspects of their work. We reached a point in which we had more model types than data scientists, and we didn’t have enough eyes to track these models’ behavior manually.

As a result, whenever business metrics were declining (which happened often), we had very little knowledge as to what was going on and how to fix it. If the results were bad enough, we would stop work on other projects and staff a team to pretty much rebuild the model. In conversations with colleagues in the same situation, I heard of similar experiences.

One way I tried to solve this issue at the time was to compartmentalize the models (and their pre and post processing steps) and try tracking irregularities in data flowing in and out of them, but fairly quickly I realized that each component in the system was impacted differently by changes in the data. It was clear that I needed guidance from the individual component developers regarding what to monitor in their "area".

The next approach was to build internal dashboards (on top of open source logging, APM and visualization infrastructure) which really helped get more transparency into what was going on in the models. But this approach had two major flaws. First there was the high cost of maintenance – both in terms of keeping the system working properly and in upgrading its capabilities. More importantly, connecting this data to an alerting mechanism to detect and proactively notify on anomalies proved to be a very hard task and created more noise than meaningful insights.

3. Many machine learning practitioners are not familiar with Application Performance Management (APM) systems. Can you provide background on APM systems?

APM refers to various types of solutions that help with managing the performance aspects of systems, including business and product levels (e.g. end user experience), network monitors, and system performance monitors. These solutions have existed for decades and are commonly used in enterprises.

Over the past few years, a new sub-field of APM called “White-box monitoring” became very popular, at first for big-data systems and later for applications as well. This approach allows teams to convey important information about the nature of the software to the monitoring platform, essentially making it MUCH smarter in understanding what real issues occur in software in production. This is done by supplying various light-weight agents, which support exporting variables from the code (which were put there just for that purpose), to quickly set rules to aggregate them correctly, and to set up dashboards and alerts on top of their aggregations.

Realizing the advantages that this approach entails for big-data systems in production, it was no surprise to me that this technique prevailed over the last few years. It’s now dominating the market in both open source projects (e.g. Prometheus) and paid services (e.g. DataDog). I refer to it as a revolution because it was the first time developers took part in the operational aspect of planning software monitoring.

4. What lessons from the APM world translate over to the world of machine learning?

To me the most important lesson is that system developers must be involved in planning the monitoring aspect of their software. They rarely are today; once development ends and the systems run in the real world, developers move on to other tasks e.g., implementing new features, fixing bugs or addressing customer requests.

In best-in-class monitoring setups, developers provide crucial inputs to the monitoring system. For example, data engineers may provide the monitoring system with the expected behaviors of metrics like the turnaround time of certain recurring operations, or the expected size of a certain dynamic data structure over time, which has the potential to affect the overall system performance. Similarly, data scientists possess knowledge that’s critical for monitoring.

Another important lesson is that you don’t have to reinvent the wheel. Monitoring, APM, logging solutions and visualization are strong and capable infrastructure, but they are limited in their ability to solve the monitoring problem without lots of additional heavy lifting. By finding the appropriate tools you’ll get much better results at a significantly lower price.

5. What made it possible for developers in big data engineering to effectively contribute their knowledge to the monitoring process? Could the same principles be reproduced in ML?

To me, a developer’s job is to develop a system, but this doesn’t imply any additional responsibilities once the product is developed. On the other hand, an engineer is in charge of building the system and making sure it lives to realize its goal, which is a very different requirement. Today in the big-data industry you see more and more engineering work done within the development teams – today’s big data engineers possess knowledge on the operational aspects of their software: how it’s deployed, its required resources, its scalability, and how to release new features while keeping a smooth user experience, to name a few. On one hand, they can’t be replaced in processes like resource planning (e.g estimate extra monthly storage), defining QA needs (e.g defining testing scopes), and oncall duty rotations – for when a real time issue is out of the automation/dev-ops scope. On the other hand, if they need to learn all the ins-and-outs of this system, they won’t be able to do their jobs.

To bridge this gap, more capable and versatile DevOps and system reliability engineers are being hired to introduce the appropriate infrastructure to facilitate this process. This is done by using many new innovative tools such as kubernetes, newly available cloud services – like FaaS, API gateways, queues, billing management, release management, and many more; and in monitoring – APMs, visualization, and incident management. I definitely believe the same type of evolution has already started to happen in the data science world.

6. Let’s talk about the technical-side of monitoring ML and building monitoring platforms. The point of monitoring is to detect unhealthy behavior. How do you define “healthy behavior”? How do you achieve it?

The short answer is that healthy behavior is defined differently in each ML system. Each model (or to be more exact each component of the modeling process – like feature engineering or post-processing) has unique properties that are critical for monitoring. One way to define healthy behavior is by 1) choosing a monitoring context, 2) defining behavior metrics, and 3) determining segments of model activations and granularity. If chosen correctly, consistent stable distributions of these metrics within all subsegments of the data indicates a “healthy behavior”, and to achieve it, one must verify this is the case.

Choosing the monitoring context – If we run different models on the same data, in order to find anomalies resulting from the interaction of two models one must define a monitoring context and put all the required information in it. For example, suppose we have a system that runs text through language classification, categorization, and sentiment analysis models. When put in a single context, the underlying monitoring system can notify me about languages with lower category scores, or a specific category with lower sentiment confidence.
Behavior metrics – What metrics can tell me about the behavior of my model? For example if we have a clustering algorithm, such metrics can be the number of clusters, their density, average distance between clusters, cluster sizes variance, or even a tailored metric like the level of agreement of each cluster on a certain feature. Confidence intervals, intermediate results, and external measurements of the output can also serve as great metrics to track the model behavior.
Splitting data to subsegments – Which information about the runs, like features or metadata, should be used to segment the model activations, and how do you segment when looking for granular anomalies. For example, if I have an age feature, should I segment it on logarithmic or linear scale? Are there any special values (e.g missing)?

Last but not least, health definition requires defining what is normal and what types of irregularities may indicate a real issue. For example, is it expected that a feature should create a bias in a metric? For example, a user geo-location data is expected to bias the results of a language classifier, while age is not.

7. Since ML projects are so diverse, how is it possible to build a “general” platform for monitoring?

To me, the first question to answer is: how do you effectively convey the required knowledge of the ML system into the monitoring platform? In my opinion this requires a careful design of the appropriate configuration language. It should balance a challenging combination: On the one hand, such a language must be highly expressive and flexible, and be able to understand ML concepts like training phase, feature vectors and model versions; and on the other hand it should be clear, intuitive and easy to learn. But once this knowledge is there, a monitoring platform can serve as a robust and highly capable engineering arm for the data scientist.

A second, more technical question is how to build a platform that translates this knowledge to provide the full monitoring suite that is tailored to the user – including data ingestion, investigation tools, intelligence tools and alerts, and benchmarking capabilities. Such an effort requires solid engineering capabilities to properly utilize resources to balance between scale, depth and user experience, and is a highly interesting big-data engineering problem.

8. There are a lot of open questions about the people and process involved in monitoring ML. In your opinion, who should own the monitoring piece?

I think there’s a huge organizational challenge when dealing with companies whose business relies on ML systems. Historically, and still today – it’s common to see ML developers, which pass the models “over the fence” to production and their job is done. While their model is being shipped to production, they often start working on new features or other models.

In other places where the data science work is more highly integrated within the organization, I’ve seen various roles that claim the ownership of the algorithmic-level performance piece – most often it would be the head of data science, but in some places it could be the CTO or engineering lead, the CIO or DevOps manager, or the product manager/owner. The common denominator is that it would be the most senior person to be woken up in the middle of the night when the algorithm misbehaves severely.

Lately a new trend is emerging in the form of Data Product Managers. I believe that it’s the right framing of the required position, completing the relatively new roles of Data Engineers (who connect pieces) and Data Scientists (who develop ML systems or invent new algorithms).

9. What about planning? When should monitoring be planned for and who should be involved?

The best time to plan monitoring is right before launch. At this phase there’s enough knowledge about how the system should behave, but it’s not too late to miss out on things that may happen right after launching. At this phase a data scientist should consider scenarios that may happen, and think about which kind of metrics would experience anomalies in that case, and in which data sub-segments? She should then work with the DevOps team to help her set up the monitoring she defined without working hard on things outside of her expertise, hopefully by utilizing great tools.

Later, after launch, usually there are a few iterations articulating the monitoring platform, after which more confidence is gradually gained and the monitoring system stays alone in making sure to alert when something bad is going on…

Which brings us to the most popular time in which teams usually start planning monitoring – after a ‘disaster’ has happened. If proper monitoring was lacking (which is usually the case), unfortunately only after the model misbehaved in a way that caused enough damage, the justification for monitoring becomes obvious.

10. One challenge in managing data scientists is that they enjoy working on new problems. It’s hard to convince a data scientist to build a model and then monitor that model for as long as the system is in use. In terms of maintenance, how should teams set up workflows that allow the data science team to set up monitoring and then leave the model in the wild on its own? Is this possible?

I think that data scientists are right not to want to spend their precious time ineffectively on engineering processes, and until very recently they simply had to. But I believe that if implementing monitoring were a 2-hour process in which they think through how to defend their system and able to explain it using an appropriate language – it would be possible.

11. Final question. What recommendations do you have for companies that are deploying ML solutions today without monitoring? What’s the single lowest lift action they can take that will have the largest impact?

There are lots of different ways to invest in monitoring solutions to get different types of outcomes and I guess it highly depends on the organization, how severe algorithmic issues may become, how stable is their data over time and many more factors. My general advices for such a company would be:

Don’t over-invest. Start a lean, iterative process and gradually add capabilities based on real pain or risk factors.
Spend enough time planning in a data-driven way. This process shouldn’t take more than a few hours in which you prioritize your shortlist of most-worrisome risk factors and think of lean ways to be informed, but somehow it’s the most neglected phase of the process.
Don’t build monitoring yourself. It’s sometimes very tempting to get exactly what you need with seemingly small effort, but the costs of such solutions over time usually become steep. It often takes just a few hours researching possible solutions that could help you do what you need (based on planning), or otherwise it’s well worth the price of getting professional advice from someone you trust early enough.

Conclusion

As companies continue to deploy and operationalize ML models, we will see new monitoring use-cases. If you’re responsible for production machine learning, regardless of whether you’re a data scientist or SRE, how is your team handling monitoring? What does your monitoring “stack” look like? I’d love to hear from you in the comments below.