This is post 4 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.
In our previous post on machine learning deployment we discussed deploying models for batch inference. We described when batch inference is suitable, created a basic implementation using python and cron, and mentioned several workflow tools for scheduling batch inference jobs in production workflows.
We also described several use-cases when batch inference shouldn’t be used to deploy ML models. In those situations, models need to serve predictions synchronously as requests are generated, which is known as near-real time. This is the domain of Online Inference.
Online inference is considerably more complex than batch inference, primarily due to the latency constraints placed on systems that need to serve predictions in near-real time. Before implementing online inference or mentioning any tools, I think it’s important to cover specific challenges practitioners will face when deploying models in an online inference scheme.
To demonstrate these challenges we’ll examine the system Uber designed to serve predictions to its UberEats customers. I learned these lessons by failing on ML projects in the real world. My hope is that you avoid repeating my mistakes by learning from this experience.
Challenges of Deploying Machine Learning Models for Online Inference
Deploying machine learning models for online inference is considerably more challenging than deploying models for batch inference. This difficulty arises from the latency restrictions of our system. Systems that must return predictions within a few hundred milliseconds can tolerate much less error than batch prediction systems that predict new samples once an hour, day, or week.
But the stricter latency requirements impose several non-obvious constraints on our ML systems. These issues are rarely discussed outside of conference papers or the communications of teams doing production machine learning, but it’s imperative any ML practitioner be aware of them.
Optimizing Feature Engineering for Online Inference
Data scientists spend the majority of their time building and testing different features for their machine learning models. But the environment in which this code is written and tested is very different from the environment in which models must operate for online inference.
For example, a feature engineering step is often written for and applied to a batch of samples at a time during exploratory development. But constraints during development are considerably less strenuous than constraints for online inference.
This logic might need to be optimized because it runs too slowly at inference time. But the data scientists writing the feature engineering steps might not be equipped to write this optimized logic, especially if the production stack requires a different programming language. For instance, I’ve known some large companies where data scientists develop models in Python but these models need to be served in Java or C.
One solution to this problem is organizational: have one team responsible for prototyping models and another team responsible for deploying the most performant models. The prototyping team’s goal is to build the most performant model possible; the quality of their code in terms of extensibility and reliability is secondary. It’s the deployment team’s job to convert the experimental code into a high-quality and well-tested codebase.
Therefore the deployment team is responsible for optimizing the experimental feature generating logic. While this organizational specialization takes advantage of different expertise, the "handoff" between teams introduces its own set of problems.
For example, the feature engineering logic of the deployment team must match that of the experimental team; this requires high quality testing logic. I won’t go any deeper into this organizational issue here, but suffice it to say it isn’t a panacea.
Generating Features From Multiple Data Sources
Another challenge that latency constraints impose on models during online inference is the complexity required to generate features from different data sources. Some features can be generated directly from the data available in the request. Consider an image classification model deployed behind a web API. "Feature generation" is simple in this case because the API request contains all of the data necessary for prediction, namely, the image to be classified.
This situation is considerably more complex when the model needs additional data not available in the request payload. The application supporting online inference must request this data from other data sources, which often includes a variety of disparate relational and non-relational databases. If these databases aren’t optimized for querying individual records (think any big data systems), teams will need to set up processes to precompute and index necessary model features.
To illustrate this challenge, consider the estimated time-to-delivery feature of the UberEats app. Each time a hungry user orders food from a restaurant, a machine learning model estimates when the food will be delivered. Quoting directly from the article: "Features for the model include information from the request (e.g., time of day, delivery location), [as well as] historical features (e.g. average meal prep time for the last seven days), and near-real time calculated features (e.g., average meal prep time for the last one hour)."
The article continues: "Models that are deployed online cannot access data stored in HDFS, and it is often difficult to compute some features in a performant manner directly from the online databases that back Uber’s production services (for instance, it is not possible to directly query the UberEATS order service to compute the average meal prep time for a restaurant over a specific period of time).
Instead, we allow features needed for online models to be precomputed and stored in Cassandra where they can be read at low latency at prediction time." This situation is illustrated in the following picture.
Setting up such a system is almost definitely outside of most data scientists’ wheelhouses. Realistically an entire team of data engineers is needed to maintain, monitor, and administer this pipeline. If you’re building models to be served in an online manner, it’s imperative to consider the cost of generating complex feature sets.
A/B Testing Models for Online Inference
How do you know if your new model will add more value than a previously deployed model? Offline metrics like accuracy, F1 score, and root mean squared error may be uncorrelated to product metrics like user engagement or member retention.
The metrics used to evaluate machine learning models in offline experiments are rarely equal to metrics that the business cares about. Consider a recommender model. Data scientists might optimize metrics like precision@k or mean average precision (MAP) on a dataset. But the product team cares about metrics such as user engagement with the product. We’d like to think that increasing the performance of a model as measured by the offline metrics will lead to corresponding performance gains in product metrics, but this is not the case (see lesson 2).
In order to measure the impact of a new model on valuable business metrics, data scientists must augment their evaluation strategy by running statistical A/B tests. To run an A/B test, the population of users must be split into statistically identical populations that each experience a different algorithm.
After a given period of time, measured differences in product metrics reveal the algorithm that’s led to the greatest increase in the business metrics. Thus the offline performance of a machine learning model is just the first step in an evaluation pipeline. As Xavier Amatrain puts it, offline performance is an indication to make decisions on follow-up A/B tests (see slide 46).
The following image is a flow chart describing the two-part process for evaluating a model in offline and online experiments.
Rollout Strategies for Online Inference
By no means is implementing the online experiments described in the previous challenge a simple mandate. This requires additional infrastructure to reliably split incoming traffic, route these requests to different competing models, and carefully store the outputs so that experimental results can be analyzed. Data scientists should be responsible for setting up and analyzing experiments, not for configuring and maintaining the requisite infrastructure.
Data scientists should be able to demo new machine learning models on subsets of the user population before rolling the models out to the entire population. In Building Intelligent Systems, Geoff Hulten describes 4 ways of deploying intelligence to end users:
- Single Deployment – Intelligence is deployed all at once to all users simultaneously. This is the simplest solution but is problematic when the system makes high-cost mistakes.
- Silent Intelligence – Also known as "shadow mode," silent intelligence runs new models in parallel to existing models, but users do not see predictions of the new model. This provides an extra check on the quality of model predictions, but you don’t get to see how users react to the model.
- Controlled Rollout – Provides a small subset of users outputs from the new model while the majority of users continue to see the previous model. Data from these interactions is used to gauge when more users should see the new model. Allows you to control the downside of an ineffective new model but is complex to implement.
- Flighting – Flighting allows you to run statistical tests to measure which new models perform best. This enables you to run online A/B tests.
Model Monitoring
Once a model has been deployed its behavior must be monitored. A model’s predictive performance is expected to degrade over time as the environment changes. This phenomenon, known as concept drift, occurs when the distributions of the input features or output target shift away from the distribution upon which the model was originally trained.
Once concept drift has been detected, it can be alleviated by retraining the machine learning models. But detecting drift through monitoring is difficult, especially when the ground truth signal is observed days, weeks, or months after the prediction is produced.
One strategy for monitoring is to use a metric from a deployed model that can be measured over time. For instance, measuring the output distribution can be helpful for detecting drift even when the actual output isn’t available in a timely manner. The observed distribution can be compared to the training output distribution, and alerts can notify data scientists when these quantities diverge.
The challenge here, of course, is setting up and maintaining the required infrastructure. This challenge is even more difficult due to the lack of available tooling. The only tools I’ve seen, both across open source and vendors, are from Seldon and Amazon SageMaker. We should expect to see more players enter the arena, since deploying models to production without appropriate monitoring in place can be a recipe for disaster.
Conclusion
In general, deploying models for online inference is considerably more complex than the batch inference case. Aside from constraining the choice of learning algorithm, latency constraints imposed by online inference heavily influence the feature generation and data retrieval process. And describing the process of deploying new model versions as "non-trivial" is a laughable understatement. Each of these issues is compounded by the relative lack of mature infrastructure and tooling.
Now that we’re aware of some of the challenges of online inference, we can discuss how to implement online inference. My next post in the series will do just that.
What are your thoughts on this series so far? What other sub-topics would you want me to explore? Shoot me an email at luigi at mlinproduction.com or @ me on Twitter @MLinProduction with your thoughts!