Batch Inference vs Online Inference

Introduction

You’ve spent the last few weeks training a new machine learning model. After working with the product team to define the business objectives, translating these objectives into appropriate evaluation metrics, and several rounds of iterative feature engineering, you’re ready to deploy version 1 of the model. I applaud your progress!

But how do you deploy your model so that it can be used by others and generate real value? A web search may point you to tutorials discussing how to stand up a Flask front-end that serves your model. But does that architecture actually fit your use-case?

The first question you need to answer is whether you should use batch inference or online inference to serve your models. What are the differences between these approaches? When should you favor one over the other? And how does this choice influence the technical details of the model deployment? In the following sections we’ll answer each of these questions and provide real world examples of both batch and online inference.

Batch Inference

What is Batch Inference?

Batch inference, or offline inference, is the process of generating predictions on a batch of observations. The batch jobs are typically generated on some recurring schedule (e.g. hourly, daily). These predictions are then stored in a database and can be made available to developers or end users. Batch inference may sometimes take advantage of big data technologies such as Spark to generate predictions. This allows data scientists and machine learning engineers to take advantage of scalable compute resources to generate many predictions at once.

What are the Benefits of Batch Inference?

Batch inference affords data scientists several benefits. Since latency requirements are typically on the order of hours or days, latency is often not a concern. This allows data scientists to use tools like Spark to generate predictions on large batches. If Spark or related technologies aren’t necessary, infrastructure requirements for batch inference are still simpler than those for online inference. For instance, rather than expose a trained model through a REST API, data scientists writing batch inference jobs may be able to simply deserialize a trained model on the machine that is performing the batch inference. The predictions generated during batch inference can also be analyzed and post processed before being seen by stakeholders.

What Challenges does Batch Inference Present?

While batch inference is simpler than online inference, this simplicity does present challenges. Obviously, predictions generated in batch are not available for real time purposes. This means that predictions may not be available for new data. One example of this is is a variation of the cold start problem. Say a new user signs up for a service like Netflix. If recommendations are generated in batch each night, the user will not be able to see personally tailored recommendations upon first signing up. One way to get around this problem is to serve that user recommendations from a model trained on similar users. For instance, the user may see recommendations for other users in the same age bracket or geographic location. The drawback of this approach is that there are more models to build, deploy, monitor, etc.

Real World Examples of Batch Inference

In my last post I described lead scoring as a machine learning application where batch inference could be utilized. To reiterate that example: suppose your company has built a lead scoring model to predict whether new prospective customers will buy your product or service. The marketing team asks for new leads to be scored within 24 hours of entering the system. We can perform inference each night on the batch of leads generated that day, guaranteeing that leads are scored within the agreed upon time frame.

Or consider product recommendations on an ecommerce site like Amazon. Rather than generate new predictions each time a user logs on to Amazon, data scientists may decide to generate recommendations for users in batch and then cache these recommendations for easy retrieval when needed. Similarly, if you’re developing a service like Netflix where you recommend viewers a list of movies, it may not make sense to generate recommendations each time a user logs on. Instead, you might generate these recommendations in batch fashion on some recurring schedule.

Note: I’m not sure whether Amazon and Netflix generate recommendationss using batch or online inference. But these are examples where batch inference could be used.

Online Inference

What is Online Inference?

Online Inference is the process of generating machine learning predictions in real time upon request. It is also known as real time inference or dynamic inference. Typically, these predictions are generated on a single observation of data at runtime. Predictions generated using online inference may be generated at any time of the day.

What are the Benefits of Online Inference?

Online inference allows us to take advantage of machine learning models in real time. This opens up an entirely new space of applications that can benefit from machine learning. Rather than wait hours or days for predictions to be generated in batch, we can generate predictions as soon as they are needed and serve these to users right away. Online inference also allows us to make predictions for any new data e.g. generating recommendations for new users upon signing up.

What Challenges does Online Inference Present?

Typically, online inference faces more challenges than batch inference. Online inference tends to be more complex because of the added tooling and systems required to meet latency requirements. A system that needs to respond with a prediction within 100ms is much harder to implement than a system with a service-level agreement of 24 hours. Reason being that in those 100ms, the system needs to retrieve any necessary data to generate predictions, perform inference, validate the model output (especially when this output is being sent to end users), and then (typically) return the results over a network. These technical challenges require data scientists to intimately understand

The process for retreiving features necessary for predictions – In many cases data required for the prediction will be stored in multiple places. For instance, a prediction may require User data that is stored in a data warehouse. If the query to retrieve this data takes multiple seconds to return, the data may need to be cached for quicker retrieval. This requires additional technologies.
The machine learning algorithm – Algorithms differ by the number of operations required to generate a single prediction. A deep recurrent neural network requires many more operations to perform inference than logistic regression. Latency requirements may dictate the user of simpler models.
Model Outputs – Models may generate invalid predictions. For instance, if a regression model predicting housing prices generates a negative value, the inference service should have a policy layer that acts as a safeguartd. This requires the data scientist to understand the potential flaws of the model outputs.
Web technologies – If the model is exposed as a REST API, the data scientist should have at least a basic understanding of REST, the HTTP protocol (when to use GET vs. PUT vs. POST, etc.), and the client-server network model. Not to mention knowing how to use the framework for exposing the model.

Additionally, online inference systems require robust monitoring solutions. Data scientists should monitor the distributions of both the input data and the generated predictions to ensure that the distributions are similar to those of the training data. If these distributions differ, it could potentially mean that an error has occured somewhere in the data pipeline. It may also mean that the underlying processes generating the data have changed. This concept is known as model drift. If model drift occurs, you will have to retrain your models to take the new samples into account.

Real World Examples of Online Inference?

Basically, online inference needs to be performed whenever predictions are required in real time. For instance, an UberEats estimated-time-to-delivery is generated whenever a user orders food through the service. It would not be possible to generate a batch of these estimates and then serve them to users. Imagine not knowing how long it will take to receive your order until after the food arrives. Other examples of applications that can benefit from online inference are augmented reality, virtual reality, human-computer interfaces, self-driving cars, and any consumer facing apps that allow users to query models in real time.

We mentioned recommendation systems earlier as examples where inferences may be generated in batch. Depending on the use case, recommendations may also be served in online fashion. For instance, if a web applications provides recommendations for new travel destinations based on form input, online inference is required to serve the recommendations within a web response.

Conclusion

One of the first questions you’ll need to answer when deciding how to deploy your machine learning models is whether to use batch inference or online inference. This choice is mainly driven by product factors: who is using the inferences and how soon do they need them? If the predictions do not need to be served immediately, you may opt for the simplicity of batch inference. If predictions need to served on an individual basis and within the time of a single web request, online inference is the way to go.

What are cases where you’ve served your models using batch inference or online inference? What helped you make your decision? I’d love to hear about your experiences in the comments below.

Additional References

Machine Learning Crash Course from Google