Batch Inference for Machine Learning Deployment (Deployment Series: Guide 03)

This is post 3 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.

In our previous post on machine learning deployment we designed a software interface to simplify deploying models to production. In this post we’ll examine how to use that interface along with a job scheduling mechanism to deploy ML models to production within a batch inference scheme. Batch inference allows us to generate predictions on a batch of samples, usually on some recurring schedule.

We’ll describe when it is and isn’t suitable to deploy models in a batch inference scheme, how to implement batch inference using Python and cron, and how to upgrade that implementation using production ready workflow managers. I’ve also created a tutorial that you can download demonstrating how to deploy batch inference using Google Composer, a managed Airflow service on Google Cloud Platform.

Let’s dive in!

When is Batch Inference Required?

In the first post of this series I described a few examples of how end users or systems might interact with the insights generated from machine learning models.

One example was building a lead scoring model whose outputs would be consumed by technical analysts. These analysts, who are capable of querying a relational database using SQL, would like to see lead scores each morning from all of the leads generated during the previous day.

This is a perfect example of batch inference because 1) predictions can be generated on a batch of samples, i.e. leads generated during the previous day, and 2) the predictions need to be generated once a day. In this case, deploying the lead scoring model means scheduling a daily batch job. This job will retrieve the new leads, deserialize a trained model, generate lead scores, and then persist these predictions to a database the analysts can access.

Another example from that post involved an ecommerce company emailing its customers product recommendations. These marketing emails are delivered to customers on Monday afternoons and Friday mornings in the customers’ local time zones and contain 5 recommendations each.

This is another use-case where batch inference can be utilized. Since recommendations must be generated for all customers, the batch of samples consists of all existing customers. The predictions can be generated twice a week, once before the Monday afternoon emails and once before the Friday morning emails. Since the emails are delivered to customers in their local time zones, we just need to ensure that the jobs complete before the earliest emails are composed.

It’s important to realize that there can be multiple ways to compose the batches of samples. In the previous example we could have defined batches by time zones or some other customer segment. This might actually make more sense if we built multiple collaborative filtering models by a customer segment.

When to run the batch inference is also a free parameter. Rather than generate predictions twice a week, we could run the batch inference process once a week, pick the top 10 recommendations for each user, and split these 10 recommendations across the 2 emails. The tradeoff is that recommendations might become stale for very active shoppers.

Use batch inference to deploy ML models whenever predictions can be generated asynchronously on a batch of input samples. Especially when predictions are needed on intervals longer than an hour. Click To Tweet

When is Batch Inference Not Suitable?

We can take the product recommendation model from our previous example and imagine a situation in which deployment cannot be handled through batch inference. Suppose the ecommerce company wishes to display product recommendations to users on the company’s web or mobile application.

The Product team wants to serve customers recommendations in several different places on both applications and wants to generate these predictions using the users’ most recent activities, including recently viewed pages and terms searched. The Product team’s request places several constraints on how our predictions should be generated which affects our deployment process.

First, recommendations must be available to different applications. This implies that our deployment process shouldn’t package the model with either the mobile or web application. Packaging the model with one of the applications would prevent the other app from easily accessing recommendations. Updating the model after retraining would also be difficult.

Second, recommendations should factor in users’ most recent activities. This constraint prevents us from using cached predictions generated on some recurring schedule. One way around this is to precompute recommendations more frequently, say every hour instead of every day, but this scheme would miss the last hour’s worth of activity for active users. And that’s if every batch inference job succeeds, which is never the case.

Anyone who runs batch jobs knows that jobs fail all the time.

Last but not least, there is a pressing latency requirement on predictions. Since users will be served recommendations on various mobile and web pages, predictions need to be generated on sub second time scales in order to prevent negative user experiences from pages that render too slowly. Latency requirements are usually the top reason that models can’t be deployed using a batch inference scheme.

Batch inference isn’t suitable for deploying ML when latency constraints require predictions in (near) real-time. Click To Tweet

Implementing Batch Inference for Machine Learning

At the bare minimum, implementing batch inference involves two components. The first component is the application logic and the second component is a mechanism to schedule and kickoff the application logic. In this section we’ll create a pseudocode implementation of batch inference using a simple Python script as our application logic and cron as our job scheduler.

Python Code for Batch Inference

Let’s create a script run_batch_inference.py that performs inference:

import argparse
import logging

logging.basicConfig(level=logging.INFO)


def run_batch_inference(remote_model_path):
    '''
    Generate and store a batch of predictions.

    Args:
    ---
    remote_model_path - Path to serialized model stored on remote object store.
    '''
    logging.info('Running batch inference.')
    raw_data = get_raw_inference_data()

    logging.info('Retrieve serialized model from {}.'.format(
        remote_model_path))
    model = Model.from_remote(remote_model_path)
    X = model.preprocess(raw_data)
    predictions = model.predict_batch(X)

    logging.info('Writing predictions to database.')
    write_to_db(raw_data, predictions)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Retrieve arguments for batch inference.')
    parser.add_argument('--remote_model_path', type=str, required=True,
                        help='Remote path to serialized model.')
    args = parser.parse_args()
    run_batch_inference(remote_model_path=args.remote_model_path)

This pseudocode contains all of the basic pieces we need to implement batch inference.

The logic for batch inference is encapsulated in the run_batch_inference() method. We first retrieve the raw input data for prediction. This logic usually retrieves data from a database and can be parameterized as needed. For instance, in our batch lead scoring example this function might accept a date and return all leads that were created on that data. In our batch recommender system example, get_raw_inference_data might accept the timezone and query users by timezone.

Next we retrieve the deployed model and load it into memory. How can this be done? One way is to write a system that persists a fitted model to a distributed file system like S3.

In my previous post on deployment, I described how a Model class could be used to define interfaces that performed ML workflow tasks. One such method, to_remote(), serializes a trained model and uploads it to a remote file system like S3 or Google Cloud Storage. That method returns the path to the serialized model, which we can pass to run_batch_inference.py for retrieving and loading the model into memory.

The next 2 lines process the raw data and generate predictions. Again, I’m relying on the existence of interface methods for accomplishing these tasks.

Finally the write_to_db() function is responsible for writing the predictions to a database. I passed both the raw_data and the predictions to that function because raw_data usually contains necessary metadata such as ID fields. This could be the ID of the leads in our lead scoring example or the IDs of the users for which we’re generating recommendations in our recommender example.

Scheduling & Running the Batch Inference Logic

Now that we have the Python code that runs batch inference, we need to schedule that code to run on a recurring basis. There are many ways to do this, including using a job scheduler, Kubernetes CronJobs, and more. In this section we’ll schedule the jobs to run in the simplest manner possible: using cron. Cron is a time-based job scheduler used to run jobs periodically at fixed times.

Although it’s very easy to schedule jobs using cron, it lacks support for functionality such as automatically rerunning failed jobs, sending notifications, and more. We’ll describe alternatives to cron that provide these capabilities in the next section.

To use cron we first need to determine how frequently and when to run batch inference. We then convert this schedule to a cron expression and add a line to a crontab file with the cron expression and the command to run.

Let’s say we want to run our lead scores batch inference jobs each morning at 3 am. The cron expression for this schedule is 0 3 * * *. So we’d add a line to the crontab that looks like the following:

0 3 * * * python3 run_batch_inference.py

In order to generating product recommendations each Monday and Friday morning at 5 am, we’d add the following line to the crontab:

0 5 * * 1,5 python3 run_batch_inference.py

We’ve deployed our models to run batch inference! The run_batch_inference.py script will run at the desired times, producing the predictions we need to add real business value to our organization.

Tools for Scheduling ML Batch Inference Jobs

In the previous sections we put together a basic batch inference implementation and deployment using a Python script and cron. This minimal tool set would trigger batch inference on a recurring basis, but it lacks a lot of extra functionality such as monitoring, automatic retries, and failure notifications. To the uninitiated these might seem like nice-to-haves. But if you’ve run batch jobs before, you know this additional functionality is required. Here I’d like to mention some tools that enable you to deploy batch inference in a performant and fault tolerant manner.

Prefect is a workflow management system that takes your code and transforms it into a robust, distributed pipeline. It contains a complete UI for jobs, remote execution clusters, and automatic scheduling. Prefect makes it super easy to add notifications and alerts and the documentation contains an extensive set of examples.

Apache Airflow is a platform to programmatically author, schedule and monitor workflows. Airflow features a rich user interface that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Airflow started at airbnb in October 2014 hence is a fairly mature tool with a large user base.

Kubernetes CronJobs allow users to run batch jobs on Kubernetes on a recurring schedule. It’s important to know that CronJobs do not come with a management UI like other workflow management tools like Prefect and Airflow. However, it’s possible to add these abilities through custom add-ons. CronJobs are most useful to data scientists if your organization is already running a Kubernetes cluster and has a staff of engineers supporting the system.

Check out my previous post on using CronJobs to deploy your machine learning models. Note also that Kubernetes doesn’t support running multi-step workflows by default. But this functionality is supported by using Argo Workflows.

Automation Tools – Traditional automation tools like Jenkins can be used to schedule batch training and batch inference jobs. These tools include error handling, notifications, and automatic retries, but weren’t built for machine learning specifically. Still, it might make sense to start with such a tool if one is already available at your company.

Machine Learning Platforms – Finally, it’s worth noting that today there are a host of machine learning platform solutions available that provide workflow management as well as other ML-centric capabilities. Open source options include Kubeflow and mlflow. Vendor solutions include Amazon Sagemaker, cnvrg, and many others.

Conclusion

In this post we defined batch inference and reviewed examples of when it’s appropriate to deploy ML models within a batch inference scheme. Next we created a basic implementation of batch inference using Python and cron and briefly described how to improve this implementation using tools such as Prefect and Airflow. I’ve created a tutorial that demonstrates how to deploy batch inference using Google Composer, a managed Airflow service on Google Cloud Platform. Download it below!

As we saw in an earlier section, not all ML models can be deployed within a batch inference scheme. In our next post, we will describe how to deploy models to production for online inference. This use-case is significantly more complicated than deploying batch inference, so stay tuned for the next post!