This is post 5 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.
In our previous post on machine learning deployment we discussed the challenges associated with serving ML models using online inference. These challenges include near real time feature generation, online model validation through A/B tests, rolling out different model versions, and monitoring deployed models.
In this post we’ll demonstrate how to implement online inference. We’ll begin by discussing when online inference is and isn’t required. Then we’ll implement online inference using 3 components: serving logic, machine learning code, and deployment configuration. Finally, we’ll show how to deploy your model to Google Cloud where it can autoscale to handle internet-scale traffic.
When is Online inference Required?
In general, online inference is required whenever predictions are needed synchronously. In the first post of this series I described several examples of how end users or systems might interact with the predictions produced by machine learning models. Let’s discuss two examples when predictions are needed in a synchronous manner.
One example involved an ecommerce company that wishes to recommend products to users after they login to that company’s mobile or web application. Since users can login at any time of the day, recommendations need to be available upon request.
This requirement alone doesn’t necessitate online inference; hypothetically we could batch precompute and cache predictions, and then serve the cached predictions at runtime. However, suppose we wish to include the user’s most recent activity into the recommendations.
For example, if a user interacts with a recommended product, we want to update that user’s recommendations with the context (i.e., added the product to a cart, removed the product, etc.) of that interaction. It’s this reliance on near real time input data that forces us to deploy the recommender model in an online inference scheme. Predictions should be generated on-the-fly rather than precomputed at recurring intervals so that users’ most recent activities can be factored into the recommendations.
Further, the model should be decoupled from both the mobile and web apps. Decoupling the model from the applications enables data scientists to update the model, rollback to previous versions, and operate various rollout strategies much more easily.
As another example, consider the UberEats estimated time-to-delivery model that estimates when food will be delivered to a hungry customer. Batch precomputing predictions is out of the question in this case. Why?
Because precomputing predictions would require Uber to know ahead of time things like: which customers will order food, what restaurant they’ll order from, what customers will order, which drivers are available, traffic conditions, restaurants conditions, etc.
These real time data constraints force Uber’s models to be deployed in an online inference scheme. Additionally, decoupling the model from both the mobile and web applications is desired to facilitate model updates, online validation, and prediction monitoring.
When is Online Inference Not Required?
Online inference is not required when you don’t need machine learning predictions immediately. Whenever latency constraints allow predictions to be produced asynchronously, batch prediction is preferred. This isn’t to say that asynchronous predictions can’t be served with online inference; rather, batch inference is much easier to implement and often considerably less expensive to maintain.
Online inference requires machines that are always running to respond to requests. Thus, you need to pay for these instances even when there aren’t any requests to respond to. And before anyone shouts out about "serverless" inference (eg AWS Lambda), remember that this adds an additional layer of complexity to your inference architecture (and keep in mind that underneath the hood, "serverless" requests are served by always-on machines).
My recommendation: if you can serve machine learning by running periodic jobs to predict a batch of data, DO THAT!
It’s also worth mentioning the streaming machine learning case. Thus far, when I’ve referred to online inference, I’ve really meant near real time online inference. A web API that returns model predictions is near real time; there’s significantly less latency than in asynchronous batch inference, but the input data to the model is likely lagged by some amount of time.
As described in the challenges of online inference, generating features across historical data stores is complex and might involve a batch component depending on how that historical data is stored.
Rather than waiting for data to be collected and stored, streaming machine learning (really machine learning on streaming datasets) is about identifying patterns as data continuously arrives. Streaming ML adapts as data distributions change over time, often on different time horizons, and is particularly useful at scales when storing raw data is impractical.
This case is sufficiently different from near real time online inference and involves a specialized tooling set built to handle these data streams. Streaming ML is beyond the scope of this post. If you’re interested in learning more about streaming, leave me a comment below. If there’s enough interest, I’ll consider doing a future post on streaming ; )
Implementing Online Inference for Machine Learning
Now that we know when to use online inference, let’s demonstrate how to deploy machine learning models for online inference. First we’ll implement the serving logic of our application using the well known Flask framework.
Then we’ll demonstrate how to use our Model interface to generate predictions. Finally, we’ll demonstrate how to configure and deploy our API using Google App Engine (GAE). GAE allows us to easily scale our application so we can serve machine learning at internet-scale.
Basic Implementation of Online Inference
Let’s create a basic implementation of online inference that contains three main components:
- Serving Logic – The logic responsible for accepting incoming requests and returning responses. We’ll define an API using the popular Flask library.
- "ML code" – This component is responsible for generating predictions. We’ll rely on the model interface we defined in Software Interfaces for Machine Learning Deployment.
- Deployment Configuration – Configuration responsible for deploying the API. We’ll deploy our API using Google App Engine.
In this example we’ll assume that all raw input data is available within the incoming request. In general, this won’t be true (see Challenges of Online Inference).
Serving Logic for Online Inference
The serving logic is responsible for accepting incoming requests and returning responses. Here we define an API with a single endpoint /predict
, which parses the request data, generates a prediction, and returns it. We also define an error handler. This API is contained within a single file app.py
.
import logging
from flask import Flask, request
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
"""Return a machine learning prediction."""
data = request.get_json()
loginfo('Incoming data: {}'.format(data))
return {'prediction': None}
@app.errorhandler(500)
def server_error(e):
logging.exception('An error occurred during a request.')
return """
An internal error occurred: <pre>{}</pre>
See logs for full stacktrace.
""".format(e), 500
if __name__ == '__main__':
app.run(host='127.0.0.1', port=8080, debug=True)
The predict
method expects POST requests with JSON payloads containing the raw input data. Let’s now turn to the machine learning code.
Machine Learning Code for Online Inference
Our API needs access to a fitted model in order to generate predictions. One way of doing this is to store the model in the repository that we ship off to Google App Engine. But this defeats our goal of creating an automated process. Ideally we’d like to use this same API logic for each new model we deploy across projects.
The main lesson of Software Interfaces for Machine Learning was that designing the correct interface up front saves time later on. Therefore, we can use the model interface defined in that post to drastically simplify our deployment.
We define a function load_model
that retrieves a serialized model from a remote file store and loads it into memory. By decorating the function with @app.before_first_request
, we ensure that the method is called once before the first request. This is how we "embed" the model into our service:
model = None
@app.before_first_request
def load_model():
global model
model_path = os.environ['REMOTE_MODEL_PATH']
loginfo('Loading model: {}'.format(model_path))
model = Model.from_remote(model_path)
We specify which model to load into memory by setting the REMOTE_MODEL_PATH
environment variable to the remote file path of the model. Parameterizing this function by using an environment variable prevents us from hard coding any information about the specific model into the API. This allows us to use the same script across projects to deploy models for online inference. This generic solution facilitates automated and repeatable machine learning deployments.
With the model loaded into memory, all that’s left to do is use the model to generate a prediction. Let’s update the predict
method to do just that:
@app.route('/predict', methods=['POST'])
def predict():
"""Return a machine learning prediction."""
global model
data = request.get_json()
loginfo('Incoming data: {}'.format(data))
prediction = model.predict(data)
inp_out = {'input': data, 'prediction': prediction}
loginfo(inp_out)
return inp_out
Super simple. We use the global
keyword to tell Python we wish to use the model variable in global namespace. After retrieving the input data from the request we call the predict
method of the Model
object. We then log and return both the input data and the prediction.
It’s worth mentioning the importance of logging in the context of machine learning deployments. Logs act as a form of telemetry. Logging the input-output pairs and aggregating these logs facilitates ML model monitoring. For example, we can monitor the distributions of the input data and determine whether concept drift is causing degraded predictive performance.
Comparing the distribution of the output target against the training set is another handy tool. We can take this a step further by taking self-healing steps: if we detect concept drift, we can automatically retrain the model. Effectively this closes the machine learning loop.
Here is the complete api.py
file:
import logging
from flask import Flask, request
app = Flask(__name__)
model = None
@app.before_first_request
def load_model():
global model
model_path = os.environ['REMOTE_MODEL_PATH']
loginfo('Loading model: {}'.format(model_path))
model = Model.from_remote(model_path)
@app.route('/predict', methods=['POST'])
def predict():
"""Return a machine learning prediction."""
global model
data = request.get_json()
loginfo('Incoming data: {}'.format(data))
prediction = model.predict(data)
inp_out = {'input': data, 'prediction': prediction}
loginfo(inp_out)
return inp_out
@app.errorhandler(500)
def server_error(e):
logging.exception('An error occurred during a request.')
return """
An internal error occurred: <pre>{}</pre>
See logs for full stacktrace.
""".format(e), 500
if __name__ == '__main__':
app.run(host='127.0.0.1', port=8080, debug=True)
Note that the amount of "ML code" (the code responsible for actual machine learning) in our script is vanishingly small. The api.py
script is 38 lines long, only 2 lines of which do machine learning (lines 14 and 22). As D.Sculley et. al note in their well-known paper Hidden Technical Debt in Machine Learning Systems, this is true in general for real-world ML systems because of the "vast and complex required surrounding infrastructure." ML code is just a small component of real-world ML systems.
Deployment Configuration for Online Inference
It’s time to deploy our machine learning API! There are many services we can use to deploy an API, including AWS, Azure, Heroku, etc. In this post we will use Google App Engine (GAE). GAE allows developers to build and deploy internet scale applications on a fully-managed platform with zero server management.
It supports many of the popular languages like Python, automatically scales up-and-down depending on application traffic, and comes with monitoring, logging, security, and diagnostics out-of-the-box. These features make App Engine a perfect solution for data scientists deploying their ML APIs for online inference.
App Engine offers two programming environments. The Standard environment offers rapid scaling and is extremely low cost, but is limited in several ways, including forcing code to run in a lightweight sandbox, preventing applications from writing to disk, and limiting the amount of CPU and RAM available to those apps.
In contrast the Flexible environment runs your app in Docker containers on Google Compute Engine virtual machines with few restrictions, offering the perfect balance of flexibility and simplicity.
Google App Engine Architecture
App Engine apps are organized hierarchically into:
- Services – Logical components of an application that can securely share App Engine features and communicate with one another. Each service in App Engine consists of the source code from your app and the corresponding App Engine configuration files. The set of files that you deploy to a service represent a single version of that service and each time that you deploy to that service, you are creating additional versions within that same service.
- Versions – Having multiple versions of your app within each service allows you to quickly switch between different versions of that app for rollbacks, testing, or other temporary events.
- Instances – The underlying compute resources that run the versions of a service. Your apps will scale up the number of instances that are running to provide consistent performance, or scale down to minimize idle instances and reduce costs.
Configuration
We’ll organize our API as a single service application where the service is responsible for handling client requests, generating predictions from a trained model, and returning these recommendations to the client. The codebase for this architecture can be organized into a single directory:
├── app.yaml # The App Engine configuration file.
├── api.py # The API logic.
└── requirements.txt # Python requirements
A Python app in App Engine is configured using an app.yaml
file that contains CPU, memory, network and disk resources, scaling, and other general settings including environment variables. Here is the app.yaml
file for our app:
runtime: python
env: flex
entrypoint: gunicorn -b :$PORT main:app
runtime_config:
python_version: 3
resources:
cpu: 2
memory_gb: 8
disk_size_gb: 16
The first several keys are general settings. The runtime
setting specifies the name of the App Engine language runtime used by the application. Here we’ve selected the python
runtime. The version is set in the runtime_config
setting. The env
setting selects the environment for the application. We’ve selected flex
for the flexible environment. The entrypoint
setting is the command to start your application. The entrypoint above starts a process that responds to HTTP requests on the port defined by the environment variable PORT.
The resources
section controls the computing resources of our instances. App Engine assigns a machine type based on the amount of CPU and memory specified. This machine is guaranteed to have at least the number of resources specified but may have more.
The cpu
specifies the number of cores and must be 1 or an even number between 2 and 96. The memory_gb
sets the RAM in GB. Each CPU core requires a total memory between 0.9 and 6.5 GB. The disk_size_gb
is the size in GB and ranges between 10 GB and 10240 GB.
For a complete list of all the supported elements in this configuration file, see the app.yaml
reference.
Deploying the Application
The simplest way to deploy your application to App Engine is by using the gcloud app deploy
command from the command line. This command automatically builds a container image by using the Cloud Build service and then deploys that image to the App Engine flexible environment. The container will include any local modifications that you’ve made to the runtime image.
Before you can deploy your application, you must ensure that:
- The Owner of the GCP project has enabled App Engine.
- Your user account includes the required privileges.
To authenticate you must call the gcloud init
command to authorize gcloud
to access Google Cloud Platform using your user account credentials. This will initiate a multi step process that involves visiting a URL in a web browser to access a key. Once this step is performed, you are ready to deploy the application.
To deploy a version of your application’s service, run the following command from the directory where the app.yaml
file of your service is located:
gcloud app deploy
By default, the deploy
command generates a unique ID for the version that you deploy, deploys the version to the GCP project you configured the gcloud
tool to use, and routes all traffic to the new version.
Conclusion
Let’s recap what we accomplished in this post. In the first half of the post we discussed when deploying machine learning models in an online inference scheme is and isn’t required. Since online inference is more complex than batch inference we should deploy ML using online inference only when necessary. In the second half of our post we implemented online inference in three components.
Component 1 is the serving logic, responsible for handling web requests. Component 2 is the machine learning code responsible for generating predictions. Component 3 is the google app engine configuration for deploying our application to the cloud.
What kind of machine learning models have you deployed with online inference? What challenges have you faced using online inference in production? I’d love to hear about your experiences in the comments below or @ me on Twitter @MLinProduction!