Docker for Machine Learning - Part III

This is Part III of the Docker for Machine Learning series. In Part II of the series we learned how to build custom Docker images and how to use volumes for persisting data in containers.

Introduction

In Part II of our Docker for Machine Learning series, we learned how to build our own Docker images by writing Dockerfiles. Today we’re going to take that a step further and show how to use Docker to perform model training and inference. There are many different ways that you can leverage Docker to deploy your models. My approach here involves training and serializing a model during the image build process and performing inference in a running container. Specifically, we will embed the trained model into the Docker image. Then, whenever we want to run inference, we simply need to run a container from that image, deserialize the model, and generate our predictions.

I find this architecture advantageous for several reasons. The first is simplicity: since the only moving pieces are Docker and the model training code, it’s very simple to understand. Second, it allows us to leverage Docker’s image tagging system to store and version control our models. It also allows us to use a container registry service, like Amazon’s Elastic Container Registry, to store and manage our models. Rather than worry about persisting individual model components, we can just store entire Docker images which contain all of the necessary model artifacts! But let’s not get too far ahead of ourselves yet.

After reading this post you’ll know how to:

Train machine learning models during the Docker image build process
Serialize your models within the Image for easy retrieval
Perform batch inference using Docker containers

Directory Structure

For context, let’s examine the files that we’re going to work with in this post. Our directory structure looks like this:

code
├── Dockerfile
├── inference.py
└── train.py

Here’s a run down of each of the files

train.py – This script will load a training data set, train a model, generate evaluation metrics for the model, and serialize both the model and the evaluation metrics to a specific location. Much of this code comes from the scikit-learn documentation.
inference.py – This script will be called to perform batch inference. It will load a model that has been previously serialized by train.py, perform inference on a dataset, and print the predictions out to the screen.
Dockerfile – The Dockerfile to build our Docker image.

Training a Machine Learning Model in a Docker Image

If we want to embed a machine learning model into a Docker image, we first need to train a model on a dataset. Let’s write a file, train.py, that does just that.

import json
import os

from joblib import dump
import matplotlib.pyplot as plt
import numpy as np
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error


# #############################################################################
# Load directory paths for persisting model and metadata

MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE = os.environ["MODEL_FILE"]
METADATA_FILE = os.environ["METADATA_FILE"]
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
METADATA_PATH = os.path.join(MODEL_DIR, METADATA_FILE)

# #############################################################################
# Load and split data
print("Loading data...")
boston = datasets.load_boston()

print("Splitting data...")
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

# #############################################################################
# Fit regression model
print("Fitting model...")
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

clf.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, clf.predict(X_train))
test_mse = mean_squared_error(y_test, clf.predict(X_test))
metadata = {
    "train_mean_square_error": train_mse,
    "test_mean_square_error": test_mse
}

# #############################################################################
# Serialize model and metadata
print("Serializing model to: {}".format(MODEL_PATH))
dump(clf, MODEL_PATH)

print("Serializing metadata to: {}".format(METADATA_PATH))
with open(METADATA_PATH, 'w') as outfile:  
    json.dump(metadata, outfile)

As mentioned in the previous section, much of the code in this file has been adapted from sklearn’s documentation. However, I’ve added code responsible for persisting the trained model and some additional metadata to a particular location in the filesystem. These locations are passed in to the script as environment variables. The purpose of this is so we don’t hard code paths all over our code. Additionally, using environment variables allows us to pass these locations to Docker at build time. For instance, you and your fellow data scientists can agree upon a directory structure you’d like to use for your docker machine learning workflow and embed these locations into a configuration file. Then during your deployment process your scripts would read directly from the configuration files and perform model training or inference.

Now that we have our training script, let’s take a look at our Dockerfile.

FROM jupyter/scipy-notebook

RUN pip install joblib

RUN mkdir model
ENV MODEL_DIR=/home/jovyan/model
ENV MODEL_FILE=clf.joblib
ENV METADATA_FILE=metadata.json

COPY train.py ./train.py
COPY inference.py ./inference.py

RUN python3 train.py

If you’ve been following my previous posts, you’ll notice that we haven’t introduced any new Docker commands here. We start with the jupyter/scipy-notebook image as our base image. Next, we pip install joblib, which will be used for serializing and deserializing our trained model. We then create a new directory at /home/jovyan/model. This will be the directory where we persist our trained model and metadata. We then set 3 environment variables which reference the newly created directory, and the filenames we wish to use for the saved model and metadata files. Finally, we copy the training and inference scripts into the image and then run the train.py script.

By running the train.py script as part of our image build process, we ensure that a machine learning model is fit and serialized to a specific location at build time. The beauty of this is that if our model training process fails, it does so at build time, which allows us to debug the issue. Further, we can also use the Docker Image ID system to keep track of our trained models. Since each Docker build produces a tagged image with an Id, we can associate particular (Tag, Id) pairs to particular versions of a model. By generating metadata during the build, we can also associate the tag with the model’s metadata!

In order to build the image, we will run the docker build command:

docker build -t docker-model -f Dockerfile .

Again, nothing new here. We use the file named Dockerfile and tag the image as docker-model. Here is the output for running the previous command.

$ docker build -t docker-model -f Dockerfile .
Sending build context to Docker daemon   7.68kB
Step 1/9 : FROM jupyter/scipy-notebook
 ---> 2fb85d5904cc
Step 2/9 : RUN pip install joblib
 ---> Using cache
 ---> 66f3447a0309
Step 3/9 : RUN mkdir model
 ---> Using cache
 ---> 25add49612f8
Step 4/9 : ENV MODEL_DIR=/home/jovyan/model
 ---> Using cache
 ---> c942e332b297
Step 5/9 : ENV MODEL_FILE=clf.joblib
 ---> Using cache
 ---> be7ff955f556
Step 6/9 : ENV METADATA_FILE=metadata.json
 ---> Using cache
 ---> 934551874dc7
Step 7/9 : COPY train.py ./train.py
 ---> Using cache
 ---> e8009734844f
Step 8/9 : COPY inference.py ./inference.py
 ---> b27b3376312a
Step 9/9 : RUN python3 train.py
 ---> Running in d0dd807e08d8
Loading data...
Splitting data...
Fitting model...
Serializing model to: /home/jovyan/model/clf.joblib
Serializing metadata to: /home/jovyan/model/metadata.json
Removing intermediate container d0dd807e08d8
 ---> ca401bbcd10f
Successfully built ca401bbcd10f
Successfully tagged docker-model:latest

In this case, our (Tag, Id) pair is (docker-model, ca401bbcd10f).

Viewing Model Artifacts

Now that we have built our image, we can inspect the model and metadata resources that we persisted during the training process. In order to do that, we need to run a container. Since our metadata is stored as a json file, let’s just print that out to the screen. The command to do this is

docker run docker-model cat /home/jovyan/model/metadata.json

Here we are starting a container from our newly built docker-model image and running the cat command. This is the output of the command

$ docker run docker-model cat /home/jovyan/model/metadata.json
{"train_mean_square_error": 1.767739146234438, "test_mean_square_error": 6.767464371311846}

For the sake of simplicity, I’ve only persisted the training and test mean square error from the training process. But this approach can be extended to perist all kinds of metadata including the type of algorithm used, the features used in the model, distributions of the training data, etc. In fact, it makes a lot of sense to persist the metadata in a database. If you also persist the Docker image tag and id along with the metadata, you’ve got yourself the beginning of a machine learning model versioning system. That sounds like it would be a good future blog post ; ).

Batch Inference in Docker

At this point, we have built a Docker image that contains a trained machine learning model. But the only reason we care about doing this is so we can make predictions! While it’s often necessary to allow clients to retrieve predictions in real time, there are many cases when this isn’t necessary. For instance, a model that is used to generate a batch of predictions on a recurring schedule doesn’t need to be served as an API (although it could be). This is called batch inference.

For example, lets say your company has built a lead scoring model to predict whether new prospective customers will buy your product or service. The marketing team would like to predict whether individual leads will convert, but they don’t necessarily need the predictions right away. Instead, they’re happy so long as new leads are scored within 24 hours of entering the system. Rather than performing inference as leads enter our system, we can perform inference each night on the batch of leads generated that day. This will guarantee that all leads from the previous day are scored within the agreed upon time frame.

How can we use our Docker image to perform batch inference? Since we’ve trained and serialized the model, this is as simple as deserializing the model and using it to perform inference on new data. Let’s look at inference.py which does just that.

import os

from joblib import load
import numpy as np
from sklearn import datasets
from sklearn.utils import shuffle


MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE = os.environ["MODEL_FILE"]
METADATA_FILE = os.environ["METADATA_FILE"]
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
METADATA_PATH = os.path.join(MODEL_DIR, METADATA_FILE)

def get_data():
    """
    Return data for inference.
    """
    print("Loading data...")
    boston = datasets.load_boston()
    X, y = shuffle(boston.data, boston.target, random_state=13)
    X = X.astype(np.float32)
    offset = int(X.shape[0] * 0.9)
    X_train, y_train = X[:offset], y[:offset]
    X_test, y_test = X[offset:], y[offset:]
    return X_test, y_test

print("Running inference...")

X, y = get_data()


# #############################################################################
# Load model
print("Loading model from: {}".format(MODEL_PATH))
clf = load(MODEL_PATH)

# #############################################################################
# Run inference
print("Scoring observations...")
y_pred = clf.predict(X)
print(y_pred)

This very simple example illustrates the pattern of retrieving data from some data source, deserializing your trained model, and then performing inference. My code isn’t realistic for several reasons. First, I’m making predictions on the same test set I used during training. Second, I’m just printing the predictions to standard output. In real life the data used during inference would be completely unseen. Continuing with my lead scoring example from earlier, the data could be new leads generated during the last day. After performing inference, we might then store these predictions in a database.

The command to perform inference is

docker run docker-model python3 inference.py

Here is the output from running that command.

$ docker run docker-model python3 inference.py
Running inference...
Loading data...
Splitting data...
Loading model from: /home/jovyan/model/clf.joblib
Scoring observations...
[ 15.32448686  27.68741572  24.22583752  31.94786177  10.41477704
  34.08931725  22.05210667  11.58265489  13.40512651  42.84036647
  33.03218733  15.77635169  23.93521876  19.88254786  25.43466604
  20.55132127  13.68825729  47.42790362  17.64804854  21.51806638
  22.57388848  16.97645106  16.25503893  20.57862843  14.57438158
  11.81385445  24.78353556  37.65978361  30.35372977  19.67713185
  23.19380271  24.98879019  18.65459129  30.18538175   8.97905481
  13.8130382   14.1646968   17.3840622   19.83840166  24.69323175
  20.4430731   15.32433651  25.8157052   16.47533793  19.2214524
  19.86928427  21.47113681  21.56443118  24.64517965  22.43665872
  22.20893399]

Summary

Congratulations – you now know how to use Docker to train machine learning models and perform batch inference! While this architecture allows us to train models and perform inference locally, we need a few additional components in order to deploy this architecture. For instance, we would need some sort of job scheduling functionality to schedule the recurring inference jobs. We’d also want to perform hyperparameter tuning during the model training process. And while I’ve introduced how to leverage Docker’s image tagging and ID for model version control, I haven’t shown how to do that. Yet.

In our next post, we’ll use Docker to perform online inference. This will require us to expose our trained model as a REST api.

If you found this tutorial helpful, please share it on LinkedIn, Twitter, or Facebook!

Docker for Machine Learning – Part III