This is Part III of the Docker for Machine Learning series. In Part II of the series we learned how to build custom Docker images and how to use volumes for persisting data in containers.
Introduction
In Part II of our Docker for Machine Learning series, we learned how to build our own Docker images by writing Dockerfiles. Today we’re going to take that a step further and show how to use Docker to perform model training and inference. There are many different ways that you can leverage Docker to deploy your models. My approach here involves training and serializing a model during the image build process and performing inference in a running container. Specifically, we will embed the trained model into the Docker image. Then, whenever we want to run inference, we simply need to run a container from that image, deserialize the model, and generate our predictions.
I find this architecture advantageous for several reasons. The first is simplicity: since the only moving pieces are Docker and the model training code, it’s very simple to understand. Second, it allows us to leverage Docker’s image tagging system to store and version control our models. It also allows us to use a container registry service, like Amazon’s Elastic Container Registry, to store and manage our models. Rather than worry about persisting individual model components, we can just store entire Docker images which contain all of the necessary model artifacts! But let’s not get too far ahead of ourselves yet.
After reading this post you’ll know how to:
- Train machine learning models during the Docker image build process
- Serialize your models within the Image for easy retrieval
- Perform batch inference using Docker containers
Directory Structure
For context, let’s examine the files that we’re going to work with in this post. Our directory structure looks like this:
code
├── Dockerfile
├── inference.py
└── train.py
Here’s a run down of each of the files
- train.py – This script will load a training data set, train a model, generate evaluation metrics for the model, and serialize both the model and the evaluation metrics to a specific location. Much of this code comes from the scikit-learn documentation.
- inference.py – This script will be called to perform batch inference. It will load a model that has been previously serialized by train.py, perform inference on a dataset, and print the predictions out to the screen.
- Dockerfile – The Dockerfile to build our Docker image.
Training a Machine Learning Model in a Docker Image
If we want to embed a machine learning model into a Docker image, we first need to train a model on a dataset. Let’s write a file, train.py, that does just that.
import json
import os
from joblib import dump
import matplotlib.pyplot as plt
import numpy as np
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
# #############################################################################
# Load directory paths for persisting model and metadata
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE = os.environ["MODEL_FILE"]
METADATA_FILE = os.environ["METADATA_FILE"]
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
METADATA_PATH = os.path.join(MODEL_DIR, METADATA_FILE)
# #############################################################################
# Load and split data
print("Loading data...")
boston = datasets.load_boston()
print("Splitting data...")
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
# #############################################################################
# Fit regression model
print("Fitting model...")
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train, y_train)
train_mse = mean_squared_error(y_train, clf.predict(X_train))
test_mse = mean_squared_error(y_test, clf.predict(X_test))
metadata = {
"train_mean_square_error": train_mse,
"test_mean_square_error": test_mse
}
# #############################################################################
# Serialize model and metadata
print("Serializing model to: {}".format(MODEL_PATH))
dump(clf, MODEL_PATH)
print("Serializing metadata to: {}".format(METADATA_PATH))
with open(METADATA_PATH, 'w') as outfile:
json.dump(metadata, outfile)
As mentioned in the previous section, much of the code in this file has been adapted from sklearn’s documentation. However, I’ve added code responsible for persisting the trained model and some additional metadata to a particular location in the filesystem. These locations are passed in to the script as environment variables. The purpose of this is so we don’t hard code paths all over our code. Additionally, using environment variables allows us to pass these locations to Docker at build time. For instance, you and your fellow data scientists can agree upon a directory structure you’d like to use for your docker machine learning workflow and embed these locations into a configuration file. Then during your deployment process your scripts would read directly from the configuration files and perform model training or inference.
Now that we have our training script, let’s take a look at our Dockerfile.
FROM jupyter/scipy-notebook
RUN pip install joblib
RUN mkdir model
ENV MODEL_DIR=/home/jovyan/model
ENV MODEL_FILE=clf.joblib
ENV METADATA_FILE=metadata.json
COPY train.py ./train.py
COPY inference.py ./inference.py
RUN python3 train.py
If you’ve been following my previous posts, you’ll notice that we haven’t introduced any new Docker commands here. We start with the jupyter/scipy-notebook image as our base image. Next, we pip install joblib
, which will be used for serializing and deserializing our trained model. We then create a new directory at /home/jovyan/model. This will be the directory where we persist our trained model and metadata. We then set 3 environment variables which reference the newly created directory, and the filenames we wish to use for the saved model and metadata files. Finally, we copy the training and inference scripts into the image and then run the train.py script.
By running the train.py script as part of our image build process, we ensure that a machine learning model is fit and serialized to a specific location at build time. The beauty of this is that if our model training process fails, it does so at build time, which allows us to debug the issue. Further, we can also use the Docker Image ID system to keep track of our trained models. Since each Docker build produces a tagged image with an Id, we can associate particular (Tag, Id) pairs to particular versions of a model. By generating metadata during the build, we can also associate the tag with the model’s metadata!
In order to build the image, we will run the docker build
command:
docker build -t docker-model -f Dockerfile .
Again, nothing new here. We use the file named Dockerfile and tag the image as docker-model. Here is the output for running the previous command.
$ docker build -t docker-model -f Dockerfile .
Sending build context to Docker daemon 7.68kB
Step 1/9 : FROM jupyter/scipy-notebook
---> 2fb85d5904cc
Step 2/9 : RUN pip install joblib
---> Using cache
---> 66f3447a0309
Step 3/9 : RUN mkdir model
---> Using cache
---> 25add49612f8
Step 4/9 : ENV MODEL_DIR=/home/jovyan/model
---> Using cache
---> c942e332b297
Step 5/9 : ENV MODEL_FILE=clf.joblib
---> Using cache
---> be7ff955f556
Step 6/9 : ENV METADATA_FILE=metadata.json
---> Using cache
---> 934551874dc7
Step 7/9 : COPY train.py ./train.py
---> Using cache
---> e8009734844f
Step 8/9 : COPY inference.py ./inference.py
---> b27b3376312a
Step 9/9 : RUN python3 train.py
---> Running in d0dd807e08d8
Loading data...
Splitting data...
Fitting model...
Serializing model to: /home/jovyan/model/clf.joblib
Serializing metadata to: /home/jovyan/model/metadata.json
Removing intermediate container d0dd807e08d8
---> ca401bbcd10f
Successfully built ca401bbcd10f
Successfully tagged docker-model:latest
In this case, our (Tag, Id) pair is (docker-model, ca401bbcd10f).
Viewing Model Artifacts
Now that we have built our image, we can inspect the model and metadata resources that we persisted during the training process. In order to do that, we need to run a container. Since our metadata is stored as a json file, let’s just print that out to the screen. The command to do this is
docker run docker-model cat /home/jovyan/model/metadata.json
Here we are starting a container from our newly built docker-model image and running the cat
command. This is the output of the command
$ docker run docker-model cat /home/jovyan/model/metadata.json
{"train_mean_square_error": 1.767739146234438, "test_mean_square_error": 6.767464371311846}
For the sake of simplicity, I’ve only persisted the training and test mean square error from the training process. But this approach can be extended to perist all kinds of metadata including the type of algorithm used, the features used in the model, distributions of the training data, etc. In fact, it makes a lot of sense to persist the metadata in a database. If you also persist the Docker image tag and id along with the metadata, you’ve got yourself the beginning of a machine learning model versioning system. That sounds like it would be a good future blog post ; ).
Batch Inference in Docker
At this point, we have built a Docker image that contains a trained machine learning model. But the only reason we care about doing this is so we can make predictions! While it’s often necessary to allow clients to retrieve predictions in real time, there are many cases when this isn’t necessary. For instance, a model that is used to generate a batch of predictions on a recurring schedule doesn’t need to be served as an API (although it could be). This is called batch inference.
For example, lets say your company has built a lead scoring model to predict whether new prospective customers will buy your product or service. The marketing team would like to predict whether individual leads will convert, but they don’t necessarily need the predictions right away. Instead, they’re happy so long as new leads are scored within 24 hours of entering the system. Rather than performing inference as leads enter our system, we can perform inference each night on the batch of leads generated that day. This will guarantee that all leads from the previous day are scored within the agreed upon time frame.
How can we use our Docker image to perform batch inference? Since we’ve trained and serialized the model, this is as simple as deserializing the model and using it to perform inference on new data. Let’s look at inference.py which does just that.
import os
from joblib import load
import numpy as np
from sklearn import datasets
from sklearn.utils import shuffle
MODEL_DIR = os.environ["MODEL_DIR"]
MODEL_FILE = os.environ["MODEL_FILE"]
METADATA_FILE = os.environ["METADATA_FILE"]
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
METADATA_PATH = os.path.join(MODEL_DIR, METADATA_FILE)
def get_data():
"""
Return data for inference.
"""
print("Loading data...")
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
return X_test, y_test
print("Running inference...")
X, y = get_data()
# #############################################################################
# Load model
print("Loading model from: {}".format(MODEL_PATH))
clf = load(MODEL_PATH)
# #############################################################################
# Run inference
print("Scoring observations...")
y_pred = clf.predict(X)
print(y_pred)
This very simple example illustrates the pattern of retrieving data from some data source, deserializing your trained model, and then performing inference. My code isn’t realistic for several reasons. First, I’m making predictions on the same test set I used during training. Second, I’m just printing the predictions to standard output. In real life the data used during inference would be completely unseen. Continuing with my lead scoring example from earlier, the data could be new leads generated during the last day. After performing inference, we might then store these predictions in a database.
The command to perform inference is
docker run docker-model python3 inference.py
Here is the output from running that command.
$ docker run docker-model python3 inference.py
Running inference...
Loading data...
Splitting data...
Loading model from: /home/jovyan/model/clf.joblib
Scoring observations...
[ 15.32448686 27.68741572 24.22583752 31.94786177 10.41477704
34.08931725 22.05210667 11.58265489 13.40512651 42.84036647
33.03218733 15.77635169 23.93521876 19.88254786 25.43466604
20.55132127 13.68825729 47.42790362 17.64804854 21.51806638
22.57388848 16.97645106 16.25503893 20.57862843 14.57438158
11.81385445 24.78353556 37.65978361 30.35372977 19.67713185
23.19380271 24.98879019 18.65459129 30.18538175 8.97905481
13.8130382 14.1646968 17.3840622 19.83840166 24.69323175
20.4430731 15.32433651 25.8157052 16.47533793 19.2214524
19.86928427 21.47113681 21.56443118 24.64517965 22.43665872
22.20893399]
Summary
Congratulations – you now know how to use Docker to train machine learning models and perform batch inference! While this architecture allows us to train models and perform inference locally, we need a few additional components in order to deploy this architecture. For instance, we would need some sort of job scheduling functionality to schedule the recurring inference jobs. We’d also want to perform hyperparameter tuning during the model training process. And while I’ve introduced how to leverage Docker’s image tagging and ID for model version control, I haven’t shown how to do that. Yet.
In our next post, we’ll use Docker to perform online inference. This will require us to expose our trained model as a REST api.
If you found this tutorial helpful, please share it on LinkedIn, Twitter, or Facebook!
2 thoughts on “Docker for Machine Learning – Part III”