Docker for Machine Learning - Part II

This is Part II of the Docker for Machine Learning series. In Part I of the series we learned how to run containers from prebuilt Docker Images. In this post you’ll learn how to build custom images by writing a Dockerfile. You’ll also learn about using Volumes for persisting data in containers.

Why Build your own Docker Images?

In Part I of our Docker for Machine Learning series, we learned how to run Docker containers using publicly available Docker images. While there are many Docker images available for use, some of which come with numerous machine learning tools and libraries preinstalled, you’ll most likely need to further customize these images in order to fit your needs. For instance, you may need to install proprietary software packages developed at your company or other open source packages. So how can we leverage Docker to build these custom environments?

Theoretically, one could start out with an available Docker image, such as the jupyter/scipy-notebook image, run bash in a container, and install all the necessary packages. But this would be a pain to replicate. We’d also prefer to be able to version control our build process. Luckily we can do both of those things by building custom Docker images.

In this post you’ll learn:

How to write a Dockerfile
How to build a custom Python image from the Dockerfile
How to run containers from this image
How to use Volumes to persist data in containers

Writing a Dockerfile

In order to build a Docker image, we need to write a Dockerfile, which is just a text file that contains the steps to build an image. To illustrate the process, let’s write out a Dockerfile that performs some common steps like installing system and Python packages and setting environment variables that we’ll need in our running containers.

Here is one such Dockerfile.

FROM jupyter/scipy-notebook

ARG SOME_ENV_VAR
ENV SOME_ENV_VAR=${SOME_ENV_VAR}

RUN pip install awscli --upgrade --user
RUN echo 'export PATH=~/.local/bin:$PATHn' &gt;&gt; $HOME/.bashrc

COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txt
RUN rm ./requirements.txt

Let’s go through this file line-by-line.

The first line FROM jupyter/scipy-notebook tells docker that we want to use the jupyter/scipy-notebook image as our base image. When we build an image we always need to start by building on top of some base image. The line ARG SOME_ENV_VAR defines a variable that users can pass to the builder at build time. In the next line, we use the ENV command to set an environment variable. This value will be in the environment for all subsequent instructions in the build stage. Notice this two part process. We use the ARG command to pass in a variable at build time, and then set an environment variable using the value passed in at build-time. It’s important to note that it’s not recommended to use build-time variables for passing sensitive information.

In line 6 we use the docker RUN command to install the AWS cli. The command that follows RUN is run in a shell and then cached as a new layer. This caching strategy allows Docker to skip building parts of the image that haven’t changed if you were to rebuild the image in the future. Next, we add ~/.local/bin to the $PATH variable in the image, which allows us to use the aws command from any directory.

The COPY requirements.txt ./requirements.txt line tells Docker to copy file requirements.txt from our host machine and add it to the filesystem of the container at path ./requirements.txt. This path is relative to the working directory of the image (which we’ve inherited from the jupyter/scipy-notebook base image). We then pip install the packages listed in requirements.txt in the next line and then perform some clean up. Note that the last line is not necessary for this Dockerfile to be valid.

Building a Docker Image

Now that we’ve written our Dockerfile, we need to build the Docker image from that file. The command for building our image is:

docker build --build-arg SOME_ENV_VAR=hello -t my-jupyter-image -f Dockerfile .

The docker build command builds an image from a Dockerfile. The -f Dockerfile parameter tells Docker to use the file named Dockerfile. Note that by default, Docker looks for a file named Dockerfile, but I’ve included it just for completeness. The . at the end of the command specifies the context of the build. All files within the context are compressed and sent to the Docker daemon process during the build. Any files outside of the context aren’t available during the build process. The --build-arg SOME_ENV_VAR=hello allows us to set the build-time variable we specified in the Dockerfile with the ARG command. Finally, -t my-jupyter-image allows us to tag the image, i.e. give the image and human-readable name.

Running your Custom Docker image

Now we can run a container from this image using the same commands that we introduced in Part I. The following command runs a container.

docker run --rm -p 8888:8888 my-jupyter-image

Notice that when we run this command, the jupyter notebook command is still executed in the container. That’s because the jupyter/scipy-notebook actually defines the default command, or entrypoint, that should execute when a container is run from the image. We’ll get into how to define those commands in Part III of our Docker for Machine Learning series. For now it’s enough to know that we’ve essentially inheritedthe behavior from the base image.

Volumes for Persisting Data

When using Docker, it’s important to remember that containers are meant to be ephemeral objects. This means that any data stored within a container is only stored temporarily. By default, files created inside a container do not persist when the container exits. One simple way to persist data past the lifecycle of a container is to use volumes. You can think of a volume as a file system mount that allows Docker to access and persist data to the filesystem of the host machine.

Connecting a volume to a container is as simple as appending an additional argument of the form -v HOST_PATH:CONTAINER_PATH to the docker run command. Let’s connect a volume to a container running the image we defined above.

docker run --rm -p 8888:8888 
  -v /Users/luigi/Development:/home/jovyan/work my-jupyter-image

Now any files or subdirectories available in /Users/luigi/Development will be available at /home/jovyan/work. Further, any data persisted in the running container in that directory will be available outside of the container. Remember to substitute a path on your machine in place of /Users/luigi/Development.

Conclusion

In this post we examined how to build our own custom Docker images. We learned how to write a Dockerfile which defines the specification for building the image. Next we learned how to build the image using the docker build command and how to run a container from the image we built. Finally, we learned about using volumes to persist data outside of the running containers. At this point, you’ve learned enough to be able to define and run your own custom Docker images!

In our next post, we’re going to expand on these lessons by training machine learning models and using these models for inference. We’ll discuss how to set up a model training pipeline in a Docker container and how to use the images we build for inference.

If you’d like to be notified when the next post is published, sign up below to receive the blog post in your inbox. If you sign up, you’ll also receive a free PDF containing the code to build custom R images.

Docker for Machine Learning – Part II