Docker for Machine Learning - Part I

Why is Docker useful?

I will admit it, I cannot mention (or even think about) Docker without a large smile coming to my face. Ever since learning about Docker while working as a Data Engineer a few years ago, I’ve tried to adopt it into my workflow whenever possible. And while, strictly speaking, it’s not a machine learning tool per se, I’ve found that it helped me immensely while I was a ML Engineer and it continues to help me today as both an individual contributor and as a data science leader managing individuals and projects. What makes it such a useful tool for working with machine learning? And why is it worth your time to invest learning how to use Docker?

After discussing why Docker and machine learning are a winning combination, I’ll walk you through how I use Docker in my workflow. We’ll focus on leveraging pre-built images that are publicly available. In subsequent posts, we’ll build our own images and walk through how to deploy our models using Docker.

Since I use both R and Python in my day-to-day, I’ll show you how to use Docker with both languages. For simplicity, I’ll demonstrate using Python in this post, but you’ll have the chance to download the R examples at the end of the article. Feel free to take this code and adapt it to your own needs! The rest of the article assumes that you have Docker installed on your machine.

How does Docker help with machine learning?

Reproducibility – If you’re training models or analyzing data, you want to be able to reproduce your results. For instance, suppose you’re asked to perform exploratory data analysis on a data set. You set up a local environment with the dataset, some Python packages, and fire up your Jupyter notebook. After exploring your dataset and building an initial model, you decide to share your work with teammates. But in order for your teammates to reproduce your analysis, they need to reproduce your whole stack. With Docker you can easily reproduce your working environment. It’s also super easy to port your work to another machine, which brings us to our next point.
Portability – It’s very advantageous as a Data Scientist to be able to seamlessly transition from working on your local machine to external clusters that offer additional resources, including CPU, memory, GPUs etc. We also want to easily be able to experiment with new frameworks and tools being developed and released by the community regularly. Docker allows you to package your code and dependencies into containers that can then be ported to different machines, even if these other machines function on different underlying hardware, operating systems, etc. Another advantage of portability is the ability to easily collaborate on projects with different teammates. Now, I don’t have to spend an entire day (or week) setting up my environment so I can begin working on an existing codebase. If the project is “Docker-ized” and I have Docker installed, I can be productive immediately.
Ease of deployment – Related to portability, Docker simplifies the process of deploying machine learning models. Do you need to make your model available to external stakeholders? No problem: simply wrap your model into an API in a container, deploy that container using a tool like Kubernetes, and there you are. Now of course, I’m glossing over a few details here. But the point is that it is relatively straightforward to go from iterating on your model in a Docker workflow to then deploying that model in the form of a container. We can then leverage the tools and processes for managing deployed containers for our model deployment.

Docker from the Command Line

Although the command line may be not be the best environment for performing extensive analysis, I find myself constantly returning to a bash shell for running quick jobs. If I need to fire up a Python interpreter, I just spin up a Docker container directly in the command line. The command to run Python 3.6 is

docker run --rm -ti python:3.6 python

Let’s walk through this command. The basic format of the command is

docker [cmd] [image:tag] [cmd to execute in container]

Here we’re instructing Docker to run a new container from the python:3.6 image and to run python interactively within that container. The -—rm flag tells Docker to remove the container once we exit the process. The -ti flags allocate a tty for the interactive process i.e. allow us to interact with the container from our terminal.

The truly incredible part of running these commands is that if we don’t have the python:3.6 image on our machines, Docker automatically downloads the images for us from the public Docker hub registry. You can think of a Docker registry as a GitHub-for-Docker-images. This allows us to take advantage of the massive number of images that have been published by other developers. Check out all of the different official python images available. To see what images you have installed on your machine, run

docker images

from the command line.

Running Different Versions of Python

If you want to run a different version of Python, all you have to do is change the tag (the part of the image name that comes after the colon) to match the version you wish to use. For example, you can run Python 2.7 with command

docker run --rm -ti python:2.7 python

Pro tip: I often choose to execute bash in the containers I run. This gives me greater flexibility when running the container by allowing me to determine paths of different files, install system packages I need using apt-get, define environment variables, etc. The command to run bash in a container is

docker run --rm -ti python:2.7 bash

Running Jupyter notebooks

When I run more extensive analyses, I always use an interactive environment like Jupyter notebooks. Again, I take advantage of the public images that are available. The command to run Jupyter notebooks is

docker run --rm -p 8888:8888 jupyter/scipy-notebook

Note the new command line argument -p 8888:8888. The -p flag tells Docker to publish a port from the container to a port on the host machine. The integer before the colon is the port number on the host machine and the the integer after the colon is the port in the container. Why 8888 in this case? Well, the jupyter/scipy-notebook serves jupyter notebooks on port 8888. If you run this command and navigate to localhost:8888 in your browser, you’ll see the login screen for jupyter. You will have to copy-and-paste the token from the command line into the browser to log-in.

What do you think would happen if you change the -p flag to -p 8000:8888? What about -p 8888:40? If you’re new to Docker try to answer this first, and then test it out.

ssh Into Running Containers

Often times its very helpful to be able to open up a bash shell in an already running container. We can do that by executing a new command in an already running container. To do that, we first need to locate the ID associated with that container. Assuming you have run one of the above commands, navigate to another window in the command line and run

docker ps

This command lists information about containers that are currently running on your machine. Copy the value under the CONTAINER ID and run the following command

docker exec -ti container_id bash

where you replace container_id with the value you’ve copied. This will put you into a bash shell in the running container. From here you can do things like determine paths of different files, install additional dependencies, define environment variables, etc.

Conclusion

As you can see, there is a lot you can do with Docker just by taking advantage of using pre-built images. By leveraging these images, we’ve been able to easily spin up different versions of Python and other data science tools like Jupyter notebooks. We’ve also learned about several Docker commands and the options associated with those commands. But this is just the tip of the iceberg when it comes to how Docker can improve your machine learning workflows.

In the next few blog posts, we’re going to discuss more advanced Docker features. First, we’re going to build our own Docker images which will allow us to fully customize our environments. Then we’ll use these Docker images to deploy our own models.

If you’d like to be notified when the next post is published, sign up below to receive the blog post in your inbox. You’ll also be able to download a free PDF containing the code to run both Python and R containers.

Docker for Machine Learning – Part I