Kubernetes for Machine Learning

One of the best parts of being a Data Scientist is the dynamic nature of the job. You’ll likely spend a majority of your time feature engineering, building models, or running experiments. But depending on your role, you may also work with Data Engineers to build out robust data pipelines, consult with Product Managers to build machine learning powered features, or work with DevOps to serve your model in production. As a Data Scientist you’re constantly learning something new, even if it’s not directly related to machine learning or statistics.

Over the last few years I’ve spent quite a bit of time learning about Kubernetes, an open source container orchestration tool designed to automate deployment, scaling, and management of containerized applications. Although the resources needed to run machine learning workloads are quite different from the needs of traditional application development, Kubernetes makes it extremely simple to deploy and scale up machine learning applications. For instance, you can use Kubernetes to parallelize hyperparameter search during model tuning, to generate predictions in batch during inference, or to serve models exposed as REST APIs.

That said, Kubernetes is typically considered outside the realm of data science and there are relatively few resources designed for data scientists to learn how to use Kubernetes. Rather than focus on systems and architecture, which aren’t data science core competencies, I’d like to present Kubernetes through a series of data science focused applications. I’ll introduce and describe different Kubernetes concepts, but through the lens of tasks that a data scientist would accomplish during day-to-day work. We’ll start by talking about Pods, the building block of the Kubernetes ecosystem.

What is a Pod?

A Pod can be thought of as the atomic unit of the Kubernetes ecosystem. It’s the smallest deployable object in the Kubernetes model that you can create or deploy. Conceptually, a Pod represents a single instance of an application.

Rather than running containers directly, Kubernetes runs one or more containers in a Pod. Sometimes a single instance of an application requires just a single container. Other times an application is composed of a few tightly coupled containers. Containers in the same Pod share the same resources and local network, so they can easily communicate with, yet be isolated from, one another.

Since a Pod represents a single instance of an application, it’s customary to deploy replicas of a Pod in case of heavy load. Other Kubernetes objects can be used to automate the management of replicas of pods and load balancing among them. Besides load balancing, maintaining replicas of a pod also promotes fault tolerance in case of Pod failure.

Interacting with Pods

Let’s examine how to create, view, update, and delete pods. This section assumes you have access to a Kubernetes cluster and have the kubectl command line client installed.

Creating a Pod

The best way of creating and deploying Kubernetes resources is to use configuration files. These files specify which objects to create, what metadata to assign those objects, and other configuration information such as the amount of resources those objects need.

To create a Pod, we need to create a Pod template, a specification for describing the metadata and containers that form the Pod. Let’s walk through a simple configuration file pod_public.yaml.

apiVersion: v1
kind: Pod
  name: python3-pod
    app: python3
  - name: python3-container
    image: python:3.6
    command: ['python3', '-c', 'print("Hello, World!")']
  restartPolicy: Never

This yaml file contains four top-level keys. The apiVersion specifies which version of the Kubernetes API to use. The kind field specifies which type of Kubernetes resource we wish to create. In this case, we are creating a Pod object. The metadata field lists a set of labels, arbitrary attributes developers can attach to Kubernetes objects. The docs contain a recommended set of labels, but I would recommend appending your own machine learning specific metadata as well.

The spec field specifies the characteristics you want the object to have. Every Kubernetes object must contain a spec field, but the format of the object spec is different for different objects (see the Kubernetes API Reference). The spec entry above specifies which containers we wish to run. Here we list a single container, with a specific name, docker image, and command to run in the container. The restartPolicy field specifies whether Kubernetes should restart the container should that container fail. In this case, we instruct Kubernetes to not restart the container.

To create a Pod from the configuration file, we need to use the kubectl create command. The command to create the Pod is:

$ kubectl create -f pod_public.yaml
pod "python3-pod" created

We can view the status of the pod by using the kubectl get command:

$ kubectl get pods
python3-pod   0/1       Completed   0          3s

We can also view the logs from within the Pod with the kubectl log command:

$ kubectl logs python3-pod
Hello, World!

Deleting the pod is as simple as using the kubectl delete command and specifying the configuration file used to create the Pod

$ kubectl delete -f pod_public.yaml
pod "python3-pod" deleted

If you view the Pod status, you’ll find that there are no Pods available:

$ kubectl get pods
No resources found.

Creating a Pod from a Custom Docker Image

In the previous section we created a Pod with a single container. That container was run from the python:3.6 Docker image that is already available on the Docker Hub. But what if you’d like to create a Pod and run a container from a custom Docker image?

In that case, we first want to push our custom Docker image to a registry. We can then reference the image in our Pod configuration, and Kubernetes will take care of pulling the appropriate image. In our example, we will build a custom Docker image, tag the image, and then push that image to Docker Hub.

Here is a simple Dockerfile that copies a local Python file into an image:

FROM jupyter/scipy-notebook


And here is the script:

import numpy as np
import pandas as pd
from sklearn import datasets

boston = datasets.load_boston()
data = np.column_stack((, 
df = pd.DataFrame(data, columns=[f for f in boston.feature_names] + ['target'])


This script loads a sample dataset into memory, creates a pandas dataframe from the feature and target data, and then prints summary statistics to the screen.

We first build the image by using the docker build command:

$ docker build -t feature_analysis -f Dockerfile .
Sending build context to Docker daemon  6.656kB
Step 1/2 : FROM jupyter/scipy-notebook
 ---> 2fb85d5904cc
Step 2/2 : COPY ./
 ---> Using cache
 ---> 8286549588cc
Successfully built 8286549588cc
Successfully tagged feature_analysis:latest

Next, we tag the local image with the name of a repository on the Docker Hub (I created this repository at Docker Hub):

docker tag feature_analysis:latest lpatruno/feature_analysis:latest

Finally, we push the newly tagged image to the Docker Hub:

docker push lpatruno/feature_analysis:latest
The push refers to repository []
aa1500dcc7ee: Layer already exists 
03de148dfb0a: Layer already exists 
b0f3e4f91d7b: Layer already exists 
d678676e139c: Layer already exists 
f1c34378f44b: Layer already exists 
3e989afdb948: Layer already exists 
5d8e59e8fa3d: Layer already exists 
d0fac854ebed: Layer already exists 
4e4c852921cc: Layer already exists 
6db4e45cf563: Layer already exists 
b9c6b5375a6e: Layer already exists 
ec7a5c783ba6: Layer already exists 
305d55183e3e: Layer already exists 
e4da5278aad5: Layer already exists 
88fb11447873: Layer already exists 
c3c9a296a12d: Layer already exists 
69ff1caa4c1a: Layer already exists 
e9804e687894: Layer already exists 
e8482936e318: Layer already exists 
059ad60bcacf: Layer already exists 
8db5f072feec: Layer already exists 
67885e448177: Layer already exists 
ec75999a0cb1: Layer already exists 
65bdd50ee76a: Layer already exists 
latest: digest: sha256:5de36fe3475a0ef72971691cf225e398974de3a07962ceaac8e762daa9e32469 size: 5342

To create the Kubernetes Pod, all we need to do is replace the image and command values in the pod specification. I’ve also updated some metadata values.

apiVersion: v1
kind: Pod
  name: custom-pod
    app: custom
  - name: custom-container
    image: lpatruno/feature_analysis:latest
    command: ['python3', '']
  restartPolicy: Never

We can then create the pod with the kubectl create command:

$ kubectl create -f pod_custom.yaml
pod/custom-pod created

The kubectl describe command can be used to print a detailed description of Kubernetes resources. In this case, we can view information about our custom-pod, including the event that was generated for Kubernetes to go out and fetch the Docker image from Docker Hub:

kubectl describe pod custom-pod
Name:               custom-pod
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               docker-desktop/
Start Time:         Sat, 04 May 2019 10:50:11 -0400
Labels:             app=custom
Annotations:        <none>
Status:             Succeeded
    Container ID:  docker://3d58b710716a1f908f7c9d80c258d56284c8d176661382545d71adde2b3ace66
    Image:         lpatruno/feature_analysis:latest
    Image ID:      docker-pullable://lpatruno/feature_analysis@sha256:5de36fe3475a0ef72971691cf225e398974de3a07962ceaac8e762daa9e32469
    Port:          <none>
    Host Port:     <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 04 May 2019 10:50:13 -0400
      Finished:     Sat, 04 May 2019 10:50:14 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
      /var/run/secrets/ from default-token-96lvl (ro)
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-96lvl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations: for 300s
        for 300s
  Type    Reason     Age   From                     Message
  ----    ------     ----  ----                     -------
  Normal  Scheduled  53s   default-scheduler        Successfully assigned default/custom-pod to docker-desktop
  Normal  Pulling    52s   kubelet, docker-desktop  pulling image "lpatruno/feature_analysis:latest"
  Normal  Pulled     52s   kubelet, docker-desktop  Successfully pulled image "lpatruno/feature_analysis:latest"
  Normal  Created    52s   kubelet, docker-desktop  Created container
  Normal  Started    51s   kubelet, docker-desktop  Started container

Finally, we can view the logs:

$ kubectl logs custom-pod
         count        mean    ...            75%       max
CRIM     506.0    3.613524    ...       3.677083   88.9762
ZN       506.0   11.363636    ...      12.500000  100.0000
INDUS    506.0   11.136779    ...      18.100000   27.7400
CHAS     506.0    0.069170    ...       0.000000    1.0000
NOX      506.0    0.554695    ...       0.624000    0.8710
RM       506.0    6.284634    ...       6.623500    8.7800
AGE      506.0   68.574901    ...      94.075000  100.0000
DIS      506.0    3.795043    ...       5.188425   12.1265
RAD      506.0    9.549407    ...      24.000000   24.0000
TAX      506.0  408.237154    ...     666.000000  711.0000
PTRATIO  506.0   18.455534    ...      20.200000   22.0000
B        506.0  356.674032    ...     396.225000  396.9000
LSTAT    506.0   12.653063    ...      16.955000   37.9700
target   506.0   22.532806    ...      25.000000   50.0000

[14 rows x 8 columns]


Although Pods are the smallest Kubernetes object that can be deployed, it’s not recommended to directly deploy Pods. This is because Naked Pods (Pods not bound to other objects) won’t be rescheduled in the event of a node failure. Instead, you should deploy higher-level abstractions that incorporate Pods.

For instance, a Job creates one or more Pods and ensures that a specific number of Pods successfully terminate. This is really useful when running batch machine learning jobs that perform feature engineering, model training, or batch inference. So important in fact, that I’ll be covering how to deploy model training as a Kubernetes Job in my next post.

