One of the best parts of being a Data Scientist is the dynamic nature of the job. You’ll likely spend a majority of your time feature engineering, building models, or running experiments. But depending on your role, you may also work with Data Engineers to build out robust data pipelines, consult with Product Managers to build machine learning powered features, or work with DevOps to serve your model in production. As a Data Scientist you’re constantly learning something new, even if it’s not directly related to machine learning or statistics.
Over the last few years I’ve spent quite a bit of time learning about Kubernetes, an open source container orchestration tool designed to automate deployment, scaling, and management of containerized applications. Although the resources needed to run machine learning workloads are quite different from the needs of traditional application development, Kubernetes makes it extremely simple to deploy and scale up machine learning applications. For instance, you can use Kubernetes to parallelize hyperparameter search during model tuning, to generate predictions in batch during inference, or to serve models exposed as REST APIs.
That said, Kubernetes is typically considered outside the realm of data science and there are relatively few resources designed for data scientists to learn how to use Kubernetes. Rather than focus on systems and architecture, which aren’t data science core competencies, I’d like to present Kubernetes through a series of data science focused applications. I’ll introduce and describe different Kubernetes concepts, but through the lens of tasks that a data scientist would accomplish during day-to-day work. We’ll start by talking about Pods, the building block of the Kubernetes ecosystem.
What is a Pod?
A Pod can be thought of as the atomic unit of the Kubernetes ecosystem. It’s the smallest deployable object in the Kubernetes model that you can create or deploy. Conceptually, a Pod represents a single instance of an application.
Rather than running containers directly, Kubernetes runs one or more containers in a Pod. Sometimes a single instance of an application requires just a single container. Other times an application is composed of a few tightly coupled containers. Containers in the same Pod share the same resources and local network, so they can easily communicate with, yet be isolated from, one another.
Since a Pod represents a single instance of an application, it’s customary to deploy replicas of a Pod in case of heavy load. Other Kubernetes objects can be used to automate the management of replicas of pods and load balancing among them. Besides load balancing, maintaining replicas of a pod also promotes fault tolerance in case of Pod failure.
Interacting with Pods
Let’s examine how to create, view, update, and delete pods. This section assumes you have access to a Kubernetes cluster and have the kubectl command line client installed.
Creating a Pod
The best way of creating and deploying Kubernetes resources is to use configuration files. These files specify which objects to create, what metadata to assign those objects, and other configuration information such as the amount of resources those objects need.
To create a Pod, we need to create a Pod template, a specification for describing the metadata and containers that form the Pod. Let’s walk through a simple configuration file pod_public.yaml.
apiVersion: v1
kind: Pod
metadata:
name: python3-pod
labels:
app: python3
spec:
containers:
- name: python3-container
image: python:3.6
command: ['python3', '-c', 'print("Hello, World!")']
restartPolicy: Never
This yaml file contains four top-level keys. The apiVersion specifies which version of the Kubernetes API to use. The kind field specifies which type of Kubernetes resource we wish to create. In this case, we are creating a Pod object. The metadata field lists a set of labels, arbitrary attributes developers can attach to Kubernetes objects. The docs contain a recommended set of labels, but I would recommend appending your own machine learning specific metadata as well.
The spec field specifies the characteristics you want the object to have. Every Kubernetes object must contain a spec field, but the format of the object spec is different for different objects (see the Kubernetes API Reference). The spec
entry above specifies which containers we wish to run. Here we list a single container, with a specific name, docker image, and command to run in the container. The restartPolicy field specifies whether Kubernetes should restart the container should that container fail. In this case, we instruct Kubernetes to not restart the container.
To create a Pod from the configuration file, we need to use the kubectl create
command. The command to create the Pod is:
$ kubectl create -f pod_public.yaml
pod "python3-pod" created
We can view the status of the pod by using the kubectl get
command:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
python3-pod 0/1 Completed 0 3s
We can also view the logs from within the Pod with the kubectl log
command:
$ kubectl logs python3-pod
Hello, World!
Deleting the pod is as simple as using the kubectl delete
command and specifying the configuration file used to create the Pod
$ kubectl delete -f pod_public.yaml
pod "python3-pod" deleted
If you view the Pod status, you’ll find that there are no Pods available:
$ kubectl get pods
No resources found.
Creating a Pod from a Custom Docker Image
In the previous section we created a Pod with a single container. That container was run from the python:3.6
Docker image that is already available on the Docker Hub. But what if you’d like to create a Pod and run a container from a custom Docker image?
In that case, we first want to push our custom Docker image to a registry. We can then reference the image in our Pod configuration, and Kubernetes will take care of pulling the appropriate image. In our example, we will build a custom Docker image, tag the image, and then push that image to Docker Hub.
Here is a simple Dockerfile that copies a local Python file into an image:
FROM jupyter/scipy-notebook
COPY feature_analysis.py ./feature_analysis.py
And here is the feature_analysis.py script:
import numpy as np
import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
data = np.column_stack((boston.data, boston.target))
df = pd.DataFrame(data, columns=[f for f in boston.feature_names] + ['target'])
print(df.describe().T)
This script loads a sample dataset into memory, creates a pandas dataframe from the feature and target data, and then prints summary statistics to the screen.
We first build the image by using the docker build
command:
$ docker build -t feature_analysis -f Dockerfile .
Sending build context to Docker daemon 6.656kB
Step 1/2 : FROM jupyter/scipy-notebook
---> 2fb85d5904cc
Step 2/2 : COPY feature_analysis.py ./feature_analysis.py
---> Using cache
---> 8286549588cc
Successfully built 8286549588cc
Successfully tagged feature_analysis:latest
Next, we tag the local image with the name of a repository on the Docker Hub (I created this repository at Docker Hub):
docker tag feature_analysis:latest lpatruno/feature_analysis:latest
Finally, we push the newly tagged image to the Docker Hub:
docker push lpatruno/feature_analysis:latest
The push refers to repository [docker.io/lpatruno/feature_analysis]
aa1500dcc7ee: Layer already exists
03de148dfb0a: Layer already exists
b0f3e4f91d7b: Layer already exists
d678676e139c: Layer already exists
f1c34378f44b: Layer already exists
3e989afdb948: Layer already exists
5d8e59e8fa3d: Layer already exists
d0fac854ebed: Layer already exists
4e4c852921cc: Layer already exists
6db4e45cf563: Layer already exists
b9c6b5375a6e: Layer already exists
ec7a5c783ba6: Layer already exists
305d55183e3e: Layer already exists
e4da5278aad5: Layer already exists
88fb11447873: Layer already exists
c3c9a296a12d: Layer already exists
69ff1caa4c1a: Layer already exists
e9804e687894: Layer already exists
e8482936e318: Layer already exists
059ad60bcacf: Layer already exists
8db5f072feec: Layer already exists
67885e448177: Layer already exists
ec75999a0cb1: Layer already exists
65bdd50ee76a: Layer already exists
latest: digest: sha256:5de36fe3475a0ef72971691cf225e398974de3a07962ceaac8e762daa9e32469 size: 5342
To create the Kubernetes Pod, all we need to do is replace the image and command values in the pod specification. I’ve also updated some metadata values.
apiVersion: v1
kind: Pod
metadata:
name: custom-pod
labels:
app: custom
spec:
containers:
- name: custom-container
image: lpatruno/feature_analysis:latest
command: ['python3', 'feature_analysis.py']
restartPolicy: Never
We can then create the pod with the kubectl create
command:
$ kubectl create -f pod_custom.yaml
pod/custom-pod created
The kubectl describe
command can be used to print a detailed description of Kubernetes resources. In this case, we can view information about our custom-pod
, including the event that was generated for Kubernetes to go out and fetch the Docker image from Docker Hub:
kubectl describe pod custom-pod
Name: custom-pod
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: docker-desktop/192.168.65.3
Start Time: Sat, 04 May 2019 10:50:11 -0400
Labels: app=custom
Annotations: <none>
Status: Succeeded
IP: 10.1.0.62
Containers:
custom-container:
Container ID: docker://3d58b710716a1f908f7c9d80c258d56284c8d176661382545d71adde2b3ace66
Image: lpatruno/feature_analysis:latest
Image ID: docker-pullable://lpatruno/feature_analysis@sha256:5de36fe3475a0ef72971691cf225e398974de3a07962ceaac8e762daa9e32469
Port: <none>
Host Port: <none>
Command:
python3
feature_analysis.py
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 04 May 2019 10:50:13 -0400
Finished: Sat, 04 May 2019 10:50:14 -0400
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-96lvl (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-96lvl:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-96lvl
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53s default-scheduler Successfully assigned default/custom-pod to docker-desktop
Normal Pulling 52s kubelet, docker-desktop pulling image "lpatruno/feature_analysis:latest"
Normal Pulled 52s kubelet, docker-desktop Successfully pulled image "lpatruno/feature_analysis:latest"
Normal Created 52s kubelet, docker-desktop Created container
Normal Started 51s kubelet, docker-desktop Started container
Finally, we can view the logs:
$ kubectl logs custom-pod
count mean ... 75% max
CRIM 506.0 3.613524 ... 3.677083 88.9762
ZN 506.0 11.363636 ... 12.500000 100.0000
INDUS 506.0 11.136779 ... 18.100000 27.7400
CHAS 506.0 0.069170 ... 0.000000 1.0000
NOX 506.0 0.554695 ... 0.624000 0.8710
RM 506.0 6.284634 ... 6.623500 8.7800
AGE 506.0 68.574901 ... 94.075000 100.0000
DIS 506.0 3.795043 ... 5.188425 12.1265
RAD 506.0 9.549407 ... 24.000000 24.0000
TAX 506.0 408.237154 ... 666.000000 711.0000
PTRATIO 506.0 18.455534 ... 20.200000 22.0000
B 506.0 356.674032 ... 396.225000 396.9000
LSTAT 506.0 12.653063 ... 16.955000 37.9700
target 506.0 22.532806 ... 25.000000 50.0000
[14 rows x 8 columns]
Conclusion
Although Pods are the smallest Kubernetes object that can be deployed, it’s not recommended to directly deploy Pods. This is because Naked Pods (Pods not bound to other objects) won’t be rescheduled in the event of a node failure. Instead, you should deploy higher-level abstractions that incorporate Pods.
For instance, a Job creates one or more Pods and ensures that a specific number of Pods successfully terminate. This is really useful when running batch machine learning jobs that perform feature engineering, model training, or batch inference. So important in fact, that I’ll be covering how to deploy model training as a Kubernetes Job in my next post.
If you’d like to be notified when that post is published, sign up below and I’ll send you an email as soon as it’s ready!