Kubernetes CronJobs for Machine Learning

In my previous post we discussed how to leverage Kubernetes Jobs to perform common production machine learning tasks such as model training and batch inference. Jobs allow us to reliably run batch processes in a fault tolerant way. Even if an underlying node in the cluster fails, Kubernetes will ensure that the Job is rescheduled on a new node.

One limitation of using Jobs for machine learning workloads is that Job objects needs to be created manually. What if we want Jobs to run at specific times? Or what if we want to run some machine learning Jobs periodically on a recurring schedule? In this case, Kubernetes offers us the CronJob.

What is a CronJob

A CronJob creates Jobs on a time-based schedule similar to cron tasks on a Linux system. CronJobs are useful when you wish to create recurring tasks or run jobs at specific times. For example, you may wish to run recurring batch processes during periods of low activity. It’s important to note that CronJobs do not create Pods directly. Instead, a CronJob is only responsible for creating Jobs based on its schedule. The created Jobs are responsible for managing Pods that perform application logic.

How are CronJob Useful for Machine Learning

CronJobs are quite useful in machine learning workflows. Suppose you’re building a feature store and need to generate features every hour from an oeprational data store. One way of producing these features is to use an hourly CronJob that reads from the data store, creates the features, and stores these in the feature store. As another example, consider a lead scoring model that performs batch inference each night. This can be deployed as a daily CronJob that loads a pretrained model, fetches new input data, performs inference, and persists the predictions.

Interacting with CronJobs

In this post we will be using the lpatruno/k8-model Docker Image and inference.py python script from my previous post on Kubernetes Jobs. If you haven’t already I recommend reading that post before continuing. Using those files we will create a CronJob that loads a pretrained model and beforms batch inference on a recurring schedule. This is a common production machine learning pattern.

Note: This section assumes you have access to a Kubernetes cluster and have the kubectl command line client installed.

Creating a CronJob

To create a CronJob, we need to create a yaml file containing the configuration data. Let’s walk through inference.yaml, the config file for our batch inference CronJob:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: inference-cronjob
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: inference-container
            imagePullPolicy: Always
            image: lpatruno/k8-model:latest
            command: ["python3",  "inference.py"]
            env:
            - name: AWS_ACCESS_KEY_ID
              value: ""
            - name: AWS_SECRET_ACCESS_KEY
              value: ""
          restartPolicy: Never
      backoffLimit: 0

This yaml file contains four top-level keys. The apiVersion specifies which version of the Kubernetes API to use. The kind field specifies which type of Kubernetes resource we wish to create. In this case, we are creating a CronJob object. The metadata field lists a set of labels, arbitrary attributes developers can attach to Kubernetes objects. The docs contain a recommended set of labels, but I would recommend appending your own machine learning specific metadata as well. The spec field specifies the characteristics you want the resource to have. Every Kubernetes resource must contain a spec field, but the format of the object spec is different for different objects (see the Kubernetes API Reference).

The .spec field for the CronJob resource above contains two fields. The .spec.schedule field contains the cron formatted string that specifies when the CronJob should run. In my example, a new Job resource will be created each day at noon. The .spec.jobTemplate field contains the same fields that would appear in a Job spec field. In fact, I simply used the .spec field from my previous post.

To create the Job, simply run

$ kubectl create -f inference.yaml
cronjob.batch/inference-cronjob created

Viewing CronJobs

We can view the scheduled CronJobs by running the kubectl get cronjobs command:

$ kubectl get cronjobs
NAME                SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
inference-cronjob   * 12 * * *   False     0        <none>          29s

Since CronJobs create Job resources, we can monitor the Jobs that are created are look for instance of inference-cronjob:

$ kubectl get jobs --watch
NAME                           COMPLETIONS   DURATION   AGE
inference-cronjob-1558532220   0/1                      0s
inference-cronjob-1558532220   0/1   0s    0s
inference-cronjob-1558532220   1/1   7s    7s

The --watch flag in the command above watches for any changes in the Jobs resources. Our CronJob has created a Job object named inference-cronjob-1558532220.

In order to the view the logs from a Job created by a CronJob, we need to retrieve the Pod resource associated with that Job. To do that, we can run the kubectl get pod command and specify the name of the Job object.

$ kubectl get pods --selector=job-name=inference-cronjob-1558532220
NAME                                 READY   STATUS      RESTARTS   AGE
inference-cronjob-1558532220-ddqsr   0/1     Completed   0          22s

We see that the Pod has comlpleted successfuly. With the name of the Pod in hand, we can view the logs from that Pod:

kubectl logs inference-cronjob-1558532220-ddqsr
Running inference...
Loading data...
Loading model from: /home/jovyan/model/clf.joblib
Scoring observations...
[ 15.32448686  27.68741572  24.17609927  31.94786177  10.40786467
  34.38871141  22.05210667  11.58265489  13.21049075  42.87157933
  33.03218733  15.77635169  23.93521876  19.79260258  25.43466604
  20.55132127  13.67733317  47.48979635  17.70069362  21.51806638
  22.57388848  16.97645106  16.25503893  20.57862843  14.57438158
  11.81385445  24.78353556  37.64333157  30.29062179  19.67713185
  23.19310437  25.06569372  18.65459129  30.26701253   8.97905481
  13.8130382   14.21123728  17.3840622   19.83840166  23.83861108
  20.44820805  15.32433651  25.8157052   16.47533793  19.2214524
  19.86928427  21.47113681  21.56443118  24.64517965  22.43665872
  22.25160648]

Success! Our batch inference is complete.

You can delete a CronJob with the kubectl delete command. This will also delete any Jobs and Pods created by the CronJob.

kubectl delete -f inference.yaml
cronjob.batch "inference-cronjob" deleted

Conclusion

Let’s briefly review our work. First, we created a CronJob configuration file called inference.yaml. This config specifies that the inference.py script should be run each day at noon. At that time, the CronJob is triggered and Kuberntes creates a Job object. This Job then creates a Pod object, which runs a container that runs the Python script. We can use the same pattern above to schedule recurring model training jobs. I’ll leave that as an exercise for the reader : )

So far in our Kubernetes series we’ve covered how to create Pod, Job, and CronJob resources. Jobs and CronJobs are great for running batch processes. But what if we need to perform online inference? In that case we’ll need to deploy an API that accepts incoming requests, performs inference, and returns the result. In our next post I’ll demonstrate how to use Kubernetes Deployments to deploy online inference.

If you’d like to be notified when that post is published, sign up below and I’ll send you an email as soon as it’s ready!