As an applied machine learning practitioner, I’m acutely aware that we’re still in the steam-powered days of machine learning. With a few notable exceptions (I’m looking at you FAANMG), most companies have only begun to utilize ML techniques in the past few years. And even fewer of these companies have been running ML systems at scale.

At a high level, running ML systems at scale is challenging for several reasons. The systems issues are often misunderstood. Although best practices are emerging quickly, they’re extremely decentralized: you’ll need to search across books, blog posts, conference talks, and github repositories to find them.

One way to leverage what tech companies like Amazon and Microsoft have learned about running ML systems is to use a centralized platform. These platforms have been built to solve common issues encountered when deploying machine learning models at scale. Here I’d like to dig into 5 challenges to running machine learning systems at scale and how Amazon SageMaker solves these challenges.

5 Challenges to Running ML Systems in Production

Challenge 1: Organizing Machine Learning Experiments

Machine learning is an iterative process. You need to experiment with multiple combinations of data, learning algorithms and model parameters, and keep track of the impact these changes have on predictive performance. Over time this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations.

As in traditional software engineering, it’s rarely the case that a single person will work on developing a model over time. Teams experience turnover, goals change, and new datasets and features become available. So we should expect experimentation to continue long after a model is first built. It becomes increasingly difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements. What’s needed is a system that keeps track of experimental metadata and the impact of different parameters on predictive performance.

SageMaker Experiments lets you organize, track, compare, and evaluate your machine learning experiments by organizing results in a unified schema.

The top level entity, an Experiment, is a collection of trials that are observed, compared, and evaluated as a group.

A Trial is a set of steps called trial components. Each trial component can include a combination of inputs such as datasets, algorithms, and parameters, and produce specific outputs such as models, metrics, datasets, and checkpoints.

Figure: Keep track of Experiments with SageMaker

Challenge 2: Debugging Model Training

Debugging model training jobs is a simple matter when you’re training models in an interactive programming environment such as Jupyter notebooks. If you’re running the code manually, you’ll be met with exceptions and stack traces if training errors out. You’ll also be able to visualize learning curves and other metrics if training succeeds. These diagnostics can point out if issues like overfitting or vanishing gradients have occurred.

But debugging training jobs becomes damn near impossible when model training jobs are running as automated batch processes on a recurring schedule. While job schedulers will rerun jobs that explicitly fail, they can’t easily check for issues like overfitting and vanishing gradients unless you code up custom solutions. And since your goal as a data science team is to deploy more and more models, this problem is only going to get worse.

SageMaker Debugger helps you inspect model training by monitoring, recording, and analyzing the data that captures the state of a training job. Debugger provides alerts that are automatically triggered when it detects common errors during model training, such as when gradient values get too low or too high, and allows you to interactively examine them. These capabilities can dramatically reduce the time needed to debug model training.

Errors are detected using Rules. A rule is Python code that detects certain conditions during training, for example, imbalanced training sets, gradients growing too large, or overfitting. SageMaker provides a set of common rules that work out of the box with popular frameworks like Tensorflow, PyTorch, and XGBoost. You can also build and configure custom rules.

Figure: Built-in Rules provided by Sagemaker

Challenge 3: Deploying Models to Production Environments

A machine learning model can only begin to add value to an organization when that model’s insights routinely become available to the users for which it was built. The process of taking a trained ML model and making its predictions available to users or other systems is known as deployment. Deployment is entirely distinct from routine machine learning tasks like feature engineering, model selection, or model evaluation. As such deployment is not very well understood amongst data scientists and ML engineers who lack backgrounds in software engineering or DevOps.

There are multiple factors to consider when deciding how to deploy a machine learning model:

how frequently predictions should be generated
whether predictions should be generated for a single instance at a time or a batch of instances
the number of applications that will access the model
the latency requirements of these applications

Batch inference allows us to generate predictions on a batch of samples, usually on some recurring schedule. Online inference is required whenever predictions are needed synchronously.

With SageMaker, after you train your model you can deploy it to get predictions in one of two ways.

First, you could set up a persistent endpoint to get one prediction at a time. SageMaker provides an HTTPS endpoint where your machine learning model is available to provide inferences. Using this option, all of the details of setting up the endpoint and network requirements is automatically taken care of for you.

Second, you can use SageMaker batch transform to get predictions for an entire dataset. With batch transform, you create a batch transform job using a trained model and a dataset stored in S3. Batch transform manages the compute resources required to get inferences, including launching instances and deleting them after the batch transform job has completed. When the jobs are completed, the inferences are saved in an S3 bucket.

Challenge 4: Scaling Up Machine Learning Inference

You’ve deployed your models to endpoints so they can deliver value to your users. This is great progress, but don’t pat yourself on the back just yet. If all goes well your model endpoints might see dramatically higher workloads in the near future. If your organization starts to serve many more users, these increased demands can bring down your machine learning services.

ML models hosted as API endpoints need to respond to such changes in demand. The number of compute instances serving your models should increase when requests rise. When workload decreases, compute instances should be removed so that you don’t pay for instances you aren’t using.

SageMaker supports automatic scaling, aka autoscaling, for hosted models. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Autoscaling works by monitoring a target metric e.g. CPU usage, and comparing that metric to a target value you assign. Additionally you configure the minimum and maximum scaling capacity and a cool-down period to control scaling behavior and price.

Challenge 5: Monitoring Models in Production

You’ve deployed your models and configured autoscaling, surely it’s time to celebrate? Nope.

The real work is said to begin at this stage.

Models must be continuously monitored to detect and combat deviations in the model quality such as data drift. Early and proactive detection of these deviations enables you to take corrective actions, such as retraining models, auditing upstream systems, or fixing data quality issues without having to monitor models manually or build additional tooling.

SageMaker Model Monitor automatically monitors models in production and notifies you when data quality issues arise.

Figure: SageMaker Model Monitor data flow.

You set up model monitoring in 4 steps:

First you create a baseline from the training dataset. Sagemaker uses Deequ, an open source library built on Apache Spark, to compute baseline schema constraints and statistics for each feature.
Next you configure your endpoint to capture data from incoming requests.
Create a monitoring schedule specifying what data to collect, how often to collect it, how to analyze it, and which reports to produce.
Finally you inspect the reports, which compare the latest data with the baseline, and watch for any violations reported and for metrics and notifications from Amazon CloudWatch.

Conclusion

Your precious time as a data scientist or ML Engineer is better spent using a platform like SageMaker rather than building your own platform. Pete Skomoroch and Lukas Biewald, two experts with experience building machine learning driven organizations, both agree that companies should avoid building their own machine learning infrastructure. By leveraging open source tools and commercial platforms that are now widely available, you can build models that provide differentiated value. Not build commoditized software.

To help you learn how to use SageMaker, I’ve designed an online course focused on solving the challenges described in this post. If you want to improve your MLOps knowledge (you definitely should want to), then check out the course!

2 thoughts on “5 Challenges to Running Machine Learning Systems in Production”

Joy Gordon says:

May 4, 2020 at 2:11 pm

There’s definately a great deal to know about this topic… I like all of the points you have made.

1. Luigi says:
  
  May 12, 2020 at 6:17 pm
  
  Absolutely, Joy. Thanks for reading!