Test-Driven Machine Learning Development (Deployment Series: Guide 07)

This is post 7 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.

In the previous post of my machine learning deployment series we discussed how a model registry is used for multiple purposes, including to store model lineage, version, and configuration information. For example, a model registry can be queried to learn where a serialized model is stored. A registry also specifies a given model’s stage of development. You can query the registry to learn which version of a trained model is in development versus which version should be used at runtime for production inference. Similarly to how new software is deployed, a newly trained model will start out in a development stage before being promoted to staging and then being production ready.

But how do you determine when to promote a model through this lifecycle? Lets stand on the shoulders of giants and learn from software development best practices.

In traditional software development, code must be tested in a variety of ways before being promoted to production. For example, unit tests examine specific units of code while integration tests test the integration of your app with with software that lives outside of the application.

It’s important to test a machine learning model before promoting it to production. By properly testing a model you’ll gain confidence that the model works as intended. But what does "testing" mean in this context?

In this post we’ll define model testing and discuss several different ways of testing models using offline tests. These testing strategies enable us to "close-the-loop" on model deployment using CI/CD for machine learning.


Test-Driven Development

As described by Martin Fowler, Test-Driven Development (TDD)

is a technique for building software that guides software development by writing tests.

The process involves three simple steps that are repeated until a project is completed. These steps are:

  1. Write a test for the functionality you want to add.
  2. Write the functional code until the test passes.
  3. Refactor all code to make it well structured.

As a simple example, suppose you want to write a function that accepts a list of numbers and returns a list of those numbers squared:

def square_list(numbers):
    '''
    Accepts a list of numbers and returns a list of those numbers squared.
    '''
    pass

Before implementing the function, you first write tests that will return expected results assuming your function is written correctly. Examples of tests for our square_list function include:

  • i) The function throws an exception if the numbers argument is not a list.
  • ii) A list is returned that’s the same length of the numbers argument.
  • iii) The number at index j of the returned list is equal to the square of the number at index j of the numbers argument.

These are just a subset of tests we could write. For example, test iii) assumes that each item in the numbers list is a number. Depending on the context, you might want to write a test to check for that condition as well.

After specifying and writing the test cases we then implement the square_list function. Once the function passes all the test, we can be fairly confident that our code achieves the desired functionality. And we have a set of tests that can be used to prevent regressions in the codebase. That’s a major benefit of test-driven development: the technique produces well-tested code, rather than just code.

Testing Machine Learning Models

Just like regular software, machine learning models must be validated before being deployed. These validations, or tests, ensure that models are delivering high-quality predictions. Models that fail to deliver high-quality predictions can lead to disastrous outcomes for users and organizations. Whereas a poorly performing song recommender system may lead to listener dissatisfaction, an inaccurate object detector in an autonomous driving system can cause death. Clearly it’s best to do what we can to prevent these errors before deploying models to production.

As I mentioned previously, TDD involves writing test cases before implementing application functionality. But writing these tests assumes you know what to test in the first place. This is (usually) straightforward in deterministic systems as we saw in the case of the square_list function. But unlike traditional software, machine learning models are non-deterministic. This is succinctly described in The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction:

ML system testing is also more complex a challenge than testing manually coded systems, due to the fact that ML system behavior depends strongly on data and models that cannot be strongly specified a priori.

So how do we test our models? How do we circumvent the stochastic nature of machine learning, the fact that we can’t predict what data we’ll see "out in the wild"? Lets examine three ways of writing machine learning model tests. Explicitly specifying these test cases will allow us to build automation into the model deployment process.

Figure. The Machine Learning Test Pyramid. ML requires more testing than traditional software engineering. Source.

1. Performance on a Held-Out Dataset

The first validation involves computing a model’s predictive performance on a held-out dataset and comparing it to a pre-determined minimum acceptable threshold.

During the product planning stage, the Product and Data Science teams should partner to decide on a performance metric and a minimum threshold of performance. This metric/threshold combination defines the minimum level of performance a model should achieve before being deployed to production.

Although this is the most well-known model test and the simplest to implement, there are several things you should keep in mind.

Once the metric/threshold combination is specified, a data splitting strategy must be chosen. Contrary to what data science bootcamps and online tutorials will have you believe, properly splitting a dataset into training, validation, and test splits is more complex than random uniform splitting. The splitting process is problem-dependent: splits should imitate how the model will be used at inference time.

A typical example here is models that predict on data with a temporal component. Consider a lead scoring model used to prioritize outreach to new leads (I describe such a use-case in the first post in this series). Randomly splitting a dataset will produce biased error estimates. Since we seek to rank order newly generated leads, the model should be trained on an older cohort of leads and validated on a newer cohort of leads.

Other issues include data leakage, hidden feedback loops, and datasets that aren’t representative of the general population.

Finally, beware that strong performance on a held-out dataset, even if the dataset is representative, does NOT mean the model will improve product or business metrics. We’ll discuss this issue more in a future post on online validation strategies. For now keep in mind that validating performance by using aggregate performance metrics on held-out datasets usually isn’t enough.

Beware that strong performance on a held-out dataset, even if the dataset is representative, does NOT mean the model will improve product or business metrics. Click To Tweet

2. Performance on Specific Examples

Computing a model’s performance against a held-out dataset doesn’t tell us anything about how the model performs on specific examples. Sometimes there are samples of data where we want always a model to produce a specific outcome.

Let’s go back to our lead scoring model and assume we’re working for Company A, a B2B SaaS company selling expensive enterprise software. Sales teams typically have a good idea of the attributes that make up a qualified lead. These qualified lead "profiles" aren’t ML models – they’re usually just heuristics or simple conditional statements on certain input features. The Sales team at Company A believes that early stage startups can’t afford our product. So rather than waste the account executives’ time on companies that have been around for less than 6 months, the Sales team prefers to "nurse" these leads by adding them to automated email drip campaigns.

In this case, any deployed model should classify companies that have existed for less than 6 months as unqualified leads. Assuming the model has access to a company’s age as an input feature, we can construct a set of tests that assert this desired behavior. But how do we do that?

One way is to look at historical data and extract previous samples that fit the profile. We query the CRM for leads whose age is less than 6 months and extract the results. Whenever we train a new model, we predict on these test cases and assert that the prediction always equals unqualified.

ML Engineer Emmanuel Ameisen describes the importance of such tests in his book Building Machine Learning Powered Applications:

We will also test predictions for specific inputs. This helps proactively detect regressions in prediction quality in new models and guarantee that any model we use always produces the expected output on these example inputs. When a new model shows better aggregate performance, it can be hard to notice whether its performance worsened on specific types of inputs. Writing such tests helps detect such issues more easily.

3. Performance on Critical Subpopulations

Evaluating a model’s performance on a held-out validation set involves reducing all of the information contained in a set of predictions down to a single number like accuracy or RMSE. Such an aggregation can produce misleading results. For instance, a model’s performance on a specific slice of data can be very different from how it performs across the entire dataset.

It’s very important that your model doesn’t systematically fail for important subpopulations in your data. These subpopulations could be defined by demographics like gender, age, or ethnicity, or they could be functional definitions such as the source of new leads e.g. organic search vs facebook ads vs linkedin ads. Even if your model achieves the minimum acceptable threshold of performance on the aggregate dataset, you could be in trouble once the model is deployed to production if the model is much less performant on a critical subpopulation.

Returning to our lead scoring example, suppose the model achieves great results in aggregate but underperforms for leads originating from mobile ads. Historically, few leads come from mobile ads, so this poor performance washes out in the aggregate. But imagine the marketing team decides to increase spend on mobile advertising, which leads to an increase in leads from mobile platforms. Now we’re in a sticky situation. Prior to the marketing campaign, the model underperformed on mobile leads but there were few of them. Now our model underperforms on a larger percentage of total leads. This can be disastrous for down-funnel performance depending on what decisions we make based on lead scores. And we may not be able to diagnose what’s wrong until it’s too late to course correct.

It's very important that your model doesn't systematically fail for important subpopulations in your data. Even if your model achieves the minimum acceptable threshold of performance on the aggregate dataset, you could be in troubleā€¦ Click To Tweet

In order to prevent this problem from occurring we need to measure our model’s performance on important subpopulations before deploying that model to production. We can then evaluate the model by computing the evaluation metric on the whole dataset as well as on the important subslices. In our lead scoring example, this could mean computing accuracy across the whole validation set as well as on each lead source. Doing this will reveal poor performance in the model evaluation stage rather than at runtime when users can be affected.

A really nice example of this was presented by Andrej Karpathy in his February 2020 talk at the ScaledML Conference. In his presentation Karpathy discusses how the AI team at Tesla manually curates a set of specific test cases that must pass before a model is deployed to its self driving vehicles. When it comes to detecting stop signs, the team has created separate test cases for stop signs that are heavily occluded, held by people, in construction zones, etc.

Andrej Karpathy discussing test driven machine learning development at Tesla.
Figure: Tesla autonomous vehicles must catch stop signs in a variety of different conditions. Source.

Conclusion

In this post we’ve examined several ways of testing machine learning models. Determining which validation procedures you should implement in your projects depends on the complexity of the application (Do models drive automated actions?), the business cost of model errors (e.g. recommending a song vs recommending a medical intervention), and the resource constraints of your organization (do you have infrastructure specialists?).

Here we looked at testing models in an offline setting. Offline tests are used to validate a model on historical data where we have access to the actual ground truth labels. But offline tests aren’t the only type of validations we can (or should) perform. Online tests are necessary to establish causal relationships between a model and some desired effect, such as increased user engagement or higher conversion rates. Implementing these additional validation steps involves more code and infrastructure. We’ll talk about online tests in our next post!

If you’re interested in learning more about model testing, you can download a doc I put together containing additional resources on ML testing!

2 thoughts on “Test-Driven Machine Learning Development (Deployment Series: Guide 07)”

Leave a Reply

Your email address will not be published. Required fields are marked *