A/B Testing Machine Learning Models (Deployment Series: Guide 08)

This is post 8 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.

For readers interested in experimentation, I’m writing a new series on how to build an effective experimentation program at your company based on my experience doing that. In that series I describe the people, processes, and infrastructure necessary to run large scale online controlled experiments (aka A/B tests) at scale. Check it out!

In the last post of my series on deploying machine learning models I described how to test ML models on historical data. These offline tests are used to validate data, infrastructure, and models before deploying a model to production.

Although offline tests can be used to demonstrate adequate model performance on historical data, these tests cannot establish causal relationships between a model and user outcomes. When machine learning is introduced to drive specific user behavior, like increasing click-thru rate or engagement, we need to perform online validation also known as experimentation.

In this post I describe experimentation outside of the context of machine learning and then discuss why online validation of machine learning models is necessary. We’ll talk about A/B tests as a technique to perform online validation and discuss an architecture that can be implemented to A/B test machine learning models. At the end of this post you can download a guide I created to demonstrate how to implement the A/B testing architecture on both Google Cloud and AWS.

Why is Online Experimentation Necessary?

To understand why it’s necessary to validate machine learning models with online tests, we first need to understand why online experimentation is necessary. This comes down to two ideas.

The first is that companies have goals. This might seem like a trivial statement, but it’s really important to keep in mind. Goals include increasing annual recurring revenues, increasing user engagement, lowering churn, etc. Progress against such goals is measured by tracking specific metrics, known as key performance indicators (KPIs). In order to achieve its goals, companies take actions, which can include launching marketing campaigns, building new features, or iterating on existing features.

This leads us to an important question. How does a company know that its actions are moving the company towards achieving its goals?

This question leads us to the second idea we need to understand: how to establish causality. Suppose a company changes its UI to drive a specific user behavior. For example, imagine the company changes the color of a button on its website from red to green in order to increase the click-thru rate of the button. Or imagine an e-commerce company that simplifies its checkout process in order to minimize the number of users who abandon their carts. How can these companies know, with sufficient confidence, that their actions are causing positive outcomes?

That’s where experimentation aka online tests come in. Online experiments are used to measure the impact of an action against some baseline according to a specific metric. The changes are known as treatments and the baseline is known as a control. In our UI example, the control is the original button color (red) and the new color (green) is the treatment. The metric we measure is the click-thru rate of the button.

A/B Testing can be used to determine whether changing the UI leads to higher conversions. Source.

In order to establish causality, we perform a controlled randomized experiment. One such experiment is known as an A/B test. In an A/B test, users are split into two distinct non-overlapping cohorts. One cohort sees the treatment (the green button) and the other cohort sees the control (the red button). After some period of time, which is established before the experiment begins, we measure the click-thru rates from both cohorts. If the metric is higher for the treatment, we conclude that the new color causes increased click-thru, and we then roll out the new green button to all users. If the control has higher click-thru rate, we conclude that the green button doesn’t cause increased click-thru, and we keep the original red button.

Why A/B Test Machine Learning Models?

Lets use the framework from the previous section to understand why it’s necessary to validate machine learning models with online tests.

Organizations are investing time and money into building machine learning models in order to improve business results. Progress against these business goals is measured by tracking KPIs. But when data scientists and ML engineers build models on their machines, they don’t measure their progress against these KPIs. They measure model performance against historical datasets. But if a model performs well in terms of offline tests and metrics, it does NOT mean that the model will drive the KPIs the business cares about. Why? Because we can’t establish causality through offline tests.

As an example, consider a video sharing platform that generates revenue through advertising (Sounds familiar). The business generates profits when viewers are clicking on ads. The product team knows that users who spend more time on the app tend to click on more ads, so they choose to optimize session watch time as a proxy metric. A new product manager comes along and hypothesizes that better recommendations will increase engagement and maximize session watch time. Time to bring in the data science team.

Fast forward to when the data science team has a new recommendation model built and ready for deployment. The model was validated on historical data through offline tests, but we don’t know whether this model will increase time spent watching videos. Validating the original hypothesis that better recommendations improves engagement requires running a randomized controlled experiment i.e. an A/B test.

Industry Examples of A/B Testing ML Models

In his Lessons learned from building practical deep learning systems lecture, Xavier Amatriain describes 12 lessons he’s learned from building and deploying deep learning systems in production. Lesson 9 of the talk emphasizes the need to validate machine learning models using online experimentation. According to Amatriain, positive offline performance is an indication to perform online tests via A/B tests. To test whether a new model should be deployed, data scientists should

Measure differences in metrics across statistically identical populations that each experience a different algorithm.

Once significant improvements have been observed during online tests, we can rollout new models to the user base. This implies an additional validation step prior to deploying a model to all users. How offline metrics correlate with A/B test results, according to Amatriain, is not very well understood.

Offline performance is an indication to make decisions on follow-up A/B tests. Source.

A similar idea is expressed in the 2019 paper 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com. In section 3, entitled Modeling: Offline Model Performance is Just a Health Check, the authors state that:

In Booking.com we are very much concerned with the value a model brings to our customers and our business. Such value is estimated through Randomized Controlled Trials (RCTs) and specific business metrics like conversion, customer service tickets or cancellations. A very interesting finding is that increasing the performance of a model, does not necessarily translates to a gain in value.

We stress that this lack of correlation is not between offline and online performance, but between offline performance gain and business value gain. At the same time we do not want to overstate the generality of this result, the external validity can be easily challenged by noting that these models work in a specific context, for a specific system, they are built in specific ways, they all target the same business metric, and furthermore they are all trying to improve it after a previous model already did it. Nevertheless we still find the lack of correlation a remarkable finding. Only where the offline metric is almost exactly the business metric, a correlation can be observed.

According to the authors this phenomenon can be explained by four factors:

  • Value Performance Saturation – It’s not possible to continue deriving business value from model improvements indefinitely.
  • Segment Saturation – Over time the size of the treatment groups decreases so it becomes more difficult to detect statistically significant gains in value.
  • Uncanny Valley effect – Certain users are unsettled by how well models predict their actions as model performance improves over time. This negative effects the user experience.
  • Proxy Over-optimization – Models may over optimize observable variables that are proxies for specific business objectives.
The phenomenon whereby users react negatively to accurate predictions is known as the Uncanney Valley Effect. Source.

How to A/B Test ML Models

How can we run an A/B test to determine whether a new model is better than an incumbent model?

The first step when running an A/B test is to determine the business outcome we wish to achieve and choose the metric by which we’ll measure progress against that outcome. In experimentation this metric is sometimes referred to as the Overall Evaluation Criterion or OEC. Often times the OEC is a proxy metric for the desired business outcome rather than a direct measurement of the business outcome. One reason for preferring proxy metrics is speed. An OEC that could be measured on the order of hours or days allows us to integrate experimental feedback quickly.

The next step is to determine the parameters of the experiment itself. The two parameters you need to determine are the sample sizes (how users are split amongst the control and treatment groups) and the duration of the experiment. How users are split into cohorts determines which users will see the new machine learning model and which users will continue to see the currently deployed model.

The duration of the experiment depends on a power analysis where you indicate the power (probability that you’ll get a false negative) and significance level (probability of failing to reject the null hypothesis when it is true ie probability of false positives). Much has been written about determining these values (for example here and here) and there are calculators to help you determine these figures.

An A/B Testing Architecture for Machine Learning Models

Now comes the challenging part – how do we actuall run the test?. Running the experiment means operating both models simultaneously and ensuring that the treatment and control groups see the correct model. Doing this correctly depends on your company’s data infrastructure and data model. Let’s discuss a naive implementation and then improve it. I’ll implement this naive approach using Flask-like pseudocode.

Suppose model_A is the currently deployed model that can be queried by making an HTTP request to the /predict endpoint.

@app.route("/predict")
def predict():
	features = request.get_json['features']
	return model_A.predict(features)

Now imagine that we split users are into control/treatment groups based on a user ID. Users within the treatment group should see the new model which we’ll call model_B. A naive A/B test implementation involves implementing the routing logic that matches users to models within the model serving application.

# IDs of users in treatment group
TREAMENT_IDS = {}

@app.route("/predict")
def predict():
	features = request.get_json['features']
	if request.get_json['user_id'] in TREATMENT_IDS:
		return model_B.predict(features)
	else:
		return model_A.predict(features)

This architecture is captured in the following diagram.

A naive architecture for A/B testing models involves serving multiple models within a single application.

Why do I think this is a bad pattern? Three reasons.

The first is delegation of responsibility. Before we introduced the experiment, our API codebase was responsible only for handling requests to a trained model. This responsibility could be accomplished in a lightweight application written with few lines of code requiring a relatively tiny test suite (assuming the API is written using a well tested framework). Adding the experiment logic to the API increases the application’s responsibilities, the size of the codebase, and the number of tests that need to be written.

The second reason comes down to mitigating risk by eliminating potential sources of errors. Adding code always increases the area of application’s "failure surface". More code means more tests. The actual deployments themselves increases the probability of adding bugs to the codebase; you’ll need to deploy the code once to start the experiment and then again when the experiment is over. We’d be better off writing generic API logic once with well constructed interfaces and rarely updating this logic again.

Arguably the most important reason, at least operationally, is that this pattern relies on developers making changes. Product experiments should be managed by product managers (or someone in a Product role). Therefore we’d prefer PMs have the ability to toggle experiments on/off whenever necessary without needing to update the codebase. This can easily be accomplished if experiments are managed through configuration stored in a database. This configuration would map users to treatments and contain additional metadata like when the experiment began, its intended duration, the parameters of the power analysis, etc.

A better approach involves adding an additional layer of abstraction between the clients requesting inference and the models. This layer is responsible for routing incoming requests based on the settings of the experiment. In this approach, each trained model is hosted in a separate environment and a routing application acts as an intermediary between the clients and the models. The routing application accepts incoming requests, determines which model to query based on the experiment configuration, and then routes the request to the application serving that model. The selected model returns a prediction to the routing application, which returns it to the client. This pattern is illustrated below.

A better architecture for A/B testing models involves adding a routing application responsible for routing incoming model requests to appropriate hosted models based on experiment configuration.

This approach fixes the issues of the naive approach:

  • Delegation of responsibility – Our application consists of 2 services. The model services are responsible for performing inference. The routing service is responsible for routing incoming requests to appropriate model servers based on experiment configuration.
  • Eliminating Sources of Error in the Codebase – There’s no need to update the codebase of the model serving services when running an experiment. Just launch a new model server for each new model you wish to test. There’s no need to update the routing service codebase, especially if it’s written to read from experiment configuration (and that should be a requirement).
  • Triggering Experiments On/Off – Product Managers can turn experiments on/off based on experiment configuration. No need to get developers involved.

Both Google Cloud (GCP) and AWS offer mechanisms to A/B test machine learning model deployments. If you’re interested, I’ve written a brief guide showing how to implement this architecture on both GCP and AWS that can be downloaded below.

Multi-Armed Bandits for Machine Learning Models

A/B tests are not the only type of online experiments. Multi-Armed Bandits (MAB) experiments learns from data gathered during a test to dynamically increase the visitor allocation in favor of better-performing variations. MAB relies on "dynamic traffic allocation" to continuously identify the degree with which a version is outperforming others. Once identified the bandits approach routes the majority of traffic dynamically and in real-time to the winning variant. What this means is that variations that aren’t good get less and less traffic allocation over time. MAB guarantees that the total number of conversions will be maximized during the test, which A/B testing does not. On the other hand, this guarantee means that you can’t learn the impact of all variations with statistical confidence.

For additional details on multi-armed bandits and how the approach differs from A/B tests, I encourage you to read these two articles:

Conclusion

Organizations perform online experiments such as A/B tests to understand whether their actions lead to desired outcomes. These actions can come in the form of marketing campaigns, new product features, or variations on existing products such as tweaks to sales copy or the UI. Introducing machine learning models with the purpose of influencing user behavior is another "lever" that needs to be validated through online experimentation.

Running A/B tests when deploying ML models requires additional code and infrastructure for configuring experiments and routing requests to the appropriate model. In this post we discussed one pattern for implementing A/B tests for machine learning models.

For readers interested in experimentation, I’m writing a new series on how to build an effective experimentation program at your company based on my experience doing that. In that series I describe the people, processes, and infrastructure necessary to run large scale online controlled experiments (aka A/B tests) at scale. Check it out!

11 thoughts on “A/B Testing Machine Learning Models (Deployment Series: Guide 08)”

  1. Are canary tests a valid method to establish the causality between the model and the actions of the users?

  2. Since data seen by the model in an A/B test is unlabeled, how do I calculate the model accuracy for control and treatment groups?

    1. I’m not quite sure what you mean. Typically, there is some amount of time from when a model prediction is served to when the actual signal is observed. You have to collect both the predictions and signal (often times separately), store these, and then join them to calculate metrics.

Leave a Reply

Your email address will not be published. Required fields are marked *