How Data Leakage Impacts Machine Learning Models

The silver bullet. A feature that led to AUC increasing from .6 to .8. After working on feature engineering for several months, I thought I had finally cracked the code and created a feature that pushed my machine learning model into production-ready territory. I was ready to pop the champagne. But like many silver bullets, this one wasn’t what it appeared to be.

The model I had trained appeared to achieve higher performance, but this was due to data leakage. Similar to the concept of overfitting, data leakage leads to models that appear performant during training, but don’t generalize to unseen data at inference time. Although I was confident in my code, the model I trained would exhibit much lower accuracy if deployed to production because the values of certain features would be fundamentally different than what the model expected. Fundamentally, the cause of this data leakage was hidden in the countless updates made to the database that stored the data used for feature engineering. Like many other forms of data leakage, this wasn’t obvious at first glance and would be very hard to detect in production.

Data leakage can occur for several reasons, many of which can be difficult to debug. In this post, we’ll define what data leakage is and how it occurs. We’ll then discuss steps you can take to identify and prevent data leakage before it occurs.

What is data leakage?

Data leakage occurs when data used at training time is unavailable at inference time. This can lead to overly optimistic estimated error rates, since the data that is unavailable at inference time was deemed predictive during the training process. These error rates can fool data scientists into deploying models that don’t actually achieve the acceptable thresholds of predictive accuracy. Sometimes these models will explicitly fail to render predictions at runtime. For example, if a trained model expects a certain input feature, but that feature isn’t passed to the model at inference time, the model will fail to render a prediction. Other times however, the model will fail silently. For instance, suppose a specific feature from the training set is very predictive and that feature is passed to the model at runtime. Imagine also that the values of that feature are derived from a column in a database table whose values are updated over time. It follows that the distribution of that feature column will change over time, causing the predictive value of the feature to vary. If we wish for our models to generalize to unseen data, the input features should be distributed similarly at training and inference time.

Models failing silently are much more dangerous than models that explicitly fail. If a model raises an exception, we know to take the model offline and debug the issue. But if the model generates predictions, we may fail to identify that leakage has occurred. This is especially true if we don’t have systems in place to monitor and debug deployed models. Data leakage can lead to suboptimal user experiences, lost profits, and even life threatening situations. While that may seem extreme at first glance, consider machine learning models deployed to predict patient outcomes in medical situations. It’s imperative that data scientists identify and prevent data leakage before deploying models to production.

How can data leakage occur during the training process?

Let’s take a look at a few common ways in which data leakage can occur during the model training process. This (incomplete) list is ordered by how difficult it is to detect data leakage is occurring, from least to most difficult.

Using the same data for training and evaluation

One cause of data leakage is improperly splitting data into separate training, validation, and test sets. If a data scientist accidentally includes training data in the held out test set, then the metric used to evaluate the model error will underestimate the true generalization error. This is very similar to the concept of overfitting, which occurs when an overly complex model learns the random noise in the training data. Accidentally using the same data to train and evaluate a model will result in error metrics that underestimate the true error, because the model has already seen the data that it’s been asked to predict.

Even if the dataset is split correctly, leakage can occur if a data scientist peaks at the held out test data during the model building process. For example, after examining the distributions of the held out test data, a data scientist may generate additional features or update the model in some other way (e.g. regularization). Any resulting improvements in model error are invalid because the model has "seen" the held out test set.

Generating features from data used to calculate the target

Data leakage can occur if the input data and the target are related in some trivial way. For example, suppose a university admissions department is asked to create a student attrition model to predict students at risk of dropping out. To calculate the target, the data scientists retrieve historical log data from student registrations. If a student takes 0 credits in a semester, the student is labeled as having dropped out. Any student with greater than 0 credits is considered as not having dropped out.

If the data scientist includes the number of credits a student takes in that semester as a feature, the model will trivially assign higher probability of dropping out to students with fewer credits in a semester. This may not be a useful feature, since some students take fewer credits for different reasons. Also, this feature may be unavailable if the admissions department wants to predict attrition several semesters before a student actually drops out. This leads to the next cause of data leakage.

Training a model with features that are unavailable at inference time

Building machine learning models in industry involves a different set of constraints than building models in academia or in competitions such as those on Kaggle. In a Kaggle competition, data scientists are provided with all the data used for training and evaluating a model. The competition then focuses on engineering the most predictive combination of features and algorithms. While feature engineering and modeling is an important part of building models in industry, you also need to consider the sources of training and testing data. Often times, data can come from a variety of different data stores, each with different permissioning schemes, query interfaces, and latencies. Data scientists need to consider how these constraints will impact the inference process.

One such constraint is that the features used during training must also be available at inference time in order for the model to generate predictions. Generally speaking, the datasets may be unavailable for two reasons. First, technical reasons might limit which datasets are available. For example, if your model must return predictions within 500 ms, but it takes several minutes for a query to return certain data, you shouldn’t use that data to train the model. You’ll either need to exclude that data or improve the data pipeline to reduce the query time.

The second reason that data may be unavailable at inference time is because the data generating process itself doesn’t produce the data at the appropriate time. This is a much more nuanced issue that can lead to leakage if left unidentified. Returning to our example of student attrition modeling, assume we determine that a student’s mid-semester grades are highly predictive of dropping out. When we train our models, we only look at historical data for which we have mid-semester grades. However, imagine that when deployed, our models need to generate predictions way before mid-semester grades become available. The field still exists in the database, but may be filled with NULL values. If the models can accept NULL features, like some tree-based models, predictions will be produced, but the will severely underperform compared to training time.

Using features whose value changes over time

Sometimes all of the features a model expects are available at runtime, but the values of these features changes over time. In my opinion, this can be the worst possible cause of data leakage because it can be incredibly difficult, sometimes impossible, to recognize. Further, this challenge often requires large investments in data engineering and infrastructure to solve.

This problem can arise when training and inference data is stored in a relational database used for operational purposes. If you train a model on historical data from the database, you may have no idea how many times particular fields in the database were updated. For instance, suppose a feature is derived from a database column that changes when users perform particular actions. The data used for training may include values that have been updated many times. But at inference time, that column might reflect a different reality based on user interactions. The problem is you will have no way of knowing this. Typically, the only way to correct for this is to train on snapshots of data at different historical points in time.

How to prevent data leakage

Now that we’ve examined several causes of data leakage, let’s mention what you can do to identify and/or prevent data leakage from occurring.

Properly split datasets into training, validation, and test sets. This will prevent overfitting, which will stop you from deploying models that underperform in production.
Do not look at the held out test set. Do not use this dataset for feature engineering. Again, this is to prevent overfitting.
Be aware of how the target variable is calculated. It may or may not be wise to use certain data feature engineering depending on how the target is calculated. Make a list of the variables used in the target calculation and think through whether including subsets of that data may bias your model.
Understand the data generating process. As a data scientist, you should understand

what data is generated
where it comes from
when it’s generated
and how data it’s generated.
Including random fields in your model can be extremely detrimental and you may not be able to detect this until it’s too late. Don’t be that data scientist.

Understand the data pipeline. You want to know how long it takes to perform certain queries and the availability of the data stores. You want to know how data flows through the system. Data engineering is critical.
Monitor the distributions of input data at inference time. Visually comparing these distributions to those from the training sets can help you determine if the new data differs from your expectations.
Monitor the distributions of the outputs of your models. Sometimes the ground truth signal is available shortly after predictions are generated. Sometimes the ground truth signal is available months after a prediction is generated. And sometimes, the ground truth signal may never be available because of hidden feedback loops that caused certain actions based on the predictions. Again, understand the data process.
If possible, train on historical snapshots of data. This requires large investments in data engineering infrastructure to capture, store, and query data.

Conclusion

Data leakage occurs when machine learning models are trained on data that is unavailable at inference time and often leads to models that do not generalize to unseen data. Sometimes this leads to models that fail to generate predictions. Other times, this leads to models that fail silently i.e. the models can generate predictions but these predictions are inaccurate. Detecting and correcting data leakage can be extremely difficult and often relies on additional investments in infrastructure and data engineering. To combat data leakage, data scientists should do all they can to gain domain expertise by understanding the data generating process and data pipeline.

If you found this tutorial helpful, please share it on LinkedIn, Twitter, or Facebook!

2 thoughts on “How Data Leakage Impacts Machine Learning Models”

Jared Price says:

April 26, 2019 at 1:33 pm

A lot of the same concerns around test-driven development. It’s good to know that logic not only works but works the way it was intended to work. Thanks for taking the time to write this stuff.

ben10 says:

November 6, 2020 at 1:17 am

Hi, small mistake you might want to correct:

and how data it’s generated.
—->
and how it’s generated.
and how data generated.