Storing Metadata from Machine Learning Experiments

Before deploying a machine learning model to production, data scientists spend a large amount of time conducting experiments. These experiments, which include determining which class of models to use and what types of features to include, produce a number of different artifacts. Without a standardized way of managing the resulting artifacts, data scientists will have a hard time reproducing their analyses and comparing the results of their experiments. In order to achieve reproducibility and comparability of machine learning experiments, data scientists need to store experimental metadata.

Before describing what metadata and artifacts to store, we’ll discuss why storing metadata is critical for the machine learning process. Then we’ll examine the different types of data to store.

Why Store Metadata from Machine Learning Experiments?

Building machine learning models is an iterative process. You start with an initial hypothesis about which input data is useful, built a set of features, and train several models. But even after hyperparameter tuning you’ll likely find that the best model isn’t as performant as you’d like. After conducting an error analysis and speaking with domain experts in the business, you get ideas for new features to build. Weeks later, you have a new model that achieves better performance. But how can you be sure that you’re fairly comparing the new model to the previous version? How can you be certain that you’re making an apples-to-apples comparison?

One of the most important reasons for storing metadata from machine learning experiments is comparability i.e. the ability to compare results across experiments. In order to compare the results of our previous example, we need to be sure that we used the same training and test set splits as well as the same validation scheme. This may be easy to do in a team of one data scientist, but it becomes much harder when multiple data scientists are working on a single project. If individual data scientists are building models independently, perhaps using different libraries and languages, it becomes much harder to compare results without a standardized way of collecting and storing experiment metadata. In this case, even having the serialized model objects doesn’t guarantee comparability between experiments.

Capturing metadata is also critical to ensure reproducibility. Suppose that after several rounds of iterative experimentation, you build a model worthy of being productionized. You go back through your Jupyter notebooks but find that you’ve lost the hidden state in the notebooks and never persisted the actual trained model object. With the appropriate metadata you can retrain the same model and be on the path towards deploying it to production. Without that metadata, you may be stuck with the memories of a great model, but no way of reproducing your results.

What Metadata should you capture during training?

Now that we’ve explained why storing metadata is important, let’s look at the different types of metadata we should store.

Data

The datasets used for model training and evaluation are critical for guaranteeing both comparability and reproducibility. While you may not wish to store copies of the actual underlying dataset, it’s useful to store a pointer to the data’s location. Other metadata includes the name and version of the dataset, its column names and types, and statistics of the dataset such as distributions of the input and target columns.

Model

There are a number of attributes of the model you should store during the training process.

Model Type

One form of model metadata is the class of algorithm used. For regression models, this could be an elastic net or a support vector machine regressor. For classification problems, this might be a random forest or gradient boosted tree classifier. A simple way of storing this information would be to store the name of the framework and class associated with the model. For example, you could store the value sklearn.linear_model.ElasticNet from the scikit-learn library or xgboost.Booster from the xgboost package. Storing these values allows you to easily instantiate new objects of the same class later on.

Feature Preprocessing Steps

It’s rarely the case that data is readily available in a format ready for the training process. More often than not, the raw data must be transformed in a series of feature preprocessing steps into a format acceptable by the machine learning algorithm. This can include encoding categorical variables, dealing with missing values (imputation or otherwise), centering, scaling, etc. There may even be "higher-level" steps that occur before this such as merging data stored in different databases and computing aggregate statistics on denormalized entities. These transforms should be stored as part of the model.

The reason I include these steps as part of the model is simple; if the model is trained on transformed data, then it will expect data in this format going forward. To promote reproducibility, the feature preprocessing steps should be stored together with the fitted model in a single object. This simplifies the process of re-instantiating the fitted model at inference time. For example, sklearn’s pipeline abstraction allows data scientists to chain together a series of preprocessing steps along with a final estimator.

Hyperparameters

Storing the hyperparameters used during the model training process is necessary for reproducibility. Often these are available within the fitted model object, but you may decide to persist these separately in order to build visualizations on top of the metadata. For instance, if you persist the model hyperparameters along with evaluation metrics from training, you can plot how the metrics vary over the hyperparameters, which is useful for the model selection process.

Metrics

A typical process when training a new model is to prepare the training and validation data sets, find the optimal set of model hyperparameters using a search routine, and then evaluate the performance of the model on a held out dataset that the model has not seen. During this process a set of evaluation metrics are computed.

Suppose you’re training a classifier to perform a binary classification task and that you’ve decided to use the area under the receiver operating characteristic curve (auroc) as the evaluation metric. Let’s also assume that you’re performing grid-search over a space of hyperparameter settings and that you’re using k-fold cross-validation. This routine implies that you’ll compute the auroc kD + 1 times, where D is the number of different hyperparameter settings you try out. The kD factor comes from the fact that k-fold cross validation is performed for each of the hyperparameter settings. The + 1 because you’ll compute auroc on the held out test set at the end of the hyperparameter optimization.

I recommend persisting each of these metrics. Why? Persisting the metrics from the training process will help you understand if your model is overfitting to the training set. You can do this by plotting learning curves from these figures. Also, k-fold cross-validation is known to be relatively noisy depending on the dataset. Persisting these values allows you to perform a thorough error analysis. You’ll also be able to investigate how different hyperparameter settings affect the evaluation metrics and this could guide further hyperparameter optimization.

Persisting the metric computed on the held out test set is important so that you have an idea of how the model will generalize to future unseen data. You’ll need this value to decide whether or not to deploy your model to production.

Context

Achieving reproducibility is complicated by the stochastic and dynamic nature of machine learning experiments. For instance, many algorithms used in machine learning begin by randomly initializing some value and then iteratively improving this guess by looping over subsets of data. To reproduce the outputs of this process, you’ll need access to the code that was run and the value used to seed the random number generator.

The random seed is an example of program context. Context is information about the environment of a machine learning experiment which may or may not affect the experiment’s output. Other examples of context include

Source code
Programming language and version
Dependencies
- e.g. Any packages installed through pip and the versions of those packages
Host information such as
- System packages
- Information about the CPU and operating system
- Environment variables

Luckily, containerization has made it extremely simple to store and version control context. You can use a tool like Docker to capture each of the bullet points listed. Writing a Dockerfile allows you to make explicit all the steps required to reproduce your entire programming environment, including which operating system to use, which system dependencies to install, which version of Python (or any other language) to run, etc. You can take this a step further by persisting the actual built image to a registry. Then, rather than having to rebuild your whole environment later on, you can just retrieve the image and run a new container from it. This is especially useful with time consuming image builds.

Conclusion

Data scientists should seek to generate reproducible and comparable experiments. In order to achieve reproducibility and comparability, it is critical to store the artifacts produced as well as metadata related to the experiment. In this post, we’ve discuss why storing metadata is important and examine the different types of metadata to store.

If you found this tutorial helpful, please share it on LinkedIn, Twitter, or Facebook!

Additional References

Automatically Tracking Metadata and Provenance of Machine Learning Experiments