Model Registries for ML Deployment (Deployment Series: Guide 06)

This is post 6 in my Ultimate Guide to Deploying Machine Learning Models. You can find the other posts in the series here.

In the last several posts of our machine learning deployment series we discussed how to deploy models for batch inference and online inference. In each of these posts we relied on the Model class defined in the software interfaces for ML deployment post to handle storing and retrieving a serialized model to and from a remote object storage system like Amazon S3 or Google Cloud Storage. This enables us to run independent model training and inference processes, possibly even on separate machines. After a model is trained and stored remotely, it can be used during inference by referencing the model’s remote file path.

Although this system allows us to separate model training from inference, it lacks a mechanism for passing information between these two processes. For example, how would we inject the remote file path from the training process into the inference process? Doing this manually is one option, but we’d prefer to "close the loop", especially if we wish to automate our machine learning deployments.

What we need is a centralized tracking system for trained machine learning models, otherwise known as a model registry. Similar to a domain name registry or container registry, a model registry is a database that stores model lineage, versioning, and other configuration information.

In the rest of this post we’ll discuss what an ML model registry is, how to implement a simple model registry, and introduce a popular open source registry built by MLflow. Let’s dive in!

What is a Model Registry?

A machine learning model registry is a centralized tracking system that stores lineage, versioning, and related metadata for published machine learning models. A registry may capture governance data required for auditing purposes, such as who trained and published a model, which datasets were used for training, the values of metrics measuring predictive performance, and when the model was deployed to production. Model registries solve two major purposes:

First, registries provide a mechanism to store model metadata.
Second, registries "connect" independent model training and inference processes by acting as a communication layer. A well constructed registry allows inference processes to correctly decide a published model to use to generate predictions.

What kind of model metadata should we store in a registry. In general, answering this question depends on the requirements of your system and company. Machine learning applications operating in highly regulated industries require detailed audit trails, and hence, comprehensive metadata should be stored. Less regulated industries may not require as much detail, but you should look beyond basic requirements to capture metadata to help your system scale as the size of the modeling team and number of applications (and users) grow.

Let’s start with a minimum set of metadata to store in a model registry. For each registered model, we should store an identifier, name, version, the date this version was added, the model’s predictive performance as measured by some evaluation metric, the remote path to the serialized model, and the model’s stage of deployment. The stage of deployment could include concepts like development, shadow-mode or production, but this can grow with your needs.

Additional metadata you might wish to include are:

a human-readable description of the model’s purpose
a human-readable description of how this model version differs from the previous version
which datasets were used for training
the git hash or Docker image ID of the code that generated the model
runtime metrics such as training time
who published the model.

A machine learning model registry is a centralized tracking system that stores lineage, versioning, and related metadata for published machine learning models. Click To Tweet

Implementing a Model Registry

Let’s implement the minimal model registry proposed in the previous section. We’ll use a relational database to store the metadata and create a set of python functions to populate the registry. We’ll show how to use these functions along with the Model interface you can download in my Software Interfaces for Machine Learning Deployments post.

In order to make it as easy as possible to run the code in this section, we’ll restrict our dependencies to common Python data science libraries like Jupyter and pandas.

Model Registry Database

SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows access to the database using a nonstandard variant of the SQL query language. It’s possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL. The sqlite3 module provides a SQL interface to SQLite.

First, let’s create a database called registry.db

import sqlite3

conn = sqlite3.connect('registry.db')

The conn object provides a connection to registry.db we’ll use to execute our queries.

Next, we’ll create a table called model_registry that includes the fields from the minimal model registry proposed in the previous section. Below is the schema for the model_registry table:

CREATE TABLE model_registry (
    id INTEGER PRIMARY KEY ASC,
    name TEXT NOT NULL,
    version TEXT NOT NULL,
    registered_date TEXT DEFAULT CURRENT_TIMESTAMP NOT NULL,
    metrics TEXT NOT NULL,
    remote_path TEXT NOT NULL,
    stage TEXT DEFAULT 'DEVELOPMENT' NOT NULL
);

To create the table we’ll create a Cursor object and call it’s execute() method:

cur = conn.cursor()
cur.execute("""
CREATE TABLE model_registry (
    id INTEGER PRIMARY KEY ASC,
    name TEXT UNIQUE NOT NULL,
    version TEXT NOT NULL,
    registered_date TEXT DEFAULT CURRENT_TIMESTAMP,
    metrics TEXT NOT NULL,
    remote_path TEXT NOT NULL,
    stage TEXT DEFAULT 'DEVELOPMENT' NOT NULL
);
""")
cur.close()

We can use the pandas library to read from the table and return a pandas DataFrame:

pd.read_sql_query("SELECT * FROM model_registry;", conn)

Finally, let’s insert a row into the table and query the result:

values = ('lead_scoring', '0.0.1', 'accuracy: 0.8', 's3://models/lead_scoring::0_0_1')

cur = conn.cursor()
cur.execute("""
INSERT INTO model_registry 
(name, version, metrics, remote_path)
VALUES (?, ?, ?, ?)""", values)
cur.close()

pd.read_sql_query("SELECT * FROM model_registry;", conn)

The SELECT query returns the following result:

id	name	version	registered_date	metrics	remote_path	stage
1	lead_scoring	0.0.1	2020-02-15 12:39:36	accuracy: 0.8	s3://models/lead_scoring::0_0_1	DEVELOPMENT

It’s looking awesome so far 😉

Model Registry API

Although we’ve built a basic database table that stores metadata on published models, we haven’t specified how this metadata is added to the database. In the previous section we wrote a SQL INSERT statement to insert dummy values into the table, but we don’t want to have data scientists reimplement this logic each time they publish a new model. We’d prefer to simplify this process by encoding common operations into a set of functions, otherwise known as an application programming interface (API).

There are several benefits to encoding the operations we wish to perform on the model registry in an API:

Ease of use – Data scientists on the team don’t have to think about how to interact with the database. This frees them up to spend more time developing models. This also makes it easy for new data scientists on the team to come up to speed.
Repeatability – Since we know that certain operations will be run many times, we can have one canonical implementation. No need to reimplement the same logic multiple times.
Easier to test – We can easily write unit and integration tests to verify the API works as intended.

Designing & Implementing the Model Registry API

In order to design our API, let’s first define what operations we wish to perform:

Publish newly trained models.
Publish a new version of a model.
Update the deployment stage of a published model.
Get metadata associated with a productionized model.

Now we can encode these as a set of python functions. Here I’ve written a ModelRegistry class with instance methods that carry out these operations.

import json

class ModelRegistry:
    def __init__(self, conn, table_name='model_registry'):
        self.conn = conn
        self.table_name = table_name
        
    def _insert(self, values):
        query = """
                INSERT INTO {} 
                (name, version, metrics, remote_path)
                VALUES (?, ?, ?, ?)""".format(self.table_name)
        self._query(query, values)

    def _query(self, query, values=None):
        cur = self.conn.cursor()
        cur.execute(query, values)
        cur.close()      
        
    def publish_model(self, model, name, metrics):
        version = 1
        remote_path = 's3://models/{}::v{}'.format(name, version)
        metrics_str = json.dumps(metrics)
        #model.to_remote(remote_path)
        self._insert((name, version, metrics_str, remote_path))
    
    def increment_version(self, model, name, metrics):
        version_query = """
                        SELECT 
                            version 
                        FROM 
                            {}
                        WHERE
                            name = '{}'
                        ORDER BY
                            registered_date DESC
                        LIMIT 1
                        ;""".format(self.table_name, name)
        version = pd.read_sql_query(version_query, conn)
        version = int(version.iloc[0]['version'])
        new_version = version + 1
        remote_path = 's3://models/{}::v{}'.format(name, new_version)
        #model.to_remote(remote_path)
        metrics_str = json.dumps(metrics)
        self._insert((name, new_version, metrics_str, remote_path))
    
    def update_stage(self, name, version, stage):
        query = """
                UPDATE
                    {}
                SET stage = ?
                WHERE 
                    name = ? AND
                    version = ?
                ;""".format(self.table_name)
        self._query(query, (stage, name, version))

    def get_production_model(self, name):
        query = """
                SELECT
                    *
                FROM
                    {}
                WHERE
                    name = '{}' AND
                    stage = 'PRODUCTION'
                ;""".format(self.table_name, name)
        return pd.read_sql_query(query, self.conn)

Let’s discuss each instance method of the ModelRegistry class.

__init__(self, conn, table_name='model_registry') – The class constructor accepts a sqlite3.Connection object and the name of the database table.
_insert(values) writes a new row to the registry table. The leading underscore indicates this method is part of the private API and shouldn’t be used outside of other ModelRegistry class methods.
_query(self, query, values=None) accepts a string containing a SQL query and a tuple of values to inject into the query. Again, the leading underscore indicates that this method is part of the private API and shouldn’t be used outside of other ModelRegistry class methods.
publish_model(self, model, name, metrics) takes in a Model object, a name, and a set of model metrics. The Model object is defined in the Software Interfaces for ML Deployment post and includes a method to_remote for persisting a trained model to a remote filestore like S3. The publish_model first persists the model to a remote filestore and then inserts a new row into the registry table.
increment_version(self, model, name, metrics) increments the version of a previously published model. The method retrieves the most recent published model version and inserts a new row with the version incremented by 1.
update_stage(self, version, stage) updates the stage column of a specific model and version. This method is used to denote if a model is suitable for production inference. Our implementation allows a client to pass in any value for the stage argument, but you’ll probably wish to limit the set of possible values and validate the client input.
get_production_model(self, name): retrieves the metadata associated with a model with stage column equal to "PRODUCTION". This method can be used by a production inference process to retrieve the remote_path of the model that should be deserialized and loaded into memory to generate predictions.

Utilizing the Model Registry in Production

Now that we’ve designed and implemented a model registry, let’s demonstrate how the registry enables machine learning in production. In particular, the registry provides a mechanism for passing information between the model training and inference processes. These processes are independent in the sense that they’re run at different times and in different environments. But training and inference are also coupled – inference relies on using a specific model output from a specific training process. The registry facilitates independence by providing inference the information it needs at runtime. I’ll illustrate this by walking through an ML workflow.

Imagine training multiple models over time until developing a model that satisfies project requirements. Once that model is developed, we can use it for inference.

Note: After each code sample I run the following code to output the state of the registry database:

pd.read_sql_query("SELECT * FROM model_registry;", conn)

Model Training

The model selection process involves running multiple iterative experiments. Although individual experiments may yield insufficient models, we’d like to store all of our models. One reason is to preserve the ability to ensemble multiple models later down the road.

We use the ModelRegistry class to publish a trained model to the registry. The following code instantiates a ModelRegistry object and publishes the trained model to the registry.

model = None # This would be replaced by the trained model.
name = 'lead_scoring'
metrics = {'accuracy': 0.8}

conn = sqlite3.connect('registry.db')
model_registry = ModelRegistry(conn=conn)
model_registry.publish_model(model=model, name=name, metrics=metrics)

id	name	version	registered_date	metrics	remote_path	stage
1	lead_scoring	1	2020-02-25 12:42:25	{"accuracy": 0.8}	s3://models/lead_scoring::v1	DEVELOPMENT

Model Iteration and Retraining

The first version of our model rarely meets the minimum predictive performance thresholds required. Data scientists typically iterate on models and features over a period of several weeks before arriving on a model that’s ready for production. Once a model is in production we need to retrain it to combat concept drift.

Once a new version of a model has been trained, we can add a new entry to our registry with the increment_version method.

model_registry.increment_version(model=model, name=name, metrics={'accuracy': 0.85})

id	name	version	registered_date	metrics	remote_path	stage
1	lead_scoring	1	2020-02-25 12:42:25	{"accuracy": 0.8}	s3://models/lead_scoring::v1	DEVELOPMENT
2	lead_scoring	2	2020-02-25 12:42:49	{"accuracy": 0.85}	s3://models/lead_scoring::v2	DEVELOPMENT

Promoting a Model to Production

Once we’ve developed a version of a model that satisfies the minimum predictive performance requirements of our project, we can mark that model as ready for production within our model registry. Doing this in our implementation involves updating the stage column of the row corresponding to the model we wish to promote to production. This is done by calling the update_stage method.

model_registry.update_stage(name=name, version='2', stage="PRODUCTION")

id	name	version	registered_date	metrics	remote_path	stage
1	lead_scoring	1	2020-02-25 12:42:25	{"accuracy": 0.8}	s3://models/lead_scoring::v1	DEVELOPMENT
2	lead_scoring	2	2020-02-25 12:42:49	{"accuracy": 0.85}	s3://models/lead_scoring::v2	PRODUCTION

Once we've developed a version of a model that satisfies the minimum predictive performance requirements of our project, we can mark that model as ready for production within our model registry. Click To Tweet

Retrieving a Model during Inference

Now that we’ve trained, iterated on, and promoted a model to production, we can use that model for inference! To retrieve the metadata corresponding to the production-ready model, we call the get_production_model method. We can then use the remote_path value to load the appropriate model into memory and perform inference.

model_registry.get_production_model(name=name)

id	name	version	registered_date	metrics	remote_path	stage
2	lead_scoring	2	2020-02-25 12:42:49	{"accuracy": 0.85}	s3://models/lead_scoring::v2	PRODUCTION

MLFlow: An Open Source Model Registry

The purpose of this post is to describe why a model registry is a necessary part of production machine learning systems and demonstrate how to implement a registry. But I should also mention the model registry component of the MLflow project.

MLflow is an open source machine learning platform that covers several key ML challenges including experimentation, reproducibility, and model deployment. Originally the project provides 3 components (Tracking, Projects, and Models) that users can opt to use together or separately. It recently added a Model Registry component for storing model metadata.

I’ve written some sample code to get you up and running with the MLflow registry. The code is available for download at the end of this post!

Conclusion

A machine learning model registry is a centralized tracking system that stores lineage, versioning, and related metadata for published machine learning models. The model registry provides a mechanism for passing information between separate model training and inference processes as well as enabling model governance and provenance.

If you’ve been following my machine learning deployment series, you know we’ve come a long way! In the next post, I will describe how to join the various components we’ve discussed to automate ML deployments.

If you’d like to be notified when the next post is published and receive my MLflow tutorial, sign up below!

One thought on “Model Registries for ML Deployment (Deployment Series: Guide 06)”

LoneWolf says:

August 13, 2021 at 1:58 am

Excellent article! Thank you so much. One thing that intrigues me is how you store every version of model is stored. Let’s say a single version has typically size of 5gb(which is not uncommon). 10 versions of such model would occupy 50 GB of storage space !! Is it possible to have a solution like Git where only the differences are tracked?