Skip to main content

Model Tracking Concepts

Model registry

What is a model registry?

A model registry is a repository used to store and version trained machine learning (ML) models. Model registries greatly simplify the task of tracking models as they move through the ML lifecycle, from training to production deployments and ultimately retirement.

In addition to the models themselves, a model registry stores information (metadata) about the data and training jobs used to create the model. Tracking these requisite inputs is essential to establish lineage for ML models. In this way, a model registry serves a function analogous to version control systems (e.g. Git, SVN) and artifact repositories (e.g. Artifactory, PyPI) for traditional software.

Another way to think about model lineage is to consider all of the details that would be necessary to recreate a trained model from scratch. Establishing lineage through a model registry is a vital component of a robust MLOps architecture.

How does a model registry work?

Each model stored in a model registry is assigned a unique identifier, also known as a model ID or UUID. Many off-the-shelf registry tools also include a mechanism for tracking multiple versions of the same model. The model ID and version can be used for data science and ML teams to refer to specific models for comparison and confidence in deployment.

Registry tools also allow for storage of parameters or metrics. For instance, training and evaluation jobs could write hyperparameter values and performance metrics (e.g. accuracy) when registering a model. Storing these values allows for simple comparison of models. As they develop new models, having this data on hand can help teams see whether new versions of a model are improving upon previous versions. Many registry tools also include a graphical interface to visualize these parameters and metrics.

Under the hood, model registries are generally comprised of the following elements:

  • Object storage (such as Amazon S3 or Azure Blob Storage) to hold model artifacts and large binary files
  • A structured or semi-structured database to store model metadata A graphical user interface (GUI) that can be used to inspect and compare trained models
  • A programmatic API that can be used to retrieve model artifacts and metadata by specifying a model ID or query

What information should a model registry store?

A robust model registry should be able to store all details necessary to establish model lineage.

Model registry tools can also store input parameters to training jobs and performance metrics to enable comparisons between different models or versions of models. These elements can usually be captured completely by storing the following forms of information:

Software – The model registry should contain references to all software used to train the model. If custom code is used to transform data or train the model, the code should live in a separate version control system (e.g. Git) and the model registry should include the latest version ID from that system. Most projects also use external libraries or other software dependencies that must be tracked; this is a common oversight that can inhibit model reproducibility. Docker containers and Conda environments are common tools used to document and recreate software environments. When those tools are used, it makes sense to include a copy of the project Dockerfile or Conda environment file in the version control system or model registry.

Data – Since ML models learn their behavior based on data, reproducing a model requires access to the original training data. A model registry should contain a reference to a static copy, view, or snapshot of the original training data. Copies of data can be placed in object storage such as Blob Storage and referenced in the model registry. If datasets are too large to be copied, organizations should consider advanced storage solutions such as DVC versioning or Apache Atlas to create snapshots and maintain lineage.

Metrics – Most model registry tools have a system for storing named parameters as key/value pairs. Storing the values of input parameters and model performance metrics can help quickly compare models when new versions are created. Training jobs should write all configurable input parameters to the model registry. After training, evaluation and performance metrics should be written to the registry to quickly see if new models are performing better or worse than previous versions.

Models - While the previous elements establish lineage, storing model artifacts themselves allows organizations to quickly deploy models. ML frameworks generally have some mechanism for preserving a model artifact (for example: exporting a Scikit-learn model to Python Pickle format or a Tensorflow model to its custom SavedModel format). These artifact files should be stored in the model registry so that models are ready to be deployed to production whenever the business needs are established.

Model packaging

Serialization

Serialization is a vital process for packaging an ML model as it enables model portability, interoperability, and model inference. Serialization is the method of converting an object or a data structure (for example, variables, arrays, and tuples) into a storable artefact, for example, into a file or a memory buffer that can be transported or transmitted (across computer networks). The main purpose of serialization is to reconstruct the serialized file into its previous data structure (for example, a serialized file into an ML model variable) in a different environment. This way, a newly trained ML model can be serialized into a file and exported into a new environment where it can de-serialized back into an ML model variable or data structure for ML inferencing. A serialized file does not save or include any of the previously associated methods or implementation. It only saves the data structure as it is in a storable artefact such as a file.

Serialization formats

Here are some popular serialization formats:

Sr. No.FormatFile ExtensionFrameworkQuantization
1Pickle.pklscikit-learnNo
2HDF5.h5KerasYes
3ONNX.onnxTensorFlow, PyTorch, scikit-learn, Caffe, Keras, MXNet, iOS Core MLYes
4PMML.pmmlscikit-learnNo
5TorchScript.ptPyTorchYes
6Apple ML Model.mlmodeliOS Core MLYes
7MLeap.zipPySparkNo
8Protobuf.pbTensorFlowYes

Addressing the interoperability issue

All these serialized formats (except ONNX) have one problem in common: interoperability. To address that, ONNX is developed as an open source project supported by Microsoft, Baidu, Amazon, and other big companies. This enables a model to be trained using one framework (for example, in scikit-learn) and then retrained again using TensorFlow. This has become a game changer for industrialized AI as models can be rendered interoperable and framework-independent. ONNX has unlocked new avenues, such as federated learning and transfer learning. Serialized models enable portability and also batch inferencing (batch inference, or offline inference, is the method of generating predictions on a batch of data points or samples) in different environments.