Model Serving

Basic concepts

After training your model, the next step is to make it accessible to users so that they can send data and receive predictions (model inference). To do this, you need to deploy and serve the model. This section will take you through some basic concepts and practical examples of how to serve your models.

What is model serving?

The basic meaning of model serving is to host machine-learning models (on the cloud or on premises) and to make their functions available via API so that applications can incorporate AI into their systems. Model serving is crucial, as a business cannot offer AI products to a large user base without making its product accessible. Deploying a machine-learning model in production also involves resource management and model monitoring including operation stats as well as model drifts.

What is an endpoint?

An endpoint is a stable and durable URL that can be used to request or invoke the model, provide the required inputs, and get the outputs back.

An endpoint provides:

A stable and durable URL (like endpoint-name.region.inference.ml.azure.com).
An authentication and authorization mechanism.

Online vs batch endpoints

Online endpoints

Online endpoints are used for online (real-time) inferencing. They deploy models behind a web server that can return predictions under the HTTP protocol.

You may use them when:

you have low-latency requirements
your model can answer the request in a relatively short amount of time
your model's inputs fit on the HTTP payload of the request
you need to scale up in term of number of request

Batch endpoints

Batch endpoints are used to do asynchronous batch inferencing on large volumes of data. They receive pointers to data and run jobs asynchronously to process the data in parallel on compute clusters. Finally, they store outputs on a data store for further analysis.

You may use them when:

you have expensive models or pipelines that require a longer time to run.
you want to operationalize machine learning pipelines and reuse components.
you need to perform inference over large amounts of data, distributed in multiple files.
you don't have low latency requirements.
your model's inputs are stored in a Storage Account or in an Azure Machine Learning data asset.
you can take advantage of parallelization.

What is deployment?

A deployment is a set of resources required for hosting the model or component that does the actual inferencing. A single endpoint can contain multiple deployments which can host independent assets and consume different resources based on what the actual assets require. Endpoints have a routing mechanism that can route the request generated for the clients to specific deployments under the endpoint.

To function properly, each endpoint needs to have at least one deployment. Endpoints and deployments are independent Azure Resource Manager resources that appear in the Azure portal.

Model Serving using KServe

KServe is a highly scalable, standards-based model inference platform on Kubernetes for trusted AI.

Why Use KServe?

KServe can serve simplest to most complex usecases with simplicity and efficiency.
KServe enables resource and cost optimization by providing ability to specify resource limits for deployments.
KServe can handle both online and batch inference serving usecases
KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases.
Provides performant, standardized inference protocol across ML frameworks.
Support modern serverless inference workload with Autoscaling including Scale to Zero on GPU.
Provides high scalability, density packing and intelligent routing using ModelMesh
Simple and Pluggable production serving for production ML serving including prediction, pre/post processing, monitoring and explainability.
Advanced deployments with canary rollout, experiments, ensembles and transformers.

KServe model serving methods

In KServe, models can be served using either of the following methods:

the Python SDK, or
a Kubernetes manifest YAML file, specifically configured for KServe

You can deploy the YAML file with tools such as Argo CD, kubectl, or the Kubeflow Central Dashboard.

Computer vision examples

ONNX Model Deployment Guide

Introduction

This guide explains the deployment of a pipeline of ONNX models into a Triton Server for the Constant Level Oilers (CLO) application. The pipeline processes high-resolution images from robots to detect CLO objects and estimate their oil percentage. Multiple models are orchestrated as an ensemble using Triton Inference Server, enabling efficient and scalable inference for computer vision tasks.

Objectives

The pipeline processes high-resolution images to detect objects (spheric oilers) and estimate the oil percentage content visually. It consists of the following models:

Preprocessing: Resizes and preprocesses the input image.
YOLOv8: Performs object detection on the preprocessed image.
Postprocessing: Processes YOLO detections and prepares the image for further analysis.
EfficientNet V2: Estimates the oil percentage from the processed image.
Ensemble: Combines all 4 models into a single pipeline for seamless execution (models are called sequentially 1 → 2 → 3 → 4 to estimate the oil percentage given an input image).

Model details

Preprocessing

Model Name: preprocessing
Platform: onnxruntime_onnx
Input: original_image_3840_2160 (dimensions: [1, 3, 2160, 3840], type: TYPE_FP32)
Output: preprocessed_640_640_image (dimensions: [1, 3, 640, 640], type: TYPE_FP32)

YOLOv8

Model Name: yolov8
Platform: onnxruntime_onnx
Input: preprocessed_640_640_image
Output: yolo_detections (dimensions: [1, 300, 6], type: TYPE_FP32)

Postprocessing

Model Name: postprocessing
Platform: onnxruntime_onnx
Input: original_image_3840_2160, yolo_detections
Output: processed_image_384_384 (dimensions: [1, 3, 384, 384], type: TYPE_FP32)

EfficientNet v2

Model Name: efficientnet_v2
Platform: onnxruntime_onnx
Input: processed_image_384_384
Output: estimated_oil_pct (dimensions: [1, 1], type: TYPE_FP32)

Ensemble

Model Name: ensemble
Platform: ensemble
Input: original_image_3840_2160
Output: yolo_detections, processed_image_384_384, estimated_oil_pct

Deployment steps

Prepare the Models: Place all ONNX models (model.onnx) in their respective folders under 1/. , as explained in Folder Structure
Configure the Models: Verify that each model's config.pbtxt file defines the correct inputs, outputs, and platform.
Deploy the Models: Use Triton Inference Server to deploy the models. Place the model folders in the Triton model repository.
Run the Pipeline: Send an input image to the ensemble model. The ensemble will execute the pipeline in sequence.
Retrieve the Outputs: The ensemble model will return the outputs of the models of the ensemble pipeline, including yolo_detections, processed_image_384_384, and estimated_oil_pct.

Data flow

Input: original_image_3840_2160 → Preprocessing → preprocessed_640_640_image

YOLOv8: preprocessed_640_640_image → YOLOv8 → yolo_detections

Postprocessing: original_image_3840_2160 + yolo_detections → Postprocessing → processed_image_384_384

EfficientNet V2: processed_image_384_384 → EfficientNet V2 → estimated_oil_pct

Folder Structure

onnx_model_repository/ 
├── preprocessing/ 
│   ├── config.pbtxt 
│   └── 1/ 
│       └── model.onnx 
├── yolov8/ 
│   ├── config.pbtxt 
│   └── 1/ 
│       └── model.onnx 
├── postprocessing/ 
│   ├── config.pbtxt 
│   └── 1/ 
│       └── model.onnx 
├── efficientnet_v2/ 
│   ├── config.pbtxt 
│   └── 1/ 
│       └── model.onnx 
└── ensemble/ 
    ├── config.pbtxt 
    └── 1/ 

Ensemble Folder Explanation

The ensemble folder does not contain a ONNX file because it is not a standalone model. Instead, it is a configuration that defines how multiple models work together in a pipeline using Triton Inference Server's ensemble scheduling feature. The folder must include a config.pbtxt file and an empty 1/ subfolder to comply with Triton's directory structure.

YAML Configuration

apiVersion: "serving.kserve.io/v1beta1" 
kind: "InferenceService" 
metadata: 
  name: "hjam" 
spec: 
  predictor: 
    triton: 
      storageUri: "pvc://ml-autonomy-aurora-azure-file-data-pvc/hjam-data/AssetAutonomy/onnx_model_repository" 
      args: 
        - "--strict-model-config=true" 
        - "--model-control-mode=explicit" 
        - "--load-model=yolov8" 
        - "--load-model=efficientnet_v2" 
        - "--load-model=preprocessing" 
        - "--load-model=postprocessing" 
        - "--load-model=ensemble" 

Explanation of YAML Elements

Element	Description
apiVersion	Specifies the API version of the resource.
kind	Defines the type of Kubernetes resource.
metadata	Contains metadata about the resource, such as its name.
spec	Describes the desired state of the resource.
predictor	Specifies the type of predictor to use.
storageUri	Points to the location of the model repository.
args	Provides additional arguments to configure the Triton Inference Server.

Additional Resources

For further guidance on working with ONNX models in this workflow, you may find the following notebooks helpful:

Convert_Models_into_ONNX: This notebook explains how to convert PyTorch models into ONNX format, covering both scripting-based and tracing-based conversion methods.
Triton_Inferencing: Provides comprehensive instructions and examples for interacting with ONNX models deployed via KServe, covering both individual ONNX model calls and the ensemble use case, as well as sample inference requests and response handling.

Basic concepts​

What is model serving?​

What is an endpoint?​

Online vs batch endpoints​

Online endpoints​

Batch endpoints​

What is deployment?​

Model Serving using KServe​

Why Use KServe?​

KServe model serving methods​

Computer vision examples​

ONNX Model Deployment Guide​

Introduction​

Objectives​

Model details​

Preprocessing​

YOLOv8​

Postprocessing​

EfficientNet v2​

Ensemble​

Deployment steps​

Data flow​

Folder Structure​

Ensemble Folder Explanation​

YAML Configuration​

Explanation of YAML Elements​

Additional Resources​