Model Serving
Basic concepts
After training your model, the next step is to make it accessible to users so that they can send data and receive predictions (model inference). To do this, you need to deploy and serve the model. This section will take you through some basic concepts and practical examples of how to serve your models.
What is model serving?
The basic meaning of model serving is to host machine-learning models (on the cloud or on premises) and to make their functions available via API so that applications can incorporate AI into their systems. Model serving is crucial, as a business cannot offer AI products to a large user base without making its product accessible. Deploying a machine-learning model in production also involves resource management and model monitoring including operation stats as well as model drifts.
What is an endpoint?
An endpoint is a stable and durable URL that can be used to request or invoke the model, provide the required inputs, and get the outputs back.
An endpoint provides:
- A stable and durable URL (like endpoint-name.region.inference.ml.azure.com).
- An authentication and authorization mechanism.
Online vs batch endpoints
Online endpoints
Online endpoints are used for online (real-time) inferencing. They deploy models behind a web server that can return predictions under the HTTP protocol.
You may use them when:
- you have low-latency requirements
- your model can answer the request in a relatively short amount of time
- your model's inputs fit on the HTTP payload of the request
- you need to scale up in term of number of request
Batch endpoints
Batch endpoints are used to do asynchronous batch inferencing on large volumes of data. They receive pointers to data and run jobs asynchronously to process the data in parallel on compute clusters. Finally, they store outputs on a data store for further analysis.
You may use them when:
- you have expensive models or pipelines that require a longer time to run.
- you want to operationalize machine learning pipelines and reuse components.
- you need to perform inference over large amounts of data, distributed in multiple files.
- you don't have low latency requirements.
- your model's inputs are stored in a Storage Account or in an Azure Machine Learning data asset.
- you can take advantage of parallelization.
What is deployment?
A deployment is a set of resources required for hosting the model or component that does the actual inferencing. A single endpoint can contain multiple deployments which can host independent assets and consume different resources based on what the actual assets require. Endpoints have a routing mechanism that can route the request generated for the clients to specific deployments under the endpoint.
To function properly, each endpoint needs to have at least one deployment. Endpoints and deployments are independent Azure Resource Manager resources that appear in the Azure portal.
Model Serving using KServe
KServe is a highly scalable, standards-based model inference platform on Kubernetes for trusted AI.
Why Use KServe?
- KServe can serve simplest to most complex usecases with simplicity and efficiency.
- KServe enables resource and cost optimization by providing ability to specify resource limits for deployments.
- KServe can handle both online and batch inference serving usecases
- KServe is a standard Model Inference Platform on Kubernetes, built for highly scalable use cases.
- Provides performant, standardized inference protocol across ML frameworks.
- Support modern serverless inference workload with Autoscaling including Scale to Zero on GPU.
- Provides high scalability, density packing and intelligent routing using ModelMesh
- Simple and Pluggable production serving for production ML serving including prediction, pre/post processing, monitoring and explainability.
- Advanced deployments with canary rollout, experiments, ensembles and transformers.
KServe model serving methods
In KServe, models can be served using either of the following methods:
- the Python SDK, or
- a Kubernetes manifest YAML file, specifically configured for KServe
You can deploy the YAML file with tools such as Argo CD, kubectl, or the Kubeflow Central Dashboard.
Computer vision examples
ONNX Model Deployment Guide
Introduction
This guide explains the deployment of a pipeline of ONNX models into a Triton Server for the Constant Level Oilers (CLO) application. The pipeline processes high-resolution images from robots to detect CLO objects and estimate their oil percentage. Multiple models are orchestrated as an ensemble using Triton Inference Server, enabling efficient and scalable inference for computer vision tasks.
Objectives
The pipeline processes high-resolution images to detect objects (spheric oilers) and estimate the oil percentage content visually. It consists of the following models:
-
Preprocessing: Resizes and preprocesses the input image.
-
YOLOv8: Performs object detection on the preprocessed image.
-
Postprocessing: Processes YOLO detections and prepares the image for further analysis.
-
EfficientNet V2: Estimates the oil percentage from the processed image.
-
Ensemble: Combines all 4 models into a single pipeline for seamless execution (models are called sequentially 1 → 2 → 3 → 4 to estimate the oil percentage given an input image).
Model details
Preprocessing
- Model Name: preprocessing
- Platform: onnxruntime_onnx
- Input: original_image_3840_2160 (dimensions: [1, 3, 2160, 3840], type: TYPE_FP32)
- Output: preprocessed_640_640_image (dimensions: [1, 3, 640, 640], type: TYPE_FP32)
YOLOv8
- Model Name: yolov8
- Platform: onnxruntime_onnx
- Input: preprocessed_640_640_image
- Output: yolo_detections (dimensions: [1, 300, 6], type: TYPE_FP32)
Postprocessing
- Model Name: postprocessing
- Platform: onnxruntime_onnx
- Input: original_image_3840_2160, yolo_detections
- Output: processed_image_384_384 (dimensions: [1, 3, 384, 384], type: TYPE_FP32)
EfficientNet v2
- Model Name: efficientnet_v2
- Platform: onnxruntime_onnx
- Input: processed_image_384_384
- Output: estimated_oil_pct (dimensions: [1, 1], type: TYPE_FP32)
Ensemble
- Model Name: ensemble
- Platform: ensemble
- Input: original_image_3840_2160
- Output: yolo_detections, processed_image_384_384, estimated_oil_pct
Deployment steps
-
Prepare the Models: Place all ONNX models (model.onnx) in their respective folders under 1/. , as explained in Folder Structure
-
Configure the Models: Verify that each model's config.pbtxt file defines the correct inputs, outputs, and platform.
-
Deploy the Models: Use Triton Inference Server to deploy the models. Place the model folders in the Triton model repository.
-
Run the Pipeline: Send an input image to the ensemble model. The ensemble will execute the pipeline in sequence.
-
Retrieve the Outputs: The ensemble model will return the outputs of the models of the ensemble pipeline, including yolo_detections, processed_image_384_384, and estimated_oil_pct.
Data flow
Input: original_image_3840_2160 → Preprocessing → preprocessed_640_640_image
YOLOv8: preprocessed_640_640_image → YOLOv8 → yolo_detections
Postprocessing: original_image_3840_2160 + yolo_detections → Postprocessing → processed_image_384_384
EfficientNet V2: processed_image_384_384 → EfficientNet V2 → estimated_oil_pct
Folder Structure
onnx_model_repository/
├── preprocessing/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
├── yolov8/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
├── postprocessing/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
├── efficientnet_v2/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
└── ensemble/
├── config.pbtxt
└── 1/
Ensemble Folder Explanation
The ensemble folder does not contain a ONNX file because it is not a standalone model. Instead, it is a configuration that defines how multiple models work together in a pipeline using Triton Inference Server's ensemble scheduling feature. The folder must include a config.pbtxt file and an empty 1/ subfolder to comply with Triton's directory structure.
YAML Configuration
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "hjam"
spec:
predictor:
triton:
storageUri: "pvc://ml-autonomy-aurora-azure-file-data-pvc/hjam-data/AssetAutonomy/onnx_model_repository"
args:
- "--strict-model-config=true"
- "--model-control-mode=explicit"
- "--load-model=yolov8"
- "--load-model=efficientnet_v2"
- "--load-model=preprocessing"
- "--load-model=postprocessing"
- "--load-model=ensemble"
Explanation of YAML Elements
| Element | Description |
|---|---|
| apiVersion | Specifies the API version of the resource. |
| kind | Defines the type of Kubernetes resource. |
| metadata | Contains metadata about the resource, such as its name. |
| spec | Describes the desired state of the resource. |
| predictor | Specifies the type of predictor to use. |
| storageUri | Points to the location of the model repository. |
| args | Provides additional arguments to configure the Triton Inference Server. |
Additional Resources
For further guidance on working with ONNX models in this workflow, you may find the following notebooks helpful:
- Convert_Models_into_ONNX: This notebook explains how to convert PyTorch models into ONNX format, covering both scripting-based and tracing-based conversion methods.
- Triton_Inferencing: Provides comprehensive instructions and examples for interacting with ONNX models deployed via KServe, covering both individual ONNX model calls and the ensemble use case, as well as sample inference requests and response handling.