AKS Clusters
AI Platform allows you to use Azure Kubernetes Service (AKS) clusters as a compute and deployment target.
Benefits of using an AKS cluster
Using an AI Platform AKS cluster as a compute and deployment target provides the following:
- Use of Spot VM instances for training jobs and pipelines. Spot VM instances are cheaper compared to the standard tier.
- An isolated environment for your code and data instead of a random serverless node provided by Microsoft.
- Secure data transferring. For example, when using a dedicated or shared AI Platform cluster as a compute target, if needed, the AI Platform team can enable a private connection from pods to storage accounts or almost any PaaS service available on Azure. Data will not go outside Azure Virtual Network.
- AI Platform AKS clusters are always available. Whether shared or dedicated, an AKS cluster has certain number of nodes that are always available for moderate resources requests. For example, 1 CPU and up to 8 Gi of Memory can be set to be provided as soon as a submitted job or pipeline completes and an Azure ML job starts.
- Granular resources distribution (client choosing) for jobs, pipelines and deploys. See Scheduling for more details.
- If choosing Spot, or heavy resources such as GPU pool, the AI Platform cluster going to perform an auto-scaling operation that takes time. Depending on instance size, the operation could take between 5 to 15 minutes.
- AI Platform Kubeflow and other workloads. Any job or pipeline submitted into the AI Platform cluster can be instructed to use internal URLs and endpoints of the models deployed with Kubeflow, or vice versa, without authentication and without public Internet.
- Deployments.
- Custom FQDN: When using AI Platform as compute you can choose a custom fully qualified domain name (FQDN) for the deployed models, for example:
yourprefix.AI Platform.equinor.com. - Deployed models can be hosted alongside workloads that are not related to Azure ML with almost any amount of requested resources, starting from 100MB of RAM up to the whole CPU workload node or GPU node.
- Custom FQDN: When using AI Platform as compute you can choose a custom fully qualified domain name (FQDN) for the deployed models, for example:
Limitations & requirements
-
The target AKS cluster and Azure ML workspace must reside under the same subscription.
Currently there are no options to connect an Azure ML workspace to the AKS cluster created in another subscription. This is a hard limitation on the Microsoft API level. -
AI Platform team manages target AKS cluster.
How to start using AI Platform AKS clusters
AI Platform offers three options to start using AKS clusters as a compute target for Azure ML workspace:
Option 1:
AKS cluster & Azure ML workspace exist in the same subscription
If the AKS cluster managed by the AI Platform team already exists and the Azure ML workspace exists in the same subscription, create a ticket request for the following:
- Request the installation of the Azure ML cluster extension and attachment of the Azure ML workspace to the AKS cluster.
- Provide the AI Platform team with your configuration information.
Option 2:
No AI Platform cluster is used
If you are not currently using an AI Platform AKS cluster, create a ticket request with the following:
- AKS cluster creation
- Azure ML Workspace creation (optional)
- Provide the AI Platform team with your configuration information.
Option 3:
Sandbox
Use a shared Azure ML workspace provided by AI Platform.
Contact the AI Platform team and provide project WBS code.
Equinor examples
Find Equinor specific examples in the following repositories:
Azure ML SDK materials and main repository
- Azure ML use cases repository
- Azure ML on AI Platform Compute: Nodepool Options
- AzureML on AI Platform Compute: Scheduling Options