Unverified Commit 93a27308 authored by devivasudevan's avatar devivasudevan Committed by GitHub
Browse files

docs: Add AKS storage guidance for Dynamo caches (#5581)


Signed-off-by: default avatarDevi Vasudevan <deviv@microsoft.com>
Signed-off-by: default avatardevivasudevan <49675305+devivasudevan@users.noreply.github.com>
Co-authored-by: default avatarSertaç Özercan <852750+sozercan@users.noreply.github.com>
parent 77e7b721
...@@ -58,6 +58,128 @@ kubectl get pods -n dynamo-system ...@@ -58,6 +58,128 @@ kubectl get pods -n dynamo-system
Follow the [Deploy Model/Workflow](../../../docs/kubernetes/installation_guide.md#next-steps) guide to deploy and test a model on your AKS cluster. Follow the [Deploy Model/Workflow](../../../docs/kubernetes/installation_guide.md#next-steps) guide to deploy and test a model on your AKS cluster.
## AKS Storage options for Model Caching and Runtime Data
For implementing tiered storage you can take advantage of the different storage options available in Azure such as:
| Storage Option | Performance | Best For |
|----------------|-------------|----------|
| Local CSI (Ephemeral Disk) | Very high | Fast model caching, warm restarts |
| [Azure Managed Lustre](https://learn.microsoft.com/en-us/azure/azure-managed-lustre/use-csi-driver-kubernetes) | Extremely high | Large multi-node models, shared cache |
| [Azure Disk (Managed Disk)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-disk#create-azure-disk-pvs-using-built-in-storage-classes) | High | Persistent single-writer model cache |
| [Azure Files](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-files#use-a-persistent-volume-for-storage) | Medium | Shared small/medium models |
| [Azure Blob (via Fuse or init)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-blob#create-a-pvc-using-built-in-storage-class) | Low–Medium | Cold model storage, bootstrap downloads |
Note: Azure Managed Lustre and Local CSI (ephemeral disk) are not installed by default in AKS and require additional setup before use. Azure Disk, Azure Files, and Azure Blob CSI drivers are available out of the box. See the [AKS CSI storage options documentation](https://learn.microsoft.com/azure/aks/csi-storage-drivers) for more details.
In the cache.yaml in the different [recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes), you can set the storageClassName to a predefined storage option that are available in your AKS cluster:
```bash
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY
azureblob-csi blob.csi.azure.com Delete
azurefile file.csi.azure.com Delete
azurefile-csi file.csi.azure.com Delete
azurefile-csi-premium file.csi.azure.com Delete
azurefile-premium file.csi.azure.com Delete
default disk.csi.azure.com Delete
managed disk.csi.azure.com Delete
managed-csi disk.csi.azure.com Delete
managed-csi-premium disk.csi.azure.com Delete
managed-premium disk.csi.azure.com Delete
sc.azurelustre.csi.azure.com azurelustre.csi.azure.com Retain
```
The recommendation for storage options for the Dynamo caches are:
- Model Cache storing raw model artifacts, configuration files, tokenizers etc.<br>
- Persistence: Required to avoid repeated downloads and reduce cold-start latency.<br>
- Recommended storage: Azure Managed Lustre (shared, high throughput) or Azure Disk (single-replica, persistent).
- Compilation Cache stores backend-specific compiled artifacts (e.g., TensorRT engines).<br>
- Persistence: Optional<br>
- Recommended storage: Local CSI (fast, node-local) or Azure Disk (persistent when GPU configuration is fixed).
- Performance Cache stores runtime tuning and profiling data.<br>
- Persistence: Not required<br>
- Recommended storage: Local CSI (or other ephemeral storage).
cache.yaml example:
```bash
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: "sc.azurelustre.csi.azure.com"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "azurefile-csi"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "local-ephemeral"
```
## Running on AKS Spot VMs based GPU node pools
When deploying Dynamo on AKS with GPU-enabled [Spot VM](https://azure.microsoft.com/en-us/products/virtual-machines/spot) node pools, AKS will automatically apply the following taint to those Spot nodes to prevent standard workloads from being scheduled on them by default.
```bash
kubernetes.azure.com/scalesetpriority=spot:NoSchedule
```
Because of these taints, workloads (including the Dynamo CRD controller, Platform components, and any GPU workloads) must include below tolerations in their Helm charts. Without these tolerations, Kubernetes will not schedule pods onto the Spot VM node pools, and GPU resources will remain unused.
```bash
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
```
To schedule Dynamo platform components and jobs onto these nodes, use the provided dynamo/examples/deployments/AKS/values-aks-spot.yaml, which includes all required tolerations for:
- Dynamo operator controller manager
- Webhook CA inject and cert generation jobs
- etcd
- NATS
- MPI SSH key generation job
- Other core Dynamo platform pods
Use the following commands to install or upgrade Dynamo using the AKS Spot values file:
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--create-namespace \
-f ./values-aks-spot.yaml
```
or
```bash
helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
-f ./values-aks-spot.yaml
```
## Clean Up Resources ## Clean Up Resources
If you want to clean up the Dynamo resources created during this guide, you can run the following commands: If you want to clean up the Dynamo resources created during this guide, you can run the following commands:
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Used to generate top-level secrets (overridden by custom-values.yaml)
# Subcharts configuration
# Dynamo operator configuration
dynamo-operator:
controllerManager:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
dynamo:
mpiRun:
sshKeygen:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
webhook:
certGenerator:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
etcd:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
nats:
podTemplate:
merge:
spec:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
natsBox:
podTemplate:
merge:
spec:
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment