Unverified Commit 9616c86f authored by Erez Zarum's avatar Erez Zarum Committed by GitHub
Browse files

docs: EKS Auto Mode example (#7369)


Signed-off-by: default avatarErez Zarum <erezz@amazon.com>
parent e252f82d
# Steps to create EKS cluster with EFS
## 1. Install CLIs
### a. Install AWS CLI (steps [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
```
sudo apt install unzip
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
```
### b. Install Kubernetes CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html))
```
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.30.0/2024-05-12/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc
```
### c. Install EKS CLI (steps [here](https://eksctl.io/installation/))
```
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
```
### d. Install Helm CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/helm.html))
```
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
```
## 2. Create an EKS cluster
In this example we create an EKS cluster consisting of 1 `g6e.48xlarge` compute node, each with 8 NVIDIA L40S GPUs and 1 `c5.2xlarge` CPU node as control plane. We also setup EFA between the compute nodes.
### a. Configure AWS CLI
```
aws configure
```
### b. Create a config file for EKS cluster creation
```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <CLUSTER_NAME>
version: "1.32"
region: <REGION_NAME>
iam:
withOIDC: true
managedNodeGroups:
- name: sys-ng
instanceType: c5.2xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
- name: efa-compute-ng
instanceType: g6e.48xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
volumeSize: 300
efaEnabled: true
privateNetworking: true
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
```
> [!NOTE]
> We set `minSize` and `desiredCapacity` to be 1 because AWS does not create your cluster successfully if no nodes are available. For example, if you specify `desiredCapacity` to be 2 but there are no available 2 nodes, your cluster creation will fail due to timeout even though there are no errors. The easiest way to avoid this is to create the cluster with 1 node and increase the number of nodes later in the EKS console. After you increase number of nodes in your node groups, make sure GPU nodes are in the same subnet. This is required for EFA to work.
### c. Create the EKS cluster
```
eksctl create cluster -f eks_cluster_config.yaml
```
## 3. Create an EFS file system
We'll need a common, shared storage location to enable pods deployed to multiple nodes to load shards of the same model. This way, they can be used in coordination to serve inference requests for models too large to loaded by GPUs on a single node. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
### a. Create an IAM role
Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
### b. Install EFS CSI driver
Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
### c. Create EFS file system
Follow the steps to create an EFS file system: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md. Make sure you mount subnets in the last step correctly. This will affect whether your nodes are able to access the created EFS file system.
## 4. Test
Follow the steps to check if your EFS file system is working properly with your nodes: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods. This test is going to mount your EFS file system on all of your available nodes and write a text file to the file system.
## 5. Create StorageClass
You can find your `fileSystemId` from AWS EFS. It usually start with `fs-`.
```
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: efs.csi.aws.com
parameters:
fileSystemId: fs-01e72da3fcdbf8a4d
provisioningMode: efs-ap
directoryPerms: "777"
uid: "1000"
gid: "1000"
```
```
kubectl apply -f storageclass.yaml
```
\ No newline at end of file
# Steps to install Dynamo Kubernetes Platform from Source
## 1. Build Dynamo Base Image
Create 1 ECR repositoriy
```
aws configure
aws ecr create-repository --repository-name <ECR_REPOSITORY>
```
Build Image
```
export NAMESPACE=dynamo-system
export DOCKER_SERVER=<ECR_REGISTRY>
export DOCKER_USERNAME=AWS
export DOCKER_PASSWORD="$(aws ecr get-login-password --region <ECR_REGION>)"
export IMAGE_TAG=0.3.2.1
python container/render.py --framework=dynamo --target=runtime --output-short-filename
docker build -t dynamo:latest-vllm -f container/rendered.Dockerfile .
```
Push Image
```
docker tag dynamo:latest-vllm <ECR_REGISTRY>/<ECR_REPOSITORY>:$IMAGE_TAG
aws ecr get-login-password | docker login --username AWS --password-stdin <ECR_REGISTRY>
docker push <ECR_REGISTRY>/<ECR_REPOSITORY>:$IMAGE_TAG
```
## 2. Install Dynamo Kubernetes Platform
Build and Push Operator Image
```
cd deploy/operator
docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG .
docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG
```
Create secrets
```
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
export HF_TOKEN=<HF_TOKEN>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
Install Dynamo Kubernetes Platform
```
cd deploy/helm/charts
helm install dynamo-crds ./crds/ \
--namespace default \
--wait \
--atomic
```
```
helm dep build ./platform/
kubectl create namespace ${NAMESPACE}
# Create docker registry secret
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
# Install platform
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/kubernetes-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
```
Your pods should be running like below
```
ubuntu@ip-192-168-83-157:~/dynamo/examples/backends/vllm/deploy$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
dynamo-system dynamo-platform-dynamo-operator-controller-manager-86795c5f4j4k 2/2 Running 0 4h17m
dynamo-system dynamo-platform-etcd-0 1/1 Running 0 4h17m
dynamo-system dynamo-platform-nats-0 2/2 Running 0 4h17m
dynamo-system dynamo-platform-nats-box-5dbf45c748-bxqj7 1/1 Running 0 4h17m
```
# Steps to deploy vLLM example
## 1. Deploy Dynamo Graph
```
cd dynamo/examples/backends/vllm/deploy
vim agg_router.yaml #under metadata add namespace: dynamo-system and change image to your built base image
kubectl apply -f agg_router.yaml
```
Your pods should be running like below
```
ubuntu@ip-192-168-83-157:~/dynamo/examples/backends/vllm/deploy$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
dynamo-system dynamo-platform-dynamo-operator-controller-manager-86795c5f4j4k 2/2 Running 0 4h17m
dynamo-system dynamo-platform-etcd-0 1/1 Running 0 4h17m
dynamo-system dynamo-platform-nats-0 2/2 Running 0 4h17m
dynamo-system dynamo-platform-nats-box-5dbf45c748-bxqj7 1/1 Running 0 4h17m
dynamo-system vllm-agg-router-frontend-79d599bb9c-fg97p 1/1 Running 0 4m9s
dynamo-system vllm-agg-router-vllmdecodeworker-787d575485-hrcjp 1/1 Running 0 4m9s
dynamo-system vllm-agg-router-vllmdecodeworker-787d575485-zkwdd 1/1 Running 0 4m9s
```
Test the Deployment
```
kubectl port-forward deployment/vllm-agg-router-frontend 8000:8000 -n dynamo-system
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You should output something similar to below
```
{"id":"chatcmpl-bbe52b36-90ed-4479-9872-89e1aa412aa7","choices":[{"index":0,"message":{"content":"<think>\nOkay, so the user wants me to develop a character background for an explorer named someone in Eldoria. The character is part of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753417848,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":29,"total_tokens":225,"prompt_tokens_details":null,"completion_tokens_details":null}}
```
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Create an Amazon EFS File System for Amazon EKS
This guide walks through creating an Amazon EFS file system and connecting it to your EKS cluster. The EFS CSI Driver was already installed as an addon via `eksctl.yaml` during cluster creation. Now we need to create the actual file system and make it available to Kubernetes workloads.
This filesystem will be used by Dynamo to store shared model weights and compilation cache across nodes.
## Prerequisites
- EKS cluster created following the [README](README.md)
- Environment variables set:
```bash
export AWS_REGION="us-east-1"
export CLUSTER_NAME="ai-dynamo"
export DYNAMO_NAMESPACE="dynamo-system"
```
## Retrieve VPC and Subnet Information
Get the VPC ID associated with your EKS cluster:
```bash
export VPC_ID=$(aws eks describe-cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--query "cluster.resourcesVpcConfig.vpcId" \
--output text)
```
Get the CIDR range for the VPC (used for the security group rule):
```bash
export VPC_CIDR=$(aws ec2 describe-vpcs \
--vpc-ids $VPC_ID \
--query "Vpcs[0].CidrBlock" \
--output text)
```
## Create a Security Group for EFS
Create a security group that allows NFS traffic (port 2049) from within the VPC:
```bash
export EFS_SG_ID=$(aws ec2 create-security-group \
--group-name dynamo-efs-sg \
--description "Security group for EFS access from EKS" \
--vpc-id $VPC_ID \
--region $AWS_REGION \
--query "GroupId" \
--output text)
```
Add an inbound rule to allow NFS traffic from the VPC CIDR:
```bash
aws ec2 authorize-security-group-ingress \
--group-id $EFS_SG_ID \
--protocol tcp \
--port 2049 \
--cidr $VPC_CIDR \
--region $AWS_REGION
```
## Create the EFS File System
```bash
export EFS_FS_ID=$(aws efs create-file-system \
--performance-mode generalPurpose \
--throughput-mode elastic \
--encrypted \
--region $AWS_REGION \
--tags Key=Name,Value=dynamo-efs \
--query "FileSystemId" \
--output text)
```
Wait for the file system to become available:
```bash
aws efs describe-file-systems \
--file-system-id $EFS_FS_ID \
--region $AWS_REGION \
--query "FileSystems[0].LifeCycleState" \
--output text
```
You should see `available` before proceeding.
## Create Mount Targets
Mount targets allow your EKS nodes to access the EFS file system. You need one mount target per subnet where your nodes run.
Get the subnet IDs used by your EKS cluster:
```bash
export SUBNET_IDS=$(aws eks describe-cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--query "cluster.resourcesVpcConfig.subnetIds[]" \
--output text)
echo "Subnet IDs: $SUBNET_IDS"
```
Create a mount target in each subnet:
```bash
for SUBNET_ID in $(echo "$SUBNET_IDS" | tr '\t' '\n'); do
echo "Creating mount target in subnet: $SUBNET_ID"
aws efs create-mount-target \
--file-system-id $EFS_FS_ID \
--subnet-id $SUBNET_ID \
--security-groups $EFS_SG_ID \
--region $AWS_REGION 2>/dev/null || echo " Mount target already exists or subnet is in a duplicate AZ (this is OK)"
done
```
> **Note:** EFS allows only one mount target per Availability Zone. If multiple subnets are in the same AZ, the command will fail for the duplicates, which is expected and safe to ignore.
Verify mount targets are available:
```bash
aws efs describe-mount-targets \
--file-system-id $EFS_FS_ID \
--region $AWS_REGION \
--query "MountTargets[*].{SubnetId:SubnetId,AZ:AvailabilityZoneName,State:LifeCycleState}" \
--output table
```
Wait until all mount targets show `available` in the State column before proceeding.
## Create Kubernetes StorageClass
Create a StorageClass that uses the EFS CSI driver with dynamic provisioning:
```bash
kubectl apply -f - << EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc-dynamic
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: "${EFS_FS_ID}"
directoryPerms: "777"
uid: "1000"
gid: "1000"
EOF
```
## Create a PersistentVolumeClaim
We create three separate PVCs because different Dynamo recipe examples reference each one individually:
* `model-cache` stores downloaded model weights (e.g. from HuggingFace).
* `compilation-cache` stores vLLM/TRT-LLM compilation artifacts.
* `perf-cache` stores benchmark traces and performance results.
```bash
# Create the namespace we will use for Dynamo if not already exists
kubectl create namespace ${DYNAMO_NAMESPACE}
# Create PVCs
kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
EOF
```
> **Note:** EFS is elastic, the `storage` value in the PVC is required by Kubernetes but does not limit the actual storage. EFS will grow and shrink automatically.
## Verify
Confirm the PVC is bound:
```bash
kubectl get pvc -n ${DYNAMO_NAMESPACE}
```
You should see output similar to:
```
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
compilation-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic <unset> 41s
model-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic <unset> 42s
perf-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic <unset> 41s
```
## Cleanup
To delete the EFS resources when no longer needed:
```bash
# Delete the Kubernetes resources
kubectl delete pvc model-cache compilation-cache perf-cache -n ${DYNAMO_NAMESPACE}
kubectl delete storageclass efs-sc-dynamic
# Delete mount targets
for MT_ID in $(aws efs describe-mount-targets --file-system-id $EFS_FS_ID --region $AWS_REGION --query "MountTargets[*].MountTargetId" --output text); do
aws efs delete-mount-target --mount-target-id $MT_ID --region $AWS_REGION
done
# Delete the EFS file system
aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_REGION
# Delete the security group
aws ec2 delete-security-group --group-id $EFS_SG_ID --region $AWS_REGION
```
This diff is collapsed.
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 300s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
spec:
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: eks.amazonaws.com/instance-family
operator: In
values:
- g5
- g6
- g6e
- g7e
- p5
- p5e
- p5en
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: Exists
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- model-download.yaml
patches:
- target:
kind: Job
name: model-download
patch: |
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
containers:
- name: model-download
resources:
requests:
memory: "4Gi"
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
requests:
gpu: "1"
limits:
gpu: "1"
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: g6e.2xlarge
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace/examples/backends/vllm
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
replicas: 1
resources:
requests:
cpu: "8"
limits:
cpu: "8"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
extraPodSpec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
nvidia.com/dynamo-component-type: "worker"
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.frontend --router-mode kv --router-reset-states
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 2
resources:
requests:
gpu: "2"
custom:
vpc.amazonaws.com/efa: "8"
limits:
gpu: "2"
custom:
vpc.amazonaws.com/efa: "8"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: p5.48xlarge
karpenter.sh/capacity-type: reserved
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmPrefillWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
env:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-32B \
--tensor-parallel-size 2 \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 6
resources:
requests:
gpu: "2"
custom:
vpc.amazonaws.com/efa: "8"
limits:
gpu: "2"
custom:
vpc.amazonaws.com/efa: "8"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: p5.48xlarge
karpenter.sh/capacity-type: reserved
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmPrefillWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
env:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-32B \
--tensor-parallel-size 2 \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
requests:
gpu: "1"
limits:
gpu: "1"
extraPodSpec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/instance-type: g6e.2xlarge
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmPrefillWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace/examples/backends/vllm
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
requests:
gpu: "1"
limits:
gpu: "1"
extraPodSpec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/instance-type: g6e.2xlarge
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmDecodeWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace/examples/backends/vllm
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cpu","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
Frontend:
componentType: frontend
envs:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
replicas: 1
resources:
requests:
cpu: "8"
limits:
cpu: "8"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
extraPodSpec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
nvidia.com/dynamo-component-type: "worker"
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.frontend --router-mode kv --router-reset-states
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
requests:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
limits:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: g7e.12xlarge
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmPrefillWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
env:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-32B \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
requests:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
limits:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
useAsCompilationCache: true
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: g7e.12xlarge
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
nvidia.com/dynamo-component: "VllmDecodeWorker"
topologyKey: "kubernetes.io/hostname"
mainContainer:
env:
- name: HF_HOME
value: /home/dynamo/.cache/huggingface
securityContext:
capabilities:
add:
- IPC_LOCK
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64
workingDir: /workspace
command:
- /bin/bash
- -c
- |
exec python3 -m dynamo.vllm \
--model Qwen/Qwen3-32B \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_REGION}
availabilityZones:
${EKS_CP_AZS}
autoModeConfig:
enabled: true
addonsConfig:
disableDefaultAddons: true
addons:
- name: aws-efs-csi-driver
version: latest
useDefaultPodIdentityAssociations: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment