Unverified Commit 69fffdba authored by kYLe's avatar kYLe Committed by GitHub
Browse files

docs: Add AIConfigurator and disagg example for Dynamo vLLM (#3183)


Signed-off-by: default avatarKyle H <kylhuang@nvidia.com>
Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Signed-off-by: default avatarKyle Huang <kylhuang@nvidia.com>
Signed-off-by: default avatarkYLe <kylhuang@nvidia.com>
Co-authored-by: default avatarNeal Vaidya <nealv@nvidia.com>
parent 237e978f
......@@ -21,10 +21,11 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace
3. Model hosting with vLLM backend
This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations
- Deployed `Qwen/Qwen2.5-1.5B-Instruct` model
- Use KV cache based routing in frontend deployment `--router-mode kv`
- Use KV cache based routing in frontend deployment via the `DYN_ROUTER_MODE=kv` environment variable
- Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse
- Created 4 replicas for this model deployment by setting `replicas: 4`
- Added `debug` flag environment variable for observability
Create a K8S secret with your Huggingface token and then deploy the models
```sh
export HF_TOKEN=YOUR_HF_TOKEN
......@@ -43,7 +44,7 @@ and use following request to test the deployed model
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{
"role": "user",
......@@ -55,3 +56,23 @@ curl localhost:8000/v1/chat/completions \
}'
```
You can also benchmark the performance of the endpoint by [AIPerf](https://github.com/ai-dynamo/aiperf/blob/main/README.md)
## 2. Deploy Single-Node-Sized Models using AIConfigurator
AIConfigurator helps users to find a strong starting configuration for disaggregated serving. We can use it as a guidance for the SNS (Single-Node-Sized) Model's serving.
1. Install AI Configurator
```sh
pip3 install aiconfigurator
```
2. Assume we have 2 GPU nodes with 16 H200 in total, and we want to deploy Llama 3.1-70B-Instruct model with an optimal disaggregated serving configuration. Run AI configurator for this model
```sh
aiconfigurator cli --model LLAMA3.1_70B --total_gpus 16 --system h200_sxm
```
and from the output, you can see the Pareto curve with suggest P/D settings
![text](images/pareto.png)
3. Start the serving with 1 prefill worker with tensor parallelism 4 and 1 decoding worker with tensor parallelism 8 as AI Configurator suggested. Update the `my-tag` in `disagg_router.yaml` with the latest Dynamo version and your local cache folder path and run following command.
![text](images/settings.png)
```sh
kubectl apply -f disagg_router.yaml --namespace ${NAMESPACE}
```
4. Forward the port and test out the performance as described in the section above.
......@@ -8,78 +8,24 @@ metadata:
spec:
services:
Frontend:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
dynamoNamespace: vllm-agg-router
componentType: main
componentType: frontend
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/components/backends/vllm
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --http-port 8000 --router-mode kv"
envs:
- name: DYN_ROUTER_MODE
value: kv
VllmDecodeWorker:
envFromSecret: hf-token-secret
livenessProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 60
dynamoNamespace: vllm-agg-router
componentType: worker
replicas: 4
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
envs:
- name: DYN_SYSTEM_ENABLED
value: "true"
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
value: "[\"generate\"]"
- name: DYN_SYSTEM_PORT
value: "9090"
- name: DYN_LOG
value: "debug"
extraPodSpec:
......@@ -89,12 +35,6 @@ spec:
path: /YOUR/LOCAL/CACHE/FOLDER
type: DirectoryOrCreate
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
volumeMounts:
- name: local-model-cache
......@@ -104,4 +44,4 @@ spec:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B 2>&1 | tee /tmp/vllm.log
- python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-v1-disagg-router
spec:
services:
Frontend:
dynamoNamespace: vllm-v1-disagg-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
envs:
- name: DYN_ROUTER_MODE
value: kv
VllmDecodeWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "8"
envs:
- name: DYN_LOG
value: "debug"
extraPodSpec:
volumes:
- name: local-model-cache
hostPath:
path: /YOUR/LOCAL/CACHE/FOLDER
type: DirectoryOrCreate
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/components/backends/vllm
volumeMounts:
- name: local-model-cache
mountPath: /root/.cache
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-3.1-70B-Instruct -tp 8
VllmPrefillWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "4"
envs:
- name: DYN_LOG
value: "debug"
extraPodSpec:
volumes:
- name: local-model-cache
hostPath:
path: /YOUR/LOCAL/CACHE/FOLDER
type: DirectoryOrCreate
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/components/backends/vllm
volumeMounts:
- name: local-model-cache
mountPath: /root/.cache
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-3.1-70B-Instruct -tp 4 --is-prefill-worker
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment