docs: Add AIConfigurator and disagg example for Dynamo vLLM (#3183)

Signed-off-by: Kyle H <kylhuang@nvidia.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: kYLe <kylhuang@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>

docs: Add AIConfigurator and disagg example for Dynamo vLLM (#3183)
Signed-off-by: Kyle H <kylhuang@nvidia.com> Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: kYLe <kylhuang@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>
69fffdba · kYLe · GitHub · 237e978f · 69fffdba · 69fffdba
Unverified Commit 69fffdba authored Oct 21, 2025 by kYLe Committed by GitHub Oct 21, 2025
5 changed files
--- a/examples/basics/kubernetes/Distributed_Inference/README.md
+++ b/examples/basics/kubernetes/Distributed_Inference/README.md
@@ -21,10 +21,11 @@ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace
 3. Model hosting with vLLM backend
 This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations
 - Deployed `Qwen/Qwen2.5-1.5B-Instruct` model
- Use KV cache based routing in frontend deployment `--router-mode kv`
+- Use KV cache based routing in frontend deployment via the `DYN_ROUTER_MODE=kv` environment variable
 - Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse
 - Created 4 replicas for this model deployment by setting `replicas: 4`
 - Added `debug` flag environment variable for observability
+
 Create a K8S secret with your Huggingface token and then deploy the models
 ```sh
 export HF_TOKEN=YOUR_HF_TOKEN
@@ -43,7 +44,7 @@ and use following request to test the deployed model
 curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "Qwen/Qwen3-0.6B",
+    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
    {
        "role": "user",
@@ -55,3 +56,23 @@ curl localhost:8000/v1/chat/completions \
  }'
  ```
 You can also benchmark the performance of the endpoint by [AIPerf](https://github.com/ai-dynamo/aiperf/blob/main/README.md)
+
+## 2. Deploy Single-Node-Sized Models using AIConfigurator
+AIConfigurator helps users to find a strong starting configuration for disaggregated serving. We can use it as a guidance for the SNS (Single-Node-Sized) Model's serving.
+1. Install AI Configurator
+```sh
+pip3 install aiconfigurator
+```
+2. Assume we have 2 GPU nodes with 16 H200 in total, and we want to deploy Llama 3.1-70B-Instruct model with an optimal disaggregated serving configuration. Run AI configurator for this model
+```sh
+aiconfigurator cli --model LLAMA3.1_70B --total_gpus 16 --system h200_sxm
+```
+and from the output, you can see the Pareto curve with suggest P/D settings
+![text](images/pareto.png)
+3. Start the serving with 1 prefill worker with tensor parallelism 4 and 1 decoding worker with tensor parallelism 8 as AI Configurator suggested. Update the `my-tag` in `disagg_router.yaml` with the latest Dynamo version and your local cache folder path and run following command.
+![text](images/settings.png)
+```sh
+kubectl apply -f disagg_router.yaml --namespace ${NAMESPACE}
+```
+
+4. Forward the port and test out the performance as described in the section above.
--- a/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml
+++ b/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml
@@ -8,78 +8,24 @@ metadata:
 spec:
  services:
    Frontend:
-      livenessProbe:
-        httpGet:
-          path: /health
-          port: 8000
-        initialDelaySeconds: 60
-        periodSeconds: 60
-        timeoutSeconds: 30
-        failureThreshold: 10
-      readinessProbe:
-        exec:
-          command:
-            - /bin/sh
-            - -c
-            - 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
-        initialDelaySeconds: 60
-        periodSeconds: 60
-        timeoutSeconds: 30
-        failureThreshold: 10
      dynamoNamespace: vllm-agg-router
-      componentType: main
+      componentType: frontend
      replicas: 1
-      resources:
-        requests:
-          cpu: "1"
-          memory: "2Gi"
-        limits:
-          cpu: "1"
-          memory: "2Gi"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
-          workingDir: /workspace/components/backends/vllm
-          command:
-            - /bin/sh
-            - -c
-          args:
-            - "python3 -m dynamo.frontend --http-port 8000 --router-mode kv"
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv
    VllmDecodeWorker:
      envFromSecret: hf-token-secret
-      livenessProbe:
-        httpGet:
-          path: /live
-          port: 9090
-        periodSeconds: 5
-        timeoutSeconds: 30
-        failureThreshold: 1
-      readinessProbe:
-        httpGet:
-          path: /health
-          port: 9090
-        periodSeconds: 10
-        timeoutSeconds: 30
-        failureThreshold: 60
      dynamoNamespace: vllm-agg-router
      componentType: worker
      replicas: 4
      resources:
-        requests:
-          cpu: "10"
-          memory: "20Gi"
-          gpu: "1"
        limits:
-          cpu: "10"
-          memory: "20Gi"
          gpu: "1"
      envs:
-        - name: DYN_SYSTEM_ENABLED
-          value: "true"
-        - name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
-          value: "[\"generate\"]"
-        - name: DYN_SYSTEM_PORT
-          value: "9090"
        - name: DYN_LOG
          value: "debug"
      extraPodSpec:
@@ -89,12 +35,6 @@ spec:
            path: /YOUR/LOCAL/CACHE/FOLDER
            type: DirectoryOrCreate
        mainContainer:
-          startupProbe:
-            httpGet:
-              path: /health
-              port: 9090
-            periodSeconds: 10
-            failureThreshold: 60
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
          volumeMounts:
          - name: local-model-cache
@@ -104,4 +44,4 @@ spec:
            - /bin/sh
            - -c
          args:
-            - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B  2>&1 | tee /tmp/vllm.log
+            - python3 -m dynamo.vllm --model Qwen/Qwen2.5-1.5B-Instruct
--- a/examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml
+++ b/examples/basics/kubernetes/Distributed_Inference/disagg_router.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: vllm-v1-disagg-router
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: vllm-v1-disagg-router
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv
+    VllmDecodeWorker:
+      dynamoNamespace: vllm-v1-disagg-router
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "8"
+      envs:
+        - name: DYN_LOG
+          value: "debug"
+      extraPodSpec:
+        volumes:
+        - name: local-model-cache
+          hostPath:
+            path: /YOUR/LOCAL/CACHE/FOLDER
+            type: DirectoryOrCreate
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          workingDir: /workspace/components/backends/vllm
+          volumeMounts:
+          - name: local-model-cache
+            mountPath: /root/.cache
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.vllm --model meta-llama/Llama-3.1-70B-Instruct -tp 8
+    VllmPrefillWorker:
+      dynamoNamespace: vllm-v1-disagg-router
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "4"
+      envs:
+        - name: DYN_LOG
+          value: "debug"
+      extraPodSpec:
+        volumes:
+        - name: local-model-cache
+          hostPath:
+            path: /YOUR/LOCAL/CACHE/FOLDER
+            type: DirectoryOrCreate
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
+          workingDir: /workspace/components/backends/vllm
+          volumeMounts:
+          - name: local-model-cache
+            mountPath: /root/.cache
+          command:
+            - /bin/sh
+            - -c
+          args:
+            - python3 -m dynamo.vllm --model meta-llama/Llama-3.1-70B-Instruct -tp 4 --is-prefill-worker
--- a/examples/basics/kubernetes/Distributed_Inference/images/pareto.png
+++ b/examples/basics/kubernetes/Distributed_Inference/images/pareto.png
--- a/examples/basics/kubernetes/Distributed_Inference/images/settings.png
+++ b/examples/basics/kubernetes/Distributed_Inference/images/settings.png