feat: add DGD example for global router + vllm (#5760)

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

feat: add DGD example for global router + vllm (#5760)
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
d1697dc3 · Hongkuan Zhou · GitHub · a379c1b1 · d1697dc3 · d1697dc3
Unverified Commit d1697dc3 authored Jan 28, 2026 by Hongkuan Zhou Committed by GitHub Jan 28, 2026
Showing with 386 additions and 9 deletions

examples/hierarchical_planner/README.md examples/hierarchical_planner/README.md +74 -9

examples/hierarchical_planner/vllm-2p1d.yaml examples/hierarchical_planner/vllm-2p1d.yaml +312 -0

No files found.
--- a/examples/hierarchical_planner/README.md
+++ b/examples/hierarchical_planner/README.md
@@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0
 This example demonstrates a hierarchical routing setup with:
 - A **Global Router** that routes to different pools based on request characteristics
 - **Local Routers** in each pool namespace
- **Mocker Workers** simulating prefill and decode backends
+- **Workers** (Mocker for local testing, vLLM for Kubernetes deployment)
 ## Architecture
@@ -23,28 +23,30 @@ This example demonstrates a hierarchical routing setup with:
        |                |                |
        v                v                v
   Prefill Pool 0   Prefill Pool 1   Decode Pool 0
-   (prefill_pool_0) (prefill_pool_1) (decode_pool_0)
+   (prefill-pool-0) (prefill-pool-1) (decode-pool-0)
        |                |                |
        v                v                v
   Local Router     Local Router     Local Router
        |                |                |
        v                v                v
-   Mocker Worker    Mocker Worker    Mocker Worker
+      Worker           Worker           Worker
   (prefill)        (prefill)        (decode)
 ```
 ## Configuration
 The `global_router_config.json` defines:
- 2 prefill pools (`prefill_pool_0`, `prefill_pool_1`)
+- 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`)
- 1 decode pool (`decode_pool_0`)
+- 1 decode pool (`decode-pool-0`)
 - Grid-based pool selection strategy
 Pool selection is based on a 2x2 grid:
 - **Prefill**: (ISL, TTFT_target) maps to prefill pool index
 - **Decode**: (context_length, ITL_target) maps to decode pool index
-## Running the Example
+## Running Locally (with Mocker)
+For local testing without GPUs, use the mocker-based script:
 ```bash
 cd examples/hierarchical_planner
@@ -53,6 +55,63 @@ cd examples/hierarchical_planner
 This starts all components in the background and provides instructions for testing.
+## Kubernetes Deployment (with vLLM)
+The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each).
+### Prerequisites
+- Kubernetes cluster with GPU nodes
+- `hf-token-secret` secret containing your HuggingFace token
+- The Dynamo operator installed
+### Deployment
+The YAML uses environment variable placeholders:
+- `${K8S_NAMESPACE}` - Your Kubernetes namespace
+- `${VLLM_IMAGE}` - Dynamo vLLM runtime container image
+Use `envsubst` to substitute these before applying:
+```bash
+# Set your Kubernetes namespace and image
+export K8S_NAMESPACE=<your-k8s-namespace>
+export VLLM_IMAGE=<dynamo-vllm-image>
+# Deploy all DGDs
+envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
+```
+### Verify Deployment
+```bash
+# Check DGD status
+kubectl get dgd -n ${K8S_NAMESPACE}
+# Check pods
+kubectl get pods -n ${K8S_NAMESPACE}
+# Check logs for a specific component
+kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend
+```
+### Cleanup
+```bash
+export K8S_NAMESPACE=<your-k8s-namespace>
+export VLLM_IMAGE=<dynamo-vllm-image>
+envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
+```
+### Namespace Convention
+The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field:
+- K8s namespace: `my-namespace`
+- `dynamoNamespace: prefill-pool-0`
+- Actual Dynamo namespace: `my-namespace-prefill-pool-0`
+This is why the global router config and local router endpoints must use the full namespace path.
 ## Testing
 Once all components are running, send a request to the frontend:
@@ -68,6 +127,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
  }'
 ```
+For Kubernetes, port-forward the frontend service first:
+```bash
+kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000
+```
 ## Request Flow
 1. Request arrives at **Frontend**
@@ -75,7 +140,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 3. Frontend sends prefill request to **Global Router** (registered as prefill)
 4. Global Router selects prefill pool based on (ISL, TTFT_target) grid
 5. Request forwarded to **Local Router** in selected prefill pool namespace
-6. Local Router forwards to **Mocker Worker** (prefill mode)
+6. Local Router forwards to **Worker** (prefill mode)
 7. Prefill response returns with `disaggregated_params`
 8. Frontend sends decode request to **Global Router** (registered as decode)
 9. Global Router selects decode pool based on (context_length, ITL_target) grid
@@ -83,7 +148,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 ## Customizing Pool Selection
-Edit `global_router_config.json` to change:
+Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change:
 - **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists
 - **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity
@@ -92,4 +157,4 @@ Edit `global_router_config.json` to change:
 Example: To always route to pool 0 regardless of request characteristics:
 ```json
 "prefill_pool_mapping": [[0, 0], [0, 0]]
 ```
\ No newline at end of file
--- a/examples/hierarchical_planner/vllm-2p1d.yaml
+++ b/examples/hierarchical_planner/vllm-2p1d.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Multi-DGD deployment for hierarchical planner example with vLLM workers
+# Architecture:
+#   DGD 1 (hierarchical): Frontend + GlobalRouter
+#   DGD 2 (prefill-pool-0): Local Router + vLLM Prefill Worker (1 GPU)
+#   DGD 3 (prefill-pool-1): Local Router + vLLM Prefill Worker (1 GPU)
+#   DGD 4 (decode-pool-0): Local Router + vLLM Decode Worker (1 GPU)
+#
+# IMPORTANT: This file uses ${K8S_NAMESPACE} as a placeholder for the Kubernetes namespace.
+# The K8s operator prepends the K8s namespace to the Dynamo namespace.
+# For example, if K8S_NAMESPACE="my-namespace" and dynamoNamespace is "prefill-pool-0",
+# the actual Dynamo namespace becomes "my-namespace-prefill-pool-0".
+#
+# vLLM workers register at:
+#   - Prefill: <namespace>.prefill.generate
+#   - Decode:  <namespace>.backend.generate
+#
+# USAGE: See README.md for deployment instructions using envsubst.
+# =============================================================================
+# ConfigMap for global router configuration
+# =============================================================================
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: hierarchical-global-router-config
+data:
+  global_router_config.json: |
+    {
+        "num_prefill_pools": 2,
+        "num_decode_pools": 1,
+        "prefill_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-prefill-pool-0", "${K8S_NAMESPACE}-prefill-pool-1"],
+        "decode_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-decode-pool-0"],
+        "prefill_pool_selection_strategy": {
+            "ttft_min": 10,
+            "ttft_max": 1000,
+            "ttft_resolution": 2,
+            "isl_min": 0,
+            "isl_max": 32000,
+            "isl_resolution": 2,
+            "prefill_pool_mapping": [[0,1],[0,1]]
+        },
+        "decode_pool_selection_strategy": {
+            "itl_min": 10,
+            "itl_max": 100,
+            "itl_resolution": 2,
+            "context_length_min": 0,
+            "context_length_max": 32000,
+            "context_length_resolution": 2,
+            "decode_pool_mapping": [[0,0],[0,0]]
+        }
+    }
+---
+# =============================================================================
+# DGD 1: Frontend + Global Router (namespace: hierarchical)
+# =============================================================================
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: hierarchical-frontend
+spec:
+  envs:
+  - name: HF_TOKEN
+    valueFrom:
+      secretKeyRef:
+        key: HF_TOKEN
+        name: hf-token-secret
+  services:
+    Frontend:
+      componentType: frontend
+      dynamoNamespace: hierarchical
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --router-mode
+          - round-robin
+          - --namespace
+          - ${K8S_NAMESPACE}-hierarchical
+          command:
+          - python
+          - -m
+          - dynamo.frontend
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+    GlobalRouter:
+      componentType: default
+      dynamoNamespace: hierarchical
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --config
+          - /workspace/config/global_router_config.json
+          - --model-name
+          - Qwen/Qwen3-0.6B
+          - --default-ttft-target
+          - "100"
+          - --default-itl-target
+          - "10"
+          - --namespace
+          - ${K8S_NAMESPACE}-hierarchical
+          command:
+          - python
+          - -m
+          - dynamo.global_router
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+          volumeMounts:
+          - mountPath: /workspace/config
+            name: global-router-config
+            readOnly: true
+        volumes:
+        - configMap:
+            name: hierarchical-global-router-config
+          name: global-router-config
+      replicas: 1
+---
+# =============================================================================
+# DGD 2: Prefill Pool 0 - Local Router + vLLM Worker (namespace: prefill-pool-0)
+# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-0
+# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
+# =============================================================================
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: prefill-pool-0
+spec:
+  envs:
+  - name: HF_TOKEN
+    valueFrom:
+      secretKeyRef:
+        key: HF_TOKEN
+        name: hf-token-secret
+  services:
+    LocalRouter:
+      componentType: default
+      dynamoNamespace: prefill-pool-0
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --endpoint
+          - ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
+          - --block-size
+          - "16"
+          - --no-track-active-blocks
+          command:
+          - python
+          - -m
+          - dynamo.router
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+    VllmPrefillWorker:
+      componentType: worker
+      subComponentType: prefill
+      dynamoNamespace: prefill-pool-0
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-0.6B
+          - --is-prefill-worker
+          - --tensor-parallel-size
+          - "1"
+          - --gpu-memory-utilization
+          - "0.90"
+          - --block-size
+          - "16"
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"
+---
+# =============================================================================
+# DGD 3: Prefill Pool 1 - Local Router + vLLM Worker (namespace: prefill-pool-1)
+# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-1
+# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
+# =============================================================================
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: prefill-pool-1
+spec:
+  envs:
+  - name: HF_TOKEN
+    valueFrom:
+      secretKeyRef:
+        key: HF_TOKEN
+        name: hf-token-secret
+  services:
+    LocalRouter:
+      componentType: default
+      dynamoNamespace: prefill-pool-1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --endpoint
+          - ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
+          - --block-size
+          - "16"
+          - --no-track-active-blocks
+          command:
+          - python
+          - -m
+          - dynamo.router
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+    VllmPrefillWorker:
+      componentType: worker
+      subComponentType: prefill
+      dynamoNamespace: prefill-pool-1
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-0.6B
+          - --is-prefill-worker
+          - --tensor-parallel-size
+          - "1"
+          - --gpu-memory-utilization
+          - "0.90"
+          - --block-size
+          - "16"
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"
+---
+# =============================================================================
+# DGD 4: Decode Pool 0 - Local Router + vLLM Worker (namespace: decode-pool-0)
+# Actual Dynamo namespace: ${K8S_NAMESPACE}-decode-pool-0
+# vLLM decode worker registers at: ${K8S_NAMESPACE}-decode-pool-0.backend.generate
+# =============================================================================
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: decode-pool-0
+spec:
+  envs:
+  - name: HF_TOKEN
+    valueFrom:
+      secretKeyRef:
+        key: HF_TOKEN
+        name: hf-token-secret
+  services:
+    LocalRouter:
+      componentType: default
+      dynamoNamespace: decode-pool-0
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --endpoint
+          - ${K8S_NAMESPACE}-decode-pool-0.backend.generate
+          - --block-size
+          - "16"
+          - --kv-overlap-score-weight
+          - "0"
+          command:
+          - python
+          - -m
+          - dynamo.router
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+    VllmDecodeWorker:
+      componentType: worker
+      subComponentType: decode
+      dynamoNamespace: decode-pool-0
+      envFromSecret: hf-token-secret
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --model
+          - Qwen/Qwen3-0.6B
+          - --tensor-parallel-size
+          - "1"
+          - --gpu-memory-utilization
+          - "0.90"
+          - --block-size
+          - "16"
+          command:
+          - python3
+          - -m
+          - dynamo.vllm
+          image: ${VLLM_IMAGE}
+          workingDir: /workspace
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"