Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
dynamo
Commits
d1697dc3
Unverified
Commit
d1697dc3
authored
Jan 28, 2026
by
Hongkuan Zhou
Committed by
GitHub
Jan 28, 2026
Browse files
feat: add DGD example for global router + vllm (#5760)
Signed-off-by:
hongkuanz
<
hongkuanz@nvidia.com
>
parent
a379c1b1
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
386 additions
and
9 deletions
+386
-9
examples/hierarchical_planner/README.md
examples/hierarchical_planner/README.md
+74
-9
examples/hierarchical_planner/vllm-2p1d.yaml
examples/hierarchical_planner/vllm-2p1d.yaml
+312
-0
No files found.
examples/hierarchical_planner/README.md
View file @
d1697dc3
...
@@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0
...
@@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0
This example demonstrates a hierarchical routing setup with:
This example demonstrates a hierarchical routing setup with:
-
A
**Global Router**
that routes to different pools based on request characteristics
-
A
**Global Router**
that routes to different pools based on request characteristics
-
**Local Routers**
in each pool namespace
-
**Local Routers**
in each pool namespace
-
**
Mocker
Workers**
simulating prefill and decode backends
-
**Workers**
(Mocker for local testing, vLLM for Kubernetes deployment)
## Architecture
## Architecture
...
@@ -23,28 +23,30 @@ This example demonstrates a hierarchical routing setup with:
...
@@ -23,28 +23,30 @@ This example demonstrates a hierarchical routing setup with:
| | |
| | |
v v v
v v v
Prefill Pool 0 Prefill Pool 1 Decode Pool 0
Prefill Pool 0 Prefill Pool 1 Decode Pool 0
(prefill
_
pool
_
0) (prefill
_
pool
_
1) (decode
_
pool
_
0)
(prefill
-
pool
-
0) (prefill
-
pool
-
1) (decode
-
pool
-
0)
| | |
| | |
v v v
v v v
Local Router Local Router Local Router
Local Router Local Router Local Router
| | |
| | |
v v v
v v v
Mocker
Worker
Mocker
Worker
Mocker
Worker
Worker
Worker
Worker
(prefill) (prefill) (decode)
(prefill) (prefill) (decode)
```
```
## Configuration
## Configuration
The
`global_router_config.json`
defines:
The
`global_router_config.json`
defines:
-
2 prefill pools (
`prefill
_
pool
_
0`
,
`prefill
_
pool
_
1`
)
-
2 prefill pools (
`prefill
-
pool
-
0`
,
`prefill
-
pool
-
1`
)
-
1 decode pool (
`decode
_
pool
_
0`
)
-
1 decode pool (
`decode
-
pool
-
0`
)
-
Grid-based pool selection strategy
-
Grid-based pool selection strategy
Pool selection is based on a 2x2 grid:
Pool selection is based on a 2x2 grid:
-
**Prefill**
: (ISL, TTFT_target) maps to prefill pool index
-
**Prefill**
: (ISL, TTFT_target) maps to prefill pool index
-
**Decode**
: (context_length, ITL_target) maps to decode pool index
-
**Decode**
: (context_length, ITL_target) maps to decode pool index
## Running the Example
## Running Locally (with Mocker)
For local testing without GPUs, use the mocker-based script:
```
bash
```
bash
cd
examples/hierarchical_planner
cd
examples/hierarchical_planner
...
@@ -53,6 +55,63 @@ cd examples/hierarchical_planner
...
@@ -53,6 +55,63 @@ cd examples/hierarchical_planner
This starts all components in the background and provides instructions for testing.
This starts all components in the background and provides instructions for testing.
## Kubernetes Deployment (with vLLM)
The
`vllm-2p1d.yaml`
file provides a multi-DGD deployment with real vLLM workers (1 GPU each).
### Prerequisites
-
Kubernetes cluster with GPU nodes
-
`hf-token-secret`
secret containing your HuggingFace token
-
The Dynamo operator installed
### Deployment
The YAML uses environment variable placeholders:
-
`${K8S_NAMESPACE}`
- Your Kubernetes namespace
-
`${VLLM_IMAGE}`
- Dynamo vLLM runtime container image
Use
`envsubst`
to substitute these before applying:
```
bash
# Set your Kubernetes namespace and image
export
K8S_NAMESPACE
=
<your-k8s-namespace>
export
VLLM_IMAGE
=
<dynamo-vllm-image>
# Deploy all DGDs
envsubst < vllm-2p1d.yaml | kubectl apply
-n
${
K8S_NAMESPACE
}
-f
-
```
### Verify Deployment
```
bash
# Check DGD status
kubectl get dgd
-n
${
K8S_NAMESPACE
}
# Check pods
kubectl get pods
-n
${
K8S_NAMESPACE
}
# Check logs for a specific component
kubectl logs
-n
${
K8S_NAMESPACE
}
-l
nvidia.com/dynamo-component
=
Frontend
```
### Cleanup
```
bash
export
K8S_NAMESPACE
=
<your-k8s-namespace>
export
VLLM_IMAGE
=
<dynamo-vllm-image>
envsubst < vllm-2p1d.yaml | kubectl delete
-n
${
K8S_NAMESPACE
}
-f
-
```
### Namespace Convention
The Dynamo operator prepends the Kubernetes namespace to the
`dynamoNamespace`
field:
-
K8s namespace:
`my-namespace`
-
`dynamoNamespace: prefill-pool-0`
-
Actual Dynamo namespace:
`my-namespace-prefill-pool-0`
This is why the global router config and local router endpoints must use the full namespace path.
## Testing
## Testing
Once all components are running, send a request to the frontend:
Once all components are running, send a request to the frontend:
...
@@ -68,6 +127,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
...
@@ -68,6 +127,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
}'
}'
```
```
For Kubernetes, port-forward the frontend service first:
```
bash
kubectl port-forward
-n
${
K8S_NAMESPACE
}
svc/hierarchical-frontend-frontend 8000:8000
```
## Request Flow
## Request Flow
1.
Request arrives at
**Frontend**
1.
Request arrives at
**Frontend**
...
@@ -75,7 +140,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
...
@@ -75,7 +140,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
3.
Frontend sends prefill request to
**Global Router**
(registered as prefill)
3.
Frontend sends prefill request to
**Global Router**
(registered as prefill)
4.
Global Router selects prefill pool based on (ISL, TTFT_target) grid
4.
Global Router selects prefill pool based on (ISL, TTFT_target) grid
5.
Request forwarded to
**Local Router**
in selected prefill pool namespace
5.
Request forwarded to
**Local Router**
in selected prefill pool namespace
6.
Local Router forwards to
**
Mocker
Worker**
(prefill mode)
6.
Local Router forwards to
**Worker**
(prefill mode)
7.
Prefill response returns with
`disaggregated_params`
7.
Prefill response returns with
`disaggregated_params`
8.
Frontend sends decode request to
**Global Router**
(registered as decode)
8.
Frontend sends decode request to
**Global Router**
(registered as decode)
9.
Global Router selects decode pool based on (context_length, ITL_target) grid
9.
Global Router selects decode pool based on (context_length, ITL_target) grid
...
@@ -83,7 +148,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
...
@@ -83,7 +148,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
## Customizing Pool Selection
## Customizing Pool Selection
Edit
`global_router_config.json`
to change:
Edit
`global_router_config.json`
(or the ConfigMap in
`vllm-2p1d.yaml`
)
to change:
-
**Number of pools**
: Adjust
`num_prefill_pools`
,
`num_decode_pools`
and corresponding namespace lists
-
**Number of pools**
: Adjust
`num_prefill_pools`
,
`num_decode_pools`
and corresponding namespace lists
-
**Selection grid**
: Modify
`isl_resolution`
,
`ttft_resolution`
etc. to change grid granularity
-
**Selection grid**
: Modify
`isl_resolution`
,
`ttft_resolution`
etc. to change grid granularity
...
@@ -92,4 +157,4 @@ Edit `global_router_config.json` to change:
...
@@ -92,4 +157,4 @@ Edit `global_router_config.json` to change:
Example: To always route to pool 0 regardless of request characteristics:
Example: To always route to pool 0 regardless of request characteristics:
```
json
```
json
"prefill_pool_mapping"
:
[[
0
,
0
],
[
0
,
0
]]
"prefill_pool_mapping"
:
[[
0
,
0
],
[
0
,
0
]]
```
```
\ No newline at end of file
examples/hierarchical_planner/vllm-2p1d.yaml
0 → 100644
View file @
d1697dc3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Multi-DGD deployment for hierarchical planner example with vLLM workers
# Architecture:
# DGD 1 (hierarchical): Frontend + GlobalRouter
# DGD 2 (prefill-pool-0): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 3 (prefill-pool-1): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 4 (decode-pool-0): Local Router + vLLM Decode Worker (1 GPU)
#
# IMPORTANT: This file uses ${K8S_NAMESPACE} as a placeholder for the Kubernetes namespace.
# The K8s operator prepends the K8s namespace to the Dynamo namespace.
# For example, if K8S_NAMESPACE="my-namespace" and dynamoNamespace is "prefill-pool-0",
# the actual Dynamo namespace becomes "my-namespace-prefill-pool-0".
#
# vLLM workers register at:
# - Prefill: <namespace>.prefill.generate
# - Decode: <namespace>.backend.generate
#
# USAGE: See README.md for deployment instructions using envsubst.
# =============================================================================
# ConfigMap for global router configuration
# =============================================================================
apiVersion
:
v1
kind
:
ConfigMap
metadata
:
name
:
hierarchical-global-router-config
data
:
global_router_config.json
:
|
{
"num_prefill_pools": 2,
"num_decode_pools": 1,
"prefill_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-prefill-pool-0", "${K8S_NAMESPACE}-prefill-pool-1"],
"decode_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-decode-pool-0"],
"prefill_pool_selection_strategy": {
"ttft_min": 10,
"ttft_max": 1000,
"ttft_resolution": 2,
"isl_min": 0,
"isl_max": 32000,
"isl_resolution": 2,
"prefill_pool_mapping": [[0,1],[0,1]]
},
"decode_pool_selection_strategy": {
"itl_min": 10,
"itl_max": 100,
"itl_resolution": 2,
"context_length_min": 0,
"context_length_max": 32000,
"context_length_resolution": 2,
"decode_pool_mapping": [[0,0],[0,0]]
}
}
---
# =============================================================================
# DGD 1: Frontend + Global Router (namespace: hierarchical)
# =============================================================================
apiVersion
:
nvidia.com/v1alpha1
kind
:
DynamoGraphDeployment
metadata
:
name
:
hierarchical-frontend
spec
:
envs
:
-
name
:
HF_TOKEN
valueFrom
:
secretKeyRef
:
key
:
HF_TOKEN
name
:
hf-token-secret
services
:
Frontend
:
componentType
:
frontend
dynamoNamespace
:
hierarchical
extraPodSpec
:
mainContainer
:
args
:
-
--router-mode
-
round-robin
-
--namespace
-
${K8S_NAMESPACE}-hierarchical
command
:
-
python
-
-m
-
dynamo.frontend
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
GlobalRouter
:
componentType
:
default
dynamoNamespace
:
hierarchical
extraPodSpec
:
mainContainer
:
args
:
-
--config
-
/workspace/config/global_router_config.json
-
--model-name
-
Qwen/Qwen3-0.6B
-
--default-ttft-target
-
"
100"
-
--default-itl-target
-
"
10"
-
--namespace
-
${K8S_NAMESPACE}-hierarchical
command
:
-
python
-
-m
-
dynamo.global_router
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
volumeMounts
:
-
mountPath
:
/workspace/config
name
:
global-router-config
readOnly
:
true
volumes
:
-
configMap
:
name
:
hierarchical-global-router-config
name
:
global-router-config
replicas
:
1
---
# =============================================================================
# DGD 2: Prefill Pool 0 - Local Router + vLLM Worker (namespace: prefill-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-0
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
# =============================================================================
apiVersion
:
nvidia.com/v1alpha1
kind
:
DynamoGraphDeployment
metadata
:
name
:
prefill-pool-0
spec
:
envs
:
-
name
:
HF_TOKEN
valueFrom
:
secretKeyRef
:
key
:
HF_TOKEN
name
:
hf-token-secret
services
:
LocalRouter
:
componentType
:
default
dynamoNamespace
:
prefill-pool-0
extraPodSpec
:
mainContainer
:
args
:
-
--endpoint
-
${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
-
--block-size
-
"
16"
-
--no-track-active-blocks
command
:
-
python
-
-m
-
dynamo.router
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
VllmPrefillWorker
:
componentType
:
worker
subComponentType
:
prefill
dynamoNamespace
:
prefill-pool-0
envFromSecret
:
hf-token-secret
extraPodSpec
:
mainContainer
:
args
:
-
--model
-
Qwen/Qwen3-0.6B
-
--is-prefill-worker
-
--tensor-parallel-size
-
"
1"
-
--gpu-memory-utilization
-
"
0.90"
-
--block-size
-
"
16"
command
:
-
python3
-
-m
-
dynamo.vllm
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
resources
:
limits
:
gpu
:
"
1"
requests
:
gpu
:
"
1"
---
# =============================================================================
# DGD 3: Prefill Pool 1 - Local Router + vLLM Worker (namespace: prefill-pool-1)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-1
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
# =============================================================================
apiVersion
:
nvidia.com/v1alpha1
kind
:
DynamoGraphDeployment
metadata
:
name
:
prefill-pool-1
spec
:
envs
:
-
name
:
HF_TOKEN
valueFrom
:
secretKeyRef
:
key
:
HF_TOKEN
name
:
hf-token-secret
services
:
LocalRouter
:
componentType
:
default
dynamoNamespace
:
prefill-pool-1
extraPodSpec
:
mainContainer
:
args
:
-
--endpoint
-
${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
-
--block-size
-
"
16"
-
--no-track-active-blocks
command
:
-
python
-
-m
-
dynamo.router
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
VllmPrefillWorker
:
componentType
:
worker
subComponentType
:
prefill
dynamoNamespace
:
prefill-pool-1
envFromSecret
:
hf-token-secret
extraPodSpec
:
mainContainer
:
args
:
-
--model
-
Qwen/Qwen3-0.6B
-
--is-prefill-worker
-
--tensor-parallel-size
-
"
1"
-
--gpu-memory-utilization
-
"
0.90"
-
--block-size
-
"
16"
command
:
-
python3
-
-m
-
dynamo.vllm
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
resources
:
limits
:
gpu
:
"
1"
requests
:
gpu
:
"
1"
---
# =============================================================================
# DGD 4: Decode Pool 0 - Local Router + vLLM Worker (namespace: decode-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-decode-pool-0
# vLLM decode worker registers at: ${K8S_NAMESPACE}-decode-pool-0.backend.generate
# =============================================================================
apiVersion
:
nvidia.com/v1alpha1
kind
:
DynamoGraphDeployment
metadata
:
name
:
decode-pool-0
spec
:
envs
:
-
name
:
HF_TOKEN
valueFrom
:
secretKeyRef
:
key
:
HF_TOKEN
name
:
hf-token-secret
services
:
LocalRouter
:
componentType
:
default
dynamoNamespace
:
decode-pool-0
extraPodSpec
:
mainContainer
:
args
:
-
--endpoint
-
${K8S_NAMESPACE}-decode-pool-0.backend.generate
-
--block-size
-
"
16"
-
--kv-overlap-score-weight
-
"
0"
command
:
-
python
-
-m
-
dynamo.router
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
VllmDecodeWorker
:
componentType
:
worker
subComponentType
:
decode
dynamoNamespace
:
decode-pool-0
envFromSecret
:
hf-token-secret
extraPodSpec
:
mainContainer
:
args
:
-
--model
-
Qwen/Qwen3-0.6B
-
--tensor-parallel-size
-
"
1"
-
--gpu-memory-utilization
-
"
0.90"
-
--block-size
-
"
16"
command
:
-
python3
-
-m
-
dynamo.vllm
image
:
${VLLM_IMAGE}
workingDir
:
/workspace
replicas
:
1
resources
:
limits
:
gpu
:
"
1"
requests
:
gpu
:
"
1"
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment