Unverified Commit d1697dc3 authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

feat: add DGD example for global router + vllm (#5760)


Signed-off-by: default avatarhongkuanz <hongkuanz@nvidia.com>
parent a379c1b1
......@@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0
This example demonstrates a hierarchical routing setup with:
- A **Global Router** that routes to different pools based on request characteristics
- **Local Routers** in each pool namespace
- **Mocker Workers** simulating prefill and decode backends
- **Workers** (Mocker for local testing, vLLM for Kubernetes deployment)
## Architecture
......@@ -23,28 +23,30 @@ This example demonstrates a hierarchical routing setup with:
| | |
v v v
Prefill Pool 0 Prefill Pool 1 Decode Pool 0
(prefill_pool_0) (prefill_pool_1) (decode_pool_0)
(prefill-pool-0) (prefill-pool-1) (decode-pool-0)
| | |
v v v
Local Router Local Router Local Router
| | |
v v v
Mocker Worker Mocker Worker Mocker Worker
Worker Worker Worker
(prefill) (prefill) (decode)
```
## Configuration
The `global_router_config.json` defines:
- 2 prefill pools (`prefill_pool_0`, `prefill_pool_1`)
- 1 decode pool (`decode_pool_0`)
- 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`)
- 1 decode pool (`decode-pool-0`)
- Grid-based pool selection strategy
Pool selection is based on a 2x2 grid:
- **Prefill**: (ISL, TTFT_target) maps to prefill pool index
- **Decode**: (context_length, ITL_target) maps to decode pool index
## Running the Example
## Running Locally (with Mocker)
For local testing without GPUs, use the mocker-based script:
```bash
cd examples/hierarchical_planner
......@@ -53,6 +55,63 @@ cd examples/hierarchical_planner
This starts all components in the background and provides instructions for testing.
## Kubernetes Deployment (with vLLM)
The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each).
### Prerequisites
- Kubernetes cluster with GPU nodes
- `hf-token-secret` secret containing your HuggingFace token
- The Dynamo operator installed
### Deployment
The YAML uses environment variable placeholders:
- `${K8S_NAMESPACE}` - Your Kubernetes namespace
- `${VLLM_IMAGE}` - Dynamo vLLM runtime container image
Use `envsubst` to substitute these before applying:
```bash
# Set your Kubernetes namespace and image
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>
# Deploy all DGDs
envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```
### Verify Deployment
```bash
# Check DGD status
kubectl get dgd -n ${K8S_NAMESPACE}
# Check pods
kubectl get pods -n ${K8S_NAMESPACE}
# Check logs for a specific component
kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend
```
### Cleanup
```bash
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>
envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
```
### Namespace Convention
The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field:
- K8s namespace: `my-namespace`
- `dynamoNamespace: prefill-pool-0`
- Actual Dynamo namespace: `my-namespace-prefill-pool-0`
This is why the global router config and local router endpoints must use the full namespace path.
## Testing
Once all components are running, send a request to the frontend:
......@@ -68,6 +127,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \
}'
```
For Kubernetes, port-forward the frontend service first:
```bash
kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000
```
## Request Flow
1. Request arrives at **Frontend**
......@@ -75,7 +140,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
3. Frontend sends prefill request to **Global Router** (registered as prefill)
4. Global Router selects prefill pool based on (ISL, TTFT_target) grid
5. Request forwarded to **Local Router** in selected prefill pool namespace
6. Local Router forwards to **Mocker Worker** (prefill mode)
6. Local Router forwards to **Worker** (prefill mode)
7. Prefill response returns with `disaggregated_params`
8. Frontend sends decode request to **Global Router** (registered as decode)
9. Global Router selects decode pool based on (context_length, ITL_target) grid
......@@ -83,7 +148,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
## Customizing Pool Selection
Edit `global_router_config.json` to change:
Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change:
- **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists
- **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity
......@@ -92,4 +157,4 @@ Edit `global_router_config.json` to change:
Example: To always route to pool 0 regardless of request characteristics:
```json
"prefill_pool_mapping": [[0, 0], [0, 0]]
```
\ No newline at end of file
```
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Multi-DGD deployment for hierarchical planner example with vLLM workers
# Architecture:
# DGD 1 (hierarchical): Frontend + GlobalRouter
# DGD 2 (prefill-pool-0): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 3 (prefill-pool-1): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 4 (decode-pool-0): Local Router + vLLM Decode Worker (1 GPU)
#
# IMPORTANT: This file uses ${K8S_NAMESPACE} as a placeholder for the Kubernetes namespace.
# The K8s operator prepends the K8s namespace to the Dynamo namespace.
# For example, if K8S_NAMESPACE="my-namespace" and dynamoNamespace is "prefill-pool-0",
# the actual Dynamo namespace becomes "my-namespace-prefill-pool-0".
#
# vLLM workers register at:
# - Prefill: <namespace>.prefill.generate
# - Decode: <namespace>.backend.generate
#
# USAGE: See README.md for deployment instructions using envsubst.
# =============================================================================
# ConfigMap for global router configuration
# =============================================================================
apiVersion: v1
kind: ConfigMap
metadata:
name: hierarchical-global-router-config
data:
global_router_config.json: |
{
"num_prefill_pools": 2,
"num_decode_pools": 1,
"prefill_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-prefill-pool-0", "${K8S_NAMESPACE}-prefill-pool-1"],
"decode_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-decode-pool-0"],
"prefill_pool_selection_strategy": {
"ttft_min": 10,
"ttft_max": 1000,
"ttft_resolution": 2,
"isl_min": 0,
"isl_max": 32000,
"isl_resolution": 2,
"prefill_pool_mapping": [[0,1],[0,1]]
},
"decode_pool_selection_strategy": {
"itl_min": 10,
"itl_max": 100,
"itl_resolution": 2,
"context_length_min": 0,
"context_length_max": 32000,
"context_length_resolution": 2,
"decode_pool_mapping": [[0,0],[0,0]]
}
}
---
# =============================================================================
# DGD 1: Frontend + Global Router (namespace: hierarchical)
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: hierarchical-frontend
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
Frontend:
componentType: frontend
dynamoNamespace: hierarchical
extraPodSpec:
mainContainer:
args:
- --router-mode
- round-robin
- --namespace
- ${K8S_NAMESPACE}-hierarchical
command:
- python
- -m
- dynamo.frontend
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
GlobalRouter:
componentType: default
dynamoNamespace: hierarchical
extraPodSpec:
mainContainer:
args:
- --config
- /workspace/config/global_router_config.json
- --model-name
- Qwen/Qwen3-0.6B
- --default-ttft-target
- "100"
- --default-itl-target
- "10"
- --namespace
- ${K8S_NAMESPACE}-hierarchical
command:
- python
- -m
- dynamo.global_router
image: ${VLLM_IMAGE}
workingDir: /workspace
volumeMounts:
- mountPath: /workspace/config
name: global-router-config
readOnly: true
volumes:
- configMap:
name: hierarchical-global-router-config
name: global-router-config
replicas: 1
---
# =============================================================================
# DGD 2: Prefill Pool 0 - Local Router + vLLM Worker (namespace: prefill-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-0
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: prefill-pool-0
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: prefill-pool-0
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
- --block-size
- "16"
- --no-track-active-blocks
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
dynamoNamespace: prefill-pool-0
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
---
# =============================================================================
# DGD 3: Prefill Pool 1 - Local Router + vLLM Worker (namespace: prefill-pool-1)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-1
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: prefill-pool-1
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: prefill-pool-1
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
- --block-size
- "16"
- --no-track-active-blocks
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
dynamoNamespace: prefill-pool-1
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
---
# =============================================================================
# DGD 4: Decode Pool 0 - Local Router + vLLM Worker (namespace: decode-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-decode-pool-0
# vLLM decode worker registers at: ${K8S_NAMESPACE}-decode-pool-0.backend.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: decode-pool-0
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: decode-pool-0
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-decode-pool-0.backend.generate
- --block-size
- "16"
- --kv-overlap-score-weight
- "0"
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmDecodeWorker:
componentType: worker
subComponentType: decode
dynamoNamespace: decode-pool-0
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment