README.md 8.02 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Hierarchical Planner Example

This example demonstrates a hierarchical routing setup with:
- A **Global Router** that routes to different pools based on request characteristics
- **Local Routers** in each pool namespace
11
- **Workers** (Mocker for local testing, vLLM for Kubernetes deployment)
12
13
14
15
16
17
18
19
20
21
22
23
24
25

## Architecture

```
                    Frontend (round-robin routing)
                         |
                         v
                    Global Router
                   (registers as both prefill + decode)
                         |
        +----------------+----------------+
        |                |                |
        v                v                v
   Prefill Pool 0   Prefill Pool 1   Decode Pool 0
26
   (prefill-pool-0) (prefill-pool-1) (decode-pool-0)
27
28
29
30
31
        |                |                |
        v                v                v
   Local Router     Local Router     Local Router
        |                |                |
        v                v                v
32
      Worker           Worker           Worker
33
34
35
36
37
38
   (prefill)        (prefill)        (decode)
```

## Configuration

The `global_router_config.json` defines:
39
40
- 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`)
- 1 decode pool (`decode-pool-0`)
41
42
43
44
45
46
- Grid-based pool selection strategy

Pool selection is based on a 2x2 grid:
- **Prefill**: (ISL, TTFT_target) maps to prefill pool index
- **Decode**: (context_length, ITL_target) maps to decode pool index

47
48
49
## Running Locally (with Mocker)

For local testing without GPUs, use the mocker-based script:
50
51
52
53
54
55
56
57

```bash
cd examples/hierarchical_planner
./run_example.sh
```

This starts all components in the background and provides instructions for testing.

58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
## Kubernetes Deployment (with vLLM)

The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each).

### Prerequisites

- Kubernetes cluster with GPU nodes
- `hf-token-secret` secret containing your HuggingFace token
- The Dynamo operator installed

### Deployment

The YAML uses environment variable placeholders:
- `${K8S_NAMESPACE}` - Your Kubernetes namespace
- `${VLLM_IMAGE}` - Dynamo vLLM runtime container image

Use `envsubst` to substitute these before applying:

```bash
# Set your Kubernetes namespace and image
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>

# Deploy all DGDs
envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```

### Verify Deployment

```bash
# Check DGD status
kubectl get dgd -n ${K8S_NAMESPACE}

# Check pods
kubectl get pods -n ${K8S_NAMESPACE}

# Check logs for a specific component
kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend
```

### Cleanup

```bash
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>
envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
```

### Namespace Convention

The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field:
- K8s namespace: `my-namespace`
- `dynamoNamespace: prefill-pool-0`
- Actual Dynamo namespace: `my-namespace-prefill-pool-0`

This is why the global router config and local router endpoints must use the full namespace path.

115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
## Testing

Once all components are running, send a request to the frontend:

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 50,
    "stream": true
  }'
```

130
131
132
133
134
135
For Kubernetes, port-forward the frontend service first:

```bash
kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000
```

136
137
138
139
140
141
142
## Request Flow

1. Request arrives at **Frontend**
2. Frontend's `PrefillRouter` detects both prefill and decode registered for the model
3. Frontend sends prefill request to **Global Router** (registered as prefill)
4. Global Router selects prefill pool based on (ISL, TTFT_target) grid
5. Request forwarded to **Local Router** in selected prefill pool namespace
143
6. Local Router forwards to **Worker** (prefill mode)
144
145
146
147
148
149
150
7. Prefill response returns with `disaggregated_params`
8. Frontend sends decode request to **Global Router** (registered as decode)
9. Global Router selects decode pool based on (context_length, ITL_target) grid
10. Tokens stream back through the chain

## Customizing Pool Selection

151
Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change:
152
153
154
155
156
157
158
159

- **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists
- **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity
- **Pool mapping**: Edit `prefill_pool_mapping` and `decode_pool_mapping` matrices

Example: To always route to pool 0 regardless of request characteristics:
```json
"prefill_pool_mapping": [[0, 0], [0, 0]]
160
```
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

## SLA Planner with GlobalPlanner

Each pool can run an SLA Planner that reads throughput metrics and delegates autoscaling decisions
to a central **GlobalPlanner** service. The GlobalPlanner arbitrates across pools and executes
scaling via the Dynamo operator.

### Architecture with SLA Planners

```
Frontend (round-robin)
     |
     v
Global Router ─── GlobalPlanner  ◄─── scale decisions from pool planners
     |
     +──────────────────────────────────────+
     |                |                     |
Prefill Pool 0    Prefill Pool 1       Decode Pool 0
LocalRouter       LocalRouter          LocalRouter
Worker            Worker               Worker
Planner ──────►   Planner ──────►      Planner ──────►  (all → GlobalPlanner)
```

### SLA Planner configuration

The SLA Planner is configured via a JSON blob passed to `--config`. Key fields for the
global-planner environment:

| Field | Description |
|---|---|
| `environment` | `"global-planner"` to delegate scaling to GlobalPlanner |
| `global_planner_namespace` | Dynamo namespace of the DGD running GlobalPlanner |
| `mode` | `"prefill"` or `"decode"` |
| `throughput_metrics_source` | `"frontend"` (default) or `"router"` — see below |

### `throughput_metrics_source`

Controls where the SLA Planner reads aggregate throughput metrics (TTFT, ITL, request rate):

- **`frontend`** (default): reads `dynamo_frontend_*` histograms from the frontend service. Works
  for single-DGD disagg deployments where the planner and frontend share a namespace.

- **`router`**: reads `dynamo_component_router_*` histograms emitted by LocalRouter pods and
  scraped by cluster Prometheus. Required for hierarchical (multi-DGD) disagg deployments where
  the SLA Planner runs in a pool DGD namespace that is different from the frontend DGD namespace.

Use `throughput_metrics_source: "router"` whenever the planner is co-located with a pool
(not the frontend), i.e. in any GlobalPlanner setup.

### Prometheus scraping for router metrics

The Dynamo operator Helm chart includes a PodMonitor that scrapes LocalRouter pods on port 9090.
LocalRouter pods must expose metrics on that port via:

```yaml
env:
  - name: DYN_SYSTEM_PORT
    value: "9090"
```

No standalone Prometheus is needed — the cluster-wide Prometheus picks up the PodMonitor
automatically.

### GlobalPlanner `--no-operation` mode

Pass `--no-operation` to GlobalPlanner to receive and log scale requests without executing them.
Useful for observing planner behaviour before enabling live scaling:

```yaml
command: [python3, -m, dynamo.global_planner]
args: [--no-operation]
```

### Example deployments

Complete end-to-end examples are in `examples/backends/`:

| File | Description |
|---|---|
| `mocker/deploy/hplanner-mocker-test.yaml` | 2 prefill + 2 decode pools with Mocker workers; GlobalPlanner in no-op mode |
| `vllm/deploy/hplanner-vllm-test.yaml` | 2 prefill (TP1, TP2) + 1 decode pool with real vLLM workers |

Both use `envsubst` for substituting `${K8S_NAMESPACE}`, `${DYNAMO_IMAGE}`, etc.