# Hierarchical Planner Example This example demonstrates a hierarchical routing setup with: - A **Global Router** that routes to different pools based on request characteristics - **Local Routers** in each pool namespace - **Workers** (Mocker for local testing, vLLM for Kubernetes deployment) ## Architecture ``` Frontend (round-robin routing) | v Global Router (registers as both prefill + decode) | +----------------+----------------+ | | | v v v Prefill Pool 0 Prefill Pool 1 Decode Pool 0 (prefill-pool-0) (prefill-pool-1) (decode-pool-0) | | | v v v Local Router Local Router Local Router | | | v v v Worker Worker Worker (prefill) (prefill) (decode) ``` ## Configuration The `global_router_config.json` defines: - 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`) - 1 decode pool (`decode-pool-0`) - Grid-based pool selection strategy Pool selection is based on a 2x2 grid: - **Prefill**: (ISL, TTFT_target) maps to prefill pool index - **Decode**: (context_length, ITL_target) maps to decode pool index ## Running Locally (with Mocker) For local testing without GPUs, use the mocker-based script: ```bash cd examples/hierarchical_planner ./run_example.sh ``` This starts all components in the background and provides instructions for testing. ## Kubernetes Deployment (with vLLM) The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each). ### Prerequisites - Kubernetes cluster with GPU nodes - `hf-token-secret` secret containing your HuggingFace token - The Dynamo operator installed ### Deployment The YAML uses environment variable placeholders: - `${K8S_NAMESPACE}` - Your Kubernetes namespace - `${VLLM_IMAGE}` - Dynamo vLLM runtime container image Use `envsubst` to substitute these before applying: ```bash # Set your Kubernetes namespace and image export K8S_NAMESPACE= export VLLM_IMAGE= # Deploy all DGDs envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f - ``` ### Verify Deployment ```bash # Check DGD status kubectl get dgd -n ${K8S_NAMESPACE} # Check pods kubectl get pods -n ${K8S_NAMESPACE} # Check logs for a specific component kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend ``` ### Cleanup ```bash export K8S_NAMESPACE= export VLLM_IMAGE= envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f - ``` ### Namespace Convention The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field: - K8s namespace: `my-namespace` - `dynamoNamespace: prefill-pool-0` - Actual Dynamo namespace: `my-namespace-prefill-pool-0` This is why the global router config and local router endpoints must use the full namespace path. ## Testing Once all components are running, send a request to the frontend: ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello, how are you?"}], "max_tokens": 50, "stream": true }' ``` For Kubernetes, port-forward the frontend service first: ```bash kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000 ``` ## Request Flow 1. Request arrives at **Frontend** 2. Frontend's `PrefillRouter` detects both prefill and decode registered for the model 3. Frontend sends prefill request to **Global Router** (registered as prefill) 4. Global Router selects prefill pool based on (ISL, TTFT_target) grid 5. Request forwarded to **Local Router** in selected prefill pool namespace 6. Local Router forwards to **Worker** (prefill mode) 7. Prefill response returns with `disaggregated_params` 8. Frontend sends decode request to **Global Router** (registered as decode) 9. Global Router selects decode pool based on (context_length, ITL_target) grid 10. Tokens stream back through the chain ## Customizing Pool Selection Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change: - **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists - **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity - **Pool mapping**: Edit `prefill_pool_mapping` and `decode_pool_mapping` matrices Example: To always route to pool 0 regardless of request characteristics: ```json "prefill_pool_mapping": [[0, 0], [0, 0]] ``` ## SLA Planner with GlobalPlanner Each pool can run an SLA Planner that reads throughput metrics and delegates autoscaling decisions to a central **GlobalPlanner** service. The GlobalPlanner arbitrates across pools and executes scaling via the Dynamo operator. ### Architecture with SLA Planners ``` Frontend (round-robin) | v Global Router ─── GlobalPlanner ◄─── scale decisions from pool planners | +──────────────────────────────────────+ | | | Prefill Pool 0 Prefill Pool 1 Decode Pool 0 LocalRouter LocalRouter LocalRouter Worker Worker Worker Planner ──────► Planner ──────► Planner ──────► (all → GlobalPlanner) ``` ### SLA Planner configuration The SLA Planner is configured via a JSON blob passed to `--config`. Key fields for the global-planner environment: | Field | Description | |---|---| | `environment` | `"global-planner"` to delegate scaling to GlobalPlanner | | `global_planner_namespace` | Dynamo namespace of the DGD running GlobalPlanner | | `mode` | `"prefill"` or `"decode"` | | `throughput_metrics_source` | `"frontend"` (default) or `"router"` — see below | ### `throughput_metrics_source` Controls where the SLA Planner reads aggregate throughput metrics (TTFT, ITL, request rate): - **`frontend`** (default): reads `dynamo_frontend_*` histograms from the frontend service. Works for single-DGD disagg deployments where the planner and frontend share a namespace. - **`router`**: reads `dynamo_component_router_*` histograms emitted by LocalRouter pods and scraped by cluster Prometheus. Required for hierarchical (multi-DGD) disagg deployments where the SLA Planner runs in a pool DGD namespace that is different from the frontend DGD namespace. Use `throughput_metrics_source: "router"` whenever the planner is co-located with a pool (not the frontend), i.e. in any GlobalPlanner setup. ### Prometheus scraping for router metrics The Dynamo operator Helm chart includes a PodMonitor that scrapes LocalRouter pods on port 9090. LocalRouter pods must expose metrics on that port via: ```yaml env: - name: DYN_SYSTEM_PORT value: "9090" ``` No standalone Prometheus is needed — the cluster-wide Prometheus picks up the PodMonitor automatically. ### GlobalPlanner `--no-operation` mode Pass `--no-operation` to GlobalPlanner to receive and log scale requests without executing them. Useful for observing planner behaviour before enabling live scaling: ```yaml command: [python3, -m, dynamo.global_planner] args: [--no-operation] ``` ### Example deployments Complete end-to-end examples are in `examples/backends/`: | File | Description | |---|---| | `mocker/deploy/hplanner-mocker-test.yaml` | 2 prefill + 2 decode pools with Mocker workers; GlobalPlanner in no-op mode | | `vllm/deploy/hplanner-vllm-test.yaml` | 2 prefill (TP1, TP2) + 1 decode pool with real vLLM workers | Both use `envsubst` for substituting `${K8S_NAMESPACE}`, `${DYNAMO_IMAGE}`, etc.