Unverified Commit 5d5fd243 authored by daiyaanarfeen's avatar daiyaanarfeen Committed by GitHub
Browse files

feat: GlobalPlanner --max-total-gpus for cluster-wide GPU budget (#7103)


Signed-off-by: default avatarDaiyaan <darfeen@nvidia.com>
Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatarAnish <80174047+athreesh@users.noreply.github.com>
Co-authored-by: default avatarClaude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Co-authored-by: default avatarAnish <80174047+athreesh@users.noreply.github.com>
parent cf5f65f7
...@@ -31,7 +31,7 @@ CODEOWNERS @ai-dynamo/Devops ...@@ -31,7 +31,7 @@ CODEOWNERS @ai-dynamo/Devops
/components/src/dynamo/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /components/src/dynamo/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
/components/src/dynamo/global_router/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /components/src/dynamo/global_router/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
/components/src/dynamo/global_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /components/src/dynamo/global_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
/examples/hierarchical_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /examples/global_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
/components/src/dynamo/profiler/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /components/src/dynamo/profiler/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
/tests/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops /tests/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
......
...@@ -3,10 +3,39 @@ ...@@ -3,10 +3,39 @@
# Global Planner # Global Planner
Centralized scaling execution service for hierarchical planner deployments. Centralized scaling execution service for multi-DGD planner deployments.
The Global Planner receives scaling decisions from distributed planners and executes The Global Planner receives scaling decisions from local planners and executes
replica updates against Kubernetes `DynamoGraphDeployment` resources. replica updates against Kubernetes `DynamoGraphDeployment` resources. It is useful
whenever multiple DGDs should delegate scaling through one centralized component,
whether or not those DGDs sit behind a single shared endpoint.
## What Problem This Solves
Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly.
That is fine for isolated deployments, but it becomes awkward when you want one place to:
- apply centralized scaling policies across multiple DGDs
- enforce shared constraints such as authorization or total GPU budget
- coordinate scaling for a single-endpoint, multi-pool deployment
`GlobalPlanner` solves that by becoming the common scale-execution endpoint for multiple local planners.
## Deployment Patterns
`GlobalPlanner` is used in two common patterns:
1. **Centralized scaling across independent DGDs**
Each DGD keeps its own normal local planner, but the local planners delegate scale execution to one `GlobalPlanner`. This is useful when separate deployments or models should share a global policy such as a total GPU budget. You do **not** need `GlobalRouter` or a single shared endpoint for this pattern.
2. **Hierarchical single-endpoint deployment**
Multiple pool DGDs for one model sit behind one public `Frontend` and one `GlobalRouter`. Each pool still has its own local planner, and those local planners delegate scaling to `GlobalPlanner`.
## Terminology
- **SLA Planner**: The normal `dynamo.planner` component that computes desired replica counts from SLA targets, profiles, and/or metrics.
- **Local planner**: An instance of that planner running inside one DGD or one pool.
- **GlobalPlanner**: The centralized execution and policy layer that receives scale requests from local planners and applies them to target DGDs.
- **Hierarchical planner**: An architecture term, not a separate binary. In practice it means multiple local planners feeding one `GlobalPlanner`, often together with `GlobalRouter`.
## Overview ## Overview
...@@ -50,6 +79,11 @@ DYN_NAMESPACE=global-infra python -m dynamo.global_planner \ ...@@ -50,6 +79,11 @@ DYN_NAMESPACE=global-infra python -m dynamo.global_planner \
DYN_NAMESPACE=global-infra python -m dynamo.global_planner --no-operation DYN_NAMESPACE=global-infra python -m dynamo.global_planner --no-operation
``` ```
```bash
# Enforce a maximum total GPU budget across managed pools
DYN_NAMESPACE=global-infra python -m dynamo.global_planner --max-total-gpus 16
```
### Arguments ### Arguments
Required environment variables: Required environment variables:
...@@ -65,6 +99,7 @@ CLI arguments: ...@@ -65,6 +99,7 @@ CLI arguments:
- `--managed-namespaces <ns1> <ns2> ...`: Allowlist for `caller_namespace`. If omitted, accepts all namespaces. - `--managed-namespaces <ns1> <ns2> ...`: Allowlist for `caller_namespace`. If omitted, accepts all namespaces.
- `--environment kubernetes`: Execution environment (currently only `kubernetes` is supported). - `--environment kubernetes`: Execution environment (currently only `kubernetes` is supported).
- `--no-operation`: Log incoming scale requests and return success without applying Kubernetes scaling. - `--no-operation`: Log incoming scale requests and return success without applying Kubernetes scaling.
- `--max-total-gpus <n>`: Reject scale requests that would push the managed pools above the configured total GPU cap.
## Scale Request Contract ## Scale Request Contract
...@@ -100,6 +135,7 @@ Response fields: ...@@ -100,6 +135,7 @@ Response fields:
## Related Documentation ## Related Documentation
- [Planner Guide](../../../../docs/components/planner/planner-guide.md) — Planner configuration and deployment workflow - [Planner Guide](../../../../docs/components/planner/planner-guide.md) — Planner configuration and deployment workflow
- [Global Planner Deployment Guide](../../../../docs/components/planner/global-planner.md) — Deployment patterns for `GlobalPlanner`, including multi-model coordination and single-endpoint multi-pool workflows
- [Planner Design](../../../../docs/design-docs/planner-design.md) — Planner architecture and algorithms - [Planner Design](../../../../docs/design-docs/planner-design.md) — Planner architecture and algorithms
Planners delegate to this service when planner config uses `environment: "global-planner"` and sets `global_planner_namespace`. Planners delegate to this service when planner config uses `environment: "global-planner"` and sets `global_planner_namespace`.
\ No newline at end of file
...@@ -76,6 +76,11 @@ async def main(runtime: DistributedRuntime, args): ...@@ -76,6 +76,11 @@ async def main(runtime: DistributedRuntime, args):
else: else:
logger.info("No-operation mode: DISABLED") logger.info("No-operation mode: DISABLED")
if args.max_total_gpus >= 0:
logger.info(f"Max total GPUs: {args.max_total_gpus}")
else:
logger.info("Max total GPUs: UNLIMITED")
logger.info("=" * 60) logger.info("=" * 60)
# Get K8s namespace (where GlobalPlanner pod is running) # Get K8s namespace (where GlobalPlanner pod is running)
...@@ -88,6 +93,7 @@ async def main(runtime: DistributedRuntime, args): ...@@ -88,6 +93,7 @@ async def main(runtime: DistributedRuntime, args):
managed_namespaces=args.managed_namespaces, managed_namespaces=args.managed_namespaces,
k8s_namespace=k8s_namespace, k8s_namespace=k8s_namespace,
no_operation=args.no_operation, no_operation=args.no_operation,
max_total_gpus=args.max_total_gpus,
) )
# Serve scale_request endpoint # Serve scale_request endpoint
......
...@@ -53,4 +53,12 @@ Examples: ...@@ -53,4 +53,12 @@ Examples:
help="Log incoming scale requests without executing them (useful for testing the e2e flow without actual K8s scaling)", help="Log incoming scale requests without executing them (useful for testing the e2e flow without actual K8s scaling)",
) )
parser.add_argument(
"--max-total-gpus",
type=int,
default=-1,
dest="max_total_gpus",
help="Maximum total GPUs across all managed pools. Requests that would exceed this limit are rejected. 0 means no GPU scaling is allowed. -1 (default) disables enforcement entirely.",
)
return parser return parser
...@@ -3,17 +3,16 @@ ...@@ -3,17 +3,16 @@
"""Handler for scale_request endpoint in GlobalPlanner.""" """Handler for scale_request endpoint in GlobalPlanner."""
import asyncio
import logging import logging
from dynamo.planner import KubernetesConnector from dynamo.planner import KubernetesConnector
from dynamo.planner.kube import KubernetesAPI
from dynamo.planner.scale_protocol import ScaleRequest, ScaleResponse, ScaleStatus from dynamo.planner.scale_protocol import ScaleRequest, ScaleResponse, ScaleStatus
from dynamo.runtime import DistributedRuntime, dynamo_endpoint from dynamo.runtime import DistributedRuntime, dynamo_endpoint
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Model name used for KubernetesConnector in remote execution mode
MANAGED_MODEL_NAME = "managed"
class ScaleRequestHandler: class ScaleRequestHandler:
"""Handles incoming scale requests in GlobalPlanner. """Handles incoming scale requests in GlobalPlanner.
...@@ -24,6 +23,14 @@ class ScaleRequestHandler: ...@@ -24,6 +23,14 @@ class ScaleRequestHandler:
3. Caches KubernetesConnector per DGD for efficiency 3. Caches KubernetesConnector per DGD for efficiency
4. Executes scaling via Kubernetes API 4. Executes scaling via Kubernetes API
5. Returns current replica counts 5. Returns current replica counts
Management modes:
- **Explicit** (``--managed-namespaces`` set): Only DGDs whose Dynamo
namespaces are listed are managed. Authorization rejects requests from
unlisted namespaces, and GPU budget only counts these DGDs.
- **Implicit** (no ``--managed-namespaces``): All DGDs in the Kubernetes
namespace are managed. Any caller is accepted, and GPU budget counts
every DGD discovered in the namespace.
""" """
def __init__( def __init__(
...@@ -32,6 +39,7 @@ class ScaleRequestHandler: ...@@ -32,6 +39,7 @@ class ScaleRequestHandler:
managed_namespaces: list, managed_namespaces: list,
k8s_namespace: str, k8s_namespace: str,
no_operation: bool = False, no_operation: bool = False,
max_total_gpus: int = -1,
): ):
"""Initialize the scale request handler. """Initialize the scale request handler.
...@@ -40,6 +48,7 @@ class ScaleRequestHandler: ...@@ -40,6 +48,7 @@ class ScaleRequestHandler:
managed_namespaces: List of authorized namespaces (None = accept all) managed_namespaces: List of authorized namespaces (None = accept all)
k8s_namespace: Kubernetes namespace where GlobalPlanner is running k8s_namespace: Kubernetes namespace where GlobalPlanner is running
no_operation: If True, log scale requests without executing K8s scaling no_operation: If True, log scale requests without executing K8s scaling
max_total_gpus: Maximum total GPUs across all managed pools (-1 = unlimited)
""" """
self.runtime = runtime self.runtime = runtime
# If managed_namespaces is None, accept all namespaces # If managed_namespaces is None, accept all namespaces
...@@ -48,7 +57,11 @@ class ScaleRequestHandler: ...@@ -48,7 +57,11 @@ class ScaleRequestHandler:
) )
self.k8s_namespace = k8s_namespace self.k8s_namespace = k8s_namespace
self.no_operation = no_operation self.no_operation = no_operation
self.max_total_gpus = max_total_gpus
self.connectors = {} # Cache of KubernetesConnector per DGD self.connectors = {} # Cache of KubernetesConnector per DGD
# Serializes budget-check + scale-execution so concurrent requests from
# different pools cannot both pass against the same pre-scale state.
self._scale_lock = asyncio.Lock()
if self.managed_namespaces: if self.managed_namespaces:
logger.info( logger.info(
...@@ -63,6 +76,122 @@ class ScaleRequestHandler: ...@@ -63,6 +76,122 @@ class ScaleRequestHandler:
"scale requests will be logged but not executed" "scale requests will be logged but not executed"
) )
if self.max_total_gpus >= 0:
logger.info(
f"GPU budget enforcement ENABLED: max {self.max_total_gpus} total GPUs"
)
self._populate_k8s_connectors()
else:
logger.info("GPU budget enforcement DISABLED (unlimited)")
def _managed_dgd_names(self) -> set[str] | None:
"""Derive the DGD names that this GlobalPlanner manages.
Returns:
A set of DGD names when in explicit mode, or None in implicit mode.
The Dynamo operator convention is:
DYN_NAMESPACE = "{k8s_namespace}-{dgd_name}"
so the DGD name is the Dynamo namespace with the k8s prefix stripped.
"""
if self.managed_namespaces is None:
return None
prefix = f"{self.k8s_namespace}-"
names = set()
for ns in self.managed_namespaces:
if ns.startswith(prefix):
names.add(ns[len(prefix) :])
else:
logger.warning(
f"Managed namespace '{ns}' does not start with "
f"expected prefix '{prefix}'; cannot derive DGD name"
)
return names
def _populate_k8s_connectors(self):
"""Pre-populate connectors for DGDs managed by this GlobalPlanner.
This ensures the GPU budget calculation accounts for DGDs that already
exist at startup, even if they haven't sent a scale request yet.
In explicit mode (--managed-namespaces set), only DGDs whose names
match the managed Dynamo namespaces are discovered.
In implicit mode, all DGDs in the k8s namespace are discovered.
"""
try:
kube_api = KubernetesAPI(self.k8s_namespace)
managed_names = self._managed_dgd_names()
dgds = kube_api.list_graph_deployments()
discovered = []
for dgd in dgds:
name = dgd.get("metadata", {}).get("name", "")
if not name:
continue
# In explicit mode, skip DGDs not in the managed set
if managed_names is not None and name not in managed_names:
continue
connector_key = f"{self.k8s_namespace}/{name}"
if connector_key not in self.connectors:
connector = KubernetesConnector(
dynamo_namespace="discovered",
k8s_namespace=self.k8s_namespace,
parent_dgd_name=name,
)
self.connectors[connector_key] = connector
discovered.append(name)
logger.info(f"Discovered {len(discovered)} existing DGDs: {discovered}")
except Exception as e:
logger.warning(f"Failed to discover existing DGDs: {e}")
def _calculate_total_gpus_after_request(self, request: ScaleRequest) -> int:
"""Calculate total GPUs across all managed DGDs if this request is granted.
For the requesting DGD, uses the desired replica counts from the request.
For all other known DGDs, uses their current replica counts.
NOTE: GPU count is read from spec.services[].resources.limits.gpu only.
GPUs specified via resources.requests.gpu or extraPodSpec resource
overrides are not counted.
"""
total_gpus = 0
requesting_key = f"{request.k8s_namespace}/{request.graph_deployment_name}"
for key, connector in self.connectors.items():
try:
deployment = connector.kube_api.get_graph_deployment(
connector.parent_dgd_name
)
except Exception as e:
logger.warning(f"Failed to read DGD for {key}: {e}")
continue
services = deployment.get("spec", {}).get("services", {})
for svc_spec in services.values():
sub_type = svc_spec.get("subComponentType", "")
if not sub_type:
continue
gpu_per_replica = int(
svc_spec.get("resources", {}).get("limits", {}).get("gpu", 0)
)
if gpu_per_replica == 0:
continue
replicas = svc_spec.get("replicas", 0)
# For the requesting DGD, use desired replicas from the request
if key == requesting_key:
for target in request.target_replicas:
if target.sub_component_type.value == sub_type:
replicas = target.desired_replicas
break
total_gpus += replicas * gpu_per_replica
return total_gpus
@dynamo_endpoint(ScaleRequest, ScaleResponse) @dynamo_endpoint(ScaleRequest, ScaleResponse)
async def scale_request(self, request: ScaleRequest): async def scale_request(self, request: ScaleRequest):
"""Process scaling request from a Planner. """Process scaling request from a Planner.
...@@ -115,7 +244,6 @@ class ScaleRequestHandler: ...@@ -115,7 +244,6 @@ class ScaleRequestHandler:
if connector_key not in self.connectors: if connector_key not in self.connectors:
connector = KubernetesConnector( connector = KubernetesConnector(
dynamo_namespace=request.caller_namespace, dynamo_namespace=request.caller_namespace,
model_name=MANAGED_MODEL_NAME, # Not used for remote execution
k8s_namespace=request.k8s_namespace, k8s_namespace=request.k8s_namespace,
parent_dgd_name=request.graph_deployment_name, parent_dgd_name=request.graph_deployment_name,
) )
...@@ -125,10 +253,35 @@ class ScaleRequestHandler: ...@@ -125,10 +253,35 @@ class ScaleRequestHandler:
connector = self.connectors[connector_key] connector = self.connectors[connector_key]
logger.debug(f"Reusing cached connector for {connector_key}") logger.debug(f"Reusing cached connector for {connector_key}")
# Execute scaling (request.target_replicas is already List[TargetReplica]) # Lock ensures the budget check and scale execution are atomic
await connector.set_component_replicas( # so concurrent requests from different pools cannot both pass
request.target_replicas, blocking=request.blocking # against the same pre-scale replica counts.
) async with self._scale_lock:
# Check GPU budget before scaling
if self.max_total_gpus >= 0:
total_gpus = self._calculate_total_gpus_after_request(request)
if total_gpus > self.max_total_gpus:
logger.warning(
f"Rejecting scale request from {request.caller_namespace}: "
f"would use {total_gpus} GPUs, exceeding max of {self.max_total_gpus}"
)
yield {
"status": ScaleStatus.ERROR.value,
"message": (
f"GPU budget exceeded: request would use {total_gpus} total GPUs, "
f"max allowed is {self.max_total_gpus}"
),
"current_replicas": {},
}
return
logger.info(
f"GPU budget check passed: {total_gpus}/{self.max_total_gpus} GPUs"
)
# Execute scaling (request.target_replicas is already List[TargetReplica])
await connector.set_component_replicas(
request.target_replicas, blocking=request.blocking
)
# Get current replica counts # Get current replica counts
current_replicas = {} current_replicas = {}
......
...@@ -163,7 +163,7 @@ If not provided, the middle of the configured range is used as default. ...@@ -163,7 +163,7 @@ If not provided, the middle of the configured range is used as default.
## Example ## Example
See `examples/hierarchical_planner/` for a complete example with: See `examples/global_planner/` for a complete example with:
- Global router configuration - Global router configuration
- Local router setup for each pool - Local router setup for each pool
- Mocker workers for testing - Mocker workers for testing
...@@ -58,6 +58,16 @@ class KubernetesAPI: ...@@ -58,6 +58,16 @@ class KubernetesAPI:
name=graph_deployment_name, name=graph_deployment_name,
) )
def list_graph_deployments(self) -> list[dict]:
"""List all DynamoGraphDeployments in the current namespace."""
result = self.custom_api.list_namespaced_custom_object(
group="nvidia.com",
version="v1alpha1",
namespace=self.current_namespace,
plural="dynamographdeployments",
)
return result.get("items", [])
def get_graph_deployment(self, graph_deployment_name: str) -> dict: def get_graph_deployment(self, graph_deployment_name: str) -> dict:
""" """
Get the parent DynamoGraphDeployment Get the parent DynamoGraphDeployment
......
...@@ -15,6 +15,8 @@ When both modes are enabled, throughput-based scaling provides a lower bound on ...@@ -15,6 +15,8 @@ When both modes are enabled, throughput-based scaling provides a lower bound on
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment. > **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
> **Need multi-DGD coordination?** See [Global Planner Deployment Guide](global-planner.md). It covers both shared-policy coordination across multiple DGDs and the one-endpoint multi-pool pattern.
## Feature Matrix ## Feature Matrix
| Feature | Throughput-Based | Load-Based (Experimental) | | Feature | Throughput-Based | Load-Based (Experimental) |
...@@ -84,6 +86,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE ...@@ -84,6 +86,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
| Document | Description | | Document | Description |
|----------|-------------| |----------|-------------|
| [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting | | [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
| [Global Planner Deployment Guide](global-planner.md) | When to use `GlobalPlanner`, including multi-model coordination and single-endpoint multi-pool deployments |
| [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns | | [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration | | [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
| [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors | | [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Global Planner Deployment Guide
---
This guide explains how to deploy `GlobalPlanner` and when to use it. `GlobalPlanner` is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint.
## Why Global Planner?
Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:
- apply centralized scaling policy across multiple DGDs
- enforce shared constraints such as authorization or total GPU budget
- coordinate scaling for a single-endpoint, multi-pool deployment
`GlobalPlanner` solves that by becoming the common scale-execution endpoint for multiple local planners.
## Terminology
- **SLA Planner**: The normal `dynamo.planner` component that computes desired replica counts to maintain SLAs.
- **Local Planner**: A pool-local instance of a SLA planner inside one DGD.
- **Global Planner**: The centralized execution and policy layer that receives scale requests from local planners.
- **Single-endpoint multi-pool deployment**: One model endpoint backed by multiple DGDs for the same model. This pattern uses both `GlobalRouter` and `GlobalPlanner`.
## Deployment Patterns
Use `GlobalPlanner` in one of these two patterns:
| Pattern | Use when | Needs `GlobalRouter` | Public endpoint shape |
|---------|----------|----------------------|-----------------------|
| Multiple model endpoints or independent DGDs | Separate DGDs should share centralized scaling policy, such as authorization or total GPU budget | No | One endpoint per DGD, or however each DGD is exposed |
| One model endpoint, multiple DGDs | One model should be reachable through one public endpoint, but different request classes should land on different DGDs | Yes | One shared endpoint |
## Pattern 1: Multiple Model Endpoints Or Independent DGDs
Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint.
Typical examples:
- DGD A: `qwen-0.6b` disaggregated deployment with its own local planner
- DGD B: `qwen-32b` disaggregated deployment with its own local planner
- one shared `GlobalPlanner` that all local planners delegate to
In this pattern:
- each DGD keeps its own normal local planner
- each local planner is configured with `environment: "global-planner"`
- all those planners point at the same `global_planner_namespace`
- each DGD keeps its own endpoint or frontend as needed
- you do **not** need `GlobalRouter`
This is the pattern to use when the goal is centralized scaling control across multiple deployments or models.
## Pattern 2: One Model Endpoint, Multiple DGDs
Use this pattern when all of the following are true:
- You want one public endpoint for a single model.
- You want different private pools for different request classes, such as short ISL vs. long ISL requests, or different latency targets.
- You want each pool to autoscale independently.
- You want routing and scale execution to be centralized instead of exposing multiple endpoints to clients.
Typical examples:
- short-input requests are cheaper on a smaller prefill pool
- long-input requests need a larger prefill pool
- decode capacity should scale independently from prefill capacity
If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead.
## What You Deploy
In the current implementation, the single-endpoint pattern is composed from multiple resources:
| Resource | Purpose | Typical contents |
|----------|---------|------------------|
| Control DGD | Public entrypoint and centralized control plane | `Frontend`, `GlobalRouter`, `GlobalPlanner` |
| Prefill pool DGD(s) | Private prefill capacity pools | `LocalRouter`, prefill worker(s), `Planner` |
| Decode pool DGD(s) | Private decode capacity pools | `LocalRouter`, decode worker(s), `Planner` |
| Optional DGDR(s) | Generate or validate one optimized pool shape at a time | Model, workload, SLA, hardware inputs |
> **Current workflow**
>
> A single DGDR does **not** generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually.
## Architecture
```text
Client
|
v
Frontend (single public endpoint)
|
v
GlobalRouter
|
+--> Prefill pool 0 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
+--> Prefill pool 1 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
|
+--> Decode pool 0 Dynamo namespace --> LocalRouter --> Decode workers --> Pool Planner
+--> Decode pool 1 Dynamo namespace --> LocalRouter --> Decode workers --> Pool Planner
Pool Planners
|
v
GlobalPlanner
|
v
Kubernetes scaling updates on the target DGDs
```
The `Frontend` exposes a single model endpoint. `GlobalRouter` selects the best pool for each request. Each pool-local `Planner` decides how much capacity its own pool needs. `GlobalPlanner` receives those scale requests and applies the Kubernetes replica changes centrally.
## Prerequisites
- Dynamo Kubernetes Platform installed. See [Kubernetes Quickstart](../../kubernetes/README.md).
- Prometheus deployed and scraping router metrics. The global planner examples assume cluster Prometheus is available.
- Backend images available for your chosen framework (`vllm`, `sglang`, or `trtllm`).
- Secrets for model access, such as a Hugging Face token secret.
- A storage strategy for model weights if your workers should share a model cache PVC.
For throughput-based scaling, you also need profiling data for each pool. See [Profiler Guide](../profiler/profiler-guide.md).
## Inputs You Need To Decide Up Front
Before writing manifests, decide the following:
| Input | Why it matters | Example |
|-------|----------------|---------|
| Model name | All pools in one hierarchy serve the same model | `meta-llama/Llama-3.3-70B-Instruct` |
| Backend | Worker args and profiling flow depend on it | `vllm` |
| Pool inventory | Number of specialized prefill and decode pools | 2 prefill pools, 1 decode pool |
| Workload classes | Determines how many pool profiles you generate | short ISL, long ISL, long context decode |
| SLA targets | Guides profiling and routing decisions | `ttft: 200 ms`, `itl: 20 ms` |
| Worker shape | Tensor parallelism, GPUs per worker, and memory footprint | TP1 prefill vs. TP2 prefill |
| Routing policy | Maps requests to pools at runtime | low-ISL requests -> pool 0 |
| Optional global budget | Caps total GPUs across managed pools | `--max-total-gpus 16` |
## Step 1: Profile Each Intended Pool Independently
Start by deciding what each pool should specialize in. Common examples:
- Prefill pool 0: lower-cost pool for short prompts.
- Prefill pool 1: larger pool for long prompts.
- Decode pool 0: standard decode pool for most requests.
For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool.
Example DGDR skeleton:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: llama-prefill-short
spec:
model: meta-llama/Llama-3.3-70B-Instruct
backend: vllm
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:<tag>
workload:
isl: 2048
osl: 256
sla:
ttft: 200.0
itl: 20.0
searchStrategy: rapid
autoApply: false
```
Repeat this once per planned pool, changing the workload and SLA inputs for each request class.
What to keep from each profiling result:
- Worker shape (`tensor-parallel-size`, GPUs per worker, memory/caching settings).
- Planner profile data directory or generated ConfigMaps.
- Planner settings such as `prefill_engine_num_gpu` or `decode_engine_num_gpu`.
- Any backend-specific flags that differ across pools.
See [Planner Examples](planner-examples.md) and [Profiler Guide](../profiler/profiler-guide.md) for DGDR details.
## Step 2: Create The Control DGD
Deploy one control DGD that contains:
- `Frontend`: the single public model endpoint.
- `GlobalRouter`: chooses which pool receives each request.
- `GlobalPlanner`: receives scale requests from pool planners and applies replica changes.
The vLLM example topology is in [examples/global_planner/global-planner-vllm-test.yaml](../../../examples/global_planner/global-planner-vllm-test.yaml).
The `GlobalPlanner` section is minimal:
```yaml
GlobalPlanner:
componentType: default
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
command:
- python3
- -m
- dynamo.global_planner
args:
- --managed-namespaces
- ${K8S_NAMESPACE}-gp-prefill-0
- ${K8S_NAMESPACE}-gp-prefill-1
- ${K8S_NAMESPACE}-gp-decode-0
```
The values passed to `--managed-namespaces` are the pool planners' **Dynamo namespaces** (`caller_namespace`), not raw Kubernetes namespaces. In many examples they share the same string prefix, but they are logically different identifiers.
**Management modes**: When `--managed-namespaces` is set (explicit mode), only the listed Dynamo namespaces are authorized to send scale requests, and only their corresponding DGDs count toward the GPU budget. DGD names are derived from the Dynamo namespace using the operator convention `DYN_NAMESPACE = {k8s_namespace}-{dgd_name}`. When omitted (implicit mode), any caller is accepted and all DGDs in the Kubernetes namespace count toward the GPU budget.
If you want the central executor to reject scale requests that exceed a total GPU budget, add `--max-total-gpus`. See [examples/global_planner/global-planner-gpu-budget.yaml](../../../examples/global_planner/global-planner-gpu-budget.yaml).
## Step 3: Create One DGD Per Pool
Each private pool gets its own DGD. A pool DGD usually contains:
- `LocalRouter`
- one worker type (`prefill` or `decode`)
- one `Planner`
The planner inside each pool must be configured for `global-planner` mode so it delegates scaling to the control stack:
```json
{
"environment": "global-planner",
"global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl",
"backend": "vllm",
"mode": "prefill",
"enable_load_scaling": false,
"enable_throughput_scaling": true,
"throughput_metrics_source": "router",
"ttft": 2000,
"prefill_engine_num_gpu": 2,
"model_name": "${MODEL_NAME}",
"profile_results_dir": "/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"
}
```
`global_planner_namespace` must point to the control stack's **Dynamo namespace**. In the reference manifests, that is the namespace string passed to the control `Frontend` and `GlobalRouter`.
Use:
- `mode: "prefill"` for prefill-only pools
- `mode: "decode"` for decode-only pools
The worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1.
In the reference vLLM example:
- `gp-prefill-0` uses a 1-GPU TP1 prefill worker
- `gp-prefill-1` uses a 2-GPU TP2 prefill worker
- `gp-decode-0` uses a 1-GPU TP1 decode worker
See [global-planner-vllm-test.yaml](../../../examples/global_planner/global-planner-vllm-test.yaml).
## Step 4: Configure GlobalRouter To Select Pools
`GlobalRouter` reads a JSON config that lists the pool namespaces and a routing grid for each request type.
Example:
```json
{
"num_prefill_pools": 2,
"num_decode_pools": 1,
"prefill_pool_dynamo_namespaces": [
"${K8S_NAMESPACE}-gp-prefill-0",
"${K8S_NAMESPACE}-gp-prefill-1"
],
"decode_pool_dynamo_namespaces": [
"${K8S_NAMESPACE}-gp-decode-0"
],
"prefill_pool_selection_strategy": {
"ttft_min": 10,
"ttft_max": 3000,
"ttft_resolution": 2,
"isl_min": 0,
"isl_max": 32000,
"isl_resolution": 2,
"prefill_pool_mapping": [[0, 1], [0, 1]]
},
"decode_pool_selection_strategy": {
"itl_min": 10,
"itl_max": 500,
"itl_resolution": 2,
"context_length_min": 0,
"context_length_max": 32000,
"context_length_resolution": 2,
"decode_pool_mapping": [[0, 0], [0, 0]]
}
}
```
The `prefill_pool_dynamo_namespaces` and `decode_pool_dynamo_namespaces` entries are **Dynamo namespaces** that the pool-local routers register under.
Important runtime behavior:
- Prefill pool selection uses **ISL + TTFT target**
- Decode pool selection uses **context length + ITL target**
- OSL is useful for **designing and profiling pools**, but it is **not a direct routing key** in the current `GlobalRouter`
Clients can pass request targets through `extra_args`:
```json
{
"extra_args": {
"ttft_target": 200,
"itl_target": 20
}
}
```
For more details, see [Global Router README](../../../components/src/dynamo/global_router/README.md).
## Step 5: Deploy In Order
For a fresh cluster, the usual order is:
1. Install Dynamo platform and Prometheus.
2. Create secrets and PVCs needed by workers.
3. Create the `GlobalRouter` ConfigMap.
4. Apply the control DGD.
5. Apply the pool DGDs.
6. Wait for all DGDs to reach ready state.
7. Expose or port-forward the control `Frontend`.
Example:
```bash
export K8S_NAMESPACE=my-llama
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
export DYNAMO_IMAGE=<dynamo-image>
export DYNAMO_VLLM_IMAGE=<vllm-image>
export STORAGE_CLASS_NAME=<rwx-storage-class>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${K8S_NAMESPACE}
envsubst < examples/global_planner/global-planner-vllm-test.yaml | \
kubectl apply -n ${K8S_NAMESPACE} -f -
```
The single user-facing endpoint is the `Frontend` in the control DGD, not the pool DGDs.
## Step 6: Validate The Stack
Validate the deployment from outside in:
- Confirm the control `Frontend` is healthy and serving the model endpoint.
- Confirm `GlobalRouter` logs show requests being assigned to the expected pool namespaces.
- Confirm pool-local planners are producing scale requests.
- Confirm `GlobalPlanner` logs show accepted scale operations.
- Confirm the target DGDs' replica counts change as expected.
If you use Prometheus and Grafana, also inspect:
- TTFT and ITL over time
- per-pool worker counts
- per-pool request mix
- total GPU usage
## Recommended Workflow For New Deployments
For most teams, the easiest way to build this deployment is:
1. Design your pool classes from expected traffic patterns.
2. Run one DGDR per pool class to generate or validate the pool configuration.
3. Copy the selected worker shape and planner settings into the final pool DGDs.
4. Build one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner`.
5. Route all client traffic through the control `Frontend`.
This keeps profiling and pool selection simple while still giving you one public endpoint for the model.
## Current Limitations
- Single-endpoint `GlobalPlanner` deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology.
- `GlobalRouter` routes by ISL/TTFT and context-length/ITL grids, not directly by OSL.
- In the single-endpoint pattern, all pools are expected to serve the same model.
## See Also
- [Planner README](README.md) — Planner overview and quick start
- [Planner Guide](planner-guide.md) — Planner configuration reference
- [Planner Examples](planner-examples.md) — DGDR examples for generating per-pool configs
- [Profiler Guide](../profiler/profiler-guide.md) — Pre-deployment profiling workflow
- [Global Planner README](../../../components/src/dynamo/global_planner/README.md) — Centralized scale execution
- [Global Router README](../../../components/src/dynamo/global_router/README.md) — Cross-pool request routing
- [vLLM global planner example](../../../examples/global_planner/global-planner-vllm-test.yaml) — End-to-end reference manifest
...@@ -114,8 +114,19 @@ The planner receives its config via `--config /path/to/planner_config.json` whic ...@@ -114,8 +114,19 @@ The planner receives its config via `--config /path/to/planner_config.json` whic
See the [Profiler Guide](../profiler/profiler-guide.md) for the full profiling workflow and how to configure pre-deployment sweeping. See the [Profiler Guide](../profiler/profiler-guide.md) for the full profiling workflow and how to configure pre-deployment sweeping.
## Hierarchical Deployments
If you want one public endpoint for a model but multiple private DGDs optimized for different request classes, use a hierarchical deployment:
- one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner`
- one or more prefill pool DGDs
- one or more decode pool DGDs
In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See [Global Planner Deployment Guide](global-planner.md).
## See Also ## See Also
- [Planner README](README.md) — Quick overview - [Planner README](README.md) — Quick overview
- [Global Planner Deployment Guide](global-planner.md)`GlobalPlanner` deployment patterns and single-endpoint multi-pool workflow
- [Planner Design](../../design-docs/planner-design.md) — Architecture internals - [Planner Design](../../design-docs/planner-design.md) — Architecture internals
- [Profiler Guide](../profiler/profiler-guide.md) — How profiling data is generated - [Profiler Guide](../profiler/profiler-guide.md) — How profiling data is generated
...@@ -279,7 +279,7 @@ For full documentation on implementing KV event publishing for custom inference ...@@ -279,7 +279,7 @@ For full documentation on implementing KV event publishing for custom inference
For deployments with multiple worker pools, the **Global Router** enables hierarchical routing by sitting between the frontend and local routers. It selects the appropriate pool for each request based on configurable policies, supporting disaggregated topologies where pools are tuned for different workload characteristics. For deployments with multiple worker pools, the **Global Router** enables hierarchical routing by sitting between the frontend and local routers. It selects the appropriate pool for each request based on configurable policies, supporting disaggregated topologies where pools are tuned for different workload characteristics.
- **Component details**: [`components/src/dynamo/global_router/`](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/global_router/) - **Component details**: [`components/src/dynamo/global_router/`](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/global_router/)
- **Example**: [`examples/hierarchical_planner/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/hierarchical_planner/) - **Example**: [`examples/global_planner/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/global_planner/)
## See Also ## See Also
......
...@@ -35,6 +35,9 @@ Advanced disaggregated deployment with KV cache routing capabilities. ...@@ -35,6 +35,9 @@ Advanced disaggregated deployment with KV cache routing capabilities.
- `VLLMDecodeWorker`: Specialized decode-only worker - `VLLMDecodeWorker`: Specialized decode-only worker
- `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`) - `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
### 5. **Global Planner Deployments** (see [`examples/global_planner/`](../../../global_planner/))
Centralized scaling across multiple DGDs via GlobalPlanner. Examples include single-endpoint multi-pool and multi-model GPU budget patterns. See the [global planner examples](../../../global_planner/) for details.
## CRD Structure ## CRD Structure
All templates use the **DynamoGraphDeployment** CRD: All templates use the **DynamoGraphDeployment** CRD:
...@@ -121,6 +124,7 @@ Select the deployment pattern that matches your requirements: ...@@ -121,6 +124,7 @@ Select the deployment pattern that matches your requirements:
- Use `disagg.yaml` for maximum performance - Use `disagg.yaml` for maximum performance
- Use `disagg_router.yaml` for high-performance with KV cache routing - Use `disagg_router.yaml` for high-performance with KV cache routing
- Use `disagg_planner.yaml` for SLA-optimized performance - Use `disagg_planner.yaml` for SLA-optimized performance
- Use [global planner examples](../../../global_planner/) for centralized scaling across multiple DGDs
### 2. Customize Configuration ### 2. Customize Configuration
Edit the template to match your environment: Edit the template to match your environment:
...@@ -249,6 +253,7 @@ args: ...@@ -249,6 +253,7 @@ args:
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md)
- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner-guide.md) - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner-guide.md)
- **Global Planner**: [Global Planner Deployment Guide](../../../../docs/components/planner/global-planner.md)
- **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md) - **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Global Planner Examples
Examples demonstrating **GlobalPlanner** — the centralized scaling execution layer that
enforces shared scaling policy across multiple DGDs.
## Example Manifests
| File | Pattern | Backend | Description |
|------|---------|---------|-------------|
| `global-planner-gpu-budget.yaml` | Multi-model, GPU budget | vLLM | 2 independent model DGDs + 1 control DGD with `--max-total-gpus` |
| `global-planner-vllm-test.yaml` | Single-endpoint, multi-pool | vLLM | 1 Frontend + GlobalRouter + GlobalPlanner, 2 prefill pools (TP1, TP2) + 1 decode pool |
| `global-planner-mocker-test.yaml` | Single-endpoint, multi-pool | Mocker | Same as above with Mocker workers; GlobalPlanner in `--no-operation` mode |
## Deployment Patterns
### Pattern 1: Multi-Model with GPU Budget (`global-planner-gpu-budget.yaml`)
Multiple independent DGDs, each serving a different model with its own Frontend.
A shared GlobalPlanner enforces a cluster-wide GPU cap.
```
DGD gp-ctrl: GlobalPlanner (--max-total-gpus)
DGD model-a: Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner (MODEL_A)
DGD model-b: Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner (MODEL_B)
```
- No GlobalRouter needed — each model has its own endpoint.
- Each DGD's local planner uses `environment: "global-planner"` to delegate scaling.
- GlobalPlanner rejects any scale request that would push total GPUs above the limit.
### Pattern 2: Single Endpoint, Multi-Pool (`global-planner-vllm-test.yaml`)
One public endpoint for a single model, backed by multiple specialized pools.
A GlobalRouter selects the best pool for each request.
```
DGD gp-ctrl: Frontend + GlobalRouter + GlobalPlanner
DGD gp-prefill-0: LocalRouter + VllmPrefillWorker (TP1) + Planner
DGD gp-prefill-1: LocalRouter + VllmPrefillWorker (TP2) + Planner
DGD gp-decode-0: LocalRouter + VllmDecodeWorker (TP1) + Planner
```
- GlobalRouter routes prefill requests by (ISL, TTFT target) and decode by (context length, ITL target).
- Each pool planner delegates scaling to GlobalPlanner.
## Prerequisites
- Dynamo Kubernetes Platform installed (see [Kubernetes Quickstart](../../docs/kubernetes/README.md))
- Cluster Prometheus scraping router metrics via PodMonitor
- HuggingFace token secret:
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=<your-token> -n ${K8S_NAMESPACE}
```
- A ReadWriteMany StorageClass for the shared model cache PVC
## Deploying
All manifests use `envsubst` for configuration. Set the required variables and apply:
### GPU Budget Example
```bash
export K8S_NAMESPACE=my-ns
export DYNAMO_IMAGE=<dynamo-image>
export DYNAMO_VLLM_IMAGE=<vllm-image>
export STORAGE_CLASS_NAME=<rwx-storage-class>
export MODEL_A=meta-llama/Llama-3.1-8B-Instruct
export MODEL_B=Qwen/Qwen3-8B
export MAX_TOTAL_GPUS=8
envsubst < global-planner-gpu-budget.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```
### Single-Endpoint vLLM Example
```bash
export K8S_NAMESPACE=my-ns
export DYNAMO_IMAGE=<dynamo-image>
export DYNAMO_VLLM_IMAGE=<vllm-image>
export STORAGE_CLASS_NAME=<rwx-storage-class>
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
envsubst < global-planner-vllm-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```
### Mocker Example (No GPUs)
```bash
export K8S_NAMESPACE=my-ns
export DYNAMO_IMAGE=<dynamo-image>
envsubst < global-planner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```
## Verifying
```bash
# Check DGD status
kubectl get dgd -n ${K8S_NAMESPACE}
# Check pods
kubectl get pods -n ${K8S_NAMESPACE}
# Watch GlobalPlanner logs for scale requests
kubectl logs -n ${K8S_NAMESPACE} \
-l nvidia.com/dynamo-component=GlobalPlanner -f
```
## Cleanup
```bash
envsubst < <manifest>.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
```
## SLA Planner Configuration
Each pool's local planner is configured via a JSON blob passed to `--config`.
Key fields for GlobalPlanner delegation:
| Field | Description |
|-------|-------------|
| `environment` | `"global-planner"` — delegates scaling to GlobalPlanner |
| `global_planner_namespace` | Dynamo namespace of the control DGD (e.g. `${K8S_NAMESPACE}-gp-ctrl`) |
| `mode` | `"disagg"`, `"prefill"`, or `"decode"` |
| `throughput_metrics_source` | `"router"` for multi-DGD setups (reads `dynamo_component_router_*` from Prometheus) |
| `max_gpu_budget` | Per-pool GPU limit (`-1` = unlimited, defer to GlobalPlanner) |
## GlobalPlanner Flags
| Flag | Description |
|------|-------------|
| `--max-total-gpus N` | Reject requests that would exceed N total GPUs across all managed DGDs. `0` = no GPU scaling allowed, `-1` (default) = unlimited |
| `--managed-namespaces NS...` | Only accept scale requests from listed Dynamo namespaces (default: accept all). See *Management Modes* below |
| `--no-operation` | Log scale requests without executing them (useful for dry-run testing) |
### Management Modes
GlobalPlanner operates in one of two modes depending on whether `--managed-namespaces` is set:
- **Explicit mode** (`--managed-namespaces` provided): Only the listed Dynamo
namespaces are authorized to send scale requests, and only their corresponding
DGDs count toward the GPU budget. DGD names are derived from the Dynamo
namespace using the operator convention `DYN_NAMESPACE = {k8s_namespace}-{dgd_name}`.
- **Implicit mode** (no `--managed-namespaces`): Any caller is accepted, and all
DGDs in the Kubernetes namespace count toward the GPU budget.
## Namespace Convention
The Dynamo operator prepends the Kubernetes namespace to the DGD's `dynamoNamespace`:
- K8s namespace: `my-ns`, DGD name: `gp-ctrl`
- Dynamo namespace: `my-ns-gp-ctrl`
This is why planner configs and router endpoints use the full `${K8S_NAMESPACE}-<dgd-name>` path.
## Further Reading
- [Global Planner Deployment Guide](../../docs/components/planner/global-planner.md)
- [Global Planner README](../../components/src/dynamo/global_planner/README.md)
- [Planner Configuration Guide](../../docs/components/planner/planner-guide.md)
- [Global Router README](../../components/src/dynamo/global_router/README.md)
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Minimal GlobalPlanner GPU budget example: 2 independent model DGDs sharing a
# GPU cap enforced by a central GlobalPlanner.
#
# Each model DGD is self-contained (Frontend + Workers + Planner) and serves a
# different model. The GlobalPlanner in the ctrl DGD rejects any scale request
# that would push the total GPU count across its managed DGDs above MAX_TOTAL_GPUS.
#
# The budget applies only to DGDs managed by this GlobalPlanner (see
# --managed-namespaces), not to every DGD in the cluster. In this example the
# ctrl DGD runs in implicit mode (no --managed-namespaces), so all DGDs in the
# same K8s namespace count toward the budget. To limit the budget to specific
# DGDs, pass --managed-namespaces with their Dynamo namespaces.
#
# Architecture:
# DGD gp-ctrl: GlobalPlanner (--max-total-gpus)
# DGD model-a: Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner (MODEL_A)
# DGD model-b: Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner (MODEL_B)
#
# Prerequisites:
# - Cluster Prometheus deployed and scraping pods via PodMonitor
# - HuggingFace token secret: kubectl create secret generic hf-token-secret \
# --from-literal=HF_TOKEN=<your-token> -n ${K8S_NAMESPACE}
#
# Usage:
# export K8S_NAMESPACE=... DYNAMO_IMAGE=... DYNAMO_VLLM_IMAGE=... STORAGE_CLASS_NAME=...
# export MODEL_A=meta-llama/Llama-3.1-8B-Instruct MODEL_B=Qwen/Qwen3-8B MAX_TOTAL_GPUS=8
# envsubst < global-planner-gpu-budget.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
# envsubst < global-planner-gpu-budget.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ${K8S_NAMESPACE}-planner
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: dynamo-platform-dynamo-operator-planner
subjects:
- kind: ServiceAccount
name: default
namespace: ${K8S_NAMESPACE}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: hf-model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: ${STORAGE_CLASS_NAME}
resources:
requests:
storage: 50Gi
---
# ── Control plane: GlobalPlanner only ────────────────────────────────────────
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: gp-ctrl
spec:
services:
GlobalPlanner:
componentType: default
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
command:
- python3
- -m
- dynamo.global_planner
args:
- --max-total-gpus
- "${MAX_TOTAL_GPUS}"
---
# ── Model A: self-contained disagg serving DGD ──────────────────────────────
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: model-a
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
workingDir: /workspace
command:
- python3
- -m
- dynamo.frontend
args:
- --model-name
- ${MODEL_A}
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
volumes:
- name: hf-model-cache
persistentVolumeClaim:
claimName: hf-model-cache
mainContainer:
image: ${DYNAMO_VLLM_IMAGE}
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- ${MODEL_A}
- --tensor-parallel-size
- "1"
- --is-prefill-worker
volumeMounts:
- name: hf-model-cache
mountPath: /home/dynamo/.cache/huggingface/hub
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
volumes:
- name: hf-model-cache
persistentVolumeClaim:
claimName: hf-model-cache
mainContainer:
image: ${DYNAMO_VLLM_IMAGE}
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- ${MODEL_A}
- --tensor-parallel-size
- "1"
volumeMounts:
- name: hf-model-cache
mountPath: /home/dynamo/.cache/huggingface/hub
Planner:
componentType: planner
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
command:
- python3
- -m
- dynamo.planner
args:
- --config
- '{"environment":"global-planner","global_planner_namespace":"${K8S_NAMESPACE}-gp-ctrl","backend":"vllm","mode":"disagg","enable_load_scaling":false,"enable_throughput_scaling":true,"throughput_metrics_source":"router","ttft":2000,"itl":200,"max_gpu_budget":-1,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"model_name":"${MODEL_A}","profile_results_dir":"/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"}'
---
# ── Model B: self-contained disagg serving DGD ──────────────────────────────
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: model-b
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
workingDir: /workspace
command:
- python3
- -m
- dynamo.frontend
args:
- --model-name
- ${MODEL_B}
VllmPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
volumes:
- name: hf-model-cache
persistentVolumeClaim:
claimName: hf-model-cache
mainContainer:
image: ${DYNAMO_VLLM_IMAGE}
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- ${MODEL_B}
- --tensor-parallel-size
- "1"
- --is-prefill-worker
volumeMounts:
- name: hf-model-cache
mountPath: /home/dynamo/.cache/huggingface/hub
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
volumes:
- name: hf-model-cache
persistentVolumeClaim:
claimName: hf-model-cache
mainContainer:
image: ${DYNAMO_VLLM_IMAGE}
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- ${MODEL_B}
- --tensor-parallel-size
- "1"
volumeMounts:
- name: hf-model-cache
mountPath: /home/dynamo/.cache/huggingface/hub
Planner:
componentType: planner
replicas: 1
extraPodSpec:
mainContainer:
image: ${DYNAMO_IMAGE}
command:
- python3
- -m
- dynamo.planner
args:
- --config
- '{"environment":"global-planner","global_planner_namespace":"${K8S_NAMESPACE}-gp-ctrl","backend":"vllm","mode":"disagg","enable_load_scaling":false,"enable_throughput_scaling":true,"throughput_metrics_source":"router","ttft":2000,"itl":200,"max_gpu_budget":-1,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"model_name":"${MODEL_B}","profile_results_dir":"/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"}'
...@@ -13,8 +13,8 @@ ...@@ -13,8 +13,8 @@
# DGD gp-decode-1: LocalRouter + MockerDecode + Planner # DGD gp-decode-1: LocalRouter + MockerDecode + Planner
# #
# Usage: # Usage:
# envsubst < hplanner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f - # envsubst < global-planner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
# envsubst < hplanner-mocker-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f - # envsubst < global-planner-mocker-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
apiVersion: v1 apiVersion: v1
kind: ConfigMap kind: ConfigMap
metadata: metadata:
......
...@@ -18,8 +18,8 @@ ...@@ -18,8 +18,8 @@
# #
# Usage: # Usage:
# export K8S_NAMESPACE=... DYNAMO_IMAGE=... DYNAMO_VLLM_IMAGE=... MODEL_NAME=... STORAGE_CLASS_NAME=... # export K8S_NAMESPACE=... DYNAMO_IMAGE=... DYNAMO_VLLM_IMAGE=... MODEL_NAME=... STORAGE_CLASS_NAME=...
# envsubst < hplanner-vllm-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f - # envsubst < global-planner-vllm-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
# envsubst < hplanner-vllm-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f - # envsubst < global-planner-vllm-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
apiVersion: rbac.authorization.k8s.io/v1 apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding kind: ClusterRoleBinding
metadata: metadata:
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Hierarchical Planner Example
This example demonstrates a hierarchical routing setup with:
- A **Global Router** that routes to different pools based on request characteristics
- **Local Routers** in each pool namespace
- **Workers** (Mocker for local testing, vLLM for Kubernetes deployment)
## Architecture
```
Frontend (round-robin routing)
|
v
Global Router
(registers as both prefill + decode)
|
+----------------+----------------+
| | |
v v v
Prefill Pool 0 Prefill Pool 1 Decode Pool 0
(prefill-pool-0) (prefill-pool-1) (decode-pool-0)
| | |
v v v
Local Router Local Router Local Router
| | |
v v v
Worker Worker Worker
(prefill) (prefill) (decode)
```
## Configuration
The `global_router_config.json` defines:
- 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`)
- 1 decode pool (`decode-pool-0`)
- Grid-based pool selection strategy
Pool selection is based on a 2x2 grid:
- **Prefill**: (ISL, TTFT_target) maps to prefill pool index
- **Decode**: (context_length, ITL_target) maps to decode pool index
## Running Locally (with Mocker)
For local testing without GPUs, use the mocker-based script:
```bash
cd examples/hierarchical_planner
./run_example.sh
```
This starts all components in the background and provides instructions for testing.
## Kubernetes Deployment (with vLLM)
The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each).
### Prerequisites
- Kubernetes cluster with GPU nodes
- `hf-token-secret` secret containing your HuggingFace token
- The Dynamo operator installed
### Deployment
The YAML uses environment variable placeholders:
- `${K8S_NAMESPACE}` - Your Kubernetes namespace
- `${VLLM_IMAGE}` - Dynamo vLLM runtime container image
Use `envsubst` to substitute these before applying:
```bash
# Set your Kubernetes namespace and image
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>
# Deploy all DGDs
envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
```
### Verify Deployment
```bash
# Check DGD status
kubectl get dgd -n ${K8S_NAMESPACE}
# Check pods
kubectl get pods -n ${K8S_NAMESPACE}
# Check logs for a specific component
kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend
```
### Cleanup
```bash
export K8S_NAMESPACE=<your-k8s-namespace>
export VLLM_IMAGE=<dynamo-vllm-image>
envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
```
### Namespace Convention
The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field:
- K8s namespace: `my-namespace`
- `dynamoNamespace: prefill-pool-0`
- Actual Dynamo namespace: `my-namespace-prefill-pool-0`
This is why the global router config and local router endpoints must use the full namespace path.
## Testing
Once all components are running, send a request to the frontend:
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 50,
"stream": true
}'
```
For Kubernetes, port-forward the frontend service first:
```bash
kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000
```
## Request Flow
1. Request arrives at **Frontend**
2. Frontend's `PrefillRouter` detects both prefill and decode registered for the model
3. Frontend sends prefill request to **Global Router** (registered as prefill)
4. Global Router selects prefill pool based on (ISL, TTFT_target) grid
5. Request forwarded to **Local Router** in selected prefill pool namespace
6. Local Router forwards to **Worker** (prefill mode)
7. Prefill response returns with `disaggregated_params`
8. Frontend sends decode request to **Global Router** (registered as decode)
9. Global Router selects decode pool based on (context_length, ITL_target) grid
10. Tokens stream back through the chain
## Customizing Pool Selection
Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change:
- **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists
- **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity
- **Pool mapping**: Edit `prefill_pool_mapping` and `decode_pool_mapping` matrices
Example: To always route to pool 0 regardless of request characteristics:
```json
"prefill_pool_mapping": [[0, 0], [0, 0]]
```
## SLA Planner with GlobalPlanner
Each pool can run an SLA Planner that reads throughput metrics and delegates autoscaling decisions
to a central **GlobalPlanner** service. The GlobalPlanner arbitrates across pools and executes
scaling via the Dynamo operator.
### Architecture with SLA Planners
```
Frontend (round-robin)
|
v
Global Router ─── GlobalPlanner ◄─── scale decisions from pool planners
|
+──────────────────────────────────────+
| | |
Prefill Pool 0 Prefill Pool 1 Decode Pool 0
LocalRouter LocalRouter LocalRouter
Worker Worker Worker
Planner ──────► Planner ──────► Planner ──────► (all → GlobalPlanner)
```
### SLA Planner configuration
The SLA Planner is configured via a JSON blob passed to `--config`. Key fields for the
global-planner environment:
| Field | Description |
|---|---|
| `environment` | `"global-planner"` to delegate scaling to GlobalPlanner |
| `global_planner_namespace` | Dynamo namespace of the DGD running GlobalPlanner |
| `mode` | `"prefill"` or `"decode"` |
| `throughput_metrics_source` | `"frontend"` (default) or `"router"` — see below |
### `throughput_metrics_source`
Controls where the SLA Planner reads aggregate throughput metrics (TTFT, ITL, request rate):
- **`frontend`** (default): reads `dynamo_frontend_*` histograms from the frontend service. Works
for single-DGD disagg deployments where the planner and frontend share a namespace.
- **`router`**: reads `dynamo_component_router_*` histograms emitted by LocalRouter pods and
scraped by cluster Prometheus. Required for hierarchical (multi-DGD) disagg deployments where
the SLA Planner runs in a pool DGD namespace that is different from the frontend DGD namespace.
Use `throughput_metrics_source: "router"` whenever the planner is co-located with a pool
(not the frontend), i.e. in any GlobalPlanner setup.
### Prometheus scraping for router metrics
The Dynamo operator Helm chart includes a PodMonitor that scrapes LocalRouter pods on port 9090.
LocalRouter pods must expose metrics on that port via:
```yaml
env:
- name: DYN_SYSTEM_PORT
value: "9090"
```
No standalone Prometheus is needed — the cluster-wide Prometheus picks up the PodMonitor
automatically.
### GlobalPlanner `--no-operation` mode
Pass `--no-operation` to GlobalPlanner to receive and log scale requests without executing them.
Useful for observing planner behaviour before enabling live scaling:
```yaml
command: [python3, -m, dynamo.global_planner]
args: [--no-operation]
```
### Example deployments
Complete end-to-end examples are in `examples/backends/`:
| File | Description |
|---|---|
| `mocker/deploy/hplanner-mocker-test.yaml` | 2 prefill + 2 decode pools with Mocker workers; GlobalPlanner in no-op mode |
| `vllm/deploy/hplanner-vllm-test.yaml` | 2 prefill (TP1, TP2) + 1 decode pool with real vLLM workers |
Both use `envsubst` for substituting `${K8S_NAMESPACE}`, `${DYNAMO_IMAGE}`, etc.
{
"num_prefill_pools": 2,
"num_decode_pools": 1,
"prefill_pool_dynamo_namespaces": ["prefill_pool_0", "prefill_pool_1"],
"decode_pool_dynamo_namespaces": ["decode_pool_0"],
"prefill_pool_selection_strategy": {
"ttft_min": 10,
"ttft_max": 1000,
"ttft_resolution": 2,
"isl_min": 0,
"isl_max": 32000,
"isl_resolution": 2,
"prefill_pool_mapping": [[0,1],[0,1]]
},
"decode_pool_selection_strategy": {
"itl_min": 10,
"itl_max": 100,
"itl_resolution": 2,
"context_length_min": 0,
"context_length_max": 32000,
"context_length_resolution": 2,
"decode_pool_mapping": [[0,0],[0,0]]
}
}
\ No newline at end of file
#!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Hierarchical Planner Example
# Run each command in a separate terminal, in order from bottom to top.
# Wait a few seconds between starting each component.
# ============================================================================
# frontend + global_router
# ============================================================================
# need to specify a namespace so that mockers are not registered to frontend
# and cannot use "dynamo" because that is reserved for all namespaces
python -m dynamo.frontend \
--router-mode round-robin \
--namespace hierarchical &
python -m dynamo.global_router \
--config examples/hierarchical_planner/global_router_config.json \
--model-name Qwen/Qwen3-0.6B \
--default-ttft-target 100 \
--default-itl-target 10 \
--namespace hierarchical &
# ============================================================================
# prefill_pool_0 - local router + mocker worker (prefill)
# ============================================================================
DYN_NAMESPACE=prefill_pool_0 python -m dynamo.router \
--endpoint prefill_pool_0.worker.generate \
--router-block-size 16 \
--no-router-track-active-blocks & # prefill router does not need to track active blocks
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--endpoint dyn://prefill_pool_0.worker.generate \
--disaggregation-mode prefill \
--block-size 16 &
# ============================================================================
# prefill_pool_1 - local router + mocker worker (prefill)
# ============================================================================
DYN_NAMESPACE=prefill_pool_1 python -m dynamo.router \
--endpoint prefill_pool_1.worker.generate \
--router-block-size 16 \
--no-router-track-active-blocks & # prefill router does not need to track active blocks
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--endpoint dyn://prefill_pool_1.worker.generate \
--disaggregation-mode prefill \
--block-size 16 &
# ============================================================================
# decode_pool_0 - local router + mocker worker (decode)
# ============================================================================
DYN_NAMESPACE=decode_pool_0 python -m dynamo.router \
--endpoint decode_pool_0.worker.generate \
--router-block-size 16 \
--router-kv-overlap-score-weight 0 &
python -m dynamo.mocker \
--model-path Qwen/Qwen3-0.6B \
--endpoint dyn://decode_pool_0.worker.generate \
--disaggregation-mode decode \
--block-size 16 &
# ============================================================================
# test request
# ============================================================================
# wait for all components to start
# curl -X POST http://localhost:8000/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "Qwen/Qwen3-0.6B",
# "messages": [{"role": "user", "content": "Hello!"}],
# "max_tokens": 50,
# "stream": true
# }'
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Multi-DGD deployment for hierarchical planner example with vLLM workers
# Architecture:
# DGD 1 (hierarchical): Frontend + GlobalRouter
# DGD 2 (prefill-pool-0): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 3 (prefill-pool-1): Local Router + vLLM Prefill Worker (1 GPU)
# DGD 4 (decode-pool-0): Local Router + vLLM Decode Worker (1 GPU)
#
# IMPORTANT: This file uses ${K8S_NAMESPACE} as a placeholder for the Kubernetes namespace.
# The K8s operator prepends the K8s namespace to the Dynamo namespace.
# For example, if K8S_NAMESPACE="my-namespace" and dynamoNamespace is "prefill-pool-0",
# the actual Dynamo namespace becomes "my-namespace-prefill-pool-0".
#
# vLLM workers register at:
# - Prefill: <namespace>.prefill.generate
# - Decode: <namespace>.backend.generate
#
# USAGE: See README.md for deployment instructions using envsubst.
# =============================================================================
# ConfigMap for global router configuration
# =============================================================================
apiVersion: v1
kind: ConfigMap
metadata:
name: hierarchical-global-router-config
data:
global_router_config.json: |
{
"num_prefill_pools": 2,
"num_decode_pools": 1,
"prefill_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-prefill-pool-0", "${K8S_NAMESPACE}-prefill-pool-1"],
"decode_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-decode-pool-0"],
"prefill_pool_selection_strategy": {
"ttft_min": 10,
"ttft_max": 1000,
"ttft_resolution": 2,
"isl_min": 0,
"isl_max": 32000,
"isl_resolution": 2,
"prefill_pool_mapping": [[0,1],[0,1]]
},
"decode_pool_selection_strategy": {
"itl_min": 10,
"itl_max": 100,
"itl_resolution": 2,
"context_length_min": 0,
"context_length_max": 32000,
"context_length_resolution": 2,
"decode_pool_mapping": [[0,0],[0,0]]
}
}
---
# =============================================================================
# DGD 1: Frontend + Global Router (namespace: hierarchical)
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: hierarchical-frontend
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
Frontend:
componentType: frontend
dynamoNamespace: hierarchical
extraPodSpec:
mainContainer:
args:
- --router-mode
- round-robin
- --namespace
- ${K8S_NAMESPACE}-hierarchical
command:
- python
- -m
- dynamo.frontend
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
GlobalRouter:
componentType: default
dynamoNamespace: hierarchical
extraPodSpec:
mainContainer:
args:
- --config
- /workspace/config/global_router_config.json
- --model-name
- Qwen/Qwen3-0.6B
- --default-ttft-target
- "100"
- --default-itl-target
- "10"
- --namespace
- ${K8S_NAMESPACE}-hierarchical
command:
- python
- -m
- dynamo.global_router
image: ${VLLM_IMAGE}
workingDir: /workspace
volumeMounts:
- mountPath: /workspace/config
name: global-router-config
readOnly: true
volumes:
- configMap:
name: hierarchical-global-router-config
name: global-router-config
replicas: 1
---
# =============================================================================
# DGD 2: Prefill Pool 0 - Local Router + vLLM Worker (namespace: prefill-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-0
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: prefill-pool-0
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: prefill-pool-0
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
- --router-block-size
- "16"
- --no-router-track-active-blocks
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
dynamoNamespace: prefill-pool-0
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
---
# =============================================================================
# DGD 3: Prefill Pool 1 - Local Router + vLLM Worker (namespace: prefill-pool-1)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-1
# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: prefill-pool-1
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: prefill-pool-1
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
- --router-block-size
- "16"
- --no-router-track-active-blocks
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
dynamoNamespace: prefill-pool-1
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
---
# =============================================================================
# DGD 4: Decode Pool 0 - Local Router + vLLM Worker (namespace: decode-pool-0)
# Actual Dynamo namespace: ${K8S_NAMESPACE}-decode-pool-0
# vLLM decode worker registers at: ${K8S_NAMESPACE}-decode-pool-0.backend.generate
# =============================================================================
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: decode-pool-0
spec:
envs:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token-secret
services:
LocalRouter:
componentType: default
dynamoNamespace: decode-pool-0
extraPodSpec:
mainContainer:
args:
- --endpoint
- ${K8S_NAMESPACE}-decode-pool-0.backend.generate
- --router-block-size
- "16"
- --router-kv-overlap-score-weight
- "0"
command:
- python
- -m
- dynamo.router
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
VllmDecodeWorker:
componentType: worker
subComponentType: decode
dynamoNamespace: decode-pool-0
envFromSecret: hf-token-secret
extraPodSpec:
mainContainer:
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.90"
- --block-size
- "16"
command:
- python3
- -m
- dynamo.vllm
image: ${VLLM_IMAGE}
workingDir: /workspace
replicas: 1
resources:
limits:
gpu: "1"
requests:
gpu: "1"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment