feat: GlobalPlanner --max-total-gpus for cluster-wide GPU budget (#7103)

Signed-off-by: Daiyaan <darfeen@nvidia.com> Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: Anish <80174047+athreesh@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Anish <80174047+athreesh@users.noreply.github.com>

feat: GlobalPlanner --max-total-gpus for cluster-wide GPU budget (#7103)
Signed-off-by: Daiyaan <darfeen@nvidia.com> Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: Anish <80174047+athreesh@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Anish <80174047+athreesh@users.noreply.github.com>
5d5fd243 · daiyaanarfeen · GitHub · cf5f65f7 · 5d5fd243 · 5d5fd243
Unverified Commit 5d5fd243 authored Mar 11, 2026 by daiyaanarfeen Committed by GitHub Mar 11, 2026
20 changed files
--- a/CODEOWNERS
+++ b/CODEOWNERS
@@ -31,7 +31,7 @@ CODEOWNERS @ai-dynamo/Devops
 /components/src/dynamo/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
 /components/src/dynamo/global_router/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
 /components/src/dynamo/global_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
-/examples/hierarchical_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
+/examples/global_planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
 /components/src/dynamo/profiler/ @ai-dynamo/python-codeowners @ai-dynamo/Devops
 /tests/planner/ @ai-dynamo/python-codeowners @ai-dynamo/Devops

--- a/components/src/dynamo/global_planner/README.md
+++ b/components/src/dynamo/global_planner/README.md
@@ -3,10 +3,39 @@
 # Global Planner
-Centralized scaling execution service for hierarchical planner deployments.
+Centralized scaling execution service for multi-DGD planner deployments.
-The Global Planner receives scaling decisions from distributed planners and executes
+The Global Planner receives scaling decisions from local planners and executes
-replica updates against Kubernetes `DynamoGraphDeployment` resources.
+replica updates against Kubernetes `DynamoGraphDeployment` resources. It is useful
+whenever multiple DGDs should delegate scaling through one centralized component,
+whether or not those DGDs sit behind a single shared endpoint.
+## What Problem This Solves
+Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly.
+That is fine for isolated deployments, but it becomes awkward when you want one place to:
+- apply centralized scaling policies across multiple DGDs
+- enforce shared constraints such as authorization or total GPU budget
+- coordinate scaling for a single-endpoint, multi-pool deployment
+`GlobalPlanner` solves that by becoming the common scale-execution endpoint for multiple local planners.
+## Deployment Patterns
+`GlobalPlanner` is used in two common patterns:
+1. **Centralized scaling across independent DGDs**
+   Each DGD keeps its own normal local planner, but the local planners delegate scale execution to one `GlobalPlanner`. This is useful when separate deployments or models should share a global policy such as a total GPU budget. You do **not** need `GlobalRouter` or a single shared endpoint for this pattern.
+2. **Hierarchical single-endpoint deployment**
+   Multiple pool DGDs for one model sit behind one public `Frontend` and one `GlobalRouter`. Each pool still has its own local planner, and those local planners delegate scaling to `GlobalPlanner`.
+## Terminology
+- **SLA Planner**: The normal `dynamo.planner` component that computes desired replica counts from SLA targets, profiles, and/or metrics.
+- **Local planner**: An instance of that planner running inside one DGD or one pool.
+- **GlobalPlanner**: The centralized execution and policy layer that receives scale requests from local planners and applies them to target DGDs.
+- **Hierarchical planner**: An architecture term, not a separate binary. In practice it means multiple local planners feeding one `GlobalPlanner`, often together with `GlobalRouter`.
 ## Overview
@@ -50,6 +79,11 @@ DYN_NAMESPACE=global-infra python -m dynamo.global_planner \
 DYN_NAMESPACE=global-infra python -m dynamo.global_planner --no-operation
 ```
+```bash
+# Enforce a maximum total GPU budget across managed pools
+DYN_NAMESPACE=global-infra python -m dynamo.global_planner --max-total-gpus 16
+```
 ### Arguments
 Required environment variables:
@@ -65,6 +99,7 @@ CLI arguments:
 - `--managed-namespaces <ns1> <ns2> ...`: Allowlist for `caller_namespace`. If omitted, accepts all namespaces.
 - `--environment kubernetes`: Execution environment (currently only `kubernetes` is supported).
 - `--no-operation`: Log incoming scale requests and return success without applying Kubernetes scaling.
+- `--max-total-gpus <n>`: Reject scale requests that would push the managed pools above the configured total GPU cap.
 ## Scale Request Contract
@@ -100,6 +135,7 @@ Response fields:
 ## Related Documentation
 - [Planner Guide](../../../../docs/components/planner/planner-guide.md) — Planner configuration and deployment workflow
+- [Global Planner Deployment Guide](../../../../docs/components/planner/global-planner.md) — Deployment patterns for `GlobalPlanner`, including multi-model coordination and single-endpoint multi-pool workflows
 - [Planner Design](../../../../docs/design-docs/planner-design.md) — Planner architecture and algorithms
 Planners delegate to this service when planner config uses `environment: "global-planner"` and sets `global_planner_namespace`.
\ No newline at end of file
--- a/components/src/dynamo/global_planner/__main__.py
+++ b/components/src/dynamo/global_planner/__main__.py
@@ -76,6 +76,11 @@ async def main(runtime: DistributedRuntime, args):
    else:
        logger.info("No-operation mode: DISABLED")
+    if args.max_total_gpus >= 0:
+        logger.info(f"Max total GPUs: {args.max_total_gpus}")
+    else:
+        logger.info("Max total GPUs: UNLIMITED")
    logger.info("=" * 60)
    # Get K8s namespace (where GlobalPlanner pod is running)
@@ -88,6 +93,7 @@ async def main(runtime: DistributedRuntime, args):
        managed_namespaces=args.managed_namespaces,
        k8s_namespace=k8s_namespace,
        no_operation=args.no_operation,
+        max_total_gpus=args.max_total_gpus,
    )
    # Serve scale_request endpoint

--- a/components/src/dynamo/global_planner/argparse_config.py
+++ b/components/src/dynamo/global_planner/argparse_config.py
@@ -53,4 +53,12 @@ Examples:
        help="Log incoming scale requests without executing them (useful for testing the e2e flow without actual K8s scaling)",
    )
+    parser.add_argument(
+        "--max-total-gpus",
+        type=int,
+        default=-1,
+        dest="max_total_gpus",
+        help="Maximum total GPUs across all managed pools. Requests that would exceed this limit are rejected. 0 means no GPU scaling is allowed. -1 (default) disables enforcement entirely.",
+    )
    return parser
--- a/components/src/dynamo/global_planner/scale_handler.py
+++ b/components/src/dynamo/global_planner/scale_handler.py
@@ -3,17 +3,16 @@
 """Handler for scale_request endpoint in GlobalPlanner."""
+import asyncio
 import logging
 from dynamo.planner import KubernetesConnector
+from dynamo.planner.kube import KubernetesAPI
 from dynamo.planner.scale_protocol import ScaleRequest, ScaleResponse, ScaleStatus
 from dynamo.runtime import DistributedRuntime, dynamo_endpoint
 logger = logging.getLogger(__name__)
-# Model name used for KubernetesConnector in remote execution mode
-MANAGED_MODEL_NAME = "managed"
 class ScaleRequestHandler:
    """Handles incoming scale requests in GlobalPlanner.
@@ -24,6 +23,14 @@ class ScaleRequestHandler:
    3. Caches KubernetesConnector per DGD for efficiency
    4. Executes scaling via Kubernetes API
    5. Returns current replica counts
+    Management modes:
+    - **Explicit** (``--managed-namespaces`` set): Only DGDs whose Dynamo
+      namespaces are listed are managed. Authorization rejects requests from
+      unlisted namespaces, and GPU budget only counts these DGDs.
+    - **Implicit** (no ``--managed-namespaces``): All DGDs in the Kubernetes
+      namespace are managed. Any caller is accepted, and GPU budget counts
+      every DGD discovered in the namespace.
    """
    def __init__(
@@ -32,6 +39,7 @@ class ScaleRequestHandler:
        managed_namespaces: list,
        k8s_namespace: str,
        no_operation: bool = False,
+        max_total_gpus: int = -1,
    ):
        """Initialize the scale request handler.
@@ -40,6 +48,7 @@ class ScaleRequestHandler:
            managed_namespaces: List of authorized namespaces (None = accept all)
            k8s_namespace: Kubernetes namespace where GlobalPlanner is running
            no_operation: If True, log scale requests without executing K8s scaling
+            max_total_gpus: Maximum total GPUs across all managed pools (-1 = unlimited)
        """
        self.runtime = runtime
        # If managed_namespaces is None, accept all namespaces
@@ -48,7 +57,11 @@ class ScaleRequestHandler:
        )
        self.k8s_namespace = k8s_namespace
        self.no_operation = no_operation
+        self.max_total_gpus = max_total_gpus
        self.connectors = {}  # Cache of KubernetesConnector per DGD
+        # Serializes budget-check + scale-execution so concurrent requests from
+        # different pools cannot both pass against the same pre-scale state.
+        self._scale_lock = asyncio.Lock()
        if self.managed_namespaces:
            logger.info(
@@ -63,6 +76,122 @@ class ScaleRequestHandler:
                "scale requests will be logged but not executed"
            )
+        if self.max_total_gpus >= 0:
+            logger.info(
+                f"GPU budget enforcement ENABLED: max {self.max_total_gpus} total GPUs"
+            )
+            self._populate_k8s_connectors()
+        else:
+            logger.info("GPU budget enforcement DISABLED (unlimited)")
+    def _managed_dgd_names(self) -> set[str] | None:
+        """Derive the DGD names that this GlobalPlanner manages.
+        Returns:
+            A set of DGD names when in explicit mode, or None in implicit mode.
+        The Dynamo operator convention is:
+            DYN_NAMESPACE = "{k8s_namespace}-{dgd_name}"
+        so the DGD name is the Dynamo namespace with the k8s prefix stripped.
+        """
+        if self.managed_namespaces is None:
+            return None
+        prefix = f"{self.k8s_namespace}-"
+        names = set()
+        for ns in self.managed_namespaces:
+            if ns.startswith(prefix):
+                names.add(ns[len(prefix) :])
+            else:
+                logger.warning(
+                    f"Managed namespace '{ns}' does not start with "
+                    f"expected prefix '{prefix}'; cannot derive DGD name"
+                )
+        return names
+    def _populate_k8s_connectors(self):
+        """Pre-populate connectors for DGDs managed by this GlobalPlanner.
+        This ensures the GPU budget calculation accounts for DGDs that already
+        exist at startup, even if they haven't sent a scale request yet.
+        In explicit mode (--managed-namespaces set), only DGDs whose names
+        match the managed Dynamo namespaces are discovered.
+        In implicit mode, all DGDs in the k8s namespace are discovered.
+        """
+        try:
+            kube_api = KubernetesAPI(self.k8s_namespace)
+            managed_names = self._managed_dgd_names()
+            dgds = kube_api.list_graph_deployments()
+            discovered = []
+            for dgd in dgds:
+                name = dgd.get("metadata", {}).get("name", "")
+                if not name:
+                    continue
+                # In explicit mode, skip DGDs not in the managed set
+                if managed_names is not None and name not in managed_names:
+                    continue
+                connector_key = f"{self.k8s_namespace}/{name}"
+                if connector_key not in self.connectors:
+                    connector = KubernetesConnector(
+                        dynamo_namespace="discovered",
+                        k8s_namespace=self.k8s_namespace,
+                        parent_dgd_name=name,
+                    )
+                    self.connectors[connector_key] = connector
+                discovered.append(name)
+            logger.info(f"Discovered {len(discovered)} existing DGDs: {discovered}")
+        except Exception as e:
+            logger.warning(f"Failed to discover existing DGDs: {e}")
+    def _calculate_total_gpus_after_request(self, request: ScaleRequest) -> int:
+        """Calculate total GPUs across all managed DGDs if this request is granted.
+        For the requesting DGD, uses the desired replica counts from the request.
+        For all other known DGDs, uses their current replica counts.
+        NOTE: GPU count is read from spec.services[].resources.limits.gpu only.
+        GPUs specified via resources.requests.gpu or extraPodSpec resource
+        overrides are not counted.
+        """
+        total_gpus = 0
+        requesting_key = f"{request.k8s_namespace}/{request.graph_deployment_name}"
+        for key, connector in self.connectors.items():
+            try:
+                deployment = connector.kube_api.get_graph_deployment(
+                    connector.parent_dgd_name
+                )
+            except Exception as e:
+                logger.warning(f"Failed to read DGD for {key}: {e}")
+                continue
+            services = deployment.get("spec", {}).get("services", {})
+            for svc_spec in services.values():
+                sub_type = svc_spec.get("subComponentType", "")
+                if not sub_type:
+                    continue
+                gpu_per_replica = int(
+                    svc_spec.get("resources", {}).get("limits", {}).get("gpu", 0)
+                )
+                if gpu_per_replica == 0:
+                    continue
+                replicas = svc_spec.get("replicas", 0)
+                # For the requesting DGD, use desired replicas from the request
+                if key == requesting_key:
+                    for target in request.target_replicas:
+                        if target.sub_component_type.value == sub_type:
+                            replicas = target.desired_replicas
+                            break
+                total_gpus += replicas * gpu_per_replica
+        return total_gpus
    @dynamo_endpoint(ScaleRequest, ScaleResponse)
    async def scale_request(self, request: ScaleRequest):
        """Process scaling request from a Planner.
@@ -115,7 +244,6 @@ class ScaleRequestHandler:
            if connector_key not in self.connectors:
                connector = KubernetesConnector(
                    dynamo_namespace=request.caller_namespace,
-                    model_name=MANAGED_MODEL_NAME,  # Not used for remote execution
                    k8s_namespace=request.k8s_namespace,
                    parent_dgd_name=request.graph_deployment_name,
                )
@@ -125,10 +253,35 @@ class ScaleRequestHandler:
                connector = self.connectors[connector_key]
                logger.debug(f"Reusing cached connector for {connector_key}")
-            # Execute scaling (request.target_replicas is already List[TargetReplica])
+            # Lock ensures the budget check and scale execution are atomic
-            await connector.set_component_replicas(
+            # so concurrent requests from different pools cannot both pass
-                request.target_replicas, blocking=request.blocking
+            # against the same pre-scale replica counts.
-            )
+            async with self._scale_lock:
+                # Check GPU budget before scaling
+                if self.max_total_gpus >= 0:
+                    total_gpus = self._calculate_total_gpus_after_request(request)
+                    if total_gpus > self.max_total_gpus:
+                        logger.warning(
+                            f"Rejecting scale request from {request.caller_namespace}: "
+                            f"would use {total_gpus} GPUs, exceeding max of {self.max_total_gpus}"
+                        )
+                        yield {
+                            "status": ScaleStatus.ERROR.value,
+                            "message": (
+                                f"GPU budget exceeded: request would use {total_gpus} total GPUs, "
+                                f"max allowed is {self.max_total_gpus}"
+                            ),
+                            "current_replicas": {},
+                        }
+                        return
+                    logger.info(
+                        f"GPU budget check passed: {total_gpus}/{self.max_total_gpus} GPUs"
+                    )
+                # Execute scaling (request.target_replicas is already List[TargetReplica])
+                await connector.set_component_replicas(
+                    request.target_replicas, blocking=request.blocking
+                )
            # Get current replica counts
            current_replicas = {}

--- a/components/src/dynamo/global_router/README.md
+++ b/components/src/dynamo/global_router/README.md
@@ -163,7 +163,7 @@ If not provided, the middle of the configured range is used as default.
 ## Example
-See `examples/hierarchical_planner/` for a complete example with:
+See `examples/global_planner/` for a complete example with:
 - Global router configuration
 - Local router setup for each pool
 - Mocker workers for testing
--- a/components/src/dynamo/planner/kube.py
+++ b/components/src/dynamo/planner/kube.py
@@ -58,6 +58,16 @@ class KubernetesAPI:
            name=graph_deployment_name,
        )
+    def list_graph_deployments(self) -> list[dict]:
+        """List all DynamoGraphDeployments in the current namespace."""
+        result = self.custom_api.list_namespaced_custom_object(
+            group="nvidia.com",
+            version="v1alpha1",
+            namespace=self.current_namespace,
+            plural="dynamographdeployments",
+        )
+        return result.get("items", [])
    def get_graph_deployment(self, graph_deployment_name: str) -> dict:
        """
        Get the parent DynamoGraphDeployment

--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
@@ -15,6 +15,8 @@ When both modes are enabled, throughput-based scaling provides a lower bound on
 > **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
+> **Need multi-DGD coordination?** See [Global Planner Deployment Guide](global-planner.md). It covers both shared-policy coordination across multiple DGDs and the one-endpoint multi-pool pattern.
 ## Feature Matrix
 | Feature | Throughput-Based | Load-Based (Experimental) |
@@ -84,6 +86,7 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
 | Document | Description |
 |----------|-------------|
 | [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
+| [Global Planner Deployment Guide](global-planner.md) | When to use `GlobalPlanner`, including multi-model coordination and single-endpoint multi-pool deployments |
 | [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
 | [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
 | [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |

--- a/docs/components/planner/global-planner.md
+++ b/docs/components/planner/global-planner.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Global Planner Deployment Guide
+---
+This guide explains how to deploy `GlobalPlanner` and when to use it. `GlobalPlanner` is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint.
+## Why Global Planner?
+Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to:
+- apply centralized scaling policy across multiple DGDs
+- enforce shared constraints such as authorization or total GPU budget
+- coordinate scaling for a single-endpoint, multi-pool deployment
+`GlobalPlanner` solves that by becoming the common scale-execution endpoint for multiple local planners.
+## Terminology
+- **SLA Planner**: The normal `dynamo.planner` component that computes desired replica counts to maintain SLAs.
+- **Local Planner**: A pool-local instance of a SLA planner inside one DGD.
+- **Global Planner**: The centralized execution and policy layer that receives scale requests from local planners.
+- **Single-endpoint multi-pool deployment**: One model endpoint backed by multiple DGDs for the same model. This pattern uses both `GlobalRouter` and `GlobalPlanner`.
+## Deployment Patterns
+Use `GlobalPlanner` in one of these two patterns:
+| Pattern | Use when | Needs `GlobalRouter` | Public endpoint shape |
+|---------|----------|----------------------|-----------------------|
+| Multiple model endpoints or independent DGDs | Separate DGDs should share centralized scaling policy, such as authorization or total GPU budget | No | One endpoint per DGD, or however each DGD is exposed |
+| One model endpoint, multiple DGDs | One model should be reachable through one public endpoint, but different request classes should land on different DGDs | Yes | One shared endpoint |
+## Pattern 1: Multiple Model Endpoints Or Independent DGDs
+Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint.
+Typical examples:
+- DGD A: `qwen-0.6b` disaggregated deployment with its own local planner
+- DGD B: `qwen-32b` disaggregated deployment with its own local planner
+- one shared `GlobalPlanner` that all local planners delegate to
+In this pattern:
+- each DGD keeps its own normal local planner
+- each local planner is configured with `environment: "global-planner"`
+- all those planners point at the same `global_planner_namespace`
+- each DGD keeps its own endpoint or frontend as needed
+- you do **not** need `GlobalRouter`
+This is the pattern to use when the goal is centralized scaling control across multiple deployments or models.
+## Pattern 2: One Model Endpoint, Multiple DGDs
+Use this pattern when all of the following are true:
+- You want one public endpoint for a single model.
+- You want different private pools for different request classes, such as short ISL vs. long ISL requests, or different latency targets.
+- You want each pool to autoscale independently.
+- You want routing and scale execution to be centralized instead of exposing multiple endpoints to clients.
+Typical examples:
+- short-input requests are cheaper on a smaller prefill pool
+- long-input requests need a larger prefill pool
+- decode capacity should scale independently from prefill capacity
+If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead.
+## What You Deploy
+In the current implementation, the single-endpoint pattern is composed from multiple resources:
+| Resource | Purpose | Typical contents |
+|----------|---------|------------------|
+| Control DGD | Public entrypoint and centralized control plane | `Frontend`, `GlobalRouter`, `GlobalPlanner` |
+| Prefill pool DGD(s) | Private prefill capacity pools | `LocalRouter`, prefill worker(s), `Planner` |
+| Decode pool DGD(s) | Private decode capacity pools | `LocalRouter`, decode worker(s), `Planner` |
+| Optional DGDR(s) | Generate or validate one optimized pool shape at a time | Model, workload, SLA, hardware inputs |
+> **Current workflow**
+>
+> A single DGDR does **not** generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually.
+## Architecture
+```text
+Client
+  |
+  v
+Frontend (single public endpoint)
+  |
+  v
+GlobalRouter
+  |
+  +--> Prefill pool 0 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
+  +--> Prefill pool 1 Dynamo namespace --> LocalRouter --> Prefill workers --> Pool Planner
+  |
+  +--> Decode pool 0 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
+  +--> Decode pool 1 Dynamo namespace  --> LocalRouter --> Decode workers  --> Pool Planner
+Pool Planners
+  |
+  v
+GlobalPlanner
+  |
+  v
+Kubernetes scaling updates on the target DGDs
+```
+The `Frontend` exposes a single model endpoint. `GlobalRouter` selects the best pool for each request. Each pool-local `Planner` decides how much capacity its own pool needs. `GlobalPlanner` receives those scale requests and applies the Kubernetes replica changes centrally.
+## Prerequisites
+- Dynamo Kubernetes Platform installed. See [Kubernetes Quickstart](../../kubernetes/README.md).
+- Prometheus deployed and scraping router metrics. The global planner examples assume cluster Prometheus is available.
+- Backend images available for your chosen framework (`vllm`, `sglang`, or `trtllm`).
+- Secrets for model access, such as a Hugging Face token secret.
+- A storage strategy for model weights if your workers should share a model cache PVC.
+For throughput-based scaling, you also need profiling data for each pool. See [Profiler Guide](../profiler/profiler-guide.md).
+## Inputs You Need To Decide Up Front
+Before writing manifests, decide the following:
+| Input | Why it matters | Example |
+|-------|----------------|---------|
+| Model name | All pools in one hierarchy serve the same model | `meta-llama/Llama-3.3-70B-Instruct` |
+| Backend | Worker args and profiling flow depend on it | `vllm` |
+| Pool inventory | Number of specialized prefill and decode pools | 2 prefill pools, 1 decode pool |
+| Workload classes | Determines how many pool profiles you generate | short ISL, long ISL, long context decode |
+| SLA targets | Guides profiling and routing decisions | `ttft: 200 ms`, `itl: 20 ms` |
+| Worker shape | Tensor parallelism, GPUs per worker, and memory footprint | TP1 prefill vs. TP2 prefill |
+| Routing policy | Maps requests to pools at runtime | low-ISL requests -> pool 0 |
+| Optional global budget | Caps total GPUs across managed pools | `--max-total-gpus 16` |
+## Step 1: Profile Each Intended Pool Independently
+Start by deciding what each pool should specialize in. Common examples:
+- Prefill pool 0: lower-cost pool for short prompts.
+- Prefill pool 1: larger pool for long prompts.
+- Decode pool 0: standard decode pool for most requests.
+For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool.
+Example DGDR skeleton:
+```yaml
+apiVersion: nvidia.com/v1beta1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: llama-prefill-short
+spec:
+  model: meta-llama/Llama-3.3-70B-Instruct
+  backend: vllm
+  image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:<tag>
+  workload:
+    isl: 2048
+    osl: 256
+  sla:
+    ttft: 200.0
+    itl: 20.0
+  searchStrategy: rapid
+  autoApply: false
+```
+Repeat this once per planned pool, changing the workload and SLA inputs for each request class.
+What to keep from each profiling result:
+- Worker shape (`tensor-parallel-size`, GPUs per worker, memory/caching settings).
+- Planner profile data directory or generated ConfigMaps.
+- Planner settings such as `prefill_engine_num_gpu` or `decode_engine_num_gpu`.
+- Any backend-specific flags that differ across pools.
+See [Planner Examples](planner-examples.md) and [Profiler Guide](../profiler/profiler-guide.md) for DGDR details.
+## Step 2: Create The Control DGD
+Deploy one control DGD that contains:
+- `Frontend`: the single public model endpoint.
+- `GlobalRouter`: chooses which pool receives each request.
+- `GlobalPlanner`: receives scale requests from pool planners and applies replica changes.
+The vLLM example topology is in [examples/global_planner/global-planner-vllm-test.yaml](../../../examples/global_planner/global-planner-vllm-test.yaml).
+The `GlobalPlanner` section is minimal:
+```yaml
+GlobalPlanner:
+  componentType: default
+  replicas: 1
+  extraPodSpec:
+    mainContainer:
+      image: ${DYNAMO_IMAGE}
+      command:
+        - python3
+        - -m
+        - dynamo.global_planner
+      args:
+        - --managed-namespaces
+        - ${K8S_NAMESPACE}-gp-prefill-0
+        - ${K8S_NAMESPACE}-gp-prefill-1
+        - ${K8S_NAMESPACE}-gp-decode-0
+```
+The values passed to `--managed-namespaces` are the pool planners' **Dynamo namespaces** (`caller_namespace`), not raw Kubernetes namespaces. In many examples they share the same string prefix, but they are logically different identifiers.
+**Management modes**: When `--managed-namespaces` is set (explicit mode), only the listed Dynamo namespaces are authorized to send scale requests, and only their corresponding DGDs count toward the GPU budget. DGD names are derived from the Dynamo namespace using the operator convention `DYN_NAMESPACE = {k8s_namespace}-{dgd_name}`. When omitted (implicit mode), any caller is accepted and all DGDs in the Kubernetes namespace count toward the GPU budget.
+If you want the central executor to reject scale requests that exceed a total GPU budget, add `--max-total-gpus`. See [examples/global_planner/global-planner-gpu-budget.yaml](../../../examples/global_planner/global-planner-gpu-budget.yaml).
+## Step 3: Create One DGD Per Pool
+Each private pool gets its own DGD. A pool DGD usually contains:
+- `LocalRouter`
+- one worker type (`prefill` or `decode`)
+- one `Planner`
+The planner inside each pool must be configured for `global-planner` mode so it delegates scaling to the control stack:
+```json
+{
+  "environment": "global-planner",
+  "global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl",
+  "backend": "vllm",
+  "mode": "prefill",
+  "enable_load_scaling": false,
+  "enable_throughput_scaling": true,
+  "throughput_metrics_source": "router",
+  "ttft": 2000,
+  "prefill_engine_num_gpu": 2,
+  "model_name": "${MODEL_NAME}",
+  "profile_results_dir": "/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"
+}
+```
+`global_planner_namespace` must point to the control stack's **Dynamo namespace**. In the reference manifests, that is the namespace string passed to the control `Frontend` and `GlobalRouter`.
+Use:
+- `mode: "prefill"` for prefill-only pools
+- `mode: "decode"` for decode-only pools
+The worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1.
+In the reference vLLM example:
+- `gp-prefill-0` uses a 1-GPU TP1 prefill worker
+- `gp-prefill-1` uses a 2-GPU TP2 prefill worker
+- `gp-decode-0` uses a 1-GPU TP1 decode worker
+See [global-planner-vllm-test.yaml](../../../examples/global_planner/global-planner-vllm-test.yaml).
+## Step 4: Configure GlobalRouter To Select Pools
+`GlobalRouter` reads a JSON config that lists the pool namespaces and a routing grid for each request type.
+Example:
+```json
+{
+  "num_prefill_pools": 2,
+  "num_decode_pools": 1,
+  "prefill_pool_dynamo_namespaces": [
+    "${K8S_NAMESPACE}-gp-prefill-0",
+    "${K8S_NAMESPACE}-gp-prefill-1"
+  ],
+  "decode_pool_dynamo_namespaces": [
+    "${K8S_NAMESPACE}-gp-decode-0"
+  ],
+  "prefill_pool_selection_strategy": {
+    "ttft_min": 10,
+    "ttft_max": 3000,
+    "ttft_resolution": 2,
+    "isl_min": 0,
+    "isl_max": 32000,
+    "isl_resolution": 2,
+    "prefill_pool_mapping": [[0, 1], [0, 1]]
+  },
+  "decode_pool_selection_strategy": {
+    "itl_min": 10,
+    "itl_max": 500,
+    "itl_resolution": 2,
+    "context_length_min": 0,
+    "context_length_max": 32000,
+    "context_length_resolution": 2,
+    "decode_pool_mapping": [[0, 0], [0, 0]]
+  }
+}
+```
+The `prefill_pool_dynamo_namespaces` and `decode_pool_dynamo_namespaces` entries are **Dynamo namespaces** that the pool-local routers register under.
+Important runtime behavior:
+- Prefill pool selection uses **ISL + TTFT target**
+- Decode pool selection uses **context length + ITL target**
+- OSL is useful for **designing and profiling pools**, but it is **not a direct routing key** in the current `GlobalRouter`
+Clients can pass request targets through `extra_args`:
+```json
+{
+  "extra_args": {
+    "ttft_target": 200,
+    "itl_target": 20
+  }
+}
+```
+For more details, see [Global Router README](../../../components/src/dynamo/global_router/README.md).
+## Step 5: Deploy In Order
+For a fresh cluster, the usual order is:
+1. Install Dynamo platform and Prometheus.
+2. Create secrets and PVCs needed by workers.
+3. Create the `GlobalRouter` ConfigMap.
+4. Apply the control DGD.
+5. Apply the pool DGDs.
+6. Wait for all DGDs to reach ready state.
+7. Expose or port-forward the control `Frontend`.
+Example:
+```bash
+export K8S_NAMESPACE=my-llama
+export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
+export DYNAMO_IMAGE=<dynamo-image>
+export DYNAMO_VLLM_IMAGE=<vllm-image>
+export STORAGE_CLASS_NAME=<rwx-storage-class>
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${K8S_NAMESPACE}
+envsubst < examples/global_planner/global-planner-vllm-test.yaml | \
+  kubectl apply -n ${K8S_NAMESPACE} -f -
+```
+The single user-facing endpoint is the `Frontend` in the control DGD, not the pool DGDs.
+## Step 6: Validate The Stack
+Validate the deployment from outside in:
+- Confirm the control `Frontend` is healthy and serving the model endpoint.
+- Confirm `GlobalRouter` logs show requests being assigned to the expected pool namespaces.
+- Confirm pool-local planners are producing scale requests.
+- Confirm `GlobalPlanner` logs show accepted scale operations.
+- Confirm the target DGDs' replica counts change as expected.
+If you use Prometheus and Grafana, also inspect:
+- TTFT and ITL over time
+- per-pool worker counts
+- per-pool request mix
+- total GPU usage
+## Recommended Workflow For New Deployments
+For most teams, the easiest way to build this deployment is:
+1. Design your pool classes from expected traffic patterns.
+2. Run one DGDR per pool class to generate or validate the pool configuration.
+3. Copy the selected worker shape and planner settings into the final pool DGDs.
+4. Build one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner`.
+5. Route all client traffic through the control `Frontend`.
+This keeps profiling and pool selection simple while still giving you one public endpoint for the model.
+## Current Limitations
+- Single-endpoint `GlobalPlanner` deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology.
+- `GlobalRouter` routes by ISL/TTFT and context-length/ITL grids, not directly by OSL.
+- In the single-endpoint pattern, all pools are expected to serve the same model.
+## See Also
+- [Planner README](README.md) — Planner overview and quick start
+- [Planner Guide](planner-guide.md) — Planner configuration reference
+- [Planner Examples](planner-examples.md) — DGDR examples for generating per-pool configs
+- [Profiler Guide](../profiler/profiler-guide.md) — Pre-deployment profiling workflow
+- [Global Planner README](../../../components/src/dynamo/global_planner/README.md) — Centralized scale execution
+- [Global Router README](../../../components/src/dynamo/global_router/README.md) — Cross-pool request routing
+- [vLLM global planner example](../../../examples/global_planner/global-planner-vllm-test.yaml) — End-to-end reference manifest
--- a/docs/components/planner/planner-guide.md
+++ b/docs/components/planner/planner-guide.md
@@ -114,8 +114,19 @@ The planner receives its config via `--config /path/to/planner_config.json` whic
 See the [Profiler Guide](../profiler/profiler-guide.md) for the full profiling workflow and how to configure pre-deployment sweeping.
+## Hierarchical Deployments
+If you want one public endpoint for a model but multiple private DGDs optimized for different request classes, use a hierarchical deployment:
+- one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner`
+- one or more prefill pool DGDs
+- one or more decode pool DGDs
+In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See [Global Planner Deployment Guide](global-planner.md).
 ## See Also
 - [Planner README](README.md) — Quick overview
+- [Global Planner Deployment Guide](global-planner.md) — `GlobalPlanner` deployment patterns and single-endpoint multi-pool workflow
 - [Planner Design](../../design-docs/planner-design.md) — Architecture internals
 - [Profiler Guide](../profiler/profiler-guide.md) — How profiling data is generated
--- a/docs/components/router/router-examples.md
+++ b/docs/components/router/router-examples.md
@@ -279,7 +279,7 @@ For full documentation on implementing KV event publishing for custom inference
 For deployments with multiple worker pools, the **Global Router** enables hierarchical routing by sitting between the frontend and local routers. It selects the appropriate pool for each request based on configurable policies, supporting disaggregated topologies where pools are tuned for different workload characteristics.
 - **Component details**: [`components/src/dynamo/global_router/`](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/global_router/)
- **Example**: [`examples/hierarchical_planner/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/hierarchical_planner/)
+- **Example**: [`examples/global_planner/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/global_planner/)
 ## See Also

--- a/examples/backends/vllm/deploy/README.md
+++ b/examples/backends/vllm/deploy/README.md
@@ -35,6 +35,9 @@ Advanced disaggregated deployment with KV cache routing capabilities.
 - `VLLMDecodeWorker`: Specialized decode-only worker
 - `VLLMPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
+### 5. **Global Planner Deployments** (see [`examples/global_planner/`](../../../global_planner/))
+Centralized scaling across multiple DGDs via GlobalPlanner. Examples include single-endpoint multi-pool and multi-model GPU budget patterns. See the [global planner examples](../../../global_planner/) for details.
 ## CRD Structure
 All templates use the **DynamoGraphDeployment** CRD:
@@ -121,6 +124,7 @@ Select the deployment pattern that matches your requirements:
 - Use `disagg.yaml` for maximum performance
 - Use `disagg_router.yaml` for high-performance with KV cache routing
 - Use `disagg_planner.yaml` for SLA-optimized performance
+- Use [global planner examples](../../../global_planner/) for centralized scaling across multiple DGDs
 ### 2. Customize Configuration
 Edit the template to match your environment:
@@ -249,6 +253,7 @@ args:
 - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
 - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation-guide.md)
 - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner-guide.md)
+- **Global Planner**: [Global Planner Deployment Guide](../../../../docs/components/planner/global-planner.md)
 - **Examples**: [Deployment Examples](../../../../docs/getting-started/examples.md)
 - **Architecture Docs**: [Disaggregated Serving](../../../../docs/design-docs/disagg-serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md)

--- a/examples/global_planner/README.md
+++ b/examples/global_planner/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+# Global Planner Examples
+Examples demonstrating **GlobalPlanner** — the centralized scaling execution layer that
+enforces shared scaling policy across multiple DGDs.
+## Example Manifests
+| File | Pattern | Backend | Description |
+|------|---------|---------|-------------|
+| `global-planner-gpu-budget.yaml` | Multi-model, GPU budget | vLLM | 2 independent model DGDs + 1 control DGD with `--max-total-gpus` |
+| `global-planner-vllm-test.yaml` | Single-endpoint, multi-pool | vLLM | 1 Frontend + GlobalRouter + GlobalPlanner, 2 prefill pools (TP1, TP2) + 1 decode pool |
+| `global-planner-mocker-test.yaml` | Single-endpoint, multi-pool | Mocker | Same as above with Mocker workers; GlobalPlanner in `--no-operation` mode |
+## Deployment Patterns
+### Pattern 1: Multi-Model with GPU Budget (`global-planner-gpu-budget.yaml`)
+Multiple independent DGDs, each serving a different model with its own Frontend.
+A shared GlobalPlanner enforces a cluster-wide GPU cap.
+```
+DGD gp-ctrl:    GlobalPlanner (--max-total-gpus)
+DGD model-a:    Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner  (MODEL_A)
+DGD model-b:    Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner  (MODEL_B)
+```
+- No GlobalRouter needed — each model has its own endpoint.
+- Each DGD's local planner uses `environment: "global-planner"` to delegate scaling.
+- GlobalPlanner rejects any scale request that would push total GPUs above the limit.
+### Pattern 2: Single Endpoint, Multi-Pool (`global-planner-vllm-test.yaml`)
+One public endpoint for a single model, backed by multiple specialized pools.
+A GlobalRouter selects the best pool for each request.
+```
+DGD gp-ctrl:      Frontend + GlobalRouter + GlobalPlanner
+DGD gp-prefill-0: LocalRouter + VllmPrefillWorker (TP1) + Planner
+DGD gp-prefill-1: LocalRouter + VllmPrefillWorker (TP2) + Planner
+DGD gp-decode-0:  LocalRouter + VllmDecodeWorker  (TP1) + Planner
+```
+- GlobalRouter routes prefill requests by (ISL, TTFT target) and decode by (context length, ITL target).
+- Each pool planner delegates scaling to GlobalPlanner.
+## Prerequisites
+- Dynamo Kubernetes Platform installed (see [Kubernetes Quickstart](../../docs/kubernetes/README.md))
+- Cluster Prometheus scraping router metrics via PodMonitor
+- HuggingFace token secret:
+  ```bash
+  kubectl create secret generic hf-token-secret \
+    --from-literal=HF_TOKEN=<your-token> -n ${K8S_NAMESPACE}
+  ```
+- A ReadWriteMany StorageClass for the shared model cache PVC
+## Deploying
+All manifests use `envsubst` for configuration. Set the required variables and apply:
+### GPU Budget Example
+```bash
+export K8S_NAMESPACE=my-ns
+export DYNAMO_IMAGE=<dynamo-image>
+export DYNAMO_VLLM_IMAGE=<vllm-image>
+export STORAGE_CLASS_NAME=<rwx-storage-class>
+export MODEL_A=meta-llama/Llama-3.1-8B-Instruct
+export MODEL_B=Qwen/Qwen3-8B
+export MAX_TOTAL_GPUS=8
+envsubst < global-planner-gpu-budget.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
+```
+### Single-Endpoint vLLM Example
+```bash
+export K8S_NAMESPACE=my-ns
+export DYNAMO_IMAGE=<dynamo-image>
+export DYNAMO_VLLM_IMAGE=<vllm-image>
+export STORAGE_CLASS_NAME=<rwx-storage-class>
+export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
+envsubst < global-planner-vllm-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
+```
+### Mocker Example (No GPUs)
+```bash
+export K8S_NAMESPACE=my-ns
+export DYNAMO_IMAGE=<dynamo-image>
+envsubst < global-planner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
+```
+## Verifying
+```bash
+# Check DGD status
+kubectl get dgd -n ${K8S_NAMESPACE}
+# Check pods
+kubectl get pods -n ${K8S_NAMESPACE}
+# Watch GlobalPlanner logs for scale requests
+kubectl logs -n ${K8S_NAMESPACE} \
+  -l nvidia.com/dynamo-component=GlobalPlanner -f
+```
+## Cleanup
+```bash
+envsubst < <manifest>.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
+```
+## SLA Planner Configuration
+Each pool's local planner is configured via a JSON blob passed to `--config`.
+Key fields for GlobalPlanner delegation:
+| Field | Description |
+|-------|-------------|
+| `environment` | `"global-planner"` — delegates scaling to GlobalPlanner |
+| `global_planner_namespace` | Dynamo namespace of the control DGD (e.g. `${K8S_NAMESPACE}-gp-ctrl`) |
+| `mode` | `"disagg"`, `"prefill"`, or `"decode"` |
+| `throughput_metrics_source` | `"router"` for multi-DGD setups (reads `dynamo_component_router_*` from Prometheus) |
+| `max_gpu_budget` | Per-pool GPU limit (`-1` = unlimited, defer to GlobalPlanner) |
+## GlobalPlanner Flags
+| Flag | Description |
+|------|-------------|
+| `--max-total-gpus N` | Reject requests that would exceed N total GPUs across all managed DGDs. `0` = no GPU scaling allowed, `-1` (default) = unlimited |
+| `--managed-namespaces NS...` | Only accept scale requests from listed Dynamo namespaces (default: accept all). See *Management Modes* below |
+| `--no-operation` | Log scale requests without executing them (useful for dry-run testing) |
+### Management Modes
+GlobalPlanner operates in one of two modes depending on whether `--managed-namespaces` is set:
+- **Explicit mode** (`--managed-namespaces` provided): Only the listed Dynamo
+  namespaces are authorized to send scale requests, and only their corresponding
+  DGDs count toward the GPU budget. DGD names are derived from the Dynamo
+  namespace using the operator convention `DYN_NAMESPACE = {k8s_namespace}-{dgd_name}`.
+- **Implicit mode** (no `--managed-namespaces`): Any caller is accepted, and all
+  DGDs in the Kubernetes namespace count toward the GPU budget.
+## Namespace Convention
+The Dynamo operator prepends the Kubernetes namespace to the DGD's `dynamoNamespace`:
+- K8s namespace: `my-ns`, DGD name: `gp-ctrl`
+- Dynamo namespace: `my-ns-gp-ctrl`
+This is why planner configs and router endpoints use the full `${K8S_NAMESPACE}-<dgd-name>` path.
+## Further Reading
+- [Global Planner Deployment Guide](../../docs/components/planner/global-planner.md)
+- [Global Planner README](../../components/src/dynamo/global_planner/README.md)
+- [Planner Configuration Guide](../../docs/components/planner/planner-guide.md)
+- [Global Router README](../../components/src/dynamo/global_router/README.md)
--- a/examples/global_planner/global-planner-gpu-budget.yaml
+++ b/examples/global_planner/global-planner-gpu-budget.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Minimal GlobalPlanner GPU budget example: 2 independent model DGDs sharing a
+# GPU cap enforced by a central GlobalPlanner.
+#
+# Each model DGD is self-contained (Frontend + Workers + Planner) and serves a
+# different model. The GlobalPlanner in the ctrl DGD rejects any scale request
+# that would push the total GPU count across its managed DGDs above MAX_TOTAL_GPUS.
+#
+# The budget applies only to DGDs managed by this GlobalPlanner (see
+# --managed-namespaces), not to every DGD in the cluster. In this example the
+# ctrl DGD runs in implicit mode (no --managed-namespaces), so all DGDs in the
+# same K8s namespace count toward the budget. To limit the budget to specific
+# DGDs, pass --managed-namespaces with their Dynamo namespaces.
+#
+# Architecture:
+#   DGD gp-ctrl:    GlobalPlanner (--max-total-gpus)
+#   DGD model-a:    Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner  (MODEL_A)
+#   DGD model-b:    Frontend + VllmPrefillWorker + VllmDecodeWorker + Planner  (MODEL_B)
+#
+# Prerequisites:
+#   - Cluster Prometheus deployed and scraping pods via PodMonitor
+#   - HuggingFace token secret: kubectl create secret generic hf-token-secret \
+#       --from-literal=HF_TOKEN=<your-token> -n ${K8S_NAMESPACE}
+#
+# Usage:
+#   export K8S_NAMESPACE=... DYNAMO_IMAGE=... DYNAMO_VLLM_IMAGE=... STORAGE_CLASS_NAME=...
+#   export MODEL_A=meta-llama/Llama-3.1-8B-Instruct MODEL_B=Qwen/Qwen3-8B MAX_TOTAL_GPUS=8
+#   envsubst < global-planner-gpu-budget.yaml | kubectl apply  -n ${K8S_NAMESPACE} -f -
+#   envsubst < global-planner-gpu-budget.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: ${K8S_NAMESPACE}-planner
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: dynamo-platform-dynamo-operator-planner
+subjects:
+  - kind: ServiceAccount
+    name: default
+    namespace: ${K8S_NAMESPACE}
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: hf-model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  storageClassName: ${STORAGE_CLASS_NAME}
+  resources:
+    requests:
+      storage: 50Gi
+---
+# ── Control plane: GlobalPlanner only ────────────────────────────────────────
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: gp-ctrl
+spec:
+  services:
+    GlobalPlanner:
+      componentType: default
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: ${DYNAMO_IMAGE}
+          command:
+            - python3
+            - -m
+            - dynamo.global_planner
+          args:
+            - --max-total-gpus
+            - "${MAX_TOTAL_GPUS}"
+---
+# ── Model A: self-contained disagg serving DGD ──────────────────────────────
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: model-a
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: ${DYNAMO_IMAGE}
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+          args:
+            - --model-name
+            - ${MODEL_A}
+    VllmPrefillWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: prefill
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        volumes:
+          - name: hf-model-cache
+            persistentVolumeClaim:
+              claimName: hf-model-cache
+        mainContainer:
+          image: ${DYNAMO_VLLM_IMAGE}
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - ${MODEL_A}
+            - --tensor-parallel-size
+            - "1"
+            - --is-prefill-worker
+          volumeMounts:
+            - name: hf-model-cache
+              mountPath: /home/dynamo/.cache/huggingface/hub
+    VllmDecodeWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: decode
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        volumes:
+          - name: hf-model-cache
+            persistentVolumeClaim:
+              claimName: hf-model-cache
+        mainContainer:
+          image: ${DYNAMO_VLLM_IMAGE}
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - ${MODEL_A}
+            - --tensor-parallel-size
+            - "1"
+          volumeMounts:
+            - name: hf-model-cache
+              mountPath: /home/dynamo/.cache/huggingface/hub
+    Planner:
+      componentType: planner
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: ${DYNAMO_IMAGE}
+          command:
+            - python3
+            - -m
+            - dynamo.planner
+          args:
+            - --config
+            - '{"environment":"global-planner","global_planner_namespace":"${K8S_NAMESPACE}-gp-ctrl","backend":"vllm","mode":"disagg","enable_load_scaling":false,"enable_throughput_scaling":true,"throughput_metrics_source":"router","ttft":2000,"itl":200,"max_gpu_budget":-1,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"model_name":"${MODEL_A}","profile_results_dir":"/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"}'
+---
+# ── Model B: self-contained disagg serving DGD ──────────────────────────────
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: model-b
+spec:
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: ${DYNAMO_IMAGE}
+          workingDir: /workspace
+          command:
+            - python3
+            - -m
+            - dynamo.frontend
+          args:
+            - --model-name
+            - ${MODEL_B}
+    VllmPrefillWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: prefill
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        volumes:
+          - name: hf-model-cache
+            persistentVolumeClaim:
+              claimName: hf-model-cache
+        mainContainer:
+          image: ${DYNAMO_VLLM_IMAGE}
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - ${MODEL_B}
+            - --tensor-parallel-size
+            - "1"
+            - --is-prefill-worker
+          volumeMounts:
+            - name: hf-model-cache
+              mountPath: /home/dynamo/.cache/huggingface/hub
+    VllmDecodeWorker:
+      envFromSecret: hf-token-secret
+      componentType: worker
+      subComponentType: decode
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        volumes:
+          - name: hf-model-cache
+            persistentVolumeClaim:
+              claimName: hf-model-cache
+        mainContainer:
+          image: ${DYNAMO_VLLM_IMAGE}
+          workingDir: /workspace/examples/backends/vllm
+          command:
+            - python3
+            - -m
+            - dynamo.vllm
+          args:
+            - --model
+            - ${MODEL_B}
+            - --tensor-parallel-size
+            - "1"
+          volumeMounts:
+            - name: hf-model-cache
+              mountPath: /home/dynamo/.cache/huggingface/hub
+    Planner:
+      componentType: planner
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: ${DYNAMO_IMAGE}
+          command:
+            - python3
+            - -m
+            - dynamo.planner
+          args:
+            - --config
+            - '{"environment":"global-planner","global_planner_namespace":"${K8S_NAMESPACE}-gp-ctrl","backend":"vllm","mode":"disagg","enable_load_scaling":false,"enable_throughput_scaling":true,"throughput_metrics_source":"router","ttft":2000,"itl":200,"max_gpu_budget":-1,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"model_name":"${MODEL_B}","profile_results_dir":"/workspace/tests/planner/profiling_results/H200_TP1P_TP1D"}'
--- a/examples/backends/mocker/deploy/hplanner-mocker-test.yaml
+++ b/examples/backends/mocker/deploy/hplanner-mocker-test.yaml
@@ -13,8 +13,8 @@
 #   DGD gp-decode-1:  LocalRouter + MockerDecode  + Planner
 #
 # Usage:
-#   envsubst < hplanner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
+#   envsubst < global-planner-mocker-test.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
-#   envsubst < hplanner-mocker-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
+#   envsubst < global-planner-mocker-test.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
 apiVersion: v1
 kind: ConfigMap
 metadata:

--- a/examples/backends/vllm/deploy/hplanner-vllm-test.yaml
+++ b/examples/backends/vllm/deploy/hplanner-vllm-test.yaml
@@ -18,8 +18,8 @@
 #
 # Usage:
 #   export K8S_NAMESPACE=... DYNAMO_IMAGE=... DYNAMO_VLLM_IMAGE=... MODEL_NAME=... STORAGE_CLASS_NAME=...
-#   envsubst < hplanner-vllm-test.yaml | kubectl apply   -n ${K8S_NAMESPACE} -f -
+#   envsubst < global-planner-vllm-test.yaml | kubectl apply   -n ${K8S_NAMESPACE} -f -
-#   envsubst < hplanner-vllm-test.yaml | kubectl delete  -n ${K8S_NAMESPACE} -f -
+#   envsubst < global-planner-vllm-test.yaml | kubectl delete  -n ${K8S_NAMESPACE} -f -
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:

--- a/examples/hierarchical_planner/README.md
+++ b/examples/hierarchical_planner/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-# Hierarchical Planner Example
-This example demonstrates a hierarchical routing setup with:
- A **Global Router** that routes to different pools based on request characteristics
- **Local Routers** in each pool namespace
- **Workers** (Mocker for local testing, vLLM for Kubernetes deployment)
-## Architecture
-```
-                    Frontend (round-robin routing)
-                         |
-                         v
-                    Global Router
-                   (registers as both prefill + decode)
-                         |
-        +----------------+----------------+
-        |                |                |
-        v                v                v
-   Prefill Pool 0   Prefill Pool 1   Decode Pool 0
-   (prefill-pool-0) (prefill-pool-1) (decode-pool-0)
-        |                |                |
-        v                v                v
-   Local Router     Local Router     Local Router
-        |                |                |
-        v                v                v
-      Worker           Worker           Worker
-   (prefill)        (prefill)        (decode)
-```
-## Configuration
-The `global_router_config.json` defines:
- 2 prefill pools (`prefill-pool-0`, `prefill-pool-1`)
- 1 decode pool (`decode-pool-0`)
- Grid-based pool selection strategy
-Pool selection is based on a 2x2 grid:
- **Prefill**: (ISL, TTFT_target) maps to prefill pool index
- **Decode**: (context_length, ITL_target) maps to decode pool index
-## Running Locally (with Mocker)
-For local testing without GPUs, use the mocker-based script:
-```bash
-cd examples/hierarchical_planner
-./run_example.sh
-```
-This starts all components in the background and provides instructions for testing.
-## Kubernetes Deployment (with vLLM)
-The `vllm-2p1d.yaml` file provides a multi-DGD deployment with real vLLM workers (1 GPU each).
-### Prerequisites
- Kubernetes cluster with GPU nodes
- `hf-token-secret` secret containing your HuggingFace token
- The Dynamo operator installed
-### Deployment
-The YAML uses environment variable placeholders:
- `${K8S_NAMESPACE}` - Your Kubernetes namespace
- `${VLLM_IMAGE}` - Dynamo vLLM runtime container image
-Use `envsubst` to substitute these before applying:
-```bash
-# Set your Kubernetes namespace and image
-export K8S_NAMESPACE=<your-k8s-namespace>
-export VLLM_IMAGE=<dynamo-vllm-image>
-# Deploy all DGDs
-envsubst < vllm-2p1d.yaml | kubectl apply -n ${K8S_NAMESPACE} -f -
-```
-### Verify Deployment
-```bash
-# Check DGD status
-kubectl get dgd -n ${K8S_NAMESPACE}
-# Check pods
-kubectl get pods -n ${K8S_NAMESPACE}
-# Check logs for a specific component
-kubectl logs -n ${K8S_NAMESPACE} -l nvidia.com/dynamo-component=Frontend
-```
-### Cleanup
-```bash
-export K8S_NAMESPACE=<your-k8s-namespace>
-export VLLM_IMAGE=<dynamo-vllm-image>
-envsubst < vllm-2p1d.yaml | kubectl delete -n ${K8S_NAMESPACE} -f -
-```
-### Namespace Convention
-The Dynamo operator prepends the Kubernetes namespace to the `dynamoNamespace` field:
- K8s namespace: `my-namespace`
- `dynamoNamespace: prefill-pool-0`
- Actual Dynamo namespace: `my-namespace-prefill-pool-0`
-This is why the global router config and local router endpoints must use the full namespace path.
-## Testing
-Once all components are running, send a request to the frontend:
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [{"role": "user", "content": "Hello, how are you?"}],
-    "max_tokens": 50,
-    "stream": true
-  }'
-```
-For Kubernetes, port-forward the frontend service first:
-```bash
-kubectl port-forward -n ${K8S_NAMESPACE} svc/hierarchical-frontend-frontend 8000:8000
-```
-## Request Flow
-1. Request arrives at **Frontend**
-2. Frontend's `PrefillRouter` detects both prefill and decode registered for the model
-3. Frontend sends prefill request to **Global Router** (registered as prefill)
-4. Global Router selects prefill pool based on (ISL, TTFT_target) grid
-5. Request forwarded to **Local Router** in selected prefill pool namespace
-6. Local Router forwards to **Worker** (prefill mode)
-7. Prefill response returns with `disaggregated_params`
-8. Frontend sends decode request to **Global Router** (registered as decode)
-9. Global Router selects decode pool based on (context_length, ITL_target) grid
-10. Tokens stream back through the chain
-## Customizing Pool Selection
-Edit `global_router_config.json` (or the ConfigMap in `vllm-2p1d.yaml`) to change:
- **Number of pools**: Adjust `num_prefill_pools`, `num_decode_pools` and corresponding namespace lists
- **Selection grid**: Modify `isl_resolution`, `ttft_resolution` etc. to change grid granularity
- **Pool mapping**: Edit `prefill_pool_mapping` and `decode_pool_mapping` matrices
-Example: To always route to pool 0 regardless of request characteristics:
-```json
-"prefill_pool_mapping": [[0, 0], [0, 0]]
-```
-## SLA Planner with GlobalPlanner
-Each pool can run an SLA Planner that reads throughput metrics and delegates autoscaling decisions
-to a central **GlobalPlanner** service. The GlobalPlanner arbitrates across pools and executes
-scaling via the Dynamo operator.
-### Architecture with SLA Planners
-```
-Frontend (round-robin)
-     |
-     v
-Global Router ─── GlobalPlanner  ◄─── scale decisions from pool planners
-     |
-     +──────────────────────────────────────+
-     |                |                     |
-Prefill Pool 0    Prefill Pool 1       Decode Pool 0
-LocalRouter       LocalRouter          LocalRouter
-Worker            Worker               Worker
-Planner ──────►   Planner ──────►      Planner ──────►  (all → GlobalPlanner)
-```
-### SLA Planner configuration
-The SLA Planner is configured via a JSON blob passed to `--config`. Key fields for the
-global-planner environment:
-| Field | Description |
-|---|---|
-| `environment` | `"global-planner"` to delegate scaling to GlobalPlanner |
-| `global_planner_namespace` | Dynamo namespace of the DGD running GlobalPlanner |
-| `mode` | `"prefill"` or `"decode"` |
-| `throughput_metrics_source` | `"frontend"` (default) or `"router"` — see below |
-### `throughput_metrics_source`
-Controls where the SLA Planner reads aggregate throughput metrics (TTFT, ITL, request rate):
- **`frontend`** (default): reads `dynamo_frontend_*` histograms from the frontend service. Works
-  for single-DGD disagg deployments where the planner and frontend share a namespace.
- **`router`**: reads `dynamo_component_router_*` histograms emitted by LocalRouter pods and
-  scraped by cluster Prometheus. Required for hierarchical (multi-DGD) disagg deployments where
-  the SLA Planner runs in a pool DGD namespace that is different from the frontend DGD namespace.
-Use `throughput_metrics_source: "router"` whenever the planner is co-located with a pool
-(not the frontend), i.e. in any GlobalPlanner setup.
-### Prometheus scraping for router metrics
-The Dynamo operator Helm chart includes a PodMonitor that scrapes LocalRouter pods on port 9090.
-LocalRouter pods must expose metrics on that port via:
-```yaml
-env:
-  - name: DYN_SYSTEM_PORT
-    value: "9090"
-```
-No standalone Prometheus is needed — the cluster-wide Prometheus picks up the PodMonitor
-automatically.
-### GlobalPlanner `--no-operation` mode
-Pass `--no-operation` to GlobalPlanner to receive and log scale requests without executing them.
-Useful for observing planner behaviour before enabling live scaling:
-```yaml
-command: [python3, -m, dynamo.global_planner]
-args: [--no-operation]
-```
-### Example deployments
-Complete end-to-end examples are in `examples/backends/`:
-| File | Description |
-|---|---|
-| `mocker/deploy/hplanner-mocker-test.yaml` | 2 prefill + 2 decode pools with Mocker workers; GlobalPlanner in no-op mode |
-| `vllm/deploy/hplanner-vllm-test.yaml` | 2 prefill (TP1, TP2) + 1 decode pool with real vLLM workers |
-Both use `envsubst` for substituting `${K8S_NAMESPACE}`, `${DYNAMO_IMAGE}`, etc.
--- a/examples/hierarchical_planner/global_router_config.json
+++ b/examples/hierarchical_planner/global_router_config.json
-{
-    "num_prefill_pools": 2,
-    "num_decode_pools": 1,
-    "prefill_pool_dynamo_namespaces": ["prefill_pool_0", "prefill_pool_1"],
-    "decode_pool_dynamo_namespaces": ["decode_pool_0"],
-    "prefill_pool_selection_strategy": {
-        "ttft_min": 10,
-        "ttft_max": 1000,
-        "ttft_resolution": 2,
-        "isl_min": 0,
-        "isl_max": 32000,
-        "isl_resolution": 2,
-        "prefill_pool_mapping": [[0,1],[0,1]]
-    },
-    "decode_pool_selection_strategy": {
-        "itl_min": 10,
-        "itl_max": 100,
-        "itl_resolution": 2,
-        "context_length_min": 0,
-        "context_length_max": 32000,
-        "context_length_resolution": 2,
-        "decode_pool_mapping": [[0,0],[0,0]]
-    }
-}
\ No newline at end of file
--- a/examples/hierarchical_planner/run_example.sh
+++ b/examples/hierarchical_planner/run_example.sh
-#!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-# Hierarchical Planner Example
-# Run each command in a separate terminal, in order from bottom to top.
-# Wait a few seconds between starting each component.
-# ============================================================================
-# frontend + global_router
-# ============================================================================
-# need to specify a namespace so that mockers are not registered to frontend
-# and cannot use "dynamo" because that is reserved for all namespaces
-python -m dynamo.frontend \
-  --router-mode round-robin \
-  --namespace hierarchical &
-python -m dynamo.global_router \
-  --config examples/hierarchical_planner/global_router_config.json \
-  --model-name Qwen/Qwen3-0.6B \
-  --default-ttft-target 100 \
-  --default-itl-target 10 \
-  --namespace hierarchical &
-# ============================================================================
-# prefill_pool_0 - local router + mocker worker (prefill)
-# ============================================================================
-DYN_NAMESPACE=prefill_pool_0 python -m dynamo.router \
-  --endpoint prefill_pool_0.worker.generate \
-  --router-block-size 16 \
-  --no-router-track-active-blocks &  # prefill router does not need to track active blocks
-python -m dynamo.mocker \
-  --model-path Qwen/Qwen3-0.6B \
-  --endpoint dyn://prefill_pool_0.worker.generate \
-  --disaggregation-mode prefill \
-  --block-size 16 &
-# ============================================================================
-# prefill_pool_1 - local router + mocker worker (prefill)
-# ============================================================================
-DYN_NAMESPACE=prefill_pool_1 python -m dynamo.router \
-  --endpoint prefill_pool_1.worker.generate \
-  --router-block-size 16 \
-  --no-router-track-active-blocks &  # prefill router does not need to track active blocks
-python -m dynamo.mocker \
-  --model-path Qwen/Qwen3-0.6B \
-  --endpoint dyn://prefill_pool_1.worker.generate \
-  --disaggregation-mode prefill \
-  --block-size 16 &
-# ============================================================================
-# decode_pool_0 - local router + mocker worker (decode)
-# ============================================================================
-DYN_NAMESPACE=decode_pool_0 python -m dynamo.router \
-  --endpoint decode_pool_0.worker.generate \
-  --router-block-size 16 \
-  --router-kv-overlap-score-weight 0 &
-python -m dynamo.mocker \
-  --model-path Qwen/Qwen3-0.6B \
-  --endpoint dyn://decode_pool_0.worker.generate \
-  --disaggregation-mode decode \
-  --block-size 16 &
-# ============================================================================
-# test request
-# ============================================================================
-# wait for all components to start
-# curl -X POST http://localhost:8000/v1/chat/completions \
-#   -H "Content-Type: application/json" \
-#   -d '{
-#     "model": "Qwen/Qwen3-0.6B",
-#     "messages": [{"role": "user", "content": "Hello!"}],
-#     "max_tokens": 50,
-#     "stream": true
-#   }'
--- a/examples/hierarchical_planner/vllm-2p1d.yaml
+++ b/examples/hierarchical_planner/vllm-2p1d.yaml
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Multi-DGD deployment for hierarchical planner example with vLLM workers
-# Architecture:
-#   DGD 1 (hierarchical): Frontend + GlobalRouter
-#   DGD 2 (prefill-pool-0): Local Router + vLLM Prefill Worker (1 GPU)
-#   DGD 3 (prefill-pool-1): Local Router + vLLM Prefill Worker (1 GPU)
-#   DGD 4 (decode-pool-0): Local Router + vLLM Decode Worker (1 GPU)
-#
-# IMPORTANT: This file uses ${K8S_NAMESPACE} as a placeholder for the Kubernetes namespace.
-# The K8s operator prepends the K8s namespace to the Dynamo namespace.
-# For example, if K8S_NAMESPACE="my-namespace" and dynamoNamespace is "prefill-pool-0",
-# the actual Dynamo namespace becomes "my-namespace-prefill-pool-0".
-#
-# vLLM workers register at:
-#   - Prefill: <namespace>.prefill.generate
-#   - Decode:  <namespace>.backend.generate
-#
-# USAGE: See README.md for deployment instructions using envsubst.
-# =============================================================================
-# ConfigMap for global router configuration
-# =============================================================================
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: hierarchical-global-router-config
-data:
-  global_router_config.json: |
-    {
-        "num_prefill_pools": 2,
-        "num_decode_pools": 1,
-        "prefill_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-prefill-pool-0", "${K8S_NAMESPACE}-prefill-pool-1"],
-        "decode_pool_dynamo_namespaces": ["${K8S_NAMESPACE}-decode-pool-0"],
-        "prefill_pool_selection_strategy": {
-            "ttft_min": 10,
-            "ttft_max": 1000,
-            "ttft_resolution": 2,
-            "isl_min": 0,
-            "isl_max": 32000,
-            "isl_resolution": 2,
-            "prefill_pool_mapping": [[0,1],[0,1]]
-        },
-        "decode_pool_selection_strategy": {
-            "itl_min": 10,
-            "itl_max": 100,
-            "itl_resolution": 2,
-            "context_length_min": 0,
-            "context_length_max": 32000,
-            "context_length_resolution": 2,
-            "decode_pool_mapping": [[0,0],[0,0]]
-        }
-    }
---
-# =============================================================================
-# DGD 1: Frontend + Global Router (namespace: hierarchical)
-# =============================================================================
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: hierarchical-frontend
-spec:
-  envs:
-  - name: HF_TOKEN
-    valueFrom:
-      secretKeyRef:
-        key: HF_TOKEN
-        name: hf-token-secret
-  services:
-    Frontend:
-      componentType: frontend
-      dynamoNamespace: hierarchical
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --router-mode
-          - round-robin
-          - --namespace
-          - ${K8S_NAMESPACE}-hierarchical
-          command:
-          - python
-          - -m
-          - dynamo.frontend
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-    GlobalRouter:
-      componentType: default
-      dynamoNamespace: hierarchical
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --config
-          - /workspace/config/global_router_config.json
-          - --model-name
-          - Qwen/Qwen3-0.6B
-          - --default-ttft-target
-          - "100"
-          - --default-itl-target
-          - "10"
-          - --namespace
-          - ${K8S_NAMESPACE}-hierarchical
-          command:
-          - python
-          - -m
-          - dynamo.global_router
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-          volumeMounts:
-          - mountPath: /workspace/config
-            name: global-router-config
-            readOnly: true
-        volumes:
-        - configMap:
-            name: hierarchical-global-router-config
-          name: global-router-config
-      replicas: 1
---
-# =============================================================================
-# DGD 2: Prefill Pool 0 - Local Router + vLLM Worker (namespace: prefill-pool-0)
-# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-0
-# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
-# =============================================================================
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: prefill-pool-0
-spec:
-  envs:
-  - name: HF_TOKEN
-    valueFrom:
-      secretKeyRef:
-        key: HF_TOKEN
-        name: hf-token-secret
-  services:
-    LocalRouter:
-      componentType: default
-      dynamoNamespace: prefill-pool-0
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --endpoint
-          - ${K8S_NAMESPACE}-prefill-pool-0.prefill.generate
-          - --router-block-size
-          - "16"
-          - --no-router-track-active-blocks
-          command:
-          - python
-          - -m
-          - dynamo.router
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-    VllmPrefillWorker:
-      componentType: worker
-      subComponentType: prefill
-      dynamoNamespace: prefill-pool-0
-      envFromSecret: hf-token-secret
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --model
-          - Qwen/Qwen3-0.6B
-          - --disaggregation-mode
-          - prefill
-          - --tensor-parallel-size
-          - "1"
-          - --gpu-memory-utilization
-          - "0.90"
-          - --block-size
-          - "16"
-          command:
-          - python3
-          - -m
-          - dynamo.vllm
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-        requests:
-          gpu: "1"
---
-# =============================================================================
-# DGD 3: Prefill Pool 1 - Local Router + vLLM Worker (namespace: prefill-pool-1)
-# Actual Dynamo namespace: ${K8S_NAMESPACE}-prefill-pool-1
-# vLLM prefill worker registers at: ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
-# =============================================================================
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: prefill-pool-1
-spec:
-  envs:
-  - name: HF_TOKEN
-    valueFrom:
-      secretKeyRef:
-        key: HF_TOKEN
-        name: hf-token-secret
-  services:
-    LocalRouter:
-      componentType: default
-      dynamoNamespace: prefill-pool-1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --endpoint
-          - ${K8S_NAMESPACE}-prefill-pool-1.prefill.generate
-          - --router-block-size
-          - "16"
-          - --no-router-track-active-blocks
-          command:
-          - python
-          - -m
-          - dynamo.router
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-    VllmPrefillWorker:
-      componentType: worker
-      subComponentType: prefill
-      dynamoNamespace: prefill-pool-1
-      envFromSecret: hf-token-secret
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --model
-          - Qwen/Qwen3-0.6B
-          - --disaggregation-mode
-          - prefill
-          - --tensor-parallel-size
-          - "1"
-          - --gpu-memory-utilization
-          - "0.90"
-          - --block-size
-          - "16"
-          command:
-          - python3
-          - -m
-          - dynamo.vllm
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-        requests:
-          gpu: "1"
---
-# =============================================================================
-# DGD 4: Decode Pool 0 - Local Router + vLLM Worker (namespace: decode-pool-0)
-# Actual Dynamo namespace: ${K8S_NAMESPACE}-decode-pool-0
-# vLLM decode worker registers at: ${K8S_NAMESPACE}-decode-pool-0.backend.generate
-# =============================================================================
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: decode-pool-0
-spec:
-  envs:
-  - name: HF_TOKEN
-    valueFrom:
-      secretKeyRef:
-        key: HF_TOKEN
-        name: hf-token-secret
-  services:
-    LocalRouter:
-      componentType: default
-      dynamoNamespace: decode-pool-0
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --endpoint
-          - ${K8S_NAMESPACE}-decode-pool-0.backend.generate
-          - --router-block-size
-          - "16"
-          - --router-kv-overlap-score-weight
-          - "0"
-          command:
-          - python
-          - -m
-          - dynamo.router
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-    VllmDecodeWorker:
-      componentType: worker
-      subComponentType: decode
-      dynamoNamespace: decode-pool-0
-      envFromSecret: hf-token-secret
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --model
-          - Qwen/Qwen3-0.6B
-          - --tensor-parallel-size
-          - "1"
-          - --gpu-memory-utilization
-          - "0.90"
-          - --block-size
-          - "16"
-          command:
-          - python3
-          - -m
-          - dynamo.vllm
-          image: ${VLLM_IMAGE}
-          workingDir: /workspace
-      replicas: 1
-      resources:
-        limits:
-          gpu: "1"
-        requests:
-          gpu: "1"