Unverified Commit de6e4f30 authored by Julien Mancuso's avatar Julien Mancuso Committed by GitHub
Browse files

feat: add Topology Aware Scheduling support (#6375)


Signed-off-by: default avatarJulien Mancuso <jmancuso@nvidia.com>
parent 23fc8db7
......@@ -89,6 +89,8 @@ navigation:
path: kubernetes/deployment/multinode-deployment.md
- page: Grove
path: kubernetes/grove.md
- page: Topology Aware Scheduling
path: kubernetes/topology-aware-scheduling.md
# ==================== User Guides ====================
- section: User Guides
......
......@@ -220,6 +220,7 @@ Key customization points include:
- **[GitOps Deployment with FluxCD](fluxcd.md)** - For advanced users
- **[Logging](observability/logging.md)** - For logging setup
- **[Multinode Deployment](deployment/multinode-deployment.md)** - For multinode deployment
- **[Topology Aware Scheduling](topology-aware-scheduling.md)** - Configure topology-aware workload placement
- **[Grove](grove.md)** - For grove details and custom installation
- **[Monitoring](observability/metrics.md)** - For monitoring setup
- **[Model Caching with Fluid](model-caching-with-fluid.md)** - For model caching with Fluid
......@@ -399,6 +399,7 @@ _Appears in:_
| `eppConfig` _[EPPConfig](#eppconfig)_ | EPPConfig defines EPP-specific configuration options for Endpoint Picker Plugin components.<br />Only applicable when ComponentType is "epp". | | Optional: \{\} <br /> |
| `frontendSidecar` _[FrontendSidecarSpec](#frontendsidecarspec)_ | FrontendSidecar configures an auto-generated frontend sidecar container.<br />When specified, the operator injects a fully configured frontend container<br />with all standard Dynamo environment variables, health probes, and ports.<br />This eliminates the need to manually specify these in extraPodSpec.containers. (GAIE) | | Optional: \{\} <br /> |
| `checkpoint` _[ServiceCheckpointConfig](#servicecheckpointconfig)_ | Checkpoint configures container checkpointing for this service.<br />When enabled, pods can be restored from a checkpoint files for faster cold start. | | Optional: \{\} <br /> |
| `topologyConstraint` _[TopologyConstraint](#topologyconstraint)_ | TopologyConstraint for this service. packDomain is required.<br />When both this and spec.topologyConstraint.packDomain are set, packDomain<br />must be narrower than or equal to the spec-level packDomain. | | Optional: \{\} <br /> |
#### DynamoComponentDeploymentSpec
......@@ -440,6 +441,7 @@ _Appears in:_
| `eppConfig` _[EPPConfig](#eppconfig)_ | EPPConfig defines EPP-specific configuration options for Endpoint Picker Plugin components.<br />Only applicable when ComponentType is "epp". | | Optional: \{\} <br /> |
| `frontendSidecar` _[FrontendSidecarSpec](#frontendsidecarspec)_ | FrontendSidecar configures an auto-generated frontend sidecar container.<br />When specified, the operator injects a fully configured frontend container<br />with all standard Dynamo environment variables, health probes, and ports.<br />This eliminates the need to manually specify these in extraPodSpec.containers. (GAIE) | | Optional: \{\} <br /> |
| `checkpoint` _[ServiceCheckpointConfig](#servicecheckpointconfig)_ | Checkpoint configures container checkpointing for this service.<br />When enabled, pods can be restored from a checkpoint files for faster cold start. | | Optional: \{\} <br /> |
| `topologyConstraint` _[TopologyConstraint](#topologyconstraint)_ | TopologyConstraint for this service. packDomain is required.<br />When both this and spec.topologyConstraint.packDomain are set, packDomain<br />must be narrower than or equal to the spec-level packDomain. | | Optional: \{\} <br /> |
#### DynamoGraphDeployment
......@@ -641,6 +643,7 @@ _Appears in:_
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs are environment variables applied to all services in the deployment unless<br />overridden by service-specific configuration. | | Optional: \{\} <br /> |
| `backendFramework` _string_ | BackendFramework specifies the backend framework (e.g., "sglang", "vllm", "trtllm"). | | Enum: [sglang vllm trtllm] <br /> |
| `restart` _[Restart](#restart)_ | Restart specifies the restart policy for the graph deployment. | | Optional: \{\} <br /> |
| `topologyConstraint` _[SpecTopologyConstraint](#spectopologyconstraint)_ | TopologyConstraint is the deployment-level topology constraint.<br />When set, topologyProfile is required and names the ClusterTopology CR to use.<br />packDomain is optional here — it can be omitted when only services carry constraints.<br />Services without their own topologyConstraint inherit from this value. | | Optional: \{\} <br /> |
#### DynamoGraphDeploymentStatus
......@@ -1221,6 +1224,63 @@ _Appears in:_
| `size` _[Quantity](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#quantity-resource-api)_ | | | |
#### SpecTopologyConstraint
SpecTopologyConstraint defines deployment-level topology placement requirements.
It carries both the topology profile (which ClusterTopology CR to use) and an
optional default pack domain that services without their own constraint inherit.
_Appears in:_
- [DynamoGraphDeploymentSpec](#dynamographdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `topologyProfile` _string_ | TopologyProfile is the name of the ClusterTopology CR that defines the<br />topology hierarchy for this deployment. | | MinLength: 1 <br /> |
| `packDomain` _[TopologyDomain](#topologydomain)_ | PackDomain is the default topology domain to pack pods within.<br />Optional — omit when only services carry constraints. | | Pattern: `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` <br />Optional: \{\} <br /> |
#### TopologyConstraint
TopologyConstraint defines service-level topology placement requirements.
The topology profile is inherited from the deployment-level SpecTopologyConstraint;
only the pack domain is specified here.
_Appears in:_
- [DynamoComponentDeploymentSharedSpec](#dynamocomponentdeploymentsharedspec)
- [DynamoComponentDeploymentSpec](#dynamocomponentdeploymentspec)
| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `packDomain` _[TopologyDomain](#topologydomain)_ | PackDomain is the topology domain to pack pods within. Must match a<br />domain defined in the referenced ClusterTopology CR. | | Pattern: `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` <br /> |
#### TopologyDomain
_Underlying type:_ _string_
TopologyDomain is a free-form topology level identifier.
Domain names are defined by the cluster admin in the ClusterTopology CR.
Common examples: "region", "zone", "datacenter", "block", "rack", "host", "numa".
Must match `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` (lowercase alphanumeric,
may contain hyphens but must not start or end with one).
_Validation:_
- Pattern: `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$`
_Appears in:_
- [SpecTopologyConstraint](#spectopologyconstraint)
- [TopologyConstraint](#topologyconstraint)
#### VolumeMount
......
......@@ -190,6 +190,21 @@ Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MO
- prefill # For disaggregated prefill workers
```
### Topology Aware Scheduling
You can optionally pack related pods within a topology domain (e.g., rack or block) to reduce inter-node latency, which is especially beneficial for disaggregated serving workloads. Add a `topologyConstraint` at the deployment level, the service level, or both:
```yaml
spec:
topologyConstraint:
packDomain: rack
services:
VllmWorker:
# ...
```
This requires Grove and a `ClusterTopology` CR configured by your cluster admin. For full details, available domains, hierarchy rules, and examples, see **[Topology Aware Scheduling](../topology-aware-scheduling.md)**.
### Image Pull Secret Configuration
#### Automatic Discovery and Injection
......
......@@ -38,6 +38,7 @@ These systems provide enhanced scheduling capabilities including topology-aware
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
- [Topology Aware Scheduling](../topology-aware-scheduling.md) — pack pods within a rack, block, or other topology domain for lower latency
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
......
......@@ -49,7 +49,7 @@ PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-sched
Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication.
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication. Dynamo exposes this capability through the `topologyConstraint` field on DynamoGraphDeployment resources, so users can opt in to topology-aware placement without interacting with Grove internals. See the [Topology Aware Scheduling guide](./topology-aware-scheduling.md) for configuration details and examples.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Topology Aware Scheduling
---
Topology Aware Scheduling (TAS) lets you control where Dynamo places inference workload pods relative to the cluster's network topology. By packing related pods within the same rack, block, or other topology domain, you reduce inter-node latency and improve throughput — especially for disaggregated serving where prefill, decode, and routing components communicate frequently.
TAS is **opt-in**. Existing deployments without topology constraints continue to work unchanged.
## Prerequisites
| Requirement | Details |
|-------------|---------|
| **Grove** | Installed on the cluster. See the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). |
| **ClusterTopology CR** | A cluster-scoped `ClusterTopology` resource configured by the cluster admin, mapping topology domain names to node labels. See [Grove documentation](https://github.com/NVIDIA/grove) for setup instructions. |
| **KAI Scheduler** | [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is required by Grove for topology-aware pod placement. |
| **Dynamo operator** | The latest Dynamo operator Helm chart includes read-only RBAC for `clustertopologies.grove.io` via a dedicated ClusterRole. This works for both cluster-wide and namespace-restricted operator deployments — no extra configuration is needed. |
## Topology Domains
Topology domains are **free-form** identifiers defined by the cluster admin in the `ClusterTopology` CR. Common examples include `region`, `zone`, `datacenter`, `block`, `rack`, `host`, and `numa`, but any name matching the pattern `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` is valid (no leading or trailing hyphens).
Domain names must match exactly what is configured in the `ClusterTopology` CR referenced by `topologyProfile`. During DGD creation, the Dynamo webhook validates that every `packDomain` exists in the referenced `ClusterTopology`.
When you specify a `packDomain`, the scheduler packs all replicas of the constrained component within a single instance of that domain. For example, `packDomain: rack` means "place all pods within the same rack."
## Topology Profile
Every DGD that uses topology constraints must reference a `ClusterTopology` CR by name via the `topologyProfile` field. This field is set at `spec.topologyConstraint` (the deployment level) and is inherited by all services — services must not set `topologyProfile` themselves.
The `topologyProfile` tells the Dynamo operator and the underlying framework which topology hierarchy to use for scheduling and validation.
## Enabling TAS on a DGD
Add a `topologyConstraint` field to your `DynamoGraphDeployment` at the deployment level, at the service level, or both. The deployment level must include a `topologyProfile`. Each constraint specifies a `packDomain`.
### Example 1: Deployment-Level Constraint (Services Inherit)
All services inherit the deployment-level constraint. This is the simplest configuration when you want uniform topology packing.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
packDomain: zone
services:
VllmWorker:
dynamoNamespace: my-llm
componentType: worker
replicas: 2
envFromSecret: hf-token-secret
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
### Example 2: Service-Level Constraint Only
Only the specified service gets topology packing. Other services are scheduled without topology constraints. The deployment level must still set `topologyProfile`.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
services:
VllmWorker:
dynamoNamespace: my-llm
componentType: worker
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: rack
envFromSecret: hf-token-secret
resources:
limits:
gpu: "8"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
### Example 3: Mixed (Deployment-Level Default + Per-Service Override)
Set a broad constraint at the deployment level and a narrower override on specific services. Service-level constraints must be **equal to or narrower than** the deployment-level constraint (determined by the ordering in the `ClusterTopology` CR).
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
packDomain: zone
services:
VllmWorker:
dynamoNamespace: my-llm
componentType: worker
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: block # narrower than zone — valid
envFromSecret: hf-token-secret
resources:
limits:
gpu: "8"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
# inherits zone from spec.topologyConstraint
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
## Hierarchy Rules
When **both** a deployment-level and a service-level `topologyConstraint` are set, the service's `packDomain` must be **equal to or narrower** than the deployment-level `packDomain`. "Narrower" is determined by the ordering of levels in the referenced `ClusterTopology` CR — levels appearing later in the `spec.levels` array are considered narrower.
The Dynamo webhook rejects the DGD at creation time if a service constraint is broader than the deployment constraint (when validating against a `ClusterTopology` CR).
When only one level is set (deployment-level only or service-level only), no hierarchy check applies.
| Configuration | Behavior |
|---------------|----------|
| `spec.topologyConstraint` set, service has none | Service inherits the deployment-level constraint |
| `spec.topologyConstraint` set, service also set | Both applied; service must be narrower or equal |
| `spec.topologyConstraint.topologyProfile` set, no `packDomain` at spec | Profile is provided for service-level constraints only |
| Neither set | No topology constraints (default) |
## Field Reference
| Field | Level | Required | Description |
|-------|-------|----------|-------------|
| `topologyProfile` | `spec.topologyConstraint` | Yes (when any constraint is set) | Name of the `ClusterTopology` CR defining the topology hierarchy. |
| `topologyProfile` | service-level `topologyConstraint` | N/A (not in schema) | Inherited from `spec.topologyConstraint`. The service-level type does not include this field. |
| `packDomain` | `spec.topologyConstraint` | Optional | Default pack domain for services that don't specify their own. |
| `packDomain` | service-level `topologyConstraint` | Required | Pack domain for this service. Must match a level in the `ClusterTopology` CR. |
## Multinode Considerations
For multinode services (services with a `multinode` section), the topology constraint is applied at the **scaling group** level rather than on individual worker pods. This is important because a multinode service spawns `replicas × nodeCount` pods — for example, 2 replicas with `nodeCount: 4` produces 8 pods across 8 nodes. Applying the constraint at the scaling group level means the scheduler packs each replica's set of nodes within the requested domain, without over-constraining individual pods to a single host.
For example, with this configuration:
```yaml
VllmWorker:
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: rack
```
Each replica's 4 nodes are packed within a single rack. The two replicas may land in different racks (the constraint applies per-replica, not across all replicas).
**Recommendation:** For multinode services, use `rack` or `block` as the `packDomain` to keep workers within a high-bandwidth domain while still allowing the scheduler to spread them across hosts within that domain. Avoid `host` for multinode services, as packing multiple nodes onto one host is not meaningful.
## Immutability
Topology constraints **cannot be changed after the DGD is created**. This includes:
- Adding a topology constraint to a DGD or service that did not have one
- Removing an existing topology constraint
- Changing the `topologyProfile` value
- Changing the `packDomain` value
To change topology constraints, **delete and recreate** the DGD. This matches the behavior of the underlying framework, which enforces immutability on topology constraints for generated resources.
## Monitoring Topology Enforcement
When any topology constraint is set, the DGD status includes a `TopologyLevelsAvailable` condition that reports whether the topology levels referenced by your constraints still exist in the cluster topology.
**Healthy state:**
```yaml
status:
conditions:
- type: Ready
status: "True"
- type: TopologyLevelsAvailable
status: "True"
reason: AllTopologyLevelsAvailable
message: "All required topology levels are available in the cluster topology"
```
**Degraded state** (e.g., an admin removed a topology level from the `ClusterTopology` CR after deployment):
```yaml
status:
conditions:
- type: Ready
status: "True"
- type: TopologyLevelsAvailable
status: "False"
reason: TopologyLevelsUnavailable
message: "Topology level 'rack' is no longer available in the cluster topology"
```
When topology levels become unavailable, Dynamo emits a **Warning** event on the DGD. The deployment may still appear `Ready` because the underlying framework keeps pods running, but topology placement is no longer guaranteed.
## Troubleshooting
### DGD rejected: "ClusterTopology not found"
The Dynamo webhook validates that the `ClusterTopology` CR referenced by `topologyProfile` exists when any topology constraint is set. If it cannot read the `ClusterTopology` CR:
- Verify that the cluster admin has created the `ClusterTopology` resource named in `topologyProfile`. See the [Grove documentation](https://github.com/NVIDIA/grove) for setup.
- Verify that the Dynamo operator has RBAC to read `clustertopologies.grove.io` (included in the default Helm chart).
### DGD rejected: "packDomain not found in cluster topology"
The specified `packDomain` does not exist as a level in the referenced `ClusterTopology` CR. Check which domains are defined:
```bash
kubectl get clustertopology <topology-profile-name> -o yaml
```
Ensure the domain you are requesting (e.g., `rack`) is configured in the `ClusterTopology` with a corresponding node label.
### DGD rejected: "topologyProfile is required"
Any DGD that has a topology constraint (at the spec or service level) must set `spec.topologyConstraint.topologyProfile` to the name of a `ClusterTopology` CR. Add the `topologyProfile` field to `spec.topologyConstraint`.
### Pods stuck in Pending
The scheduler cannot satisfy the topology constraint. Common causes:
- Not enough nodes within a single instance of the requested domain (e.g., requesting 8 GPUs packed in one rack, but no rack has 8 available GPUs).
- Node labels do not match the `ClusterTopology` configuration.
Inspect scheduler events for details:
```bash
kubectl describe pod <pod-name> -n <namespace>
```
### TopologyLevelsAvailable is False
The DGD was deployed successfully, but the topology definition has since changed. The underlying framework detected that one or more required topology levels are no longer available.
- Check the condition message for specifics.
- Inspect the `ClusterTopology` CR to see if a domain was removed or renamed.
- If the topology was intentionally changed, delete and recreate the DGD to pick up the new topology.
### DGD rejected: hierarchy violation
A service-level `packDomain` is broader than the deployment-level `packDomain`. "Broader" and "narrower" are determined by the order of levels in the `ClusterTopology` CR — levels appearing earlier in `spec.levels` are broader.
Ensure service-level constraints are equal to or narrower than the deployment-level constraint.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment