| `enabled` _boolean_ | Deprecated: This field is ignored. | | |
| `minReplicas` _integer_ | Deprecated: This field is ignored. | | |
| `maxReplicas` _integer_ | Deprecated: This field is ignored. | | |
| `behavior` _[HorizontalPodAutoscalerBehavior](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#horizontalpodautoscalerbehavior-v2-autoscaling)_ | Deprecated: This field is ignored. | | |
| `metrics` _[MetricSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#metricspec-v2-autoscaling) array_ | Deprecated: This field is ignored. | | |
...
...
@@ -165,7 +168,7 @@ _Appears in:_
| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br/>The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component | | Optional: \{\}<br/> |
| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br/>GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br/>with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br/>for migration guidance. This field will be removed in a future API version. | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br/>environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
...
...
@@ -176,8 +179,9 @@ _Appears in:_
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br/>It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br/>that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br/>When scalingAdapter is enabled (default), this field is managed by the<br/>DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br/> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br/>When enabled (default), replicas are managed via DGDSA and external autoscalers can scale<br/>the service using the Scale subresource. When disabled, replicas can be modified directly. | | |
#### DynamoComponentDeploymentSpec
...
...
@@ -202,7 +206,7 @@ _Appears in:_
| `dynamoNamespace` _string_ | DynamoNamespace is deprecated and will be removed in a future version.<br/>The DGD Kubernetes namespace and DynamoGraphDeployment name are used to construct the Dynamo namespace for each component | | Optional: \{\}<br/> |
| `globalDynamoNamespace` _boolean_ | GlobalDynamoNamespace indicates that the Component will be placed in the global Dynamo namespace | | |
| `resources` _[Resources](#resources)_ | Resources requested and limits for this component, including CPU, memory,<br/>GPUs/devices, and any runtime-specific resources. | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Autoscaling config for this component (replica range, target utilization, etc.). | | |
| `autoscaling` _[Autoscaling](#autoscaling)_ | Deprecated: This field is deprecated and ignored. Use DynamoGraphDeploymentScalingAdapter<br/>with HPA, KEDA, or Planner for autoscaling instead. See docs/kubernetes/autoscaling.md<br/>for migration guidance. This field will be removed in a future API version. | | |
| `envs` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Envs defines additional environment variables to inject into the component containers. | | |
| `envFromSecret` _string_ | EnvFromSecret references a Secret whose key/value pairs will be exposed as<br/>environment variables in the component containers. | | |
| `volumeMounts` _[VolumeMount](#volumemount) array_ | VolumeMounts references PVCs defined at the top level for volumes to be mounted by the component. | | |
...
...
@@ -213,8 +217,9 @@ _Appears in:_
| `extraPodSpec` _[ExtraPodSpec](#extrapodspec)_ | ExtraPodSpec allows to override the main pod spec configuration.<br/>It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field<br/>that allows overriding the main container configuration. | | |
| `livenessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | LivenessProbe to detect and restart unhealthy containers. | | |
| `readinessProbe` _[Probe](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#probe-v1-core)_ | ReadinessProbe to signal when the container is ready to receive traffic. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component when autoscaling is not used. | | |
| `replicas` _integer_ | Replicas is the desired number of Pods for this component.<br/>When scalingAdapter is enabled (default), this field is managed by the<br/>DynamoGraphDeploymentScalingAdapter and should not be modified directly. | | Minimum: 0 <br/> |
| `multinode` _[MultinodeSpec](#multinodespec)_ | Multinode is the configuration for multinode components. | | |
| `scalingAdapter` _[ScalingAdapter](#scalingadapter)_ | ScalingAdapter configures whether this service uses the DynamoGraphDeploymentScalingAdapter.<br/>When enabled (default), replicas are managed via DGDSA and external autoscalers can scale<br/>the service using the Scale subresource. When disabled, replicas can be modified directly. | | |
#### DynamoGraphDeployment
...
...
@@ -314,6 +319,83 @@ _Appears in:_
| `deployment` _[DeploymentStatus](#deploymentstatus)_ | Deployment tracks the auto-created DGD when AutoApply is true.<br/>Contains name, namespace, state, and creation status of the managed DGD. | | Optional: \{\}<br/> |
#### DynamoGraphDeploymentScalingAdapter
DynamoGraphDeploymentScalingAdapter provides a scaling interface for individual services
within a DynamoGraphDeployment. It implements the Kubernetes scale
subresource, enabling integration with HPA, KEDA, and custom autoscalers.
The adapter acts as an intermediary between autoscalers and the DGD,
ensuring that only the adapter controller modifies the DGD's service replicas.
This prevents conflicts when multiple autoscaling mechanisms are in play.
| `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | |
| `replicas` _integer_ | Replicas is the desired number of replicas for the target service.<br/>This field is modified by external autoscalers (HPA/KEDA/Planner) or manually by users. | | Minimum: 0 <br/>Required: \{\}<br/> |
| `dgdRef` _[DynamoGraphDeploymentServiceRef](#dynamographdeploymentserviceref)_ | DGDRef references the DynamoGraphDeployment and the specific service to scale. | | Required: \{\}<br/> |
#### DynamoGraphDeploymentScalingAdapterStatus
DynamoGraphDeploymentScalingAdapterStatus defines the observed state of DynamoGraphDeploymentScalingAdapter
| `replicas` _integer_ | Replicas is the current number of replicas for the target service.<br/>This is synced from the DGD's service replicas and is required for the scale subresource. | | |
| `selector` _string_ | Selector is a label selector string for the pods managed by this adapter.<br/>Required for HPA compatibility via the scale subresource. | | |
| `lastScaleTime` _[Time](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#time-v1-meta)_ | LastScaleTime is the last time the adapter scaled the target service. | | |
#### DynamoGraphDeploymentServiceRef
DynamoGraphDeploymentServiceRef identifies a specific service within a DynamoGraphDeployment
| `name` _string_ | Name of the DynamoGraphDeployment | | MinLength: 1 <br/>Required: \{\}<br/> |
| `serviceName` _string_ | ServiceName is the key name of the service within the DGD's spec.services map to scale | | MinLength: 1 <br/>Required: \{\}<br/> |
| `disable` _boolean_ | Disable indicates whether the ScalingAdapter should be disabled for this service.<br/>When false (default), a DGDSA is created and owns the replicas field.<br/>When true, no DGDSA is created and replicas can be modified directly in the DGD. | false | |
This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
## Example DGD
All examples in this guide use the following DGD:
```yaml
# examples/backends/sglang/deploy/agg.yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:sglang-agg
namespace:default
spec:
services:
Frontend:
dynamoNamespace:sglang-agg
componentType:frontend
replicas:1
decode:
dynamoNamespace:sglang-agg
componentType:worker
replicas:1
resources:
limits:
gpu:"1"
```
**Key identifiers:**
-**DGD name**: `sglang-agg`
-**Namespace**: `default`
-**Services**: `Frontend`, `decode`
-**dynamo_namespace label**: `default-sglang-agg` (used for metric filtering)
## Overview
Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAdapter` (DGDSA) resource. When you deploy a DGD, the operator automatically creates one adapter per service (unless explicitly disabled). These adapters implement the Kubernetes [Scale subresource](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource), enabling integration with:
| Autoscaler | Description | Best For |
|------------|-------------|----------|
| **KEDA** | Event-driven autoscaling (recommended) | Most use cases |
> **⚠️ Deprecation Notice**: The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
1. You deploy a DGD with services (Frontend, decode)
2. The operator auto-creates one DGDSA per service
3. Autoscalers (KEDA, HPA, Planner) target the adapters via `/scale` subresource
4. Adapter controller syncs replica changes to the DGD
5. DGD controller reconciles the underlying pods
## Viewing Scaling Adapters
After deploying the `sglang-agg` DGD, verify the auto-created adapters:
```bash
kubectl get dgdsa -n default
# Example output:
# NAME DGD SERVICE REPLICAS AGE
# sglang-agg-frontend sglang-agg Frontend 1 5m
# sglang-agg-decode sglang-agg decode 1 5m
```
## Replica Ownership Model
When DGDSA is enabled (the default), it becomes the **source of truth** for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
# Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
# use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead
```
## Disabling DGDSA for a Service
If you want to manage replicas directly in the DGD (without autoscaling), you can disable the scaling adapter per service:
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:sglang-agg
spec:
services:
Frontend:
replicas:2
scalingAdapter:
disable:true# ← No DGDSA created, direct edits allowed
decode:
replicas:1# ← DGDSA created by default, managed via adapter
```
**When to disable DGDSA:**
- You want simple, manual replica management
- You don't need autoscaling for that service
- You prefer direct DGD edits over adapter-based scaling
**When to keep DGDSA enabled (default):**
- You want to use HPA, KEDA, or Planner for autoscaling
- You want a clear separation between "desired scale" (adapter) and "deployment config" (DGD)
- You want protection against accidental direct replica edits
## Autoscaling with Dynamo Planner
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
**When to use Planner:**
- You want LLM-optimized autoscaling out of the box
- You need coordinated scaling across prefill/decode services
- You want SLA-driven scaling (e.g., target TTFT < 500ms)
**How Planner works:**
Planner is deployed as a service component within your DGD. It:
1. Queries Prometheus for frontend metrics (request rate, latency, etc.)
2. Uses profiling data to predict optimal replica counts
3. Scales prefill/decode workers to meet SLA targets
**Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla_planner_quickstart.md) for complete instructions.
For more details, see the [SLA Planner documentation](../planner/sla_planner.md).
## Autoscaling with Kubernetes HPA
The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution.
**When to use HPA:**
- You have simple, predictable scaling requirements
- You want to use standard Kubernetes tooling
- You need CPU or memory-based scaling
> **Note**: For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure.
### Basic HPA (CPU-based)
```yaml
apiVersion:autoscaling/v2
kind:HorizontalPodAutoscaler
metadata:
name:sglang-agg-frontend-hpa
namespace:default
spec:
scaleTargetRef:
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeploymentScalingAdapter
name:sglang-agg-frontend
minReplicas:1
maxReplicas:10
metrics:
-type:Resource
resource:
name:cpu
target:
type:Utilization
averageUtilization:70
behavior:
scaleDown:
stabilizationWindowSeconds:300
scaleUp:
stabilizationWindowSeconds:0
```
### HPA with Dynamo Metrics
Dynamo exports several metrics useful for autoscaling. These are available at the `/metrics` endpoint on each frontend pod.
> **See also**: For a complete list of all Dynamo metrics, see the [Metrics Reference](../observability/metrics.md). For Prometheus and Grafana setup, see the [Prometheus and Grafana Setup Guide](../observability/prometheus-grafana.md).
#### Available Dynamo Metrics
| Metric | Type | Description | Good for scaling |
> **Note**: If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts.
### Example: Scale Decode Based on TTFT
Using the `sglang-agg` DGD from `examples/backends/sglang/deploy/agg.yaml`:
```yaml
apiVersion:keda.sh/v1alpha1
kind:ScaledObject
metadata:
name:sglang-agg-decode-scaler
namespace:default
spec:
scaleTargetRef:
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeploymentScalingAdapter
name:sglang-agg-decode
minReplicaCount:1
maxReplicaCount:10
pollingInterval:15# Check metrics every 15 seconds
cooldownPeriod:60# Wait 60s before scaling down
triggers:
-type:prometheus
metadata:
# Update this URL to match your Prometheus service