Unverified Commit 7752ce21 authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: planner 3-tier documentation restructure (#5876)


Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatardagil-nvidia <dagil@nvidia.com>
Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: default avatardagil-nvidia <dagil@nvidia.com>
Co-authored-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent 8aa7335e
...@@ -15,4 +15,9 @@ See the License for the specific language governing permissions and ...@@ -15,4 +15,9 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation. # Planner
SLA-driven autoscaling controller for Dynamo inference graphs.
- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
# Planner Design
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/).
## Overview
The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
## Architecture
```text
┌──────────────────────────────────────────────────────────┐
│ Planner Component │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ Metric │ │ Load │ │ Performance │ │
│ │ Collector │ │ Predictor │ │ Interpolator │ │
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Scaling Algorithm │ │
│ └───────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼───────────────────────────┐ │
│ │ Connector Layer │ │
│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
│ │ └───────────────────┘ └───────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
## Scaling Algorithm
### Step 1: Metric Collection
Every `adjustment_interval` seconds, the planner queries Prometheus for:
- Average TTFT and ITL over the interval
- Total request count
- Average input sequence length (ISL) and output sequence length (OSL)
The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
### Step 2: Correction Factor Calculation
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
```text
prefill_correction = actual_ttft / expected_ttft
decode_correction = actual_itl / expected_itl
```
These factors account for hard to model factors such as:
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
- **Metric variance**: Average ISL/OSL may not represent the actual distribution
The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
### Step 3: Load Prediction
The planner forecasts three values for the next interval:
- `next_num_req`: Number of requests
- `next_isl`: Average input sequence length
- `next_osl`: Average output sequence length
Four predictor implementations are available:
| Predictor | Algorithm | Best For |
| ------------ | ---------------------------------------- | -------------------------------- |
| **Constant** | `next = current` | Stable workloads, long intervals |
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
### Step 4: Replica Calculation
**Prefill replicas:**
```python
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
**Decode replicas:**
```python
# Apply correction to the ITL SLA target
corrected_itl = target_itl / decode_correction_factor
# Find best throughput/GPU that achieves corrected ITL at predicted context length
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# Calculate required replicas
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
```
### Step 5: Scaling Execution
The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
## Connector Design
### Interface
```python
class PlannerConnector(ABC):
async def add_component(self, component_name)
async def remove_component(self, component_name)
# Extended interface (not on ABC, but implemented by both connectors):
async def set_component_replicas(self, targets, blocking)
async def validate_deployment(self, ...)
async def wait_for_deployment_ready(self)
```
### KubernetesConnector
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
**Design decisions:**
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
### VirtualConnector
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
**Scaling decision flow:**
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
2. External system reads decision via `client.wait()`
3. External system executes scaling
4. External system reports completion via `client.complete(decision)`
5. Planner sees `scaled_decision_id >= decision_id` and proceeds
**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
## Performance Interpolation
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
Two interpolators are maintained:
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
## Initialization
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
After the delay:
1. Initialize the connector (K8s or Virtual based on `--environment`)
2. Validate deployment structure
3. Load profiling results
4. Build interpolators
5. Initialize load predictor
6. Enter main scaling loop
## Performance Considerations
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
## Known Limitations
1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
## Future Work
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
- Multi-DGD coordination for shared-cluster scenarios
- Distribution-aware interpolation (beyond mean ISL/OSL)
- Adaptive adjustment interval based on observed scaling latency
## File Map
| File | Size | Purpose |
| ---------------------------- | ---- | ----------------------------------------------------- |
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
| `kube.py` | 7.4k | Low-level K8s client wrapper |
| `exceptions.py` | 7.2k | Custom exception hierarchy |
| `prometheus.py` | 7.3k | Prometheus query builder and client |
| `defaults.py` | 8.1k | Default configs, backend name mappings |
| `planner_argparse.py` | 6.2k | CLI argument definitions |
...@@ -88,3 +88,4 @@ Quickstart ...@@ -88,3 +88,4 @@ Quickstart
Distributed Runtime <design_docs/distributed_runtime.md> Distributed Runtime <design_docs/distributed_runtime.md>
Request Plane <design_docs/request_plane.md> Request Plane <design_docs/request_plane.md>
Event Plane <design_docs/event_plane.md> Event Plane <design_docs/event_plane.md>
Planner Design <design_docs/planner_design.md>
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment.
## Feature Matrix
| Category | Feature | Status |
|----------|---------|--------|
| **Backend** | Local (bare metal) | Deprecated |
| | Kubernetes | Supported |
| **LLM Framework** | vLLM | Supported |
| | TensorRT-LLM | Supported |
| | SGLang | Supported |
| **Serving Type** | Aggregated | Unsupported |
| | Disaggregated | Supported |
| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
| | Load-based (KV cache/queue thresholds) | Deprecated |
| **Load Predictors** | ARIMA | Supported |
| | Prophet | Supported |
| | Kalman filter | Supported |
| | Constant (current = next) | Supported |
| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
| | VirtualConnector (external environments) | Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md))
### Deploy with DGDR (Recommended)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow.
### Deploy with DGD (Manual)
For manual control, use the disaggregated planner templates:
```bash
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Documentation
| Document | Description |
|----------|-------------|
| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration |
| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
## Configuration Reference
### Key Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
| `--environment` | `kubernetes` | Deployment environment |
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
## Monitoring
### Grafana Dashboard
Deploy the planner dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
### Prometheus Metrics
The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
# Planner Examples
Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md).
## Basic Examples
### Minimal DGDR with AIC (Fastest)
The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-aic
spec:
model: Qwen/Qwen3-32B
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
### Online Profiling (Real Measurements)
Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-online
spec:
model: meta-llama/Llama-3.3-70B-Instruct
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: false
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
```
Available sample DGDRs in `benchmarks/profiler/deploy/`:
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
## Kubernetes Examples
### MoE Models (SGLang)
For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-moe
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
```
### Using Existing DGD Configs (Custom Setups)
Reference an existing DynamoGraphDeployment config via ConfigMap:
**Step 1: Create ConfigMap from your DGD config:**
```bash
kubectl create configmap deepseek-r1-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
**Step 2: Reference it in your DGDR:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
### Inline Configuration (Simple Use Cases)
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
```yaml
profilingConfig:
config:
sla:
isl: 8000
osl: 200
ttft: 200.0
itl: 10.0
hardware:
minNumGpusPerEngine: 2
maxNumGpusPerEngine: 8
gpuType: h200_sxm
sweep:
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
```
### Mocker Deployment (Testing)
Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
- Large-scale experiments without GPU resources
- Testing planner behavior and infrastructure
- Validating deployment configurations
```yaml
spec:
model: <model-name>
backend: trtllm # Real backend for profiling
useMocker: true # Deploy mocker instead of real backend
profilingConfig:
profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h100_sxm
autoApply: true
```
Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
### Model Cache PVC (0.8.1+)
For large models, use a pre-populated PVC instead of downloading from HuggingFace:
See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details.
## Advanced Examples
### Custom Load Predictors
#### Warm-starting with Trace Data
Pre-load predictors with historical request patterns before live traffic:
```yaml
# In planner arguments
args:
- --load-predictor arima
- --load-predictor-warmup-trace /data/trace.jsonl
- --load-predictor-log1p
```
The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
#### Kalman Filter Tuning
For workloads with rapid changes, tune the Kalman filter:
```yaml
args:
- --load-predictor kalman
- --kalman-q-level 2.0 # Higher = more responsive to level changes
- --kalman-q-trend 0.5 # Higher = trend changes faster
- --kalman-r 5.0 # Lower = trusts new measurements more
- --kalman-min-points 3 # Fewer points before forecasting starts
- --load-predictor-log1p # Often helps with request-rate series
```
#### Prophet for Seasonal Workloads
For workloads with daily/weekly patterns:
```yaml
args:
- --load-predictor prophet
- --prophet-window-size 100 # Larger window for seasonal detection
- --load-predictor-log1p
```
### Virtual Connector
For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
# Initialize client
client = VirtualConnectorClient(distributed_runtime, namespace)
# Main loop: watch for planner decisions and execute them
while True:
# Block until the planner makes a new scaling decision
await client.wait()
# Read the decision
decision = await client.get()
print(f"Scale to: prefill={decision.num_prefill_workers}, "
f"decode={decision.num_decode_workers}, "
f"id={decision.decision_id}")
# Execute scaling in your environment
scale_prefill_workers(decision.num_prefill_workers)
scale_decode_workers(decision.num_decode_workers)
# Report completion
await client.complete(decision)
```
See `components/planner/test/test_virtual_connector.py` for a full working example.
### Planner Configuration Passthrough
Pass planner-specific settings through the DGDR:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
### Review Before Deploy (autoApply: false)
Disable auto-deployment to inspect the generated DGD:
```yaml
spec:
autoApply: false
```
After profiling completes:
```bash
# Extract and review generated DGD
kubectl get dgdr sla-aic -n $NAMESPACE \
-o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
# Review and modify as needed
vi my-dgd.yaml
# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
### Profiling Artifacts with PVC
Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
```
Setup:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```
Access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Related Documentation
- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner_guide.md) -- Deployment, configuration, integration
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference)
- [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md)
# Planner Guide
Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md).
## Deployment
### Prerequisites
Before deploying the planner, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
### Container Images
Each DGDR requires container images for the profiling and deployment process:
**profilingConfig.profilerImage** (Required):
The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.
**deploymentOverrides.workersImage** (Optional):
The container image used for DGD worker components (frontend, workers, planner). Used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.
```yaml
spec:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
```
### What is a DynamoGraphDeploymentRequest (DGDR)?
A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:
- **What** model to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
### DGDR Workflow
The DGDR workflow automates the entire process from SLA specification to deployment:
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs
```mermaid
flowchart TD
A[Create DGDR] --> B[DGDR Controller]
B --> C{Profiling Method}
C -->|Online| D[Run Profiling Job<br/>2-4 hours]
C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
D --> F[Generate DGD Config]
E --> F
F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
style A fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#fff8e1
```
### Monitoring Progress
Watch DGDR status:
```bash
# View status
kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
### Relationship to DGD
- **DGDR**: High-level "intent" -- what you want deployed
- **DGD**: Low-level "implementation" -- how it's deployed
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes the SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
## Configuration
### DGDR Configuration
#### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
#### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |
#### SLA Configuration
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200 # Target Time To First Token (milliseconds, float)
itl: 20 # Target Inter-Token Latency (milliseconds, float)
```
**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
### Profiling Methods
Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
```yaml
# Online Profiling (Default)
sweep:
useAiConfigurator: false
# Offline Profiling (AI Configurator)
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).
### Load Predictors
The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
#### Constant Predictor
- **Use case**: Stable workloads with long prediction intervals
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
#### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
- **Tunable parameters**:
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
#### Kalman Predictor
- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
- **Configuration**: `load-predictor: "kalman"`
- **Tunable parameters**:
- `--kalman-q-level`: process noise for level (higher = more responsive)
- `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
- `--kalman-r`: measurement noise (lower = trusts new measurements more)
- `--kalman-min-points`: minimum points before forecasting
- `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
#### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
- **Tunable parameters**:
- `--prophet-window-size`: bounds internal history to control refit cost
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`
#### Warm-starting Load Predictors (Optional)
You can warm-start load predictors with a mooncake-style JSONL trace file:
- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
### Planner Scaling Parameters
| Argument | Default | Description |
|----------|---------|-------------|
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
#### Planner Configuration Passthrough
Add planner-specific settings in the DGDR:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
## Integration
### Prometheus Setup
The planner queries Prometheus to collect frontend request metrics. The architecture:
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: Prefill and backend workers handle inference
The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.
**Prometheus endpoint configuration:**
| Variable | Default |
|----------|---------|
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |
If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.
#### Scaling Decision Flow
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment: "virtual"
backend: "vllm" # or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must use `VirtualConnectorClient`:
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
client = VirtualConnectorClient(distributed_runtime, namespace)
```
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
2. **Parse Decisions**: Read values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`
A scaling decision (returned by `client.get()`) contains:
- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
- `num_decode_workers`: Target number of decode workers (-1 if not set)
- `decision_id`: Incremental ID for each scaling decision
See `components/planner/test/test_virtual_connector.py` for a full example.
### Grafana Dashboard
Deploy the planner Grafana dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
Follow [Dynamo Metrics Collection on Kubernetes](/docs/kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.
## DGDR Immutability
DGDRs are **immutable**. To update SLAs or configuration:
1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications
## Manual Deployment Control
### Option 1: Use DGDR-Generated Configuration (Recommended)
Disable auto-deployment to review the generated DGD before applying:
```yaml
spec:
autoApply: false
```
Then manually extract and apply:
```bash
# Extract generated DGD from DGDR status
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
# Or save to file first for review/modification
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
vi my-dgd.yaml
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
### Option 2: Use Standalone Planner Templates (Advanced)
For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
```bash
# After profiling completes, profiling data is stored in ConfigMaps
kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
# Update PROMETHEUS_ENDPOINT in the template, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Accessing Profiling Artifacts
By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:
**ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
**PVC (optional):**
- Performance plots (PNGs)
- DGD configuration and logs for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler log
```bash
# Setup PVC
deploy/utils/setup_benchmarking_resources.sh
# Access results after profiling
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Troubleshooting
### Quick Diagnostics
```bash
# Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE
# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
```
### Common Issues
| Issue | Quick Fix |
|-------|-----------|
| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
## Related Documentation
- [Planner README](README.md) -- Overview and quick start
- [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling](/docs/benchmarks/sla_driven_profiling.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
...@@ -77,6 +77,9 @@ Key features include: ...@@ -77,6 +77,9 @@ Key features include:
:hidden: :hidden:
Overview <self> Overview <self>
Planner README <README>
Planner Guide <planner_guide>
Planner Examples <planner_examples>
SLA Planner Quick Start <sla_planner_quickstart> SLA Planner Quick Start <sla_planner_quickstart>
SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md> SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md>
SLA-based Planner <sla_planner.md> SLA-based Planner <sla_planner.md>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment