"vllm/model_executor/models/rnj1.py" did not exist on "02f0c7b220422792f5e53de2a7d51d2d3ff2df28"
README.md 10.8 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Planner
5
6
---

7
## Why LLM Inference Needs a Different Autoscaler
8

9
Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it's low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable.
10

11
LLM inference breaks these assumptions:
12

13
14
15
16
- **Latency depends on request content, not just request count.** A single request with a 32K-token prompt consumes orders of magnitude more compute than a short one. Two requests per second can mean completely different GPU loads depending on input/output sequence lengths.
- **Prefill and decode have different scaling characteristics.** In disaggregated serving, prefill is compute-bound (scales with input length) while decode is memory-bound (scales with concurrent sequences and KV cache usage). A single replica count doesn't capture both.
- **The metrics that matter aren't standard.** The SLAs users care about — Time to First Token (TTFT) and Inter-Token Latency (ITL) — don't map cleanly to CPU utilization or request throughput. HPA can't target "keep P95 TTFT under 500ms" because that requires understanding the relationship between sequence lengths, GPU memory pressure, and latency.
- **Scaling decisions are expensive.** Spinning up a GPU worker takes minutes, not seconds. Overscaling wastes GPU-hours at cloud prices; underscaling violates SLAs. The autoscaler needs to predict demand, not just react to it.
17

18
The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics.
19

20
21
22
23
24
25
26
27
> **New to the Planner?** Start with the [Planner Guide](planner-guide.md) for a complete workflow including profiling and deployment.

> **Need multi-DGD coordination?** See the [Global Planner Guide](global-planner.md) for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments.

## Scaling Modes

The Planner supports two scaling modes that can run independently or together:

28
29
- **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
30
31

When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
32

33
34
## Feature Matrix

35
| Feature | Throughput-Based | Load-Based |
36
37
38
|---------|:----------------:|:-------------------------:|
| **Deployment** | | |
| Disaggregated | Supported | Supported |
39
| Aggregated | Supported | Supported |
40
41
| **LLM Framework** | | |
| SGLang | Supported | Supported |
42
43
| TensorRT-LLM | Supported | Supported |
| vLLM | Supported | Supported |
44
| **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No |
45
| **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
46
47
48
| **Router** | | |
| Any (round-robin, random, etc.) | Supported | Not supported |
| KV Router | Supported | Supported |
49
50
51
52
53
54
| **Connectors** | | |
| KubernetesConnector | Supported | Supported |
| VirtualConnector | Supported | Supported |

## When to Use Which Mode

55
56
- **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
57
- **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
58
59
60
61
62
63
64
65

## Quick Start

### Prerequisites

- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))

66
For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
67
68

### Throughput-Based Scaling (with DGDR)
69

70
The fastest path to a throughput-based planner deployment is through a DynamoGraphDeploymentRequest, which automatically profiles your model:
71
72

```bash
73
kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
74
75
```

76
77
78
79
80
81
82
83
84
85
86
87
88
See [Planner Guide](planner-guide.md) for the full workflow.

### Load-Based Scaling (without profiling)

To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:

```yaml
args:
  - --enable-loadbased-scaling
  - --disable-throughput-scaling
  - --loadbased-adjustment-interval=5
```

89
The planner will auto-discover the frontend metrics endpoint from the DGD. See [disagg_planner.yaml](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/deploy/disagg_planner.yaml) for a complete example.
90

91
### Manual DGD Deployment
92

93
For manual control with throughput-based scaling, use the disaggregated planner templates:
94
95
96
97
98
99

```bash
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
```

100
101
## Current Limitations

102
### Load-based scaling
103

104
Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these.
105

106
**Requires ForwardPassMetrics (FPM).** Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). FPM is currently only available for vllm and is automatically enabled when the engine uses `InstrumentedScheduler` and `DYN_FORWARDPASS_METRIC_PORT` is set. The KV Router is **not** required for load-based scaling.
107
108
109
110
111

### General

**In-flight requests during scale-down.** When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. **Workaround:** Set `--min-endpoint` to a value that avoids scaling below your steady-state traffic floor, and use a lower `--loadbased-scaling-down-sensitivity` value to reduce the frequency of scale-down events.

112
113
114
115
## Documentation

| Document | Description |
|----------|-------------|
116
117
| [Planner Guide](planner-guide.md) | Deployment, configuration, integration |
| [Planner Design](../../design-docs/planner-design.md) | Architecture and algorithm internals |
118
| [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
119
| [Global Planner Guide](global-planner.md) | Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments |
120
121
122
123
124
125
126

## Configuration Reference

### Key Arguments

| Argument | Default | Description |
|----------|---------|-------------|
127
| **Common** | | |
128
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
129
| `--backend` | `vllm` | Backend framework (`sglang`, `trtllm`, `vllm`) |
130
| `--mode` | `disagg` | Planner mode (`disagg`, `prefill`, `decode`, `agg`) |
131
132
133
134
135
136
137
138
| `--environment` | `kubernetes` | Deployment environment |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
139
140
141
| **Throughput-based scaling** | | |
| `--enable-throughput-scaling` | `true` | Enable throughput-based scaling |
| `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
142
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
143
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
144
| **Load-based scaling** | | |
145
| `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
146
147
148
| `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions |
| `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression |
| `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) |
149
150
151
| `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) |
| `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval |
| `--loadbased-min-observations` | `5` | Minimum observations before regression activates |
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |

## Monitoring

### Grafana Dashboard

Deploy the planner dashboard:

```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```

The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
176
- FPM regression model status
177
178
179

### Prometheus Metrics

180
**Throughput-based scaling** pulls traffic metrics from the cluster-wide Prometheus server:
181
182
183
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
184

185
186
187
188
**Load-based scaling** uses ForwardPassMetrics (FPM) from the Dynamo event plane:
- Per-iteration wall time, scheduled prefill/decode tokens, and queued request status
- Delivered via `FpmEventSubscriber` with automatic engine discovery and lifecycle tracking
- No router `/metrics` scraping required