sla_planner_quickstart.md 9.65 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# SLA Planner Quick Start Guide

Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.

> [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).

## Overview

The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.

The deployment process consists of two mandatory phases:

1. **Pre-Deployment Profiling** (2-4 hours) - Generates performance data
2. **SLA Planner Deployment** (5-10 minutes) - Enables autoscaling

> [!TIP]
> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-aiconfigurator) in the Profiling Guide.

```mermaid
flowchart TD
    A[Start Setup] --> B{Profiling Done?}
    B -->|No| C[Run Profiling<br/>2-4 hours]
    C --> D[Verify Results]
    D --> E[Deploy Planner<br/>5-10 minutes]
    B -->|Yes| E
    E --> F[Test System]
    F --> G[Ready!]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style E fill:#e8f5e8
    style G fill:#f3e5f5
    style B fill:#fff8e1
```

37
38
39
40
41
## Prerequisites

Before deploying the SLA planner, ensure:
- **Dynamo platform installed** (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running.** By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
42
43
- **Benchmarking resources setup** (see [Kubernetes utilities for Dynamo Benchmarking and Profiling](../../deploy/utils/README.md)) The script will create a `dynamo-pvc` with `ReadWriteMany` access, if your cluster's default storageClassName does not allow `ReadWriteMany`, you need to specify a different storageClassName in `deploy/utils/manifests/pvc.yaml` which does support `ReadWriteMany`.

44
45
46
47

## Pre-Deployment Profiling

Deploying planner starts with running pre-deployment profiling.
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

> [!WARNING]
> **MANDATORY**: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.

### Step 1.1: Set Up Profiling Environment

Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.

```bash
export NAMESPACE=your-namespace
```

**Prerequisites**: Ensure all dependencies are installed:
```bash
pip install -r deploy/utils/requirements.txt
```

### Step 1.2: Inject Your Configuration

Use the injector utility to place your DGD manifest into the PVC:

```bash
# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml

# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
```

> **Note**: All paths must start with `/data/` for security reasons.

### Step 1.3: Configure SLA Targets

For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:

```yaml
spec:
  template:
    spec:
      containers:
        - name: profile-sla
          args:
            - --isl
            - "3000" # average ISL is 3000 tokens
            - --osl
            - "150" # average OSL is 150 tokens
            - --ttft
            - "200" # target TTFT is 200ms
            - --itl
            - "20" # target ITL is 20ms
            - --backend
            - <vllm/sglang>
```

For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead.

### Step 1.4: Run Profiling

Set the container image and config path:

```bash
export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
export DGD_CONFIG_FILE=/data/configs/disagg.yaml
```

Run profiling:

```bash
# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -

# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -

# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -
```

### Step 1.5: Monitor Profiling Progress

```bash
kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE
```

> [!NOTE]
> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.

136
### Step 1.6: Download Profiling Results
137
138
139
140
141
142
143
144
145
146
147
148
149
150

If you want to view the profiling results and performance plots:

```bash
# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
```

For detailed information about the output structure, performance plots, and how to analyze the results, see the [Viewing Profiling Results](/docs/benchmarks/pre_deployment_profiling.md#viewing-profiling-results) section in the Profiling Guide.

**Verify Success**: Look for terminal output like:
```
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
151
152
153
...
Final DGD config with planner: {...}
Deploying the optimized DGD with planner...
154
155
```

156
157
158
159
160
161
162
### Step 1.7: Deploy the DGD with Planner

```bash
kubectl apply -f ./results/config_with_planner.yaml
```

### Step 1.8: Wait for Deployment to be Ready
163
164
165
166
167
168
169
170
171
172
173
174
175

```bash
kubectl get pods -n $NAMESPACE
```

**Expected pods** (all should be `1/1 Running`):
```
vllm-disagg-planner-frontend-*            1/1 Running
vllm-disagg-planner-planner-*             1/1 Running
vllm-disagg-planner-backend-*             1/1 Running
vllm-disagg-planner-prefill-*             1/1 Running
```

176
### Step 1.9: Test the System
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

```bash
# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000

# Send a request
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":true,
    "max_tokens": 30
  }'
```

198
### Step 1.10: Monitor Scaling
199
200
201
202
203
204
205
206
207
208

```bash
# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
```

**Expected successful output** (after streaming requests):
```
New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
209
Observed ttft: X.XXms itl: X.XXms
210
211
212
Number of prefill workers: 1, number of decode workers: 1
```

213
## Production Readiness
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251

### Monitoring Metrics

- **Basic metrics** (request count): Available with any request type
- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
- **Scaling decisions**: Require sufficient request volume

### Troubleshooting

**Connection Issues:**
```bash
# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
```

**Missing Metrics:**
```bash
# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service
```

**Worker Issues:**
- Large models can take 10+ minutes to initialize
- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
- Ensure GPU resources are available for workers

**Unknown Field subComponentType:**

If you encounter the following error when applying the deployment:
```bash
Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
```
This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).

## Next Steps

252
- **Architecture Details**: See [SLA-based Planner Architecture](/docs/planner/sla_planner.md) for technical details
253
254
255
256
257
258
259
260
261
262
263
264
265
266
- **Performance Tuning**: See [Pre-Deployment Profiling Guide](/docs/benchmarks/pre_deployment_profiling.md) for advanced profiling options
- **Load Testing**: See [SLA Planner Load Test](/tests/planner/README.md) for comprehensive testing tools

## Quick Reference

| Phase | Duration | Purpose | Status Check |
|-------|----------|---------|--------------|
| Profiling | 2-4 hours | Generate performance data | `kubectl logs job/profile-sla` |
| Deployment | 5-10 minutes | Enable autoscaling | `kubectl get pods` |
| Testing | 5 minutes | Verify functionality | `kubectl logs deployment/planner` |

---

> [!TIP]
267
> **Need Help?** If you encounter issues, check the [troubleshooting section](#troubleshooting) or refer to the detailed guides linked in [Next Steps](#next-steps).