sla_planner_quickstart.md 16.9 KB
Newer Older
1
# SLA-Driven Profiling and Planner Deployment Quick Start Guide
2

3
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
4
5
6
7
8
9

> [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).

## Overview

10
The DGDR workflow automates the entire process from SLA specification to deployment:
11

12
13
14
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
15
16
17

```mermaid
flowchart TD
18
19
20
21
22
23
24
25
    A[Create DGDR] --> B[DGDR Controller]
    B --> C{Profiling Method}
    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
    D --> F[Generate DGD Config]
    E --> F
    F --> G[Auto-Deploy DGD]
    G --> H[Monitor & Scale]
26
27

    style A fill:#e1f5fe
28
    style D fill:#fff3e0
29
30
    style E fill:#e8f5e8
    style G fill:#f3e5f5
31
    style H fill:#fff8e1
32
33
```

34
## What is a DynamoGraphDeploymentRequest (DGDR)?
35

36
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
37

38
39
40
41
42
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
43

44
45
46
47
48
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
49

50
51
52
53
54
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
55

56
## Prerequisites
57

58
59
60
61
62
63
64
Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
- **Profiling PVC created** (see [Benchmarking Resource Setup](/deploy/utils/README.md#benchmarking-resource-setup#BenchmarkingResourceSetup))
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
65

66
### Container Images
67

68
Each DGDR requires you to specify container images for the profiling and deployment process:
69

70
71
**profilingConfig.profilerImage** (Required):
Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
72

73
74
75
76
**deploymentOverrides.workersImage** (Optional):
Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
77

78
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
79

80
81
82
83
84
85
```yaml
spec:
  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
86
87
```

88
89
90
91
92
## Quick Start: Deploy with DGDR

### Step 1: Create Your DGDR

Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
93

94
95
96
97
**Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator (TensorRT-LLM)
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
98

99
Or, you can create your own DGDR for your own needs:
100
101

```yaml
102
103
104
105
106
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: my-model-deployment  # Change the name
  namespace: default         # Change the namespace
107
spec:
108
109
  model: "Qwen/Qwen3-0.6B"     # Update to your model
  backend: vllm                # Backend: vllm, sglang, or trtllm
110

111
112
113
114
115
116
117
118
  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
    config:
      sla:
        isl: 3000    # Adjust to your workload
        osl: 150     # Adjust to your workload
        ttft: 200    # Your target (ms)
        itl: 20      # Your target (ms)
119

120
121
      sweep:
        use_ai_configurator: false  # Set to true for fast profiling (TensorRT-LLM only)
122

123
124
  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
125

126
  autoApply: true  # Auto-deploy after profiling
127
128
```

129
130
> [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
131

132
### Step 2: Apply the DGDR
133

134
The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
135

136
137
138
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
139
140
```

141
142
143
144
145
The Dynamo Operator will immediately begin processing your request.

### Step 3: Monitor Progress

Watch the DGDR status:
146
147

```bash
148
149
150
151
152
153
154
155
# View status
kubectl get dgdr -n $NAMESPACE

# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE

# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
156
157
```

158
159
160
161
162
163
164
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)

165
> [!NOTE]
166
> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
167

168
### Step 4: Access Your Deployment
169

170
Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
171
172

```bash
173
174
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep trtllm-disagg
175

176
177
# Port-forward to access locally
kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
178

179
180
# Test the endpoint
curl http://localhost:8000/v1/models
181
182
```

183
## DGDR Configuration Details
184

185
### Required Fields
186

187
188
189
190
191
192
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
193

194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
### Optional Fields

| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |

### SLA Configuration

The `sla` section defines performance requirements and workload characteristics:

```yaml
sla:
  isl: 3000      # Average input sequence length (tokens)
  osl: 150       # Average output sequence length (tokens)
  ttft: 200      # Target Time To First Token (milliseconds, float)
  itl: 20        # Target Inter-Token Latency (milliseconds, float)
212
213
```

214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources

### Profiling Methods

Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):

```yaml
# Online Profiling (Default)
sweep:
  use_ai_configurator: false

# Offline Profiling (AI Configurator - TensorRT-LLM only)
sweep:
  use_ai_configurator: true
232
233
234
  aic_system: h200_sxm
  aic_model_name: QWEN3_32B
  aic_backend_version: "0.20.0"
235
```
236
237
238
239

> [!NOTE]
> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).

240
### Hardware Configuration
241

242
For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md#hardware-configuration).
243

244
245
246
247
### Advanced Configuration

#### Using Existing DGD Configs (Recommended for Custom Setups)

248
If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
249
250

**Step 1: Create ConfigMap from your DGD config file:**
251
252

```bash
253
254
255
256
kubectl create configmap deepseek-r1-config \
  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
  --namespace $NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -
257
258
```

259
**Step 2: Reference the ConfigMap in your DGDR:**
260

261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: deepseek-r1
spec:
  model: deepseek-ai/DeepSeek-R1
  backend: sglang

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
    configMapRef:
      name: deepseek-r1-config
      key: disagg.yaml  # Must match the key used in --from-file
    config:
      sla:
        isl: 4000
        osl: 500
        ttft: 300
        itl: 10
      sweep:
        use_ai_configurator: true
      aic:
        system: h200_sxm
        model_name: DEEPSEEK_V3
        backend_version: "0.20.0"

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"

  autoApply: true
292
293
```

294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.

#### Inline Configuration (Simple Use Cases)

For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:

```yaml
profilingConfig:
  config:
    # SLA targets (required for profiling)
    sla:
      isl: 8000   # Input sequence length
      osl: 200    # Output sequence length
      ttft: 200.0 # Time To First Token (ms)
      itl: 10.0   # Inter-Token Latency (ms)

    # Hardware constraints (optional)
    hardware:
      min_num_gpus_per_engine: 2
      max_num_gpus_per_engine: 8
      gpu_type: h200_sxm

    # Profiling sweep settings (optional)
    sweep:
318
319
      prefill_interpolation_granularity: 16  # Number of samples for prefill ISL sweep
      decode_interpolation_granularity: 6    # Number of samples for decode sweep
320
```
321
322
323
324
325
326
327
328
329
330
331

> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.

#### Planner Configuration Passthrough
Add planner-specific settings. Planner arguments use a `planner_` prefix:

```yaml
profilingConfig:
  config:
    planner:
      planner_min_endpoint: 2
332
333
```

334
## Understanding Profiling Results
335

336
For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details).
337

338
## Advanced Topics
339

340
### DGDR Immutability
341

342
343
344
345
346
347
348
DGDRs are **immutable** - if you need to update SLAs or configuration:

1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications

### Manual Deployment Control

349
350
351
352
353
There are two ways to manually control deployment after profiling:

#### Option 1: Use DGDR-Generated Configuration (Recommended)

Disable auto-deployment to review the generated DGD before applying:
354
355
356
357

```yaml
spec:
  autoApply: false
358
359
```

360
Then manually extract and apply the generated DGD:
361

362
```bash
363
364
365
366
367
368
369
370
# Extract generated config
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedConfig}' > my-dgd.yaml

# Review and modify if needed
vi my-dgd.yaml

# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
371
372
```

373
374
375
376
377
378
379
380
381
The generated DGD includes optimized configurations and the SLA planner component.

#### Option 2: Use Standalone Planner Templates (Advanced)

For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:

```bash
# After profiling completes, profiling data is stored on the PVC at /data

382
383
384
385
386
387
388
389
390
391
# OPTIONAL: Download profiling results for local inspection
# Create access pod (skip this step if access pod is already running)
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s

# Download the data
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling_data

# Cleanup
kubectl delete pod pvc-access-pod -n $NAMESPACE
392
393
394
395
396
397
398

# Update backend planner manifest as needed, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```

> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets.

399
### Relationship to DynamoGraphDeployment (DGD)
400

401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed

The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs

The generated DGD is tracked via labels:
```yaml
metadata:
  labels:
    dgdr.nvidia.com/name: sla-aic
    dgdr.nvidia.com/namespace: your-namespace
```

## Troubleshooting

### Quick Diagnostics
420
421

```bash
422
423
424
425
426
427
428
429
# Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE

# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100

# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
430
431
```

432
### Common Issues
433

434
435
436
437
438
439
440
| Issue | Quick Fix |
|-------|-----------|
| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
441

442
443
> [!NOTE]
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
444

445
## Configuration Reference
446

447
For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
448

449
450
451
452
453
454
455
456
457
458
459
460
461
462
This includes detailed explanations of:
- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling

## Related Documentation

- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling Details](/docs/benchmarks/sla_driven_profiling.md)
- [SLA Planner Architecture](/docs/planner/sla_planner.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)