sla_planner_quickstart.md 19.9 KB
Newer Older
1
# SLA-Driven Profiling and Planner Deployment Quick Start Guide
2

3
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
4
5
6
7
8
9

> [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).

## Overview

10
The DGDR workflow automates the entire process from SLA specification to deployment:
11

12
13
14
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
15
16
17

```mermaid
flowchart TD
18
19
20
21
22
23
24
25
    A[Create DGDR] --> B[DGDR Controller]
    B --> C{Profiling Method}
    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
    D --> F[Generate DGD Config]
    E --> F
    F --> G[Auto-Deploy DGD]
    G --> H[Monitor & Scale]
26
27

    style A fill:#e1f5fe
28
    style D fill:#fff3e0
29
30
    style E fill:#e8f5e8
    style G fill:#f3e5f5
31
    style H fill:#fff8e1
32
33
```

34
## What is a DynamoGraphDeploymentRequest (DGDR)?
35

36
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
37

38
39
40
41
42
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
43

44
45
46
47
48
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
49

50
51
52
53
54
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
55

56
## Prerequisites
57

58
59
60
61
62
63
Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
64

65
### Container Images
66

67
Each DGDR requires you to specify container images for the profiling and deployment process:
68

69
70
**profilingConfig.profilerImage** (Required):
Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
71

72
73
74
75
**deploymentOverrides.workersImage** (Optional):
Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
76

77
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
78

79
80
81
82
83
84
```yaml
spec:
  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
85
86
```

87
88
89
90
91
## Quick Start: Deploy with DGDR

### Step 1: Create Your DGDR

Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
92

93
94
**Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
95
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
96
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
97

98
Or, you can create your own DGDR for your own needs.
99

100
101
> [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
102

103
### Step 2: Apply the DGDR
104

105
The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
106

107
108
109
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
110
111
```

112
113
114
115
116
The Dynamo Operator will immediately begin processing your request.

### Step 3: Monitor Progress

Watch the DGDR status:
117
118

```bash
119
120
121
122
123
124
125
126
# View status
kubectl get dgdr -n $NAMESPACE

# Detailed status
kubectl describe dgdr sla-aic -n $NAMESPACE

# Watch profiling job logs
kubectl logs -f job/profile-sla-aic -n $NAMESPACE
127
128
```

129
130
131
132
133
134
135
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)

136
> [!NOTE]
137
> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
138

139
### Step 4: Access Your Deployment
140

141
Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
142
143

```bash
144
145
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep trtllm-disagg
146

147
148
# Port-forward to access locally
kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
149

150
151
# Test the endpoint
curl http://localhost:8000/v1/models
152
153
```

154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
### Step 5 (Optional): Access the Planner Grafana Dashboard

If you want to monitor the SLA Planner's decision-making in real-time, you can deploy the Planner Grafana dashboard.

```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```

Follow the instructions in [Dynamo Metrics Collection on Kubernetes](../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.

The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance

> [!TIP]
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.

173
## DGDR Configuration Details
174

175
### Required Fields
176

177
178
179
180
181
182
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
183

184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
### Optional Fields

| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |

### SLA Configuration

The `sla` section defines performance requirements and workload characteristics:

```yaml
sla:
  isl: 3000      # Average input sequence length (tokens)
  osl: 150       # Average output sequence length (tokens)
  ttft: 200      # Target Time To First Token (milliseconds, float)
  itl: 20        # Target Inter-Token Latency (milliseconds, float)
202
203
```

204
205
206
207
208
209
210
211
212
213
214
215
216
**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources

### Profiling Methods

Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):

```yaml
# Online Profiling (Default)
sweep:
217
  useAiConfigurator: false
218

219
# Offline Profiling (AI Configurator)
220
sweep:
221
222
223
224
  useAiConfigurator: true
  aicSystem: h200_sxm
  aicHfId: Qwen/Qwen3-32B
  aicBackendVersion: "0.20.0"
225
```
226
227
228
229

> [!NOTE]
> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods).

230
### Hardware Configuration
231

232
For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md#hardware-configuration).
233

234
235
236
237
### Advanced Configuration

#### Using Existing DGD Configs (Recommended for Custom Setups)

238
If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
239
240

**Step 1: Create ConfigMap from your DGD config file:**
241
242

```bash
243
244
245
246
kubectl create configmap deepseek-r1-config \
  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
  --namespace $NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -
247
248
```

249
**Step 2: Reference the ConfigMap in your DGDR:**
250

251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: deepseek-r1
spec:
  model: deepseek-ai/DeepSeek-R1
  backend: sglang

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
    configMapRef:
      name: deepseek-r1-config
      key: disagg.yaml  # Must match the key used in --from-file
    config:
      sla:
        isl: 4000
        osl: 500
        ttft: 300
        itl: 10
      sweep:
272
273
274
275
        useAiConfigurator: true
        aicSystem: h200_sxm
        aicHfId: deepseek-ai/DeepSeek-V3
        aicBackendVersion: "0.20.0"
276
277
278
279
280

  deploymentOverrides:
    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"

  autoApply: true
281
282
```

283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.

#### Inline Configuration (Simple Use Cases)

For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:

```yaml
profilingConfig:
  config:
    # SLA targets (required for profiling)
    sla:
      isl: 8000   # Input sequence length
      osl: 200    # Output sequence length
      ttft: 200.0 # Time To First Token (ms)
      itl: 10.0   # Inter-Token Latency (ms)

    # Hardware constraints (optional)
    hardware:
301
302
303
      minNumGpusPerEngine: 2
      maxNumGpusPerEngine: 8
      gpuType: h200_sxm
304
305
306

    # Profiling sweep settings (optional)
    sweep:
307
308
      prefillInterpolationGranularity: 16  # Number of samples for prefill ISL sweep
      decodeInterpolationGranularity: 6    # Number of samples for decode sweep
309
```
310
311
312
313

> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.

#### Planner Configuration Passthrough
314
Add planner-specific settings:
315
316
317
318
319

```yaml
profilingConfig:
  config:
    planner:
320
      plannerMinEndpoint: 2
321
322
```

323
## Understanding Profiling Results
324

325
For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details).
326

327
## Advanced Topics
328

329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
### Mocker Deployment

Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for:
- Large-scale experiments without GPU resources
- Testing Planner behavior and infrastructure
- Validating deployment configurations

To deploy mocker instead of the real backend, set `useMocker: true`:

```yaml
spec:
  model: <model-name>
  backend: trtllm  # Real backend for profiling (vllm, sglang, or trtllm)
  useMocker: true  # Deploy mocker instead of real backend

  profilingConfig:
    profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
    ...
  autoApply: true
```

Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.

352
353
354
355
### Using a Model Cache PVC

For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.

356
### DGDR Immutability
357

358
359
360
361
362
363
364
DGDRs are **immutable** - if you need to update SLAs or configuration:

1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications

### Manual Deployment Control

365
366
367
368
369
There are two ways to manually control deployment after profiling:

#### Option 1: Use DGDR-Generated Configuration (Recommended)

Disable auto-deployment to review the generated DGD before applying:
370
371
372
373

```yaml
spec:
  autoApply: false
374
375
```

376
Then manually extract and apply the generated DGD:
377

378
```bash
379
380
# Extract generated DGD from DGDR status
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
381

382
383
# Or save to file first for review/modification
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
384

385
vi my-dgd.yaml
386
kubectl apply -f my-dgd.yaml -n $NAMESPACE
387
388
```

389
The generated DGD includes optimized configurations and the SLA planner component. The required `planner-profile-data` ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully.
390
391
392
393
394
395

#### Option 2: Use Standalone Planner Templates (Advanced)

For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:

```bash
396
# After profiling completes, profiling data is automatically stored in ConfigMaps
397

398
399
400
# OPTIONAL: Inspect profiling results stored in ConfigMaps
# View the generated DGD configuration
kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
401

402
403
# View the planner profiling data (JSON format)
kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
404

405
406
# Update the PROMETHEUS_ENDPOINT environment variable in the planner template
# to match your cluster's Prometheus service location (see comments in the template)
407
408
409
410
411
412

# Update backend planner manifest as needed, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```

> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets.
413
414
>
> **Important - Prometheus Configuration**: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like "Failed to resolve prometheus service", ensure the `PROMETHEUS_ENDPOINT` environment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details.
415

416
### Relationship to DynamoGraphDeployment (DGD)
417

418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed

The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs

The generated DGD is tracked via labels:
```yaml
metadata:
  labels:
    dgdr.nvidia.com/name: sla-aic
    dgdr.nvidia.com/namespace: your-namespace
```

434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
### Accessing Detailed Profiling Artifacts

By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.

**What's available in ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)

**What's available in PVC if attached to DGDR (optional):**
- Performance plots (PNGs)
- DGD configuration and logs of all services for each profiled deployment
- AIPerf profiling artifacts for each AIPerf run
- Raw profiling data (`.npz` files)
- Profiler log

**Setup:**

1. Set up the benchmarking PVC:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```

2. Add `outputPVC` to your DGDR's `profilingConfig`:
```yaml
spec:
  profilingConfig:
    outputPVC: "dynamo-pvc"
    config:
      # ... rest of config
```

3. After profiling completes, access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```

474
475
476
## Troubleshooting

### Quick Diagnostics
477
478

```bash
479
480
481
482
483
484
485
486
# Check DGDR status and events
kubectl describe dgdr sla-aic -n $NAMESPACE

# Check operator logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100

# Check profiling job logs
kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
487
488
```

489
### Common Issues
490

491
492
493
494
495
496
497
| Issue | Quick Fix |
|-------|-----------|
| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
498

499
500
> [!NOTE]
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting).
501

502
## Configuration Reference
503

504
For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference).
505

506
507
508
509
510
511
512
513
514
515
516
517
518
519
This includes detailed explanations of:
- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling

## Related Documentation

- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling Details](/docs/benchmarks/sla_driven_profiling.md)
- [SLA Planner Architecture](/docs/planner/sla_planner.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)