profiler-examples.md 5.02 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Profiler Examples
5
6
---

7
Complete examples for profiling with DGDRs.
8
9
10

## DGDR Examples

11
### Dense Model: Rapid
12

13
Fast profiling (~30 seconds):
14
15

```yaml
16
apiVersion: nvidia.com/v1beta1
17
18
kind: DynamoGraphDeploymentRequest
metadata:
19
  name: qwen-0-6b
20
21
spec:
  model: "Qwen/Qwen3-0.6B"
22
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
23
24
```

25
### Dense Model: Thorough
26

27
Profiling with real GPU measurements:
28
29

```yaml
30
apiVersion: nvidia.com/v1beta1
31
32
kind: DynamoGraphDeploymentRequest
metadata:
33
  name: vllm-dense-online
34
spec:
35
36
37
38
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
  searchStrategy: thorough
39
40
41
42
43
44
```

### MoE Model

Multi-node MoE profiling with SGLang:

45
46
47
48
49
> [!IMPORTANT]
> The PVC referenced by `modelCache.pvcName` must already exist in the same namespace and contain
> the model weights at the specified `pvcModelPath`. The DGDR controller does not create or
> populate the PVC — it only mounts it into the profiling job and deployed workers.

50
```yaml
51
apiVersion: nvidia.com/v1beta1
52
53
54
55
56
57
kind: DynamoGraphDeploymentRequest
metadata:
  name: sglang-moe
spec:
  model: "deepseek-ai/DeepSeek-R1"
  backend: sglang
58
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"
59

60
61
  hardware:
    numGpusPerNode: 8
62

63
64
65
  modelCache:
    pvcName: "model-cache"
    pvcModelPath: "deepseek-r1"      # path within the PVC
66
67
```

68
### Private Model
69

70
71
For gated or private HuggingFace models, pass your token via an environment variable injected
into the profiling job. Create the secret first:
72
73

```bash
74
75
76
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="${HF_TOKEN}" \
  -n ${NAMESPACE}
77
78
```

79
80
Then reference it in your DGDR:

81
```yaml
82
apiVersion: nvidia.com/v1beta1
83
84
kind: DynamoGraphDeploymentRequest
metadata:
85
  name: llama-private
86
spec:
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
  model: "meta-llama/Llama-3.1-8B-Instruct"
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"

  overrides:
    profilingJob:
      template:
        spec:
          containers: []    # required placeholder; leave empty to inherit defaults
          initContainers:
            - name: profiler
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-token-secret
                      key: HF_TOKEN
```

### Custom SLA Targets

Control how the profiler optimizes your deployment by specifying latency targets and workload
characteristics.

**Explicit TTFT + ITL targets** (default mode):

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: low-latency-dense
spec:
  model: "Qwen/Qwen3-0.6B"
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"

  sla:
    ttft: 500      # Time To First Token target in milliseconds
    itl: 20        # Inter-Token Latency target in milliseconds
124

125
  workload:
126
127
128
    isl: 2000      # expected input sequence length (tokens)
    osl: 500       # expected output sequence length (tokens)
```
129

130
131
132
133
134
135
136
137
138
139
140
141
142
143
**End-to-end latency target** (alternative to ttft+itl):

```yaml
spec:
  ...
  sla:
    e2eLatency: 10000    # total request latency budget in milliseconds
```

**Optimization objective without explicit targets** (maximize throughput or minimize latency):

```yaml
spec:
  ...
144
  sla:
145
146
147
148
149
150
151
152
153
    optimizationType: throughput    # or: latency
```

### Overrides

Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
GPU node taints or inject environment variables.

**GPU node toleration** (common on GKE and shared clusters):
154

155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: dense-with-tolerations
spec:
  model: "Qwen/Qwen3-0.6B"
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.0.0"

  overrides:
    profilingJob:
      template:
        spec:
          containers: []    # required placeholder; leave empty to inherit defaults
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
```

**Override the generated DynamoGraphDeployment** (e.g., to use a custom worker image):

```yaml
spec:
  ...
  overrides:
    dgd:
      apiVersion: nvidia.com/v1alpha1
      kind: DynamoGraphDeployment
      spec:
        services:
          VllmWorker:
            extraEnvs:
              - name: CUSTOM_ENV
                value: "my-value"
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
```

## SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
  -H "Content-Type: application/json" \
  -d '{"output_dir": "/tmp/profiler_output"}'

# Run inference requests to generate profiling data...

# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```

A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:

```bash
python examples/backends/sglang/test_sglang_profile.py
```

View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.