profiler-examples.md 4.84 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Profiler Examples
5
6
---

7
Complete examples for profiling with DGDRs.
8
9
10

## DGDR Examples

11
### Dense Model: Rapid
12

13
Fast profiling (~30 seconds):
14
15

```yaml
16
apiVersion: nvidia.com/v1beta1
17
18
kind: DynamoGraphDeploymentRequest
metadata:
19
  name: qwen-0-6b
20
21
spec:
  model: "Qwen/Qwen3-0.6B"
22
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
23
24
```

25
### Dense Model: Thorough
26

27
Profiling with real GPU measurements:
28
29

```yaml
30
apiVersion: nvidia.com/v1beta1
31
32
kind: DynamoGraphDeploymentRequest
metadata:
33
  name: vllm-dense-online
34
spec:
35
36
  model: "Qwen/Qwen3-0.6B"
  backend: vllm
37
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
38
  searchStrategy: thorough
39
40
41
42
43
44
```

### MoE Model

Multi-node MoE profiling with SGLang:

45
46
47
48
49
> [!IMPORTANT]
> The PVC referenced by `modelCache.pvcName` must already exist in the same namespace and contain
> the model weights at the specified `pvcModelPath`. The DGDR controller does not create or
> populate the PVC — it only mounts it into the profiling job and deployed workers.

50
```yaml
51
apiVersion: nvidia.com/v1beta1
52
53
54
55
56
57
kind: DynamoGraphDeploymentRequest
metadata:
  name: sglang-moe
spec:
  model: "deepseek-ai/DeepSeek-R1"
  backend: sglang
58
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
59

60
61
  hardware:
    numGpusPerNode: 8
62

63
64
65
  modelCache:
    pvcName: "model-cache"
    pvcModelPath: "deepseek-r1"      # path within the PVC
66
67
```

68
### Private Model
69

70
71
For gated or private HuggingFace models, pass your token via an environment variable injected
into the profiling job. Create the secret first:
72
73

```bash
74
75
76
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="${HF_TOKEN}" \
  -n ${NAMESPACE}
77
78
```

79
80
Then reference it in your DGDR:

81
```yaml
82
apiVersion: nvidia.com/v1beta1
83
84
kind: DynamoGraphDeploymentRequest
metadata:
85
  name: llama-private
86
spec:
87
  model: "meta-llama/Llama-3.1-8B-Instruct"
88
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

  overrides:
    profilingJob:
      template:
        spec:
          containers: []    # required placeholder; leave empty to inherit defaults
          initContainers:
            - name: profiler
              env:
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-token-secret
                      key: HF_TOKEN
```

### Custom SLA Targets

Control how the profiler optimizes your deployment by specifying latency targets and workload
characteristics.

**Explicit TTFT + ITL targets** (default mode):

```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: low-latency-dense
spec:
  model: "Qwen/Qwen3-0.6B"
119
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
120
121
122
123

  sla:
    ttft: 500      # Time To First Token target in milliseconds
    itl: 20        # Inter-Token Latency target in milliseconds
124

125
  workload:
126
127
128
    isl: 2000      # expected input sequence length (tokens)
    osl: 500       # expected output sequence length (tokens)
```
129

130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
**End-to-end latency target** (alternative to ttft+itl):

```yaml
spec:
  ...
  sla:
    e2eLatency: 10000    # total request latency budget in milliseconds
```

### Overrides

Use `overrides` to customize the profiling job pod spec — for example to add tolerations for
GPU node taints or inject environment variables.

**GPU node toleration** (common on GKE and shared clusters):
145

146
147
148
149
150
151
152
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: dense-with-tolerations
spec:
  model: "Qwen/Qwen3-0.6B"
153
  image: "nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.1.0"
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180

  overrides:
    profilingJob:
      template:
        spec:
          containers: []    # required placeholder; leave empty to inherit defaults
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule
```

**Override the generated DynamoGraphDeployment** (e.g., to use a custom worker image):

```yaml
spec:
  ...
  overrides:
    dgd:
      apiVersion: nvidia.com/v1alpha1
      kind: DynamoGraphDeployment
      spec:
        services:
          VllmWorker:
            extraEnvs:
              - name: CUSTOM_ENV
                value: "my-value"
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
```

## SGLang Runtime Profiling

Profile SGLang workers at runtime via HTTP endpoints:

```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
  -H "Content-Type: application/json" \
  -d '{"output_dir": "/tmp/profiler_output"}'

# Run inference requests to generate profiling data...

# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```

A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:

```bash
python examples/backends/sglang/test_sglang_profile.py
```

View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.