profiler-guide.md 9.06 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Profiler Guide
5
6
7
8
---

# Profiler Guide

9
## Overview
10

11
The Dynamo Profiler analyzes model inference performance and generates optimized deployment configurations (DynamoGraphDeployments). Given a model, hardware, and SLA targets, it determines the best parallelization strategy, selects optimal prefill and decode engine configurations, and produces a ready-to-deploy DGD YAML.
12

13
The profiler accepts a `DynamoGraphDeploymentRequestSpec` (DGDR) as input and uses [AI Configurator (AIC)](https://github.com/ai-dynamo/aiconfigurator) for performance simulation, candidate enumeration, and configuration picking. When the planner is enabled, the profiler additionally generates engine interpolation curves used for runtime autoscaling.
14

15
## Workflow
16

17
The profiler follows this pipeline:
18

19
20
21
22
```mermaid
flowchart TD
    Input["DGDR Spec"] --> Validate["Validate + Gate Checks"]
    Validate --> Strategy{searchStrategy?}
23

24
25
    Strategy -->|rapid| AICCheck{"AIC supports\nmodel/hw/backend?"}
    Strategy -->|thorough| Enumerate["Enumerate candidates\nvia AIC"]
26

27
28
    AICCheck -->|yes| Simulate["AIC Simulation\n+ Picking"]
    AICCheck -->|no| Naive["Naive Config\nGeneration"]
29

30
31
    Enumerate --> Deploy["Deploy + Benchmark\neach candidate"]
    Deploy --> Pick["AIC Picking"]
32

33
34
35
    Simulate --> DGDGen["DGD Generation"]
    Pick --> DGDGen
    Naive --> DGDGen
36

37
38
39
    DGDGen --> PlannerCheck{"Planner\nenabled?"}
    PlannerCheck -->|yes| Interpolation["Interpolation\nCurves"]
    PlannerCheck -->|no| MockerCheck
40

41
42
    Interpolation --> AddPlanner["Add Planner\nService + ConfigMaps"]
    AddPlanner --> MockerCheck{"Mocker\nenabled?"}
43

44
45
    MockerCheck -->|yes| Mocker["Output Mocker DGD"]
    MockerCheck -->|no| RealDGD["Output Real DGD"]
46

47
48
    Mocker --> Final["final_config.yaml"]
    RealDGD --> Final
49
50
```

51
### Stage-by-stage walkthrough
52

53
1. **Validation**: The DGDR spec is validated — required fields checked (`image`, `hardware.gpuSku`, `hardware.numGpusPerNode`), SLA targets verified, and gate checks applied (see [Gate Checks](#gate-checks-and-constraints)).
54

55
56
57
2. **Search Strategy**: The profiler branches based on `searchStrategy`:
   - **Rapid**: Uses AIC simulation to estimate performance across parallelization configs. No GPUs needed, completes in ~30 seconds.
   - **Thorough**: Enumerates candidate parallelization configs via AIC, deploys each on real GPUs, benchmarks with AIPerf, then picks the best. Takes 2-4 hours, disagg mode only.
58

59
3. **Picking**: The profiler selects the best configuration using one of three modes, determined automatically from the DGDR spec (see [Picking Modes](#picking-modes)).
60

61
4. **DGD Generation**: The picked configuration is rendered into a complete DGD YAML via AIC's generator pipeline, including correct parallelization, replica counts, container image, and PVC mounts.
62

63
5. **Interpolation** (planner only): When the planner is enabled, the profiler generates detailed performance interpolation curves — TTFT vs ISL for prefill, ITL vs KV-cache utilization for decode. These are saved into ConfigMaps for the planner to use at runtime.
64

65
6. **Final Assembly**: The planner service is added to the DGD if enabled. If mocker is enabled, the mocker DGD is used instead of real workers. The result is written to `final_config.yaml`.
66

67
## Search Strategies
68

69
### Rapid
70

71
Uses AIC's performance simulation to estimate optimal configurations without deploying real engines. Completes in ~30 seconds.
72
73

```yaml
74
searchStrategy: rapid
75
76
```

77
78
79
- Supports all backends: vLLM, SGLang, TensorRT-LLM
- If the model/hardware/backend combination is not supported by AIC, falls back to a naive config (memory-fit TP calculation)
- No GPU resources consumed during profiling
80

81
### Thorough
82

83
Enumerates candidate parallelization configs, deploys each as a real K8s workload, and benchmarks with AIPerf.
84
85

```yaml
86
searchStrategy: thorough
87
88
```

89
90
91
92
- Only disaggregated mode is supported
- Does not support `auto` backend — specify `vllm`, `sglang`, or `trtllm`
- Takes 2-4 hours depending on the number of candidates
- Provides highest accuracy since measurements come from real hardware
93

94
## Picking Modes
95

96
The profiler automatically selects a picking mode based on the DGDR spec:
97

98
### Autoscale
99

100
Triggered when the **planner is enabled** (scaling enabled in `features.planner`). Picks prefill and decode engines independently, each with 1 replica. The planner handles scaling at runtime.
101

102
### Load Match
103

104
Triggered when a **target load** is specified (`workload.requestRate` or `workload.concurrency`). Finds the configuration that serves the target load with the minimum number of GPUs under SLA.
105
106

```yaml
107
108
workload:
  requestRate: 5.0   # target 5 req/s
109
110
```

111
### Default
112

113
Triggered when there is **no planner and no target load**. Maximizes throughput for the available GPU budget under SLA.
114

115
## Planner Integration
116

117
When the planner is enabled, the profiler generates engine interpolation data needed for throughput-based autoscaling. The `pre_deployment_sweeping_mode` field controls how this data is produced:
118
119

```yaml
120
121
122
123
features:
  planner:
    pre_deployment_sweeping_mode: rapid   # rapid | thorough | none
    enable_throughput_scaling: true
124
125
```

126
127
128
- **rapid**: Uses AIC simulation to generate interpolation curves (~30s, no GPUs)
- **thorough**: Deploys the selected engine config on real GPUs and sweeps across ISL/concurrency ranges (2-4h)
- **none**: Skips interpolation. Only valid when using load-based scaling without throughput-based scaling.
129

130
131
132
The profiler saves two ConfigMaps into the generated DGD:
- **planner-config-XXXX**: Serialized `PlannerConfig` JSON (with `profile_results_dir` pointing to the profiling data mount)
- **planner-profile-data-XXXX**: Prefill and decode interpolation data (JSON)
133

134
See the [Planner Guide](../planner/planner-guide.md) for the full `PlannerConfig` reference.
135

136
## Mocker
137

138
When `features.mocker.enabled: true`, the profiler outputs a mocker DGD that simulates engine behavior without real GPUs. This is useful for testing planner behavior and validating configurations at scale.
139

140
Mocker requires pre-deployment sweeping to generate simulated performance profiles — `pre_deployment_sweeping_mode` cannot be `none` when mocker is enabled.
141

142
## Gate Checks and Constraints
143

144
The profiler enforces these rules at startup:
145

146
147
148
149
150
151
152
153
| Condition | Behavior |
|-----------|----------|
| `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. |
| AIC unsupported + `enable_throughput_scaling: true` | Rejected. Throughput planner requires AIC support. |
| AIC unsupported + `pre_deployment_sweeping_mode: rapid` | Falls back to `none` with a warning. |
| `e2eLatency` provided without `ttft: null, itl: null` | Rejected by SLA validator. When using `e2eLatency`, explicitly null out `ttft` and `itl`. |
| SLA unachievable | Warning logged, SLA updated to best achievable value. |
| Load-match needs more GPUs than available | Warning logged. |
154

155
## CLI Usage
156

157
The profiler can be run directly for local development and testing:
158
159

```bash
160
python -m dynamo.profiler --config <spec.yaml>
161
162
```

163
Where `<spec.yaml>` is a DGDR spec (JSON or YAML file, or inline JSON string).
164

165
### Operational flags
166

167
168
169
170
171
172
173
| Flag | Default | Description |
|------|---------|-------------|
| `--output-dir` | `profiling_results` | Directory for output files |
| `--deployment-timeout` | `3600` | Max seconds to wait for K8s deployment readiness |
| `--prefill-interpolation-granularity` | `16` | Number of ISL samples for prefill interpolation |
| `--decode-interpolation-granularity` | `6` | Number of samples for decode interpolation |
| `--dry-run` | `false` | Skip all deployments and benchmarking (dev mode) |
174

175
### Output
176

177
The profiler writes `final_config.yaml` to the output directory. When the planner is enabled, this is a multi-document YAML containing ConfigMaps + DGD. The `profiler_status.yaml` file tracks job status (`success` / `failed`).
178

179
## Support Matrix
180

181
182
183
184
185
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 |
186
187
188
189
190

## Troubleshooting

### SLA Cannot Be Met

191
192
193
194
195
The profiler logs a warning and updates the SLA to the best achievable value. To improve results:
- Relax SLA targets (increase TTFT/ITL)
- Add more GPU resources
- Try a different backend
- Use a smaller or quantized model
196

197
### Profiling Takes Too Long
198

199
200
201
- Use `searchStrategy: rapid` for ~30s profiling
- Reduce interpolation granularity
- Reduce the GPU search space via hardware constraints
202
203
204

### Out of Memory During Profiling

205
206
207
- Reduce `max_batch_size` in engine config
- Skip larger TP configurations by constraining hardware
- Use a quantized model variant
208

209
### Image Pull Errors
210

211
Ensure image pull secrets are configured in your namespace for the container registry.
212
213
214

## See Also

215
216
217
218
- [Profiler README](README.md) — Quick overview and feature matrix
- [Profiler Examples](profiler-examples.md) — Complete DGDR YAML examples
- [Planner Guide](../planner/planner-guide.md) — PlannerConfig reference and scaling modes
- [DGDR API Reference](../../kubernetes/api-reference.md) — Full DGDR specification