profiler-guide.md 9.05 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Profiler Guide
5
6
---

7
## Overview
8

9
The Dynamo Profiler analyzes model inference performance and generates optimized deployment configurations (DynamoGraphDeployments). Given a model, hardware, and SLA targets, it determines the best parallelization strategy, selects optimal prefill and decode engine configurations, and produces a ready-to-deploy DGD YAML.
10

11
The profiler accepts a `DynamoGraphDeploymentRequestSpec` (DGDR) as input and uses [AI Configurator (AIC)](https://github.com/ai-dynamo/aiconfigurator) for performance simulation, candidate enumeration, and configuration picking. When the planner is enabled, the profiler additionally generates engine interpolation curves used for runtime autoscaling.
12

13
## Workflow
14

15
The profiler follows this pipeline:
16

17
18
19
20
```mermaid
flowchart TD
    Input["DGDR Spec"] --> Validate["Validate + Gate Checks"]
    Validate --> Strategy{searchStrategy?}
21

22
23
    Strategy -->|rapid| AICCheck{"AIC supports\nmodel/hw/backend?"}
    Strategy -->|thorough| Enumerate["Enumerate candidates\nvia AIC"]
24

25
26
    AICCheck -->|yes| Simulate["AIC Simulation\n+ Picking"]
    AICCheck -->|no| Naive["Naive Config\nGeneration"]
27

28
29
    Enumerate --> Deploy["Deploy + Benchmark\neach candidate"]
    Deploy --> Pick["AIC Picking"]
30

31
32
33
    Simulate --> DGDGen["DGD Generation"]
    Pick --> DGDGen
    Naive --> DGDGen
34

35
36
37
    DGDGen --> PlannerCheck{"Planner\nenabled?"}
    PlannerCheck -->|yes| Interpolation["Interpolation\nCurves"]
    PlannerCheck -->|no| MockerCheck
38

39
40
    Interpolation --> AddPlanner["Add Planner\nService + ConfigMaps"]
    AddPlanner --> MockerCheck{"Mocker\nenabled?"}
41

42
43
    MockerCheck -->|yes| Mocker["Output Mocker DGD"]
    MockerCheck -->|no| RealDGD["Output Real DGD"]
44

45
46
    Mocker --> Final["final_config.yaml"]
    RealDGD --> Final
47
48
```

49
### Stage-by-stage walkthrough
50

51
1. **Validation**: The DGDR spec is validated — required fields checked (`image`, `hardware.gpuSku`, `hardware.numGpusPerNode`), SLA targets verified, and gate checks applied (see [Gate Checks](#gate-checks-and-constraints)).
52

53
54
55
2. **Search Strategy**: The profiler branches based on `searchStrategy`:
   - **Rapid**: Uses AIC simulation to estimate performance across parallelization configs. No GPUs needed, completes in ~30 seconds.
   - **Thorough**: Enumerates candidate parallelization configs via AIC, deploys each on real GPUs, benchmarks with AIPerf, then picks the best. Takes 2-4 hours, disagg mode only.
56

57
3. **Picking**: The profiler selects the best configuration using one of three modes, determined automatically from the DGDR spec (see [Picking Modes](#picking-modes)).
58

59
4. **DGD Generation**: The picked configuration is rendered into a complete DGD YAML via AIC's generator pipeline, including correct parallelization, replica counts, container image, and PVC mounts.
60

61
5. **Interpolation** (planner only): When the planner is enabled, the profiler generates detailed performance interpolation curves — TTFT vs ISL for prefill, ITL vs KV-cache utilization for decode. These are saved into ConfigMaps for the planner to use at runtime.
62

63
6. **Final Assembly**: The planner service is added to the DGD if enabled. If mocker is enabled, the mocker DGD is used instead of real workers. The result is written to `final_config.yaml`.
64

65
## Search Strategies
66

67
### Rapid
68

69
Uses AIC's performance simulation to estimate optimal configurations without deploying real engines. Completes in ~30 seconds.
70
71

```yaml
72
searchStrategy: rapid
73
74
```

75
76
77
- Supports all backends: vLLM, SGLang, TensorRT-LLM
- If the model/hardware/backend combination is not supported by AIC, falls back to a naive config (memory-fit TP calculation)
- No GPU resources consumed during profiling
78

79
### Thorough
80

81
Enumerates candidate parallelization configs, deploys each as a real K8s workload, and benchmarks with AIPerf.
82
83

```yaml
84
searchStrategy: thorough
85
86
```

87
88
89
90
- Only disaggregated mode is supported
- Does not support `auto` backend — specify `vllm`, `sglang`, or `trtllm`
- Takes 2-4 hours depending on the number of candidates
- Provides highest accuracy since measurements come from real hardware
91

92
## Picking Modes
93

94
The profiler automatically selects a picking mode based on the DGDR spec:
95

96
### Autoscale
97

98
Triggered when the **planner is enabled** (scaling enabled in `features.planner`). Picks prefill and decode engines independently, each with 1 replica. The planner handles scaling at runtime.
99

100
### Load Match
101

102
Triggered when a **target load** is specified (`workload.requestRate` or `workload.concurrency`). Finds the configuration that serves the target load with the minimum number of GPUs under SLA.
103
104

```yaml
105
106
workload:
  requestRate: 5.0   # target 5 req/s
107
108
```

109
### Default
110

111
Triggered when there is **no planner and no target load**. Maximizes throughput for the available GPU budget under SLA.
112

113
## Planner Integration
114

115
When the planner is enabled, the profiler generates engine interpolation data needed for throughput-based autoscaling. The `pre_deployment_sweeping_mode` field controls how this data is produced:
116
117

```yaml
118
119
120
121
features:
  planner:
    pre_deployment_sweeping_mode: rapid   # rapid | thorough | none
    enable_throughput_scaling: true
122
123
```

124
125
126
- **rapid**: Uses AIC simulation to generate interpolation curves (~30s, no GPUs)
- **thorough**: Deploys the selected engine config on real GPUs and sweeps across ISL/concurrency ranges (2-4h)
- **none**: Skips interpolation. Only valid when using load-based scaling without throughput-based scaling.
127

128
129
130
The profiler saves two ConfigMaps into the generated DGD:
- **planner-config-XXXX**: Serialized `PlannerConfig` JSON (with `profile_results_dir` pointing to the profiling data mount)
- **planner-profile-data-XXXX**: Prefill and decode interpolation data (JSON)
131

132
See the [Planner Guide](../planner/planner-guide.md) for the full `PlannerConfig` reference.
133

134
## Mocker
135

136
When `features.mocker.enabled: true`, the profiler outputs a mocker DGD that simulates engine behavior without real GPUs. This is useful for testing planner behavior and validating configurations at scale.
137

138
Mocker requires pre-deployment sweeping to generate simulated performance profiles — `pre_deployment_sweeping_mode` cannot be `none` when mocker is enabled.
139

140
## Gate Checks and Constraints
141

142
The profiler enforces these rules at startup:
143

144
145
146
147
148
149
150
151
| Condition | Behavior |
|-----------|----------|
| `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. |
| AIC unsupported + `enable_throughput_scaling: true` | Rejected. Throughput planner requires AIC support. |
| AIC unsupported + `pre_deployment_sweeping_mode: rapid` | Falls back to `none` with a warning. |
| `e2eLatency` provided without `ttft: null, itl: null` | Rejected by SLA validator. When using `e2eLatency`, explicitly null out `ttft` and `itl`. |
| SLA unachievable | Warning logged, SLA updated to best achievable value. |
| Load-match needs more GPUs than available | Warning logged. |
152

153
## CLI Usage
154

155
The profiler can be run directly for local development and testing:
156
157

```bash
158
python -m dynamo.profiler --config <spec.yaml>
159
160
```

161
Where `<spec.yaml>` is a DGDR spec (JSON or YAML file, or inline JSON string).
162

163
### Operational flags
164

165
166
167
168
169
170
171
| Flag | Default | Description |
|------|---------|-------------|
| `--output-dir` | `profiling_results` | Directory for output files |
| `--deployment-timeout` | `3600` | Max seconds to wait for K8s deployment readiness |
| `--prefill-interpolation-granularity` | `16` | Number of ISL samples for prefill interpolation |
| `--decode-interpolation-granularity` | `6` | Number of samples for decode interpolation |
| `--dry-run` | `false` | Skip all deployments and benchmarking (dev mode) |
172

173
### Output
174

175
The profiler writes `final_config.yaml` to the output directory. When the planner is enabled, this is a multi-document YAML containing ConfigMaps + DGD. The `profiler_status.yaml` file tracks job status (`success` / `failed`).
176

177
## Support Matrix
178

179
180
181
182
183
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 |
184
185
186
187
188

## Troubleshooting

### SLA Cannot Be Met

189
190
191
192
193
The profiler logs a warning and updates the SLA to the best achievable value. To improve results:
- Relax SLA targets (increase TTFT/ITL)
- Add more GPU resources
- Try a different backend
- Use a smaller or quantized model
194

195
### Profiling Takes Too Long
196

197
198
199
- Use `searchStrategy: rapid` for ~30s profiling
- Reduce interpolation granularity
- Reduce the GPU search space via hardware constraints
200
201
202

### Out of Memory During Profiling

203
204
205
- Reduce `max_batch_size` in engine config
- Skip larger TP configurations by constraining hardware
- Use a quantized model variant
206

207
### Image Pull Errors
208

209
Ensure image pull secrets are configured in your namespace for the container registry.
210
211
212

## See Also

213
214
215
216
- [Profiler README](README.md) — Quick overview and feature matrix
- [Profiler Examples](profiler-examples.md) — Complete DGDR YAML examples
- [Planner Guide](../planner/planner-guide.md) — PlannerConfig reference and scaling modes
- [DGDR API Reference](../../kubernetes/api-reference.md) — Full DGDR specification