pre_deployment_profiling.md 17.5 KB
Newer Older
1
2
# Pre-Deployment Profiling

3
> [!TIP]
4
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
5

6
7
8
9
## Profiling Script

To ensure Dynamo deployments comply with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.

10
11
12
> [!NOTE]
> **Time Investment**: This profiling process is comprehensive and typically takes **a few hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings. This upfront investment ensures your deployment meets SLA requirements and operates efficiently.

13
14
15
16
17
18
Support matrix:
| Backends | Model Types | Supported |
| --- | --- | --- |
| vLLM | Dense | ✅ |
| vLLM | MoE | 🚧 |
| SGLang | Dense | ✅ |
19
| SGLang | MoE | ✅ |
20
| TensorRT-LLM | Dense | ✅ |
21
22
| TensorRT-LLM | MoE | 🚧 |

23
24
25
26
27
28
29
> [!NOTE]
> The script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.

We assume there is no piggy-backed prefill requests in the decode engine. Even if there are some short piggy-backed prefill requests in the decode engine, it should not affect the ITL too much in most conditions. However, if the piggy-backed prefill requests are too much, the ITL might be inaccurate.

The script will first detect the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it will profile the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reusing. For decode, since the ITL (or iteration time) is relevant with how many requests are in-flight, the script will measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
### GPU Resource Usage

**Important**: Profiling tests different tensor parallelism (TP) configurations **sequentially**, not in parallel. This means:

- **One TP configuration at a time**: Each tensor parallelism size (TP1, TP2, TP4, TP8, etc.) is tested individually
- **Full GPU access**: Each TP configuration gets exclusive access to all available GPUs during its profiling run
- **Resource isolation**: No interference between different TP configurations during testing
- **Accurate measurements**: Each configuration is profiled under identical resource conditions

This sequential approach ensures:
- **Precise performance profiling** without resource conflicts
- **Consistent GPU allocation** for fair comparison across TP sizes
- **Reliable cleanup** between different TP configuration tests
- **Accurate SLA compliance verification** for each configuration

45
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `components/backends/vllm/deploy/disagg.yaml`:
46

47
48
![Prefill Performance](../../docs/images/h100_prefill_performance.png)
![Decode Performance](../../docs/images/h100_decode_performance.png)
49
50
51
52
53
54
55
56
57
58
59
60
61
62

For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner.

For the decode performance, the script will plot the ITL for different TP sizes and different in-flight requests. Similarly, it will select the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommend the upper and lower bounds of the kv cache utilization rate to be used in planner.

The script will recommend the best TP size for prefill and decode, as well as the upper and lower bounds of the prefill queue size and decode kv cache utilization rate if using load-based planner. The following information will be printed out in the terminal:
```
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
2025-05-16 15:20:24 - __main__ - INFO - Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
```

63
After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner. The results will be saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`. Please change the prefill and decode TP size in the config file to match the best TP sizes obtained from the profiling script.
64
65
66
67
68

### Prefill Interpolation Data

In prefill engine, prefills are usually done with batch size=1 and only the ISL (excluding prefix cache hit) affects the iteration time. The script profiles the selected prefill TP configuration across different ISLs and record the TTFT and prefill throughput per GPU under those ISLs.

69
70
71
For dense models, the script profiles different TP sizes.
For MoE models, the script only profiles different TEP sizes, since DEP is generally not the optimal prefill configuration.

72
73
74
### Decode Interpolation Data
In decode engine, decode requests are added inflight and iteration time (or ITL) depends on both the context length and the real-time load of the engine. We capture the real-time load of the engine with active kv usage and average context length. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.

75
76
77
For dense models, the script profiles different TP sizes.
For MoE models, the script profiles different DEP sizes. TEP decode engines for low latency will be supported in the future.

78
![images](../../docs/images/itl_interpolation.png)
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

The script profiles the selected decode TP configuration across different active kv blocks and average context length.

### Output Format of Interpolation Data

After suggesting the optimal TP configuration, two `.npz` files that describe the performance characteristics of the prefill and decode engines in their suggested parallel configurations will be generated. The two `.npz` files are:
* `${benchmark_result_dir}/selected_prefill_interpolation/raw_data.npz}`
  * `prefill_isl`: a 1D Numpy array to store the ISLs used to profile the prefill engine.
  * `prefill_ttft`: a 1D Numpy array to store the TTFTs under the corresponding ISLs when the prefill engine is exclusively running each prefill request (i.e., with batch size of 1). The unit is in milliseconds.
  * `prefill_thpt_per_gpu`: a 1D Numpy array to store the prefill throughput per GPU under the corresponding ISLs. The unit is in tokens per second per GPU.
* `${benchmark_result_dir}/selected_decode_interpolation/raw_data.npz`
  * `max_kv_tokens`: a 1D Numpy array with only one element to store the total number of KV tokens in the decode engine.
  * `x_kv_usage`: a 1D Numpy array to store the percentage of the active KV blocks (in the range of [0, 1]) used to profile the decode engine. The active KV blocks can be controlled by varying `(ISL + OSL / 2) * concurrency`.
  * `y_context_length`: a 1D Numpy array to store the average context length (ISL + OSL / 2) used to profile the decode engine.
  * `z_itl`: a 1D Numpy array to store the ITLs under the corresponding active KV usage and context length. To skip the prefill stage while maintaining the context length, benchmark can be done by turn on kv reuse and warmup the engine with the prompts first before running the actual profiling. The unit is in milliseconds.
  * `z_thpt_per_gpu`: a 1D Numpy array to store the decode throughput per GPU under the corresponding active KV usage and context length. The unit is in tokens per second per GPU.

SLA planner can work with any interpolation data that follows the above format. For best results, use fine-grained and high coverage interpolation data for the prefill and decode engines.


99
## Detailed Kubernetes Profiling Instructions
100

101
> [!TIP]
102
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md).
103

104
This section provides detailed technical information for advanced users who need to customize the profiling process.
105

106
### Configuration Options
107

108
**For dense models**, configure `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:
109
110
111
112
113
114
115
116
117
118
119
120
121

```yaml
spec:
  template:
    spec:
      containers:
        - name: profile-sla
          args:
            - --isl
            - "3000" # average ISL is 3000 tokens
            - --osl
            - "150" # average OSL is 150 tokens
            - --ttft
122
            - "200" # target TTFT is 200ms (float, in milliseconds)
123
            - --itl
124
            - "20" # target ITL is 20ms (float, in milliseconds)
125
126
            - --backend
            - <vllm/sglang>
127
128
```

129
**For MoE models**, use `profile_sla_moe_job.yaml` with TEP/DEP configuration instead.
130

131
132
133
134
135
136
137
138
139
140
### Auto-Configuration

To automatically configure the profiling job based on the hardware and model information, supply the `--model` argument to the profiling script. The following arguments will be automatically set:
- `--config`: will use the default config file (`components/backends/<backend>/deploy/disagg.yaml`) with model updated to the provided model name
- `--min-num-gpus-per-engine`: will be set to the minimum number of GPUs per engine based on the model size and hardware information
- `--max-num-gpus-per-engine`: will be set to the maximum number of GPUs per engine based on the model size and hardware information
- `--num-gpus-per-node`: will be set to the number of GPUs per node based on the hardware information
- `--is-moe-model`: will be set based on the HF config file
- `--max-context-length`: will be set to the maximum context length supported by the model based on the HF config file

141
### Advanced Configuration
142

143
- **Model caching**: For large models, create a multi-attach PVC to cache the model. See [recipes](../../recipes/README.md) for details.
144
145
- **Custom disaggregated configurations**: Use the manifest injector to place custom DGD configurations in the PVC.
- **Planner Config Passthrough**: To specify custom planner configurations (e.g., `adjustment-interval` or `load-predictor`) in the generated or deployed DGD config, add a `planner-` prefix to the argument. For example, to specify `--adjustment-interval=60` in SLA planner, add `--planner-adjustment-interval=60` arg to the profiling job.
146
- **Resource allocation**: Modify the job YAML to adjust GPU and memory requirements.
147
148
149

### Viewing Profiling Results

150
After the profiling job completes successfully, the results are stored in the persistent volume claim (PVC) created during Step 2.
151

152
To download the results:
153

154
155
```bash
# Download to directory
156
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
157

158
# Download without any of the auto-created config.yaml files used in profiling
159
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results --no-config
160
```
161

162
163
164
165
The script will:
* Deploy a temporary access pod
* Download all files maintaining directory structure
* Clean the pod up automatically
166

167
168
169
170
#### File Structure

The profiling results directory contains the following structure:
```
171
/workspace/data/profiling_results/
172
173
174
175
176
177
178
179
180
181
├── prefill_performance.png                    # Main prefill performance plot
├── decode_performance.png                     # Main decode performance plot
├── prefill_tp1/                               # Individual TP profiling directories
...
├── decode_tp1/
...
├── selected_prefill_interpolation/
│   ├── raw_data.npz                           # Prefill interpolation data
│   ├── prefill_ttft_interpolation.png         # TTFT vs ISL plot
│   └── prefill_throughput_interpolation.png   # Throughput vs ISL plot
182
183
184
185
├── selected_decode_interpolation/
│   ├── raw_data.npz                           # Decode interpolation data
│   └── decode_tp{best_tp}.png                 # 3D ITL surface plot
└── config_with_planner.yaml                   # Generated DGD config with planner
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
```

#### Viewing Performance Plots

The profiling generates several performance visualization files:

**Main Performance Plots:**
- **`prefill_performance.png`**: Shows TTFT (Time To First Token) performance across different tensor parallelism (TP) sizes
- **`decode_performance.png`**: Shows ITL (Inter-Token Latency) performance across different TP sizes and in-flight request counts

**Interpolation Plots:**
- **`selected_prefill_interpolation/prefill_ttft_interpolation.png`**: TTFT vs Input Sequence Length with quadratic fit
- **`selected_prefill_interpolation/prefill_throughput_interpolation.png`**: Prefill throughput vs Input Sequence Length
- **`selected_decode_interpolation/decode_tp{best_tp}.png`**: 3D surface plot showing ITL vs KV usage and context length

#### Understanding the Data Files

The `.npz` files contain raw profiling data that can be loaded and analyzed using Python:

```python
import numpy as np

# Load prefill data
prefill_data = np.load('selected_prefill_interpolation/raw_data.npz')
print("Prefill data keys:", list(prefill_data.keys()))

# Load decode data
decode_data = np.load('selected_decode_interpolation/raw_data.npz')
print("Decode data keys:", list(decode_data.keys()))
```

### Troubleshooting

#### Image Pull Authentication Errors

If you see `ErrImagePull` or `ImagePullBackOff` errors with 401 unauthorized messages:

1. Ensure the `nvcr-imagepullsecret` exists in your namespace:
   ```bash
   kubectl get secret nvcr-imagepullsecret -n $NAMESPACE
   ```

2. Verify the service account was created with the image pull secret:
229
  ```bash
230
  kubectl get serviceaccount dgdr-profiling-job -n $NAMESPACE -o yaml
231
232
233
   ```

3. The service account should show `imagePullSecrets` containing `nvcr-imagepullsecret`.
234

235
236
237
238
239
240
241
242
If it doesn't, create the secret

```bash
export NGC_API_KEY=<you-ngc-api-key-here>
kubectl create secret docker-registry nvcr-imagepullsecret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY

```

243

244
245
246
247
248
249
## Running the Profiling Script with AI Configurator

> [!NOTE]
> **TensorRT-LLM Only**: AI Configurator currently supports TensorRT-LLM only. Support for vLLM and SGLang is coming soon.

The profiling script can be run much faster using AI Configurator to estimate performance numbers instead of running real Dynamo deployments. This completes profiling in 20-30 seconds using performance simulation.
250
251

**Advantages** of `--use-ai-configurator`:
252
253
254
* Script completes in seconds rather than hours
* No Kubernetes or GPU access required
* Ideal for rapid prototyping and testing
255
256

**Disadvantages**:
257
258
259
* Estimated performance may contain errors, especially for out-of-distribution input dimensions
* Limited list of supported models, systems, and backends
* Less accurate than real deployment profiling
260
261

### Prerequisites
262
263

Install AI Configurator:
264
```bash
265
pip install aiconfigurator
266
267
```

268
If using local environment, also install:
269
```bash
270
pip install -r deploy/utils/requirements.txt
271
272
```

273
274
275
### Check Support Matrix

View supported models, systems, and backends:
276
277
278
```bash
aiconfigurator cli --help
```
279
280

**Supported configurations:**
281
```
282
Models: GPT_7B, GPT_13B, GPT_30B, GPT_66B, GPT_175B, LLAMA2_7B, LLAMA2_13B, LLAMA2_70B, LLAMA3.1_8B, LLAMA3.1_70B, LLAMA3.1_405B, MOE_Mixtral8x7B, MOE_Mixtral8x22B, DEEPSEEK_V3, KIMI_K2, QWEN2.5_1.5B, QWEN2.5_7B, QWEN2.5_32B, QWEN2.5_72B, QWEN3_32B, QWEN3_235B, QWEN3_480B, Nemotron_super_v1.1
283

284
Systems: h100_sxm, h200_sxm
285

286
287
Backends: trtllm (vllm and sglang support coming soon)
```
288

289
290
291
### Running Fast Profiling

Example command for TensorRT-LLM:
292
```bash
293
294
python3 -m benchmarks.profiler.profile_sla \
   --config ./components/backends/trtllm/deploy/disagg.yaml \
295
   --backend trtllm \
296
297
298
   --use-ai-configurator \
   --aic-system h200_sxm \
   --aic-model-name QWEN3_32B \
299
300
   --aic-backend trtllm \ # optional, will use --backend if not provided
   --aic-backend-version 0.20.0 \
301
302
   --isl 3000 \
   --osl 150 \
303
304
   --ttft 200 \ # target TTFT in milliseconds (float)
   --itl 20 # target ITL in milliseconds (float)
305
```
306
307

The output will be written to `./profiling_results/` and can be used directly with SLA planner deployment.