gpu_utils.md 11 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# GPU Memory Parameters by Engine

How vLLM, sglang, and TensorRT-LLM interpret memory-related parameters, and how
to estimate total GPU VRAM usage for each.

---

## Quick Reference

| Parameter | vLLM | sglang | TensorRT-LLM |
|---|---|---|---|
| Memory fraction | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` (YAML/override) |
| Fraction base | Total VRAM | Total VRAM | Free VRAM (after model load) |
| Default fraction | 0.90 | 0.90 | 0.90 |
| Max sequence length | `--max-model-len` | `--context-length` | `max_seq_len` (YAML/override) |
| KV cache size override | `--kv-cache-memory-bytes` | N/A | `max_gpu_total_bytes` (broken in 1.3.0rc5) |

---

## 1. vLLM

### How `--gpu-memory-utilization` works

This is a fraction of **total** GPU VRAM. The engine budgets everything within
this limit:

```
budget = total_vram * gpu_memory_utilization

KV cache = budget - model_weights - peak_activations - framework_overhead
```

At startup, vLLM profiles actual model weight and activation memory, then
pre-allocates the remaining budget as KV cache blocks. The KV pool size is fixed
for the lifetime of the engine.

### How `--max-model-len` works

Sets the maximum total sequence length (input + output tokens). Longer sequences
require more KV cache per request. If the requested `max-model-len` needs more
KV cache than the budget allows, vLLM errors at startup:

```
ValueError: ... X GiB KV cache is needed, which is larger than the available
KV cache memory (Y GiB). ...
```

Reducing `--max-model-len` is the most effective way to reduce VRAM when the
model fits but the KV cache doesn't.

### How `--kv-cache-memory-bytes` works

When set, this overrides the automatic KV cache sizing from
`gpu-memory-utilization`. The engine allocates exactly this many bytes for KV
cache regardless of the fraction. This means `gpu-memory-utilization` still
controls the *overall* VRAM budget (and thus whether the model fits), but the
KV cache portion is pinned to the explicit byte value.

Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
changing `DYN_GPU_MEMORY_FRACTION_OVERRIDE` (which maps to
`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
headroom for activations and overhead.

### Estimating total GPU usage

```
total_vram ≈ model_weights + kv_cache + activations + overhead

model_weights ≈ num_params * bytes_per_param
                (e.g. 7B * 2 bytes for BF16 ≈ 14 GiB)

kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
                     (the factor of 2 is for K and V tensors)

kv_cache_total = kv_cache_per_token * max_model_len * max_concurrent_seqs

overhead ≈ engine-dependent (auto-computed by estimate_worker_vram):
           vllm:   1.2 + 1.0 * sqrt(params_b) GiB  (0.6B≈2.0, 8B≈4.0)
           sglang: 2.5 + 1.5 * sqrt(params_b) GiB  (0.6B≈3.7, 8B≈6.7)
           trtllm: 2.0 + 1.2 * sqrt(params_b) GiB  (0.6B≈2.9, 8B≈5.4)
```

Rule of thumb: set `gpu-memory-utilization` so that
`total_vram * fraction >= model_weights + 2 GiB`. The rest becomes KV cache.

---

## 2. sglang

### How `--mem-fraction-static` works

Like vLLM, this is a fraction of **total** GPU VRAM:

```
budget = total_vram * mem_fraction_static

KV cache pool = budget - model_weights
```

The budget covers model weights and the KV cache pool. Activations and CUDA
graph buffers are allocated *outside* this budget from the remaining VRAM.
This is slightly different from vLLM (which includes activations in the budget).

sglang recommends keeping 5-8 GiB free for activations and overhead. If you
see OOM errors, decrease `--mem-fraction-static` by 0.01-0.05 increments.

### How `--context-length` works

Equivalent to vLLM's `--max-model-len`. Defaults to the model's native context
window. Reducing it shrinks the per-request KV cache requirement and allows more
concurrent sequences.

### Estimating total GPU usage

```
total_vram ≈ model_weights + kv_cache_pool + activations_and_overhead

kv_cache_pool = total_vram * mem_fraction_static - model_weights

activations_and_overhead ≈ 1-8 GiB (depends on model size, batch size, seq len;
                           ~1-2 GiB for small models like 0.6B,
                           ~5-8 GiB for larger models like 8B+ with CUDA graphs)
```

---

## 3. TensorRT-LLM

### How `free_gpu_memory_fraction` works

This is a fraction of **free** VRAM (not total). The engine:

1. Loads model weights and builds the TRT engine (fixed cost).
2. Queries remaining free GPU memory.
3. Allocates `free_memory * free_gpu_memory_fraction` for the KV cache pool.

```
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```

This means the same fraction yields different absolute KV cache sizes depending
on how much VRAM the model consumed. A 5 GiB model on a 48 GiB GPU leaves
~43 GiB free; fraction=0.24 gives ~10 GiB KV cache. A 30 GiB model leaves
~18 GiB free; fraction=0.24 gives only ~4 GiB.

Set via YAML config, CLI, or env var:

```bash
--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
```

### How `max_seq_len` works

Maximum total sequence length. Defaults to the model's native context.
Sequences exceeding this limit are rejected at runtime.

**VRAM impact: none (PyTorch backend).** Reducing max_seq_len from 40960 to
2048 had zero effect on total VRAM or KV cache size in testing (Qwen3-0.6B,
trtllm 1.3.0rc5). The PyTorch backend does not pre-allocate internal buffers
proportional to max_seq_len; KV cache size is determined solely by
`free_gpu_memory_fraction`. This differs from vLLM/sglang where reducing
context length measurably reduces memory.

Override via:

```bash
--override-engine-args '{"max_seq_len": 4096}'
```

### Override gotcha: sub-dict replacement

Overriding any field inside `kv_cache_config` **replaces the entire sub-dict**.
If your YAML has `enable_block_reuse: true` and you override only
`free_gpu_memory_fraction`, you lose `enable_block_reuse`. Always re-include
all fields you need:

```json
{"kv_cache_config": {"free_gpu_memory_fraction": 0.15, "enable_block_reuse": true}}
```

### How `max_num_tokens` works

Maximum batched input tokens per iteration. Primarily a throughput knob.

**VRAM impact: none.** Reducing from 8192 → 256 had no measurable effect on
total VRAM (41,643 vs 41,465 MiB — within noise; the slight *increase* is
because smaller activation footprint lets the fraction claim marginally more
KV cache).

### `max_gpu_total_bytes` (broken)

Intended as an absolute byte cap for KV cache. As of trtllm 1.3.0rc5, this
field is **ignored**. Setting 5 GiB cap with `free_gpu_memory_fraction=0.95`
still allocated ~42 GiB of KV cache. Setting `free_gpu_memory_fraction=0.0`
with only `max_gpu_total_bytes` causes `"Impossible to fit any sequence in
kvCache"`. Do not rely on this field.

### Override precedence

```
--override-engine-args JSON  >  --extra-engine-args YAML  >  CLI flags
```

The `DYN_TRTLLM_OVERRIDE_ENGINE_ARGS` env var is equivalent to
`--override-engine-args` and avoids shell quoting issues with scripts whose
arg parsers consume unknown flags before passing `"$@"`.

### Estimating total GPU usage

```
total_vram ≈ model_weights + engine_overhead + kv_cache

model_weights ≈ num_params * bytes_per_param / tensor_parallel_size
engine_overhead ≈ 2.0 + 1.2 * sqrt(params_b) GiB  (CUDA context + TRT buffers + activations)
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```

Engine overhead is auto-computed by `estimate_worker_vram` when called with the
`trtllm` engine name.  Examples: 0.6B → 2.9 GiB, 8B → 5.4 GiB, 30B → 8.6 GiB.

### Empirical validation (Qwen3-0.6B, RTX 6000 Ada 48 GiB, trtllm 1.3.0rc5)

Controlled test: single worker via agg.sh, one override at a time.

| # | Override | Total VRAM | KV Cache | Tokens |
|---|---------|-----------|----------|--------|
| 1 | Baseline (YAML frac=0.85) | 41,465 MiB | 38.04 GiB | 356,160 |
| 2 | `free_gpu_memory_fraction=0.15` | 9,383 MiB | 6.71 GiB | 62,848 |
| 3 | `max_num_tokens=256` | 41,643 MiB | 38.26 GiB | 358,208 |
| 4 | `max_seq_len=4096` | 41,469 MiB | 38.05 GiB | 356,192 |
| 5 | `max_seq_len=2048` | 41,469 MiB | 38.05 GiB | 356,192 |
| 6 | seq=4096 + frac=0.15 | 9,383 MiB | 6.71 GiB | 62,848 |
| 7 | tokens=256 + seq=4096 + frac=0.15 | 9,377 MiB | 6.75 GiB | 63,200 |

**Conclusion:** `free_gpu_memory_fraction` is the **sole effective knob** for
trtllm VRAM control. Neither `max_seq_len` nor `max_num_tokens` reduce memory.
Combined overrides (test 7) produce no additional benefit over fraction alone
(test 2).

---

## Why vLLM/sglang fractions are NOT interchangeable with TensorRT-LLM

Consider wanting 10 GiB of KV cache on a 48 GiB GPU with a 5 GiB model:

| Engine | Fraction meaning | Calculation | Result |
|---|---|---|---|
| vLLM | 10/48 = 0.21 of total | `48 * 0.21 = 10 GiB` budget (minus model = 5 GiB KV) | Wrong — need higher fraction |
| sglang | Same as vLLM | Same math | Same problem |
| TensorRT-LLM | 10/43 = 0.23 of free | `43 * 0.23 = 10 GiB` KV cache | Correct |

For vLLM/sglang, you actually need `(model + kv) / total = (5 + 10) / 48 = 0.31`
to get 10 GiB of KV cache with a 5 GiB model.

The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine>`: unified wrapper — reads `_EW_*` vars from
  `estimate_worker_vram` and calls the right function for the engine.

Launch scripts use `gpu_worker_fraction` so they all follow the same pattern:

```bash
estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" trtllm
GPU_MEM_FRACTION=$(gpu_worker_fraction trtllm)
```

---

## KV Cache Memory Per Token

The formula for KV cache memory per token is the same across all engines:

```
kv_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
```

| Model | Layers | KV Heads | Head Dim | Dtype | Per Token |
|---|---|---|---|---|---|
| Qwen3-0.6B | 28 | 8 | 128 | BF16 | 112 KiB |
| Llama-3.1-8B | 32 | 8 | 128 | BF16 | 128 KiB |
| Llama-3.1-70B | 80 | 8 | 128 | BF16 | 320 KiB |
| Qwen2.5-VL-7B | 28 | 4 | 128 | BF16 | 56 KiB |

To estimate KV cache for a given context length:

```
kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^3)
```

---

## `DYN_GPU_MEMORY_FRACTION_OVERRIDE`

Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
memory fraction a script needs.

- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
  `--override-engine-args`.
- Launch scripts use `gpu_worker_fraction <engine>` to compute the default
  fraction; the override bypasses this and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
  cache sizing, making the profiler's fraction override ineffective for KV cache.
  Those scripts should warn when `DYN_GPU_MEMORY_FRACTION_OVERRIDE` is set.