gpu_memory_utilization.md 5.48 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# GPU Memory Calculation and Configuration

This guide explains how to calculate GPU memory requirements and properly configure `gpu_memory_utilization` for vLLM-Omni stages.

## Overview

`gpu_memory_utilization` is a critical parameter that controls how much GPU memory each stage can use. It's specified as a fraction between 0.0 and 1.0, where:
- `0.8` means 80% of the GPU's total memory
- `1.0` means 100% of the GPU's total memory (not recommended, leaves no buffer)

## How Memory is Calculated

### Memory Allocation Formula

For each stage, vLLM-Omni calculates the requested memory as:

```
requested_memory = total_gpu_memory × gpu_memory_utilization
```

The system checks that:
```
free_memory ≥ requested_memory
```

If this condition is not met, the stage will fail to initialize with an error message showing the memory requirements.

### Memory Components

The total memory used by a stage includes:

1. **Model Weights**: The size of the model parameters loaded on the GPU
2. **KV Cache**: Memory for storing key-value cache during generation
3. **Activation Memory**: Temporary memory for intermediate computations
4. **System Overhead**: Memory used by CUDA, PyTorch, and other system components
5. **Non-Torch Memory**: Memory allocated outside of PyTorch (e.g., CUDA graphs)

### Example Calculation

For a GPU with 80GB total memory:
- `gpu_memory_utilization: 0.8` → 64GB available for the stage
- `gpu_memory_utilization: 0.6` → 48GB available for the stage
- `gpu_memory_utilization: 0.15` → 12GB available for the stage

## Setting Up `gpu_memory_utilization`

### Step 1: Determine GPU Memory

First, check your GPU's total memory:

```bash
# Using nvidia-smi
nvidia-smi --query-gpu=memory.total --format=csv

# Or using Python
python -c "import torch; print(f'{torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"
```

### Step 2: Estimate Model Memory Requirements

#### For Autoregressive (AR) Stages

AR stages typically need more memory due to:
- Large model weights
- KV cache for attention
- Activation buffers

#### For Diffusion/Generation Stages

Diffusion stages (like code2wav) typically need less memory:
- Smaller model components
- Different memory access patterns

**Typical values:**
- `0.1 - 0.3` for most diffusion stages

### Step 3: Consider Multi-Stage Scenarios

When multiple stages share the same GPU, you must ensure the sum of their `gpu_memory_utilization` values doesn't exceed 1.0.

**Example: Two stages on GPU 0**
```yaml
stage_args:
  - stage_id: 0
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.6  # Uses 60% of GPU 0

  - stage_id: 1
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.3  # Uses 30% of GPU 0
      # Total: 90% of GPU 0 (safe, leaves 10% buffer)
```

**Important:** If stages run on different GPUs, each can use up to 1.0 independently.

### Step 4: Account for Tensor Parallelism

When using `tensor_parallel_size > 1`, the model is split across multiple GPUs, so each GPU needs less memory.

**Example: 2-way tensor parallelism**
```yaml
stage_args:
  - stage_id: 0
    runtime:
      devices: "0,1"  # Uses both GPUs
    engine_args:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.6  # 60% per GPU
      # Model is split, so each GPU uses ~30% of model memory
```

## Examples

### Qwen3-Omni-MoE on 2x H100-80GB

```yaml
stage_args:
  - stage_id: 0  # Thinker stage with TP=2
    runtime:
      devices: "0,1"
    engine_args:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.6  # 48GB per GPU

  - stage_id: 1  # Talker stage
    runtime:
      devices: "1"
    engine_args:
      gpu_memory_utilization: 0.3  # 24GB on GPU 1

  - stage_id: 2  # Code2Wav stage
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.1  # 8GB on GPU 0
```
**Note:** In this configuration, stages 0 and 2 share GPU 0, but they run at different times in the pipeline, so their memory usage doesn't overlap.

## Troubleshooting

### Error: "Free memory is less than desired GPU memory utilization"

This means the GPU doesn't have enough free memory when the stage starts.

**Solutions:**
1. Free up memory by closing other processes
2. Reduce `gpu_memory_utilization` for this stage
3. Use a GPU with more memory
4. Move the stage to a different GPU

### Error: OOM during inference

The stage initialized but ran out of memory during processing.

**Solutions:**
1. Reduce `max_num_batched_tokens`
2. Reduce `max_batch_size` in runtime config
3. Lower `gpu_memory_utilization` slightly
4. Enable quantization if supported

### Memory Not Fully Utilized

If you see low memory usage, you can:
1. Increase `gpu_memory_utilization` to allow larger KV cache
2. Increase `max_num_batched_tokens` for better batching
3. Check if other stages are limiting throughput

## Useful formula for Memory Calculation

### KV Cache Memory

The KV cache size depends on:
- Number of sequences in batch
- Sequence length (prompt + generation)
- Model hidden size
- Number of attention heads
- Number of layers

approximate Formula:
```
kv_cache_memory ≈ batch_size × seq_len × hidden_size × num_layers × 2 × dtype_size
```
2 for k & v

### Model Weight Memory

```
model_memory ≈ num_parameters × dtype_size
```

For example:
- 7B parameters in FP16: ~14GB
- 7B parameters in FP32: ~28GB
- 7B parameters in INT8: ~7GB

### Activation Memory

Activation memory is typically smaller but varies with:
- Batch size
- Sequence length
- Model architecture

It's usually 10-30% of model weight memory during inference.