gpu_memory_utilization.md

# GPU Memory Calculation and Configuration

This guide explains how to calculate GPU memory requirements and properly configure `gpu_memory_utilization` for vLLM-Omni stages.

## Overview

`gpu_memory_utilization` is a critical parameter that controls how much GPU memory each stage can use. It's specified as a fraction between 0.0 and 1.0, where:
- `0.8` means 80% of the GPU's total memory
- `1.0` means 100% of the GPU's total memory (not recommended, leaves no buffer)

## How Memory is Calculated

### Memory Allocation Formula

For each stage, vLLM-Omni calculates the requested memory as:

```
requested_memory = total_gpu_memory × gpu_memory_utilization
```

The system checks that:
```
free_memory ≥ requested_memory
```

If this condition is not met, the stage will fail to initialize with an error message showing the memory requirements.

### Memory Components

The total memory used by a stage includes:

1. **Model Weights**: The size of the model parameters loaded on the GPU
2. **KV Cache**: Memory for storing key-value cache during generation
3. **Activation Memory**: Temporary memory for intermediate computations
4. **System Overhead**: Memory used by CUDA, PyTorch, and other system components
5. **Non-Torch Memory**: Memory allocated outside of PyTorch (e.g., CUDA graphs)

### Example Calculation

For a GPU with 80GB total memory:
- `gpu_memory_utilization: 0.8` → 64GB available for the stage
- `gpu_memory_utilization: 0.6` → 48GB available for the stage
- `gpu_memory_utilization: 0.15` → 12GB available for the stage

## Setting Up `gpu_memory_utilization`

### Step 1: Determine GPU Memory

First, check your GPU's total memory:

```bash
# Using nvidia-smi
nvidia-smi --query-gpu=memory.total --format=csv

# Or using Python
python -c "import torch; print(f'{torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"
```

### Step 2: Estimate Model Memory Requirements

#### For Autoregressive (AR) Stages

AR stages typically need more memory due to:
- Large model weights
- KV cache for attention
- Activation buffers

#### For Diffusion/Generation Stages

Diffusion stages (like code2wav) typically need less memory:
- Smaller model components
- Different memory access patterns

**Typical values:**
- `0.1 - 0.3` for most diffusion stages

### Step 3: Consider Multi-Stage Scenarios

When multiple stages share the same GPU, you must ensure the sum of their `gpu_memory_utilization` values doesn't exceed 1.0.

**Example: Two stages on GPU 0**
```yaml
stage_args:
  - stage_id: 0
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.6  # Uses 60% of GPU 0

  - stage_id: 1
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.3  # Uses 30% of GPU 0
      # Total: 90% of GPU 0 (safe, leaves 10% buffer)
```

**Important:** If stages run on different GPUs, each can use up to 1.0 independently.

### Step 4: Account for Tensor Parallelism

When using `tensor_parallel_size > 1`, the model is split across multiple GPUs, so each GPU needs less memory.

**Example: 2-way tensor parallelism**
```yaml
stage_args:
  - stage_id: 0
    runtime:
      devices: "0,1"  # Uses both GPUs
    engine_args:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.6  # 60% per GPU
      # Model is split, so each GPU uses ~30% of model memory
```

## Examples

### Qwen3-Omni-MoE on 2x H100-80GB

```yaml
stage_args:
  - stage_id: 0  # Thinker stage with TP=2
    runtime:
      devices: "0,1"
    engine_args:
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.6  # 48GB per GPU

  - stage_id: 1  # Talker stage
    runtime:
      devices: "1"
    engine_args:
      gpu_memory_utilization: 0.3  # 24GB on GPU 1

  - stage_id: 2  # Code2Wav stage
    runtime:
      devices: "0"
    engine_args:
      gpu_memory_utilization: 0.1  # 8GB on GPU 0
```
**Note:** In this configuration, stages 0 and 2 share GPU 0, but they run at different times in the pipeline, so their memory usage doesn't overlap.

## Troubleshooting

### Error: "Free memory is less than desired GPU memory utilization"

This means the GPU doesn't have enough free memory when the stage starts.

**Solutions:**
1. Free up memory by closing other processes
2. Reduce `gpu_memory_utilization` for this stage
3. Use a GPU with more memory
4. Move the stage to a different GPU

### Error: OOM during inference

The stage initialized but ran out of memory during processing.

**Solutions:**
1. Reduce `max_num_batched_tokens`
2. Reduce `max_batch_size` in runtime config
3. Lower `gpu_memory_utilization` slightly
4. Enable quantization if supported

### Memory Not Fully Utilized

If you see low memory usage, you can:
1. Increase `gpu_memory_utilization` to allow larger KV cache
2. Increase `max_num_batched_tokens` for better batching
3. Check if other stages are limiting throughput

## Useful formula for Memory Calculation

### KV Cache Memory

The KV cache size depends on:
- Number of sequences in batch
- Sequence length (prompt + generation)
- Model hidden size
- Number of attention heads
- Number of layers

approximate Formula:
```
kv_cache_memory ≈ batch_size × seq_len × hidden_size × num_layers × 2 × dtype_size
```
2 for k & v

### Model Weight Memory

```
model_memory ≈ num_parameters × dtype_size
```

For example:
- 7B parameters in FP16: ~14GB
- 7B parameters in FP32: ~28GB
- 7B parameters in INT8: ~7GB

### Activation Memory

Activation memory is typically smaller but varies with:
- Batch size
- Sequence length
- Model architecture

It's usually 10-30% of model weight memory during inference.