offload.md 9.16 KB
Newer Older
gushiqiao's avatar
gushiqiao committed
1
# Parameter Offload
helloyongyang's avatar
helloyongyang committed
2

gushiqiao's avatar
gushiqiao committed
3
4
## 📖 Overview

gushiqiao's avatar
gushiqiao committed
5
LightX2V implements an advanced parameter offload mechanism specifically designed for large model inference under limited hardware resources. The system provides an excellent speed-memory balance by intelligently managing model weights across different memory hierarchies.
gushiqiao's avatar
gushiqiao committed
6
7

**Core Features:**
gushiqiao's avatar
gushiqiao committed
8
9
10
11
12
13
- **Block/Phase-level Offload**: Efficiently manages model weights in block/phase units for optimal memory usage
  - **Block**: The basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), serving as a larger memory management unit
  - **Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), providing more precise memory control
- **Multi-tier Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
- **Asynchronous Operations**: Overlaps computation and data transfer using CUDA streams
- **Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient
gushiqiao's avatar
gushiqiao committed
14

gushiqiao's avatar
gushiqiao committed
15
## 🎯 Offload Strategies
gushiqiao's avatar
gushiqiao committed
16

gushiqiao's avatar
gushiqiao committed
17
### Strategy 1: GPU-CPU Block/Phase Offload
gushiqiao's avatar
gushiqiao committed
18

gushiqiao's avatar
gushiqiao committed
19
**Use Case**: Insufficient GPU memory but sufficient system memory
gushiqiao's avatar
gushiqiao committed
20

gushiqiao's avatar
gushiqiao committed
21
**How It Works**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while Phases are individual computational components within blocks.
gushiqiao's avatar
gushiqiao committed
22

gushiqiao's avatar
gushiqiao committed
23
<div align="center">
gushiqiao's avatar
gushiqiao committed
24
<img alt="GPU-CPU block/phase offload workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig1_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
25
</div>
gushiqiao's avatar
gushiqiao committed
26

gushiqiao's avatar
gushiqiao committed
27
<div align="center">
gushiqiao's avatar
gushiqiao committed
28
<img alt="Swap operation" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig2_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
29
</div>
gushiqiao's avatar
gushiqiao committed
30

gushiqiao's avatar
gushiqiao committed
31
<div align="center">
gushiqiao's avatar
gushiqiao committed
32
<img alt="Swap concept" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig3_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
33
</div>
gushiqiao's avatar
gushiqiao committed
34
35


gushiqiao's avatar
gushiqiao committed
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
**Block vs Phase Explanation**:
- **Block Granularity**: Larger memory management unit containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), suitable for sufficient memory scenarios with reduced management overhead
- **Phase Granularity**: Finer-grained memory management containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), suitable for memory-constrained scenarios with more flexible memory control

**Key Features:**
- **Asynchronous Transfer**: Uses three CUDA streams with different priorities for parallel computation and transfer
  - Compute stream (priority=-1): High priority, handles current computation
  - GPU load stream (priority=0): Medium priority, handles CPU to GPU prefetching
  - CPU load stream (priority=0): Medium priority, handles GPU to CPU offloading
- **Prefetch Mechanism**: Preloads the next block/phase to GPU in advance
- **Intelligent Caching**: Maintains weight cache in CPU memory
- **Stream Synchronization**: Ensures correctness of data transfer and computation
- **Swap Operation**: Rotates block/phase positions after computation for continuous execution



gushiqiao's avatar
gushiqiao committed
52

gushiqiao's avatar
gushiqiao committed
53
54
55
56
57
### Strategy 2: Disk-CPU-GPU Block/Phase Offload (Lazy Loading)

**Use Case**: Both GPU memory and system memory are insufficient

**How It Works**: Builds upon Strategy 1 by introducing disk storage, implementing a three-tier storage hierarchy (Disk → CPU → GPU). CPU continues to serve as a cache pool with configurable size, suitable for devices with limited CPU memory.
gushiqiao's avatar
gushiqiao committed
58

gushiqiao's avatar
gushiqiao committed
59
60

<div align="center">
gushiqiao's avatar
gushiqiao committed
61
<img alt="Disk-CPU-GPU block/phase offload workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig4_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
62
</div>
gushiqiao's avatar
gushiqiao committed
63

gushiqiao's avatar
gushiqiao committed
64

gushiqiao's avatar
gushiqiao committed
65
<div align="center">
gushiqiao's avatar
gushiqiao committed
66
<img alt="Working steps" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig5_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
67
</div>
gushiqiao's avatar
gushiqiao committed
68

gushiqiao's avatar
gushiqiao committed
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
**Key Features:**
- **Lazy Loading**: Model weights are loaded from disk on-demand, avoiding loading the entire model at once
- **Intelligent Caching**: CPU memory buffer uses FIFO strategy with configurable size
- **Multi-threaded Prefetch**: Uses multiple disk worker threads for parallel loading
- **Asynchronous Transfer**: Uses CUDA streams to overlap computation and data transfer
- **Swap Rotation**: Achieves continuous computation through position rotation, avoiding repeated loading/offloading

**Working Steps**:
- **Disk Storage**: Model weights are stored on SSD/NVMe by block, one .safetensors file per block
- **Task Scheduling**: When a block/phase is needed, priority task queue assigns disk worker threads
- **Asynchronous Loading**: Multiple disk threads load weight files from disk to CPU memory buffer in parallel
- **Intelligent Caching**: CPU memory buffer manages cache using FIFO strategy with configurable size
- **Cache Hit**: If weights are already in cache, transfer directly to GPU without disk read
- **Prefetch Transfer**: Weights in cache are asynchronously transferred to GPU memory (using GPU load stream)
- **Compute Execution**: Weights on GPU perform computation (using compute stream) while background continues prefetching next block/phase
- **Swap Rotation**: After computation completes, rotate block/phase positions for continuous computation
- **Memory Management**: When CPU cache is full, automatically evict the least recently used weight block/phase

gushiqiao's avatar
gushiqiao committed
87
88


gushiqiao's avatar
gushiqiao committed
89
## ⚙️ Configuration Parameters
gushiqiao's avatar
gushiqiao committed
90

gushiqiao's avatar
gushiqiao committed
91
### GPU-CPU Offload Configuration
gushiqiao's avatar
gushiqiao committed
92
93
94

```python
config = {
gushiqiao's avatar
gushiqiao committed
95
96
97
98
    "cpu_offload": True,
    "offload_ratio": 1.0,           # Offload ratio (0.0-1.0)
    "offload_granularity": "block", # Offload granularity: "block" or "phase"
    "lazy_load": False,             # Disable lazy loading
gushiqiao's avatar
gushiqiao committed
99
100
101
}
```

gushiqiao's avatar
gushiqiao committed
102
### Disk-CPU-GPU Offload Configuration
gushiqiao's avatar
gushiqiao committed
103
104
105

```python
config = {
gushiqiao's avatar
gushiqiao committed
106
107
108
109
    "cpu_offload": True,
    "lazy_load": True,              # Enable lazy loading
    "offload_ratio": 1.0,           # Offload ratio
    "offload_granularity": "phase", # Recommended to use phase granularity
gushiqiao's avatar
gushiqiao committed
110
    "num_disk_workers": 2,          # Number of disk worker threads
gushiqiao's avatar
gushiqiao committed
111
    "offload_to_disk": True,        # Enable disk offload
gushiqiao's avatar
gushiqiao committed
112
113
114
}
```

gushiqiao's avatar
gushiqiao committed
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
**Intelligent Cache Key Parameters:**
- `max_memory`: Controls CPU cache size, affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects prefetch speed
- `offload_granularity`: Controls cache granularity (block or phase), affects cache efficiency
  - `"block"`: Cache management in complete Transformer layer units
  - `"phase"`: Cache management in individual computational component units

**Offload Configuration for Non-DIT Model Components (T5, CLIP, VAE):**

The offload behavior of these components follows these rules:
- **Default Behavior**: If not specified separately, T5, CLIP, VAE will follow the `cpu_offload` setting
- **Independent Configuration**: Can set offload strategy separately for each component for fine-grained control

**Configuration Example**:
```json
{
    "cpu_offload": true,           // DIT model offload switch
    "t5_cpu_offload": false,       // T5 encoder independent setting
    "clip_cpu_offload": false,     // CLIP encoder independent setting
    "vae_cpu_offload": false       // VAE encoder independent setting
}
```

For memory-constrained devices, a progressive offload strategy is recommended:

1. **Step 1**: Only enable `cpu_offload`, disable `t5_cpu_offload`, `clip_cpu_offload`, `vae_cpu_offload`
2. **Step 2**: If memory is still insufficient, gradually enable CPU offload for T5, CLIP, VAE
3. **Step 3**: If memory is still not enough, consider using quantization + CPU offload or enable `lazy_load`

**Practical Experience**:
- **RTX 4090 24GB + 14B Model**: Usually only need to enable `cpu_offload`, manually set other component offload to `false`, and use FP8 quantized version
- **Smaller Memory GPUs**: Need to combine quantization, CPU offload, and lazy loading
- **Quantization Schemes**: Refer to [Quantization Documentation](../method_tutorials/quantization.md) to select appropriate quantization strategy
gushiqiao's avatar
gushiqiao committed
148
149


gushiqiao's avatar
gushiqiao committed
150
151
152
**Configuration File Reference**:
- **Wan2.1 Series Models**: Refer to [offload config files](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
- **Wan2.2 Series Models**: Refer to [wan22 config files](https://github.com/ModelTC/lightx2v/tree/main/configs/wan22) with `4090` suffix
gushiqiao's avatar
gushiqiao committed
153

gushiqiao's avatar
gushiqiao committed
154
155
## 🎯 Usage Recommendations
- 🔄 GPU-CPU Block/Phase Offload: Suitable for insufficient GPU memory (RTX 3090/4090 24G) but sufficient system memory (>64/128G)
gushiqiao's avatar
gushiqiao committed
156

gushiqiao's avatar
gushiqiao committed
157
- 💾 Disk-CPU-GPU Block/Phase Offload: Suitable for both insufficient GPU memory (RTX 3060/4090 8G) and system memory (16/32G)
gushiqiao's avatar
gushiqiao committed
158

gushiqiao's avatar
gushiqiao committed
159
- 🚫 No Offload: Suitable for high-end hardware configurations pursuing best performance
gushiqiao's avatar
gushiqiao committed
160
161


gushiqiao's avatar
gushiqiao committed
162
163
164
165
166
167
## 🔍 Troubleshooting

### Common Issues and Solutions

1. **Disk I/O Bottleneck**
   - Solution: Use NVMe SSD, increase num_disk_workers
gushiqiao's avatar
gushiqiao committed
168
169


gushiqiao's avatar
gushiqiao committed
170
2. **Memory Buffer Overflow**
gushiqiao's avatar
gushiqiao committed
171
172
173
174
175
176
177
   - Solution: Increase max_memory or reduce num_disk_workers

3. **Loading Timeout**
   - Solution: Check disk performance, optimize file system


**Note**: This offload mechanism is specifically designed for LightX2V, fully utilizing the asynchronous computing capabilities of modern hardware, significantly lowering the hardware threshold for large model inference.