offload.md 29.1 KB
Newer Older
gushiqiao's avatar
gushiqiao committed
1
# Lightx2v Parameter Offloading Mechanism Documentation
helloyongyang's avatar
helloyongyang committed
2

gushiqiao's avatar
gushiqiao committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
## 📖 Overview

Lightx2v implements an advanced parameter offloading mechanism designed for large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies.

**Core Features:**
- **Block/Phase Offloading**: Efficiently manages model weights in block/phase units for optimal memory usage
  - **Block**: Basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feed-forward networks, etc.), serving as larger memory management units
  - **Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feed-forward networks, etc.), providing more precise memory control
- **Multi-level Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
- **Asynchronous Operations**: Uses CUDA streams to overlap computation and data transfer
- **Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient

## 🎯 Offloading Strategies

### Strategy 1: GPU-CPU Block/Phase Offloading

**Applicable Scenarios**: GPU VRAM insufficient but system memory adequate

**Working Principle**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while phases are individual computational components within blocks.

**Block vs Phase Explanation**:
- **Block Granularity**: Larger memory management units containing complete Transformer layers (self-attention, cross-attention, feed-forward networks, etc.), suitable for memory-sufficient scenarios, reducing management overhead
- **Phase Granularity**: Finer-grained memory management containing individual computational components (such as self-attention, cross-attention, feed-forward networks, etc.), suitable for memory-constrained scenarios, providing more flexible memory control

```
GPU-CPU Block/Phase Offloading Workflow:

╔═════════════════════════════════════════════════════════════════╗
║                        🎯 GPU Memory                            ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ ║
║  │ 🔄 Current      │    │ ⏳ Prefetch     │    │ 📤 To Offload   │ ║
║  │ block/phase N   │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║  └─────────────────┘    └─────────────────┘    └─────────────────┘ ║
║         │                       │                       │         ║
║         ▼                       ▼                       ▼         ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ Compute     │    │ GPU Load    │    │ CPU Load    │         ║
║  │ Stream      │    │ Stream      │    │ Stream      │         ║
║  │(priority=-1)│   │ (priority=0) │   │ (priority=0) │         ║
║  └─────────────┘    └─────────────┘    └─────────────┘         ║
╚═════════════════════════════════════════════════════════════════╝

╔═════════════════════════════════════════════════════════════════╗
║                        💾 CPU Memory                            ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║  │ 📥 Cache    │ │ 📥 Cache    │ │ 📥 Cache    │ │ 📥 Cache    │ ║
║  │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║  │    N-2      │ │    N-1      │ │     N       │ │    N+1      │ ║
║  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║         ▲               ▲               ▲               ▲         ║
║         │               │               │               │         ║
║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║  │ CPU Load    │ │ CPU Load    │ │ CPU Load    │ │ CPU Load    │ ║
║  │ Stream      │ │ Stream      │ │ Stream      │ │ Stream      │ ║
║  │(priority=0) │ │(priority=0) │ │(priority=0) │ │(priority=0) │ ║
║  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║                                                               ║
║  💡 CPU memory stores multiple blocks/phases, forming cache pool ║
║  🔄 GPU load stream prefetches from CPU cache, CPU load stream  ║
║     offloads to CPU cache                                        ║
╚═════════════════════════════════════════════════════════════════╝


╔═════════════════════════════════════════════════════════════════╗
║                        🔄 Swap Operation Flow                   ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  Step 1: Parallel Execution Phase                              ║
║  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ ║
║  │ 🔄 Compute      │    │ ⏳ Prefetch     │    │ 📤 Offload      │ ║
║  │ block/phase N   │    │ block/phase N+1 │    │ block/phase N-1 │ ║
║  │ (Compute Stream)│    │ (GPU Load Stream)│   │ (CPU Load Stream)│ ║
║  └─────────────────┘    └─────────────────┘    └─────────────────┘ ║
║                                                               ║
║  Step 2: Swap Rotation Phase                                   ║
║  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ ║
║  │ 🔄 Compute      │    │ ⏳ Prefetch     │    │ 📤 Offload      │ ║
║  │ block/phase N+1 │    │ block/phase N+2 │    │ block/phase N   │ ║
║  │ (Compute Stream)│    │ (GPU Load Stream)│   │ (CPU Load Stream)│ ║
║  └─────────────────┘    └─────────────────┘    └─────────────────┘ ║
║                                                               ║
║  Swap Concept: Achieves continuous computation through position ║
║  rotation, avoiding repeated loading/unloading                  ║
╚═════════════════════════════════════════════════════════════════╝

╔═════════════════════════════════════════════════════════════════╗
║                        💡 Swap Core Concept                     ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  🔄 Traditional vs Swap Method Comparison:                     ║
║                                                               ║
║  Traditional Method:                                            ║
║  ┌─────────────┐    ┌──────────┐    ┌─────────┐    ┌────────┐ ║
║  │ Compute N   │───►│ Offload N│───►│ Load N+1│───►│Compute │ ║
║  │             │    │          │    │         │    │N+1     │ ║
║  └─────────────┘    └──────────┘    └─────────┘    └────────┘ ║
║       ❌ Serial execution, waiting time, low efficiency        ║
║                                                               ║
║  Swap Method:                                                  ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ Compute N   │    │ Prefetch    │    │ Offload     │         ║
║  │(Compute     │    │N+1          │    │N-1          │         ║
║  │ Stream)     │    │(GPU Load    │    │(CPU Load    │         ║
║  └─────────────┘    │ Stream)     │    │ Stream)     │         ║
║                     └─────────────┘    └─────────────┘         ║
║       ✅ Parallel execution, no waiting time, high efficiency  ║
║                                                               ║
║  🎯 Swap Advantages:                                           ║
║  • Avoids repeated loading/unloading of same data              ║
║  • Achieves continuous computation through position rotation   ║
║  • Maximizes GPU utilization                                   ║
║  • Reduces memory fragmentation                                ║
╚════════════════════════════════════════════════════════════════╝
```

**Key Features:**
- **Asynchronous Transfer**: Uses three CUDA streams with different priorities to parallelize computation and transfer
  - Compute Stream (priority=-1): High priority, responsible for current computation
  - GPU Load Stream (priority=0): Medium priority, responsible for prefetching from CPU to GPU
  - CPU Load Stream (priority=0): Medium priority, responsible for offloading from GPU to CPU
- **Prefetch Mechanism**: Preloads the next block/phase to GPU
- **Intelligent Caching**: Maintains weight cache in CPU memory
- **Stream Synchronization**: Ensures correctness of data transfer and computation
- **Swap Operation**: Rotates block/phase positions after computation completion for continuous processing


### Strategy 2: Disk-CPU-GPU Block/Phase Offloading (Lazy Loading)

**Applicable Scenarios**: Both GPU VRAM and system memory insufficient

**Working Principle**: Introduces disk storage on top of Strategy 1, implementing a three-level storage hierarchy (Disk → CPU → GPU). CPU continues as a cache pool but with configurable size, suitable for CPU memory-constrained devices.

```
Disk-CPU-GPU Block/Phase Offloading Workflow:

╔═════════════════════════════════════════════════════════════════╗
║                        💿 SSD/NVMe Storage                     ║
╠═════════════════════════════════════════════════════════════════╣
║                                                                 ║
║  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║  │ 📁 block_0  │ │ 📁 block_1  │ │ 📁 block_2  │ │ 📁 block_N  │ ║
║  │ .safetensors│ │ .safetensors│ │ .safetensors│ │ .safetensors│ ║
║  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║         │               │               │               │         ║
║         ▼               ▼               ▼               ▼         ║
║  ┌─────────────────────────────────────────────────────────────┐ ║
║  │                    🎯 Disk Worker Thread Pool               │ ║
║  │                                                             │ ║
║  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │ ║
║  │  │ Disk Thread │ │ Disk Thread │ │ Disk Thread │            │ ║
║  │  │     1       │ │     2       │ │     N       │            │ ║
║  │  │(Async Load) │ │(Async Load) │ │(Async Load) │            │ ║
║  │  └─────────────┘ └─────────────┘ └─────────────┘            │ ║
║  │         │               │               │                   │ ║
║  │         └───────────────┼───────────────┘                   │ ║
║  │                         ▼                                   │ ║
║  │  ┌─────────────────────────────────────────────────────────┐ │ ║
║  │  │                 📋 Priority Task Queue                  │ │ ║
║  │  │              (Manages disk loading task scheduling)     │ │ ║
║  │  └─────────────────────────────────────────────────────────┘ │ ║
║  └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝

╔═════════════════════════════════════════════════════════════════╗
║                        💾 CPU Memory Buffer                     ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  ┌─────────────────────────────────────────────────────────────┐ ║
║  │                    🎯 FIFO Intelligent Cache                │ ║
║  │                                                             │ ║
║  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║  │  │ 📥 Cache    │ │ 📥 Cache    │ │ 📥 Cache    │ │ 📥 Cache    │ ║
║  │  │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║  │  │    N-2      │ │    N-1      │ │     N       │ │    N+1      │ ║
║  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║  │         ▲               ▲               ▲               ▲         ║
║  │         │               │               │               │         ║
║  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║  │  │ CPU Load    │ │ CPU Load    │ │ CPU Load    │ │ CPU Load    │ ║
║  │  │ Stream      │ │ Stream      │ │ Stream      │ │ Stream      │ ║
║  │  │(priority=0) │ │(priority=0) │ │(priority=0) │ │(priority=0) │ ║
║  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║  │                                                             │ ║
║  │  💡 Configurable Size 🎯 FIFO Eviction 🔄 Cache Hit/Miss    │ ║
║  └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝

╔═════════════════════════════════════════════════════════════════╗
║                        🎯 GPU Memory                            ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ ║
║  │ 🔄 Current      │    │ ⏳ Prefetch     │    │ 📤 To Offload   │ ║
║  │ block/phase N   │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║  └─────────────────┘    └─────────────────┘    └─────────────────┘ ║
║         │                       │                       │         ║
║         ▼                       ▼                       ▼         ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ Compute     │    │ GPU Load    │    │ CPU Load    │         ║
║  │ Stream      │    │ Stream      │    │ Stream      │         ║
║  │(priority=-1)│   │ (priority=0) │   │ (priority=0) │         ║
║  └─────────────┘    └─────────────┘    └─────────────┘         ║
╚═════════════════════════════════════════════════════════════════╝

╔═════════════════════════════════════════════════════════════════╗
║                        🔄 Complete Workflow                     ║
╠═════════════════════════════════════════════════════════════════╣
║                                                               ║
║  Step 1: Cache Miss Handling                                   ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ 💿 Disk     │───►│ 💾 CPU Cache│───►│ 🎯 GPU      │         ║
║  │ (On-demand  │     │ (FIFO       │    │ Memory      │         ║
║  │  loading)   │     │  Management)│    │ (Compute    │         ║
║  └─────────────┘    └─────────────┘    │ Execution)  │         ║
║                                        └─────────────┘         ║
║                                                               ║
║  Step 2: Cache Hit Handling                                    ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ 💿 Disk     │    │ 💾 CPU Cache│───►│ 🎯 GPU      │         ║
║  │ (Skip       │     │ (Direct     │    │ Memory      │         ║
║  │  loading)   │     │  Access)    │    │ (Compute    │         ║
║  └─────────────┘    └─────────────┘    │ Execution)  │         ║
║                                        └─────────────┘         ║
║                                                               ║
║  Step 3: Memory Management                                      ║
║  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         ║
║  │ 💿 Disk     │    │ 💾 CPU Cache│    │ 🎯 GPU      │         ║
║  │ (Persistent │     │ (FIFO       │    │ Memory      │         ║
║  │  Storage)   │     │  Eviction)  │    │ (Swap       │         ║
║  └─────────────┘    └─────────────┘    │ Rotation)   │         ║
║                                        └─────────────┘         ║
╚═════════════════════════════════════════════════════════════════╝

Work Steps:
1. Disk Storage: Model weights stored by block on SSD/NVMe, one .safetensors file per block
2. Task Scheduling: When a block/phase is needed, priority task queue assigns disk worker threads
3. Async Loading: Multiple disk threads parallelly read weight files from disk to CPU memory buffer
4. Intelligent Caching: CPU memory buffer uses FIFO strategy for cache management with configurable size
5. Cache Hit: If weights are already in cache, directly transfer to GPU without disk reading
6. Prefetch Transfer: Weights in cache asynchronously transfer to GPU memory (using GPU load stream)
7. Compute Execution: Weights on GPU perform computation (using compute stream), while background continues prefetching next block/phase
8. Swap Rotation: After computation completion, rotate block/phase positions for continuous computation
9. Memory Management: When CPU cache is full, automatically evict earliest used weight blocks/phases
```

**Key Features:**
- **Lazy Loading**: Model weights loaded from disk on-demand, avoiding loading entire model at once
- **Intelligent Caching**: CPU memory buffer uses FIFO strategy with configurable size
- **Multi-threaded Prefetching**: Uses multiple disk worker threads for parallel loading
- **Asynchronous Transfer**: Uses CUDA streams to overlap computation and data transfer
- **Swap Rotation**: Achieves continuous computation through position rotation, avoiding repeated loading/unloading



## ⚙️ Configuration Parameters

### GPU-CPU Offloading Configuration

```python
config = {
    "cpu_offload": True,
    "offload_ratio": 1.0,           # Offload ratio (0.0-1.0)
    "offload_granularity": "block", # Offload granularity: "block" or "phase"
    "lazy_load": False,             # Disable lazy loading
}
```

### Disk-CPU-GPU Offloading Configuration

```python
config = {
    "cpu_offload": True,
    "lazy_load": True,              # Enable lazy loading
    "offload_ratio": 1.0,           # Offload ratio
    "offload_granularity": "phase", # Recommended to use phase granularity
    "num_disk_workers": 2,          # Number of disk worker threads
    "offload_to_disk": True,        # Enable disk offloading
    "offload_path": ".",            # Disk offload path
}
```

**Intelligent Cache Key Parameters:**
- `max_memory`: Controls CPU cache size, affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects prefetch speed
- `offload_granularity`: Controls cache granularity (block or phase), affects cache efficiency
  - `"block"`: Cache management in units of complete Transformer layers
  - `"phase"`: Cache management in units of individual computational components

Detailed configuration files can be referenced at [config](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)

## 🎯 Usage Recommendations

```
╔═════════════════════════════════════════════════════════════════╗
║                        📋 Configuration Guide                   ║
╠═════════════════════════════════════════════════════════════════╣
║                                                                 ║
║  🔄 GPU-CPU Block/Phase Offloading:                            ║
║        Suitable for insufficient GPU VRAM (RTX 3090/4090 24G)  ║
║        but adequate system memory (>64/128G)                   ║
║  💾 Disk-CPU-GPU Block/Phase Offloading:                       ║
║        Suitable for insufficient GPU VRAM (RTX 3060/4090 8G)   ║
║        and system memory (16/32G)                              ║
║  🚫 No Offload: Suitable for high-end hardware configurations, ║
║        pursuing optimal performance                             ║
║                                                                 ║
╚═════════════════════════════════════════════════════════════════╝
```

## 🔍 Troubleshooting

### Common Issues and Solutions

1. **Disk I/O Bottleneck**
   ```
   Solution: Use NVMe SSD, increase num_disk_workers
   ```

2. **Memory Buffer Overflow**
   ```
   Solution: Increase max_memory or decrease num_disk_workers
   ```

3. **Loading Timeout**
   ```
   Solution: Check disk performance, optimize file system
   ```

**Note**: This offloading mechanism is specifically designed for Lightx2v, fully utilizing modern hardware's asynchronous computing capabilities, significantly reducing the hardware threshold for large model inference.