offload.md 10.1 KB
Newer Older
helloyongyang's avatar
helloyongyang committed
1
# Offload
helloyongyang's avatar
helloyongyang committed
2

gushiqiao's avatar
gushiqiao committed
3
4
## 📖 Overview

gushiqiao's avatar
gushiqiao committed
5
Lightx2v implements a state-of-the-art parameter offloading mechanism specifically designed for efficient large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies, enabling dynamic scheduling between GPU, CPU, and disk storage.
gushiqiao's avatar
gushiqiao committed
6
7

**Core Features:**
gushiqiao's avatar
gushiqiao committed
8
9
10
11
12
13
- **Intelligent Granularity Management**: Supports both Block and Phase offloading granularities for flexible memory control
  - **Block Granularity**: Complete Transformer layers as management units, containing self-attention, cross-attention, feed-forward networks, etc., suitable for memory-sufficient environments
  - **Phase Granularity**: Individual computational components as management units, providing finer-grained memory control for memory-constrained deployment scenarios
- **Multi-level Storage Architecture**: GPU → CPU → Disk three-tier storage hierarchy with intelligent caching strategies
- **Asynchronous Parallel Processing**: CUDA stream-based asynchronous computation and data transfer for maximum hardware utilization
- **Persistent Storage Support**: SSD/NVMe disk storage support for ultra-large model inference deployment
gushiqiao's avatar
gushiqiao committed
14

gushiqiao's avatar
gushiqiao committed
15
## 🎯 Offloading Strategy Details
gushiqiao's avatar
gushiqiao committed
16

gushiqiao's avatar
gushiqiao committed
17
### Strategy 1: GPU-CPU Granularity Offloading
gushiqiao's avatar
gushiqiao committed
18

gushiqiao's avatar
gushiqiao committed
19
**Applicable Scenarios**: GPU VRAM insufficient but system memory resources adequate
gushiqiao's avatar
gushiqiao committed
20

gushiqiao's avatar
gushiqiao committed
21
**Technical Principle**: Establishes efficient weight scheduling mechanism between GPU and CPU memory, managing model weights in Block or Phase units. Leverages CUDA stream asynchronous capabilities to achieve parallel execution of computation and data transfer. Blocks contain complete Transformer layer structures, while Phases correspond to individual computational components within layers.
gushiqiao's avatar
gushiqiao committed
22

gushiqiao's avatar
gushiqiao committed
23
24
25
**Granularity Selection Guide**:
- **Block Granularity**: Suitable for memory-sufficient environments, reduces management overhead and improves overall performance
- **Phase Granularity**: Suitable for memory-constrained environments, provides more flexible memory control and optimizes resource utilization
gushiqiao's avatar
gushiqiao committed
26

gushiqiao's avatar
gushiqiao committed
27
<div align="center">
gushiqiao's avatar
Fix  
gushiqiao committed
28
<img alt="GPU-CPU Block/Phase Offloading Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig1_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
29
</div>
gushiqiao's avatar
gushiqiao committed
30

gushiqiao's avatar
gushiqiao committed
31
<div align="center">
gushiqiao's avatar
Fix  
gushiqiao committed
32
<img alt="Swap Mechanism Core Concept" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig2_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
33
</div>
gushiqiao's avatar
gushiqiao committed
34

gushiqiao's avatar
gushiqiao committed
35
<div align="center">
gushiqiao's avatar
Fix  
gushiqiao committed
36
<img alt="Asynchronous Execution Flow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig3_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
37
</div>
gushiqiao's avatar
gushiqiao committed
38

gushiqiao's avatar
gushiqiao committed
39
40
41
42
43
44
45
46
47
**Technical Features:**
- **Multi-stream Parallel Architecture**: Employs three CUDA streams with different priorities to parallelize computation and transfer
  - Compute Stream (priority=-1): High priority, responsible for current computation tasks
  - GPU Load Stream (priority=0): Medium priority, responsible for weight prefetching from CPU to GPU
  - CPU Load Stream (priority=0): Medium priority, responsible for weight offloading from GPU to CPU
- **Intelligent Prefetching Mechanism**: Predictively loads next Block/Phase based on computation progress
- **Efficient Cache Management**: Maintains weight cache pool in CPU memory for improved access efficiency
- **Stream Synchronization Guarantee**: Ensures temporal correctness of data transfer and computation
- **Position Rotation Optimization**: Achieves continuous computation through Swap operations, avoiding repeated loading/unloading
gushiqiao's avatar
gushiqiao committed
48

gushiqiao's avatar
gushiqiao committed
49
### Strategy 2: Disk-CPU-GPU Three-Level Offloading (Lazy Loading)
gushiqiao's avatar
gushiqiao committed
50

gushiqiao's avatar
gushiqiao committed
51
**Applicable Scenarios**: Both GPU VRAM and system memory resources insufficient in constrained environments
gushiqiao's avatar
gushiqiao committed
52

gushiqiao's avatar
gushiqiao committed
53
54
55
**Technical Principle**: Introduces disk storage layer on top of Strategy 1, constructing a Disk→CPU→GPU three-level storage architecture. CPU serves as a configurable intelligent cache pool, suitable for various memory-constrained deployment environments.

<div align="center">
gushiqiao's avatar
Fix  
gushiqiao committed
56
<img alt="Disk-CPU-GPU Three-Level Offloading Architecture" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig4_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
57
</div>
gushiqiao's avatar
gushiqiao committed
58

gushiqiao's avatar
gushiqiao committed
59
<div align="center">
gushiqiao's avatar
Fix  
gushiqiao committed
60
<img alt="Complete Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig5_en.png" width="75%">
gushiqiao's avatar
gushiqiao committed
61
</div>
gushiqiao's avatar
gushiqiao committed
62

gushiqiao's avatar
gushiqiao committed
63
64
65
66
67
68
69
70
71
72
**Execution Steps Details:**
1. **Disk Storage Layer**: Model weights organized by Block on SSD/NVMe, each Block corresponding to one .safetensors file
2. **Task Scheduling Layer**: Priority queue-based intelligent scheduling system for disk loading task assignment
3. **Asynchronous Loading Layer**: Multi-threaded parallel reading of weight files from disk to CPU memory buffer
4. **Intelligent Cache Layer**: CPU memory buffer using FIFO strategy for cache management with dynamic size configuration
5. **Cache Hit Optimization**: Direct transfer to GPU when weights are already in cache, avoiding disk I/O overhead
6. **Prefetch Transfer Layer**: Weights in cache asynchronously transferred to GPU memory via GPU load stream
7. **Compute Execution Layer**: Weights on GPU perform computation (compute stream) while background continues prefetching next Block/Phase
8. **Position Rotation Layer**: Swap rotation after computation completion for continuous computation flow
9. **Memory Management Layer**: Automatic eviction of earliest used weight Blocks/Phases when CPU cache is full
gushiqiao's avatar
gushiqiao committed
73

gushiqiao's avatar
gushiqiao committed
74
75
76
77
78
79
**Technical Features:**
- **On-demand Loading Mechanism**: Model weights loaded from disk only when needed, avoiding loading entire model at once
- **Configurable Cache Strategy**: CPU memory buffer supports FIFO strategy with dynamically adjustable size
- **Multi-threaded Parallel Loading**: Leverages multiple disk worker threads for parallel data loading
- **Asynchronous Transfer Optimization**: CUDA stream-based asynchronous data transfer for maximum hardware utilization
- **Continuous Computation Guarantee**: Achieves continuous computation through position rotation mechanism, avoiding repeated loading/unloading operations
gushiqiao's avatar
gushiqiao committed
80

gushiqiao's avatar
gushiqiao committed
81
## ⚙️ Configuration Parameters Details
gushiqiao's avatar
gushiqiao committed
82
83
84
85
86

### GPU-CPU Offloading Configuration

```python
config = {
gushiqiao's avatar
gushiqiao committed
87
88
89
90
    "cpu_offload": True,            # Enable CPU offloading functionality
    "offload_ratio": 1.0,           # Offload ratio (0.0-1.0), 1.0 means complete offloading
    "offload_granularity": "block", # Offload granularity selection: "block" or "phase"
    "lazy_load": False,             # Disable lazy loading mode
gushiqiao's avatar
gushiqiao committed
91
92
93
94
95
96
97
}
```

### Disk-CPU-GPU Offloading Configuration

```python
config = {
gushiqiao's avatar
gushiqiao committed
98
99
100
101
    "cpu_offload": True,            # Enable CPU offloading functionality
    "lazy_load": True,              # Enable lazy loading mode
    "offload_ratio": 1.0,           # Offload ratio setting
    "offload_granularity": "phase", # Recommended to use phase granularity for better memory control
gushiqiao's avatar
gushiqiao committed
102
    "num_disk_workers": 2,          # Number of disk worker threads
gushiqiao's avatar
gushiqiao committed
103
104
    "offload_to_disk": True,        # Enable disk offloading functionality
    "offload_path": ".",            # Disk offload path configuration
gushiqiao's avatar
gushiqiao committed
105
106
107
}
```

gushiqiao's avatar
gushiqiao committed
108
109
110
111
112
113
**Intelligent Cache Key Parameter Descriptions:**
- `max_memory`: Controls CPU cache size upper limit, directly affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects data prefetch speed
- `offload_granularity`: Controls cache management granularity, affects cache efficiency and memory utilization
  - `"block"`: Cache management in units of complete Transformer layers, suitable for memory-sufficient environments
  - `"phase"`: Cache management in units of individual computational components, suitable for memory-constrained environments
gushiqiao's avatar
gushiqiao committed
114

gushiqiao's avatar
gushiqiao committed
115
Detailed configuration files can be referenced at [Official Configuration Repository](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
gushiqiao's avatar
gushiqiao committed
116

gushiqiao's avatar
gushiqiao committed
117
## 🎯 Deployment Strategy Recommendations
gushiqiao's avatar
gushiqiao committed
118

gushiqiao's avatar
gushiqiao committed
119
120
- 🔄 GPU-CPU Granularity Offloading: Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) but adequate system memory (>64G)
  - Advantages: Balances performance and memory usage, suitable for medium-scale model inference
gushiqiao's avatar
gushiqiao committed
121

gushiqiao's avatar
gushiqiao committed
122
123
- 💾 Disk-CPU-GPU Three-Level Offloading: Suitable for limited GPU VRAM (RTX 3060/4090 8G) and insufficient system memory (16-32G)
  - Advantages: Supports ultra-large model inference with lowest hardware threshold
gushiqiao's avatar
gushiqiao committed
124

gushiqiao's avatar
gushiqiao committed
125
126
- 🚫 No Offload Mode: Suitable for high-end hardware configurations pursuing optimal inference performance
  - Advantages: Maximizes computational efficiency, suitable for latency-sensitive application scenarios
gushiqiao's avatar
gushiqiao committed
127

gushiqiao's avatar
gushiqiao committed
128
## 🔍 Troubleshooting and Solutions
gushiqiao's avatar
gushiqiao committed
129

gushiqiao's avatar
gushiqiao committed
130
### Common Performance Issues and Optimization Strategies
gushiqiao's avatar
gushiqiao committed
131

gushiqiao's avatar
gushiqiao committed
132
133
134
135
136
137
1. **Disk I/O Performance Bottleneck**
   - Problem Symptoms: Slow model loading speed, high inference latency
   - Solutions:
     - Upgrade to NVMe SSD storage devices
     - Increase num_disk_workers parameter value
     - Optimize file system configuration
gushiqiao's avatar
gushiqiao committed
138

gushiqiao's avatar
gushiqiao committed
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
2. **Memory Buffer Overflow**
   - Problem Symptoms: Insufficient system memory, program abnormal exit
   - Solutions:
     - Increase max_memory parameter value
     - Decrease num_disk_workers parameter value
     - Adjust offload_granularity to "phase"

3. **Model Loading Timeout**
   - Problem Symptoms: Timeout errors during model loading process
   - Solutions:
     - Check disk read/write performance
     - Optimize file system parameters
     - Verify storage device health status

## 📚 Technical Summary

Lightx2v's offloading mechanism is specifically designed for modern AI inference scenarios, fully leveraging GPU's asynchronous computing capabilities and multi-level storage architecture advantages. Through intelligent weight management and efficient parallel processing, this mechanism significantly reduces the hardware threshold for large model inference, providing developers with flexible and efficient deployment solutions.

**Technical Highlights:**
- 🚀 **Performance Optimization**: Asynchronous parallel processing maximizes hardware utilization
- 💾 **Intelligent Memory**: Multi-level caching strategies achieve optimal memory management
- 🔧 **Flexible Configuration**: Supports flexible configuration of multiple granularities and strategies
- 🛡️ **Stable and Reliable**: Comprehensive error handling and fault recovery mechanisms