Commit d0ec1aaa authored by gushiqiao's avatar gushiqiao
Browse files

Update docs

parent 38c40fa0
# 低资源场景部署
# Lightx2v Low-Resource Deployment Guide
xxx
## 📋 Overview
This guide is specifically designed for hardware resource-constrained environments, particularly configurations with **8GB VRAM + 16/32GB RAM**, providing detailed instructions on how to successfully run Lightx2v 14B models for 480p and 720p video generation.
Lightx2v is a powerful video generation model, but it requires careful optimization to run smoothly in resource-constrained environments. This guide provides a complete solution from hardware selection to software configuration, ensuring you can achieve the best video generation experience under limited hardware conditions.
## 🎯 Target Hardware Configuration
### Recommended Hardware Specifications
**GPU Requirements**:
- **VRAM**: 8GB (RTX 3060/3070/4060/4060Ti, etc.)
- **Architecture**: NVIDIA graphics cards with CUDA support
**System Memory**:
- **Minimum**: 16GB DDR4
- **Recommended**: 32GB DDR4/DDR5
- **Memory Speed**: 3200MHz or higher recommended
**Storage Requirements**:
- **Type**: NVMe SSD strongly recommended
- **Capacity**: At least 50GB available space
- **Speed**: Read speed of 3000MB/s or higher recommended
**CPU Requirements**:
- **Cores**: 8 cores or more recommended
- **Frequency**: 3.0GHz or higher recommended
- **Architecture**: Support for AVX2 instruction set
## ⚙️ Core Optimization Strategies
### 1. Environment Optimization
Before running Lightx2v, it's recommended to set the following environment variables to optimize performance:
```bash
# CUDA memory allocation optimization
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Enable CUDA Graph mode to improve inference performance
export ENABLE_GRAPH_MODE=true
# Use BF16 precision for inference to reduce VRAM usage (default FP32 precision)
export DTYPE=BF16
```
**Optimization Details**:
- `expandable_segments:True`: Allows dynamic expansion of CUDA memory segments, reducing memory fragmentation
- `ENABLE_GRAPH_MODE=true`: Enables CUDA Graph to reduce kernel launch overhead
- `DTYPE=BF16`: Uses BF16 precision to reduce VRAM usage while maintaining quality
### 2. Quantization Strategy
Quantization is a key optimization technique in low-resource environments, reducing memory usage by lowering model precision.
#### Quantization Scheme Comparison
**FP8 Quantization** (Recommended for RTX 40 series):
```python
# Suitable for GPUs supporting FP8, providing better precision
dit_quant_scheme = "fp8" # DIT model quantization
t5_quant_scheme = "fp8" # T5 text encoder quantization
clip_quant_scheme = "fp8" # CLIP visual encoder quantization
```
**INT8 Quantization** (Universal solution):
```python
# Suitable for all GPUs, minimal memory usage
dit_quant_scheme = "int8" # 8-bit integer quantization
t5_quant_scheme = "int8" # Text encoder quantization
clip_quant_scheme = "int8" # Visual encoder quantization
```
### 3. Efficient Operator Selection Guide
Choosing the right operators can significantly improve inference speed and reduce memory usage.
#### Attention Operator Selection
**Recommended Priority**:
1. **[Sage Attention](https://github.com/thu-ml/SageAttention)** (Highest priority)
2. **[Flash Attention](https://github.com/Dao-AILab/flash-attention)** (Universal solution)
#### Matrix Multiplication Operator Selection
**ADA Architecture GPUs** (RTX 40 series):
Recommended priority:
1. **[q8-kernel](https://github.com/KONAKONA666/q8_kernels)** (Highest performance, ADA architecture only)
2. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (Balanced solution)
3. **[vllm-kernel](https://github.com/vllm-project/vllm)** (Universal solution)
**Other Architecture GPUs**:
1. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (Recommended)
2. **[vllm-kernel](https://github.com/vllm-project/vllm)** (Alternative)
### 4. Parameter Offloading Strategy
Parameter offloading technology allows models to dynamically schedule parameters between CPU and disk, breaking through VRAM limitations.
#### Three-Level Offloading Architecture
```python
# Disk-CPU-GPU three-level offloading configuration
cpu_offload=True # Enable CPU offloading
t5_cpu_offload=True # Enable T5 encoder CPU offloading
offload_granularity=phase # DIT model fine-grained offloading
t5_offload_granularity=block # T5 encoder fine-grained offloading
lazy_load = True # Enable lazy loading mechanism
num_disk_workers = 2 # Disk I/O worker threads
```
#### Offloading Strategy Details
**Lazy Loading Mechanism**:
- Model parameters are loaded from disk to CPU on demand
- Reduces runtime memory usage
- Supports large models running with limited memory
**Disk Storage Optimization**:
- Use high-speed SSD to store model parameters
- Store model files grouped by blocks
- Refer to conversion script [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md), specify `--save_by_block` parameter during conversion
### 5. VRAM Optimization Techniques
VRAM optimization strategies for 720p video generation.
#### CUDA Memory Management
```python
# CUDA memory cleanup configuration
clean_cuda_cache = True # Timely cleanup of GPU cache
rotary_chunk = True # Rotary position encoding chunked computation
rotary_chunk_size = 100 # Chunk size, adjustable based on VRAM
```
#### Chunked Computation Strategy
**Rotary Position Encoding Chunking**:
- Process long sequences in small chunks
- Reduce peak VRAM usage
- Maintain computational precision
### 6. VAE Optimization
VAE (Variational Autoencoder) is a key component in video generation, and optimizing VAE can significantly improve performance.
#### VAE Chunked Inference
```python
# VAE optimization configuration
use_tiling_vae = True # Enable VAE chunked inference
```
#### [Lightweight VAE](https://github.com/madebyollin/taehv/blob/main/taew2_1.pth)
```python
# VAE optimization configuration
use_tiny_vae = True # Use lightweight VAE
```
**VAE Optimization Effects**:
- Standard VAE: Baseline performance, 100% quality retention
- Standard VAE chunked: Reduces VRAM usage, increases inference time, 100% quality retention
- Lightweight VAE: Extremely low VRAM usage, video quality loss
### 7. Model Selection Strategy
Choosing the right model version is crucial for low-resource environments.
#### Recommended Model Comparison
**Distilled Models** (Strongly recommended):
-**[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)**
-**[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)**
#### Performance Optimization Suggestions
When using the above distilled models, you can further optimize performance:
- Disable CFG: `"enable_cfg": false`
- Reduce inference steps: `infer_step: 4`
- Reference configuration files: [config](https://github.com/ModelTC/LightX2V/tree/main/configs/distill)
## 🚀 Complete Configuration Examples
### Pre-configured Templates
- **[14B Model 480p Video Generation Configuration](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_480p.json)**
- **[14B Model 720p Video Generation Configuration](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_720p.json)**
- **[1.3B Model 720p Video Generation Configuration](https://github.com/ModelTC/LightX2V/tree/main/configs/offload/block/wan_t2v_1_3b.json)**
- The inference bottleneck for 1.3B models is the T5 encoder, so the configuration file specifically optimizes for T5
**[Launch Script](https://github.com/ModelTC/LightX2V/tree/main/scripts/wan/run_wan_i2v_lazy_load.sh)**
## 📚 Reference Resources
- [Parameter Offloading Mechanism Documentation](../method_tutorials/offload.md) - In-depth understanding of offloading technology principles
- [Quantization Technology Guide](../method_tutorials/quantization.md) - Detailed explanation of quantization technology
- [Gradio Deployment Guide](deploy_gradio.md) - Detailed Gradio deployment instructions
## ⚠️ Important Notes
1. **Hardware Requirements**: Ensure your hardware meets minimum configuration requirements
2. **Driver Version**: Recommend using the latest NVIDIA drivers (535+)
3. **CUDA Version**: Ensure CUDA version is compatible with PyTorch (recommend CUDA 11.8+)
4. **Storage Space**: Reserve sufficient disk space for model caching (at least 50GB)
5. **Network Environment**: Stable network connection required for initial model download
6. **Environment Variables**: Be sure to set the recommended environment variables to optimize performance
**Technical Support**: If you encounter issues, please submit an Issue to the project repository.
# Lightx2v Parameter Offloading Mechanism Documentation
# Lightx2v Parameter Offloading Mechanism Technical Documentation
## 📖 Overview
Lightx2v implements an advanced parameter offloading mechanism designed for large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies.
Lightx2v implements a state-of-the-art parameter offloading mechanism specifically designed for efficient large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies, enabling dynamic scheduling between GPU, CPU, and disk storage.
**Core Features:**
- **Block/Phase Offloading**: Efficiently manages model weights in block/phase units for optimal memory usage
- **Block**: Basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feed-forward networks, etc.), serving as larger memory management units
- **Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feed-forward networks, etc.), providing more precise memory control
- **Multi-level Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
- **Asynchronous Operations**: Uses CUDA streams to overlap computation and data transfer
- **Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient
- **Intelligent Granularity Management**: Supports both Block and Phase offloading granularities for flexible memory control
- **Block Granularity**: Complete Transformer layers as management units, containing self-attention, cross-attention, feed-forward networks, etc., suitable for memory-sufficient environments
- **Phase Granularity**: Individual computational components as management units, providing finer-grained memory control for memory-constrained deployment scenarios
- **Multi-level Storage Architecture**: GPU → CPU → Disk three-tier storage hierarchy with intelligent caching strategies
- **Asynchronous Parallel Processing**: CUDA stream-based asynchronous computation and data transfer for maximum hardware utilization
- **Persistent Storage Support**: SSD/NVMe disk storage support for ultra-large model inference deployment
## 🎯 Offloading Strategies
## 🎯 Offloading Strategy Details
### Strategy 1: GPU-CPU Block/Phase Offloading
### Strategy 1: GPU-CPU Granularity Offloading
**Applicable Scenarios**: GPU VRAM insufficient but system memory adequate
**Applicable Scenarios**: GPU VRAM insufficient but system memory resources adequate
**Working Principle**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while phases are individual computational components within blocks.
**Technical Principle**: Establishes efficient weight scheduling mechanism between GPU and CPU memory, managing model weights in Block or Phase units. Leverages CUDA stream asynchronous capabilities to achieve parallel execution of computation and data transfer. Blocks contain complete Transformer layer structures, while Phases correspond to individual computational components within layers.
**Block vs Phase Explanation**:
- **Block Granularity**: Larger memory management units containing complete Transformer layers (self-attention, cross-attention, feed-forward networks, etc.), suitable for memory-sufficient scenarios, reducing management overhead
- **Phase Granularity**: Finer-grained memory management containing individual computational components (such as self-attention, cross-attention, feed-forward networks, etc.), suitable for memory-constrained scenarios, providing more flexible memory control
**Granularity Selection Guide**:
- **Block Granularity**: Suitable for memory-sufficient environments, reduces management overhead and improves overall performance
- **Phase Granularity**: Suitable for memory-constrained environments, provides more flexible memory control and optimizes resource utilization
```
GPU-CPU Block/Phase Offloading Workflow:
╔═════════════════════════════════════════════════════════════════╗
║ 🎯 GPU Memory ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 Current │ │ ⏳ Prefetch │ │ 📤 To Offload │ ║
║ │ block/phase N │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ │ │ │ ║
║ ▼ ▼ ▼ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ Compute │ │ GPU Load │ │ CPU Load │ ║
║ │ Stream │ │ Stream │ │ Stream │ ║
║ │(priority=-1)│ │ (priority=0) │ │ (priority=0) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💾 CPU Memory ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 📥 Cache │ │ 📥 Cache │ │ 📥 Cache │ │ 📥 Cache │ ║
║ │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║ │ N-2 │ │ N-1 │ │ N │ │ N+1 │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ▲ ▲ ▲ ▲ ║
║ │ │ │ │ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ CPU Load │ │ CPU Load │ │ CPU Load │ │ CPU Load │ ║
║ │ Stream │ │ Stream │ │ Stream │ │ Stream │ ║
║ │(priority=0) │ │(priority=0) │ │(priority=0) │ │(priority=0) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ║
║ 💡 CPU memory stores multiple blocks/phases, forming cache pool ║
║ 🔄 GPU load stream prefetches from CPU cache, CPU load stream ║
║ offloads to CPU cache ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🔄 Swap Operation Flow ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ Step 1: Parallel Execution Phase ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 Compute │ │ ⏳ Prefetch │ │ 📤 Offload │ ║
║ │ block/phase N │ │ block/phase N+1 │ │ block/phase N-1 │ ║
║ │ (Compute Stream)│ │ (GPU Load Stream)│ │ (CPU Load Stream)│ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ ║
║ Step 2: Swap Rotation Phase ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 Compute │ │ ⏳ Prefetch │ │ 📤 Offload │ ║
║ │ block/phase N+1 │ │ block/phase N+2 │ │ block/phase N │ ║
║ │ (Compute Stream)│ │ (GPU Load Stream)│ │ (CPU Load Stream)│ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ ║
║ Swap Concept: Achieves continuous computation through position ║
║ rotation, avoiding repeated loading/unloading ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💡 Swap Core Concept ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 🔄 Traditional vs Swap Method Comparison: ║
║ ║
║ Traditional Method: ║
║ ┌─────────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐ ║
║ │ Compute N │───►│ Offload N│───►│ Load N+1│───►│Compute │ ║
║ │ │ │ │ │ │ │N+1 │ ║
║ └─────────────┘ └──────────┘ └─────────┘ └────────┘ ║
║ ❌ Serial execution, waiting time, low efficiency ║
║ ║
║ Swap Method: ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ Compute N │ │ Prefetch │ │ Offload │ ║
║ │(Compute │ │N+1 │ │N-1 │ ║
║ │ Stream) │ │(GPU Load │ │(CPU Load │ ║
║ └─────────────┘ │ Stream) │ │ Stream) │ ║
║ └─────────────┘ └─────────────┘ ║
║ ✅ Parallel execution, no waiting time, high efficiency ║
║ ║
║ 🎯 Swap Advantages: ║
║ • Avoids repeated loading/unloading of same data ║
║ • Achieves continuous computation through position rotation ║
║ • Maximizes GPU utilization ║
║ • Reduces memory fragmentation ║
╚════════════════════════════════════════════════════════════════╝
```
<div align="center">
<img alt="GPU-CPU Block/Phase Offloading Workflow" src="../../../../assets/figs/offload/fig1_en.png" width="75%">
</div>
**Key Features:**
- **Asynchronous Transfer**: Uses three CUDA streams with different priorities to parallelize computation and transfer
- Compute Stream (priority=-1): High priority, responsible for current computation
- GPU Load Stream (priority=0): Medium priority, responsible for prefetching from CPU to GPU
- CPU Load Stream (priority=0): Medium priority, responsible for offloading from GPU to CPU
- **Prefetch Mechanism**: Preloads the next block/phase to GPU
- **Intelligent Caching**: Maintains weight cache in CPU memory
- **Stream Synchronization**: Ensures correctness of data transfer and computation
- **Swap Operation**: Rotates block/phase positions after computation completion for continuous processing
<div align="center">
<img alt="Swap Mechanism Core Concept" src="../../../../assets/figs/offload/fig2_en.png" width="75%">
</div>
<div align="center">
<img alt="Asynchronous Execution Flow" src="../../../../assets/figs/offload/fig3_en.png" width="75%">
</div>
### Strategy 2: Disk-CPU-GPU Block/Phase Offloading (Lazy Loading)
**Technical Features:**
- **Multi-stream Parallel Architecture**: Employs three CUDA streams with different priorities to parallelize computation and transfer
- Compute Stream (priority=-1): High priority, responsible for current computation tasks
- GPU Load Stream (priority=0): Medium priority, responsible for weight prefetching from CPU to GPU
- CPU Load Stream (priority=0): Medium priority, responsible for weight offloading from GPU to CPU
- **Intelligent Prefetching Mechanism**: Predictively loads next Block/Phase based on computation progress
- **Efficient Cache Management**: Maintains weight cache pool in CPU memory for improved access efficiency
- **Stream Synchronization Guarantee**: Ensures temporal correctness of data transfer and computation
- **Position Rotation Optimization**: Achieves continuous computation through Swap operations, avoiding repeated loading/unloading
**Applicable Scenarios**: Both GPU VRAM and system memory insufficient
### Strategy 2: Disk-CPU-GPU Three-Level Offloading (Lazy Loading)
**Working Principle**: Introduces disk storage on top of Strategy 1, implementing a three-level storage hierarchy (Disk → CPU → GPU). CPU continues as a cache pool but with configurable size, suitable for CPU memory-constrained devices.
**Applicable Scenarios**: Both GPU VRAM and system memory resources insufficient in constrained environments
```
Disk-CPU-GPU Block/Phase Offloading Workflow:
╔═════════════════════════════════════════════════════════════════╗
║ 💿 SSD/NVMe Storage ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 📁 block_0 │ │ 📁 block_1 │ │ 📁 block_2 │ │ 📁 block_N │ ║
║ │ .safetensors│ │ .safetensors│ │ .safetensors│ │ .safetensors│ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ │ │ │ ║
║ ▼ ▼ ▼ ▼ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ 🎯 Disk Worker Thread Pool │ ║
║ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ ║
║ │ │ Disk Thread │ │ Disk Thread │ │ Disk Thread │ │ ║
║ │ │ 1 │ │ 2 │ │ N │ │ ║
║ │ │(Async Load) │ │(Async Load) │ │(Async Load) │ │ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ║
║ │ │ │ │ │ ║
║ │ └───────────────┼───────────────┘ │ ║
║ │ ▼ │ ║
║ │ ┌─────────────────────────────────────────────────────────┐ │ ║
║ │ │ 📋 Priority Task Queue │ │ ║
║ │ │ (Manages disk loading task scheduling) │ │ ║
║ │ └─────────────────────────────────────────────────────────┘ │ ║
║ └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💾 CPU Memory Buffer ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ 🎯 FIFO Intelligent Cache │ ║
║ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ │ 📥 Cache │ │ 📥 Cache │ │ 📥 Cache │ │ 📥 Cache │ ║
║ │ │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║ │ │ N-2 │ │ N-1 │ │ N │ │ N+1 │ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ ▲ ▲ ▲ ▲ ║
║ │ │ │ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ │ CPU Load │ │ CPU Load │ │ CPU Load │ │ CPU Load │ ║
║ │ │ Stream │ │ Stream │ │ Stream │ │ Stream │ ║
║ │ │(priority=0) │ │(priority=0) │ │(priority=0) │ │(priority=0) │ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ │ ║
║ │ 💡 Configurable Size 🎯 FIFO Eviction 🔄 Cache Hit/Miss │ ║
║ └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🎯 GPU Memory ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 Current │ │ ⏳ Prefetch │ │ 📤 To Offload │ ║
║ │ block/phase N │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ │ │ │ ║
║ ▼ ▼ ▼ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ Compute │ │ GPU Load │ │ CPU Load │ ║
║ │ Stream │ │ Stream │ │ Stream │ ║
║ │(priority=-1)│ │ (priority=0) │ │ (priority=0) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🔄 Complete Workflow ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ Step 1: Cache Miss Handling ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 Disk │───►│ 💾 CPU Cache│───►│ 🎯 GPU │ ║
║ │ (On-demand │ │ (FIFO │ │ Memory │ ║
║ │ loading) │ │ Management)│ │ (Compute │ ║
║ └─────────────┘ └─────────────┘ │ Execution) │ ║
║ └─────────────┘ ║
║ ║
║ Step 2: Cache Hit Handling ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 Disk │ │ 💾 CPU Cache│───►│ 🎯 GPU │ ║
║ │ (Skip │ │ (Direct │ │ Memory │ ║
║ │ loading) │ │ Access) │ │ (Compute │ ║
║ └─────────────┘ └─────────────┘ │ Execution) │ ║
║ └─────────────┘ ║
║ ║
║ Step 3: Memory Management ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 Disk │ │ 💾 CPU Cache│ │ 🎯 GPU │ ║
║ │ (Persistent │ │ (FIFO │ │ Memory │ ║
║ │ Storage) │ │ Eviction) │ │ (Swap │ ║
║ └─────────────┘ └─────────────┘ │ Rotation) │ ║
║ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
Work Steps:
1. Disk Storage: Model weights stored by block on SSD/NVMe, one .safetensors file per block
2. Task Scheduling: When a block/phase is needed, priority task queue assigns disk worker threads
3. Async Loading: Multiple disk threads parallelly read weight files from disk to CPU memory buffer
4. Intelligent Caching: CPU memory buffer uses FIFO strategy for cache management with configurable size
5. Cache Hit: If weights are already in cache, directly transfer to GPU without disk reading
6. Prefetch Transfer: Weights in cache asynchronously transfer to GPU memory (using GPU load stream)
7. Compute Execution: Weights on GPU perform computation (using compute stream), while background continues prefetching next block/phase
8. Swap Rotation: After computation completion, rotate block/phase positions for continuous computation
9. Memory Management: When CPU cache is full, automatically evict earliest used weight blocks/phases
```
**Technical Principle**: Introduces disk storage layer on top of Strategy 1, constructing a Disk→CPU→GPU three-level storage architecture. CPU serves as a configurable intelligent cache pool, suitable for various memory-constrained deployment environments.
<div align="center">
<img alt="Disk-CPU-GPU Three-Level Offloading Architecture" src="../../../../assets/figs/offload/fig4_en.png" width="75%">
</div>
**Key Features:**
- **Lazy Loading**: Model weights loaded from disk on-demand, avoiding loading entire model at once
- **Intelligent Caching**: CPU memory buffer uses FIFO strategy with configurable size
- **Multi-threaded Prefetching**: Uses multiple disk worker threads for parallel loading
- **Asynchronous Transfer**: Uses CUDA streams to overlap computation and data transfer
- **Swap Rotation**: Achieves continuous computation through position rotation, avoiding repeated loading/unloading
<div align="center">
<img alt="Complete Workflow" src="../../../../assets/figs/offload/fig5_en.png" width="75%">
</div>
**Execution Steps Details:**
1. **Disk Storage Layer**: Model weights organized by Block on SSD/NVMe, each Block corresponding to one .safetensors file
2. **Task Scheduling Layer**: Priority queue-based intelligent scheduling system for disk loading task assignment
3. **Asynchronous Loading Layer**: Multi-threaded parallel reading of weight files from disk to CPU memory buffer
4. **Intelligent Cache Layer**: CPU memory buffer using FIFO strategy for cache management with dynamic size configuration
5. **Cache Hit Optimization**: Direct transfer to GPU when weights are already in cache, avoiding disk I/O overhead
6. **Prefetch Transfer Layer**: Weights in cache asynchronously transferred to GPU memory via GPU load stream
7. **Compute Execution Layer**: Weights on GPU perform computation (compute stream) while background continues prefetching next Block/Phase
8. **Position Rotation Layer**: Swap rotation after computation completion for continuous computation flow
9. **Memory Management Layer**: Automatic eviction of earliest used weight Blocks/Phases when CPU cache is full
**Technical Features:**
- **On-demand Loading Mechanism**: Model weights loaded from disk only when needed, avoiding loading entire model at once
- **Configurable Cache Strategy**: CPU memory buffer supports FIFO strategy with dynamically adjustable size
- **Multi-threaded Parallel Loading**: Leverages multiple disk worker threads for parallel data loading
- **Asynchronous Transfer Optimization**: CUDA stream-based asynchronous data transfer for maximum hardware utilization
- **Continuous Computation Guarantee**: Achieves continuous computation through position rotation mechanism, avoiding repeated loading/unloading operations
## ⚙️ Configuration Parameters
## ⚙️ Configuration Parameters Details
### GPU-CPU Offloading Configuration
```python
config = {
"cpu_offload": True,
"offload_ratio": 1.0, # Offload ratio (0.0-1.0)
"offload_granularity": "block", # Offload granularity: "block" or "phase"
"lazy_load": False, # Disable lazy loading
"cpu_offload": True, # Enable CPU offloading functionality
"offload_ratio": 1.0, # Offload ratio (0.0-1.0), 1.0 means complete offloading
"offload_granularity": "block", # Offload granularity selection: "block" or "phase"
"lazy_load": False, # Disable lazy loading mode
}
```
......@@ -275,61 +95,67 @@ config = {
```python
config = {
"cpu_offload": True,
"lazy_load": True, # Enable lazy loading
"offload_ratio": 1.0, # Offload ratio
"offload_granularity": "phase", # Recommended to use phase granularity
"cpu_offload": True, # Enable CPU offloading functionality
"lazy_load": True, # Enable lazy loading mode
"offload_ratio": 1.0, # Offload ratio setting
"offload_granularity": "phase", # Recommended to use phase granularity for better memory control
"num_disk_workers": 2, # Number of disk worker threads
"offload_to_disk": True, # Enable disk offloading
"offload_path": ".", # Disk offload path
"offload_to_disk": True, # Enable disk offloading functionality
"offload_path": ".", # Disk offload path configuration
}
```
**Intelligent Cache Key Parameters:**
- `max_memory`: Controls CPU cache size, affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects prefetch speed
- `offload_granularity`: Controls cache granularity (block or phase), affects cache efficiency
- `"block"`: Cache management in units of complete Transformer layers
- `"phase"`: Cache management in units of individual computational components
**Intelligent Cache Key Parameter Descriptions:**
- `max_memory`: Controls CPU cache size upper limit, directly affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects data prefetch speed
- `offload_granularity`: Controls cache management granularity, affects cache efficiency and memory utilization
- `"block"`: Cache management in units of complete Transformer layers, suitable for memory-sufficient environments
- `"phase"`: Cache management in units of individual computational components, suitable for memory-constrained environments
Detailed configuration files can be referenced at [config](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
Detailed configuration files can be referenced at [Official Configuration Repository](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
## 🎯 Usage Recommendations
## 🎯 Deployment Strategy Recommendations
```
╔═════════════════════════════════════════════════════════════════╗
║ 📋 Configuration Guide ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 🔄 GPU-CPU Block/Phase Offloading: ║
║ Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) ║
║ but adequate system memory (>64/128G) ║
║ 💾 Disk-CPU-GPU Block/Phase Offloading: ║
║ Suitable for insufficient GPU VRAM (RTX 3060/4090 8G) ║
║ and system memory (16/32G) ║
║ 🚫 No Offload: Suitable for high-end hardware configurations, ║
║ pursuing optimal performance ║
║ ║
╚═════════════════════════════════════════════════════════════════╝
```
- 🔄 GPU-CPU Granularity Offloading: Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) but adequate system memory (>64G)
- Advantages: Balances performance and memory usage, suitable for medium-scale model inference
## 🔍 Troubleshooting
- 💾 Disk-CPU-GPU Three-Level Offloading: Suitable for limited GPU VRAM (RTX 3060/4090 8G) and insufficient system memory (16-32G)
- Advantages: Supports ultra-large model inference with lowest hardware threshold
### Common Issues and Solutions
- 🚫 No Offload Mode: Suitable for high-end hardware configurations pursuing optimal inference performance
- Advantages: Maximizes computational efficiency, suitable for latency-sensitive application scenarios
1. **Disk I/O Bottleneck**
```
Solution: Use NVMe SSD, increase num_disk_workers
```
## 🔍 Troubleshooting and Solutions
2. **Memory Buffer Overflow**
```
Solution: Increase max_memory or decrease num_disk_workers
```
### Common Performance Issues and Optimization Strategies
3. **Loading Timeout**
```
Solution: Check disk performance, optimize file system
```
1. **Disk I/O Performance Bottleneck**
- Problem Symptoms: Slow model loading speed, high inference latency
- Solutions:
- Upgrade to NVMe SSD storage devices
- Increase num_disk_workers parameter value
- Optimize file system configuration
**Note**: This offloading mechanism is specifically designed for Lightx2v, fully utilizing modern hardware's asynchronous computing capabilities, significantly reducing the hardware threshold for large model inference.
2. **Memory Buffer Overflow**
- Problem Symptoms: Insufficient system memory, program abnormal exit
- Solutions:
- Increase max_memory parameter value
- Decrease num_disk_workers parameter value
- Adjust offload_granularity to "phase"
3. **Model Loading Timeout**
- Problem Symptoms: Timeout errors during model loading process
- Solutions:
- Check disk read/write performance
- Optimize file system parameters
- Verify storage device health status
## 📚 Technical Summary
Lightx2v's offloading mechanism is specifically designed for modern AI inference scenarios, fully leveraging GPU's asynchronous computing capabilities and multi-level storage architecture advantages. Through intelligent weight management and efficient parallel processing, this mechanism significantly reduces the hardware threshold for large model inference, providing developers with flexible and efficient deployment solutions.
**Technical Highlights:**
- 🚀 **Performance Optimization**: Asynchronous parallel processing maximizes hardware utilization
- 💾 **Intelligent Memory**: Multi-level caching strategies achieve optimal memory management
- 🔧 **Flexible Configuration**: Supports flexible configuration of multiple granularities and strategies
- 🛡️ **Stable and Reliable**: Comprehensive error handling and fault recovery mechanisms
# 低资源场景部署
# Lightx2v 低资源部署指南
xxx
## 📋 概述
本指南专门针对硬件资源受限的环境,特别是**8GB显存 + 16/32GB内存**的配置,详细说明如何成功运行Lightx2v 14B模型进行480p和720p视频生成。
Lightx2v是一个强大的视频生成模型,但在资源受限的环境下需要精心优化才能流畅运行。本指南将为您提供从硬件选择到软件配置的完整解决方案,确保您能够在有限的硬件条件下获得最佳的视频生成体验。
## 🎯 目标硬件配置详解
### 推荐硬件规格
**GPU要求**:
- **显存**: 8GB (RTX 3060/3070/4060/4060Ti 等)
- **架构**: 支持CUDA的NVIDIA显卡
**系统内存**:
- **最低要求**: 16GB DDR4
- **推荐配置**: 32GB DDR4/DDR5
- **内存速度**: 建议3200MHz及以上
**存储要求**:
- **类型**: 强烈推荐NVMe SSD
- **容量**: 至少50GB可用空间
- **速度**: 读取速度建议3000MB/s以上
**CPU要求**:
- **核心数**: 建议8核心及以上
- **频率**: 建议3.0GHz及以上
- **架构**: 支持AVX2指令集
## ⚙️ 核心优化策略详解
### 1. 环境优化
在运行Lightx2v之前,建议设置以下环境变量以优化性能:
```bash
# CUDA内存分配优化
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 启用CUDA Graph模式,提升推理性能
export ENABLE_GRAPH_MODE=true
# 使用BF16精度推理,减少显存占用(默认FP32精度)
export DTYPE=BF16
```
**优化说明**:
- `expandable_segments:True`: 允许CUDA内存段动态扩展,减少内存碎片
- `ENABLE_GRAPH_MODE=true`: 启用CUDA Graph,减少内核启动开销
- `DTYPE=BF16`: 使用BF16精度,在保持质量的同时减少显存占用
### 2. 量化策略
量化是低资源环境下的关键优化技术,通过降低模型精度来减少内存占用。
#### 量化方案对比
**FP8量化** (推荐用于RTX 40系列):
```python
# 适用于支持FP8的GPU,提供更好的精度
dit_quant_scheme = "fp8" # DIT模型量化
t5_quant_scheme = "fp8" # T5文本编码器量化
clip_quant_scheme = "fp8" # CLIP视觉编码器量化
```
**INT8量化** (通用方案):
```python
# 适用于所有GPU,内存占用最小
dit_quant_scheme = "int8" # 8位整数量化
t5_quant_scheme = "int8" # 文本编码器量化
clip_quant_scheme = "int8" # 视觉编码器量化
```
### 3. 高效算子选择指南
选择合适的算子可以显著提升推理速度和减少内存占用。
#### 注意力算子选择
**推荐优先级**:
1. **[Sage Attention](https://github.com/thu-ml/SageAttention)** (最高优先级)
2. **[Flash Attention](https://github.com/Dao-AILab/flash-attention)** (通用方案)
#### 矩阵乘算子选择
**ADA架构显卡** (RTX 40系列):
推荐优先级:
1. **[q8-kernel](https://github.com/KONAKONA666/q8_kernels)** (最高性能,仅支持ADA架构)
2. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (平衡方案)
3. **[vllm-kernel](https://github.com/vllm-project/vllm)** (通用方案)
**其他架构显卡**:
1. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (推荐)
2. **[vllm-kernel](https://github.com/vllm-project/vllm)** (备选)
### 4. 参数卸载策略详解
参数卸载技术允许模型在CPU和磁盘之间动态调度参数,突破显存限制。
#### 三级卸载架构
```python
# 磁盘-CPU-GPU三级卸载配置
cpu_offload=True # 启用CPU卸载
t5_cpu_offload=True # 启用T5编码器CPU卸载
offload_granularity=phase # DIT模型细粒度卸载
t5_offload_granularity=block # T5编码器细粒度卸载
lazy_load = True # 启用延迟加载机制
num_disk_workers = 2 # 磁盘I/O工作线程数
```
#### 卸载策略详解
**延迟加载机制**:
- 模型参数按需从磁盘加载到CPU
- 减少运行时内存占用
- 支持大模型在有限内存下运行
**磁盘存储优化**:
- 使用高速SSD存储模型参数
- 按照block分组存储模型文件
- 参考转换脚本[文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md),转换时指定`--save_by_block`参数
### 5. 显存优化技术详解
针对720p视频生成的显存优化策略。
#### CUDA内存管理
```python
# CUDA内存清理配置
clean_cuda_cache = True # 及时清理GPU缓存
rotary_chunk = True # 旋转位置编码分块计算
rotary_chunk_size = 100 # 分块大小,可根据显存调整
```
#### 分块计算策略
**旋转位置编码分块**:
- 将长序列分成小块处理
- 减少峰值显存占用
- 保持计算精度
### 6. VAE优化详解
VAE (变分自编码器) 是视频生成的关键组件,优化VAE可以显著提升性能。
#### VAE分块推理
```python
# VAE优化配置
use_tiling_vae = True # 启用VAE分块推理
```
#### [轻量级VAE](https://github.com/madebyollin/taehv/blob/main/taew2_1.pth)
```python
# VAE优化配置
use_tiny_vae = True # 使用轻量级VAE
```
**VAE优化效果**:
- 标准VAE: 基准性能,100%质量保持
- 标准VAE分块: 降低显存,增加推理时间,100%质量保持
- 轻量VAE: 极低显存,视频质量有损
### 7. 模型选择策略
选择合适的模型版本对低资源环境至关重要。
#### 推荐模型对比
**蒸馏模型** (强烈推荐):
-**[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)**
-**[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)**
#### 性能优化建议
使用上述蒸馏模型时,可以进一步优化性能:
- 关闭CFG: `"enable_cfg": false`
- 减少推理步数: `infer_step: 4`
- 参考配置文件: [config](https://github.com/ModelTC/LightX2V/tree/main/configs/distill)
## 🚀 完整配置示例
### 预配置模板
- **[14B模型480p视频生成配置](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_480p.json)**
- **[14B模型720p视频生成配置](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_720p.json)**
- **[1.3B模型720p视频生成配置](https://github.com/ModelTC/LightX2V/tree/main/configs/offload/block/wan_t2v_1_3b.json)**
- 1.3B模型推理瓶颈是T5 encoder,配置文件专门针对T5进行优化
**[启动脚本](https://github.com/ModelTC/LightX2V/tree/main/scripts/wan/run_wan_i2v_lazy_load.sh)**
## 📚 参考资源
- [参数卸载机制文档](../method_tutorials/offload.md) - 深入了解卸载技术原理
- [量化技术指南](../method_tutorials/quantization.md) - 量化技术详细说明
- [Gradio部署指南](deploy_gradio.md) - Gradio部署详细说明
## ⚠️ 重要注意事项
1. **硬件要求**: 确保您的硬件满足最低配置要求
2. **驱动版本**: 建议使用最新的NVIDIA驱动 (535+)
3. **CUDA版本**: 确保CUDA版本与PyTorch兼容 (建议CUDA 11.8+)
4. **存储空间**: 预留足够的磁盘空间用于模型缓存 (至少50GB)
5. **网络环境**: 首次下载模型需要稳定的网络连接
6. **环境变量**: 务必设置推荐的环境变量以优化性能
**技术支持**: 如遇到问题,请提交Issue到项目仓库。
......@@ -20,98 +20,23 @@ Lightx2v 实现了先进的参数卸载机制,专为在有限硬件资源下
**工作原理**:在 GPU 和 CPU 内存之间以block或phase为单位管理模型权重,利用 CUDA 流实现计算和数据传输的重叠。Block包含完整的Transformer层,而Phase则是Block内部的单个计算组件。
<div align="center">
<img alt="GPU-CPU block/phase卸载流程图" src="../../../../assets/figs/offload/fig1_zh.png" width="75%">
</div>
<div align="center">
<img alt="Swap操作" src="../../../../assets/figs/offload/fig2_zh.png" width="75%">
</div>
<div align="center">
<img alt="Swap思想" src="../../../../assets/figs/offload/fig3_zh.png" width="75%">
</div>
**Block vs Phase 说明**
- **Block粒度**:较大的内存管理单位,包含完整的Transformer层(自注意力、交叉注意力、前馈网络等),适合内存充足的情况,减少管理开销
- **Phase粒度**:更细粒度的内存管理,包含单个计算组件(如自注意力、交叉注意力、前馈网络等),适合内存受限的情况,提供更灵活的内存控制
```
GPU-CPU 分block/phase卸载工作流程:
╔═════════════════════════════════════════════════════════════════╗
║ 🎯 GPU 内存 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 当前计算 │ │ ⏳ 预取 │ │ 📤 待卸载 │ ║
║ │ block/phase N │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ │ │ │ ║
║ ▼ ▼ ▼ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 计算流 │ │ GPU加载流 │ │ CPU加载流 │ ║
║ │ (priority=-1)│ │ (priority=0) │ │ (priority=0) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💾 CPU 内存 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 📥 缓存 │ │ 📥 缓存 │ │ 📥 缓存 │ │ 📥 缓存 │ ║
║ │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║ │ N-2 │ │ N-1 │ │ N │ │ N+1 │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ▲ ▲ ▲ ▲ ║
║ │ │ │ │ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ CPU加载流 │ │ CPU加载流 │ │ CPU加载流 │ │ CPU加载流 │ ║
║ │ (priority=0)│ │ (priority=0)│ │ (priority=0)│ │ (priority=0)│ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ║
║ 💡 CPU内存中存储了多个block/phase,形成缓存池 ║
║ 🔄 GPU加载流从CPU缓存中预取,CPU加载流向CPU缓存卸载 ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🔄 Swap 操作流程 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 步骤1: 并行执行阶段 ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 计算 │ │ ⏳ 预取 │ │ 📤 卸载 │ ║
║ │ block/phase N │ │ block/phase N+1 │ │ block/phase N-1 │ ║
║ │ (计算流) │ │ (GPU加载流) │ │ (CPU加载流) │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ ║
║ 步骤2: Swap 轮换阶段 ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 计算 │ │ ⏳ 预取 │ │ 📤 卸载 │ ║
║ │ block/phase N+1 │ │ block/phase N+2 │ │ block/phase N │ ║
║ │ (计算流) │ │ (GPU加载流) │ │ (CPU加载流) │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ ║
║ Swap 思想:通过轮换位置实现连续计算,避免重复加载/卸载 ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💡 Swap 核心思想 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 🔄 传统方式 vs Swap方式对比: ║
║ ║
║ 传统方式: ║
║ ┌─────────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐ ║
║ │ 计算N │───►│ 卸载N │───►│ 加载N+1 │───►│ 计算N+1│ ║
║ └─────────────┘ └──────────┘ └─────────┘ └────────┘ ║
║ ❌ 串行执行,存在等待时间,效率低 ║
║ ║
║ Swap方式: ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 计算N │ │ 预取N+1 │ │ 卸载N-1 │ ║
║ │ (计算流) │ │ (GPU加载流) │ │ (CPU加载流) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ✅ 并行执行,无等待时间,效率高 ║
║ ║
║ 🎯 Swap优势: ║
║ • 避免重复加载/卸载同一数据 ║
║ • 通过位置轮换实现连续计算 ║
║ • 最大化GPU利用率 ║
║ • 减少内存碎片 ║
╚════════════════════════════════════════════════════════════════╝
```
**关键特性:**
- **异步传输**:使用三个不同优先级的CUDA流实现计算和传输的并行
- 计算流(priority=-1):高优先级,负责当前计算
......@@ -123,115 +48,23 @@ GPU-CPU 分block/phase卸载工作流程:
- **Swap操作**:计算完成后轮换block/phase位置,实现连续计算
### 策略二:磁盘-CPU-GPU 分block/phase卸载(延迟加载)
**适用场景**:GPU 显存和系统内存都不足
**工作原理**:在策略一的基础上引入磁盘存储,实现三级存储层次(磁盘 → CPU → GPU)。CPU继续作为缓存池,但大小可配置,适用于CPU内存受限的设备。
```
磁盘-CPU-GPU 分block/phase卸载工作流程:
╔═════════════════════════════════════════════════════════════════╗
║ 💿 SSD/NVMe 存储 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 📁 block_0 │ │ 📁 block_1 │ │ 📁 block_2 │ │ 📁 block_N │ ║
║ │ .safetensors│ │ .safetensors│ │ .safetensors│ │ .safetensors│ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ │ │ │ ║
║ ▼ ▼ ▼ ▼ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ 🎯 磁盘工作线程池 │ ║
║ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ ║
║ │ │ 磁盘线程1 │ │ 磁盘线程2 │ │ 磁盘线程N │ │ ║
║ │ │ (异步加载) │ │ (异步加载) │ │ (异步加载) │ │ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ║
║ │ │ │ │ │ ║
║ │ └───────────────┼───────────────┘ │ ║
║ │ ▼ │ ║
║ │ ┌─────────────────────────────────────────────────────────┐ │ ║
║ │ │ 📋 优先级任务队列 │ │ ║
║ │ │ (管理磁盘加载任务调度) │ │ ║
║ │ └─────────────────────────────────────────────────────────┘ │ ║
║ └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 💾 CPU 内存缓冲区 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ 🎯 FIFO 智能缓存 │ ║
║ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ │ 📥 缓存 │ │ 📥 缓存 │ │ 📥 缓存 │ │ 📥 缓存 │ ║
║ │ │ block/phase │ │ block/phase │ │ block/phase │ │ block/phase │ ║
║ │ │ N-2 │ │ N-1 │ │ N │ │ N+1 │ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ ▲ ▲ ▲ ▲ ║
║ │ │ │ │ │ ║
║ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ │ CPU加载流 │ │ CPU加载流 │ │ CPU加载流 │ │ CPU加载流 │ ║
║ │ │ (priority=0)│ │ (priority=0)│ │ (priority=0)│ │ (priority=0)│ ║
║ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ │ │ ║
║ │ 💡 可配置大小 🎯 FIFO淘汰策略 🔄 缓存命中/未命中处理 │ ║
║ └─────────────────────────────────────────────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🎯 GPU 内存 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ║
║ │ 🔄 当前计算 │ │ ⏳ 预取 │ │ 📤 待卸载 │ ║
║ │ block/phase N │◄──►│ block/phase N+1 │◄──►│ block/phase N-1 │ ║
║ └─────────────────┘ └─────────────────┘ └─────────────────┘ ║
║ │ │ │ ║
║ ▼ ▼ ▼ ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 计算流 │ │ GPU加载流 │ │ CPU加载流 │ ║
║ │ (priority=-1)│ │ (priority=0) │ │ (priority=0) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════╗
║ 🔄 完整工作流程 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 步骤1: 缓存未命中处理 ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 磁盘 │───►│ 💾 CPU缓存 │───►│ 🎯 GPU内存 │ ║
║ │ (按需加载) │ │ (FIFO管理) │ │ (计算执行) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ║
║ 步骤2: 缓存命中处理 ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 磁盘 │ │ 💾 CPU缓存 │───►│ 🎯 GPU内存 │ ║
║ │ (跳过加载) │ │ (直接获取) │ │ (计算执行) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
║ ║
║ 步骤3: 内存管理 ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║
║ │ 💿 磁盘 │ │ 💾 CPU缓存 │ │ 🎯 GPU内存 │ ║
║ │ (持久存储) │ │ (FIFO淘汰) │ │ (Swap轮换) │ ║
║ └─────────────┘ └─────────────┘ └─────────────┘ ║
╚═════════════════════════════════════════════════════════════════╝
工作步骤:
1. 磁盘存储:模型权重按block存储在SSD/NVMe上,每个block一个.safetensors文件
2. 任务调度:当需要某个block/phase时,优先级任务队列分配磁盘工作线程
3. 异步加载:多个磁盘线程并行从磁盘读取权重文件到CPU内存缓冲区
4. 智能缓存:CPU内存缓冲区使用FIFO策略管理缓存,可配置大小
5. 缓存命中:如果权重已在缓存中,直接传输到GPU,无需磁盘读取
6. 预取传输:缓存中的权重异步传输到GPU内存(使用GPU加载流)
7. 计算执行:GPU上的权重进行计算(使用计算流),同时后台继续预取下一个block/phase
8. Swap轮换:计算完成后轮换block/phase位置,实现连续计算
9. 内存管理:当CPU缓存满时,自动淘汰最早使用的权重block/phase
```
<div align="center">
<img alt="磁盘-CPU-GPU 分block/phase卸载工作流程" src="../../../../assets/figs/offload/fig4_zh.png" width="75%">
</div>
<div align="center">
<img alt="工作步骤" src="../../../../assets/figs/offload/fig5_zh.png" width="75%">
</div>
**关键特性:**
- **延迟加载**:模型权重按需从磁盘加载,避免一次性加载全部模型
......@@ -240,6 +73,17 @@ GPU-CPU 分block/phase卸载工作流程:
- **异步传输**:使用CUDA流实现计算和数据传输的重叠
- **Swap轮换**:通过位置轮换实现连续计算,避免重复加载/卸载
**工作步骤**
- **磁盘存储**:模型权重按block存储在SSD/NVMe上,每个block一个.safetensors文件
- **任务调度**:当需要某个block/phase时,优先级任务队列分配磁盘工作线程
- **异步加载**:多个磁盘线程并行从磁盘读取权重文件到CPU内存缓冲区
- **智能缓存**:CPU内存缓冲区使用FIFO策略管理缓存,可配置大小
- **缓存命中**:如果权重已在缓存中,直接传输到GPU,无需磁盘读取
- **预取传输**:缓存中的权重异步传输到GPU内存(使用GPU加载流)
- **计算执行**:GPU上的权重进行计算(使用计算流),同时后台继续预取下一个block/phase
- **Swap轮换**:计算完成后轮换block/phase位置,实现连续计算
- **内存管理**:当CPU缓存满时,自动淘汰最早使用的权重block/phase
## ⚙️ 配置参数
......@@ -279,20 +123,11 @@ config = {
详细配置文件可参考[config](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
## 🎯 使用建议
╔═════════════════════════════════════════════════════════════════╗
║ 📋 配置建议 ║
╠═════════════════════════════════════════════════════════════════╣
║ ║
║ 🔄 GPU-CPU分block/phase卸载: ║
║ 适合GPU显存不足(RTX 3090/4090 24G)但系统内存(>64/128G)充足 ║
║ 💾 磁盘-CPU-GPU分block/phase卸载: ║
║ 适合GPU显存(RTX 3060/4090 8G)和系统内存(16/32G)都不足 ║
║ 🚫 无Offload:适合高端硬件配置,追求最佳性能 ║
║ ║
╚═════════════════════════════════════════════════════════════════╝
```
- 🔄 GPU-CPU分block/phase卸载:适合GPU显存不足(RTX 3090/4090 24G)但系统内存(>64/128G)充足
- 💾 磁盘-CPU-GPU分block/phase卸载:适合GPU显存(RTX 3060/4090 8G)和系统内存(16/32G)都不足
- 🚫 无Offload:适合高端硬件配置,追求最佳性能
## 🔍 故障排除
......@@ -300,19 +135,14 @@ config = {
### 常见问题及解决方案
1. **磁盘I/O瓶颈**
```
解决方案:使用NVMe SSD,增加num_disk_workers
```
- 解决方案:使用NVMe SSD,增加num_disk_workers
2. **内存缓冲区溢出**
```
解决方案:增加max_memory或减少num_disk_workers
```
- 解决方案:增加max_memory或减少num_disk_workers
3. **加载超时**
```
解决方案:检查磁盘性能,优化文件系统
```
- 解决方案:检查磁盘性能,优化文件系统
**注意**:本卸载机制专为Lightx2v设计,充分利用了现代硬件的异步计算能力,能够显著降低大模型推理的硬件门槛。
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment