Unverified Commit 6f281bdd authored by gushiqiao's avatar gushiqiao Committed by GitHub
Browse files

Update docs (#386)

parent 8b1e4f94
...@@ -26,26 +26,23 @@ For comprehensive usage instructions, please refer to our documentation: **[Engl ...@@ -26,26 +26,23 @@ For comprehensive usage instructions, please refer to our documentation: **[Engl
## 🤖 Supported Model Ecosystem ## 🤖 Supported Model Ecosystem
### Official Open-Source Models ### Official Open-Source Models
-[HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo)
-[Wan2.1 & Wan2.2](https://huggingface.co/Wan-AI/) -[Wan2.1 & Wan2.2](https://huggingface.co/Wan-AI/)
-[SkyReels-V2-DF](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P) -[Qwen-Image](https://huggingface.co/Qwen/Qwen-Image)
-[CogVideoX1.5-5B-T2V](https://huggingface.co/THUDM/CogVideoX1.5-5B) -[Qwen-Image-Edit](https://huggingface.co/spaces/Qwen/Qwen-Image-Edit)
### Quantized Models ### Quantized and Distilled Models/LoRAs (**🚀 Recommended: 4-step inference**)
-[Wan2.1-T2V-1.3B-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-1.3B-Lightx2v) -[Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
-[Wan2.1-T2V-14B-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-Lightx2v) -[Wan2.2-Distill-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
-[Wan2.1-I2V-14B-480P-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-Lightx2v) -[Wan2.1-Distill-Loras](https://huggingface.co/lightx2v/Wan2.1-Distill-Loras)
-[Wan2.1-I2V-14B-720P-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-Lightx2v) -[Wan2.2-Distill-Loras](https://huggingface.co/lightx2v/Wan2.2-Distill-Loras)
### Distilled Models (**🚀 Recommended: 4-step inference**)
-[Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v)
-[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)
-[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)
🔔 Follow our [HuggingFace page](https://huggingface.co/lightx2v) for the latest model releases from our team. 🔔 Follow our [HuggingFace page](https://huggingface.co/lightx2v) for the latest model releases from our team.
### Autoregressive Models ### Autoregressive Models
-[Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid) -[Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid)
-[Self-Forcing](https://github.com/guandeh17/Self-Forcing)
💡 Refer to the [Model Structure Documentation](https://lightx2v-en.readthedocs.io/en/latest/deploy_guides/model_structure.html) to quickly get started with LightX2V
## 🚀 Frontend Interfaces ## 🚀 Frontend Interfaces
......
...@@ -25,26 +25,23 @@ ...@@ -25,26 +25,23 @@
## 🤖 支持的模型生态 ## 🤖 支持的模型生态
### 官方开源模型 ### 官方开源模型
-[HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo)
-[Wan2.1 & Wan2.2](https://huggingface.co/Wan-AI/) -[Wan2.1 & Wan2.2](https://huggingface.co/Wan-AI/)
-[SkyReels-V2-DF](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P) -[Qwen-Image](https://huggingface.co/Qwen/Qwen-Image)
-[CogVideoX1.5-5B-T2V](https://huggingface.co/THUDM/CogVideoX1.5-5B) -[Qwen-Image-Edit](https://huggingface.co/spaces/Qwen/Qwen-Image-Edit)
### 量化模型 ### 量化模型和蒸馏模型/Lora (**🚀 推荐:4步推理**)
-[Wan2.1-T2V-1.3B-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-1.3B-Lightx2v) -[Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
-[Wan2.1-T2V-14B-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-Lightx2v) -[Wan2.2-Distill-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
-[Wan2.1-I2V-14B-480P-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-Lightx2v) -[Wan2.1-Distill-Loras](https://huggingface.co/lightx2v/Wan2.1-Distill-Loras)
-[Wan2.1-I2V-14B-720P-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-Lightx2v) -[Wan2.2-Distill-Loras](https://huggingface.co/lightx2v/Wan2.2-Distill-Loras)
### 蒸馏模型 (**🚀 推荐:4步推理**)
-[Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v)
-[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)
-[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)
🔔 可以关注我们的[HuggingFace主页](https://huggingface.co/lightx2v),及时获取我们团队的模型。 🔔 可以关注我们的[HuggingFace主页](https://huggingface.co/lightx2v),及时获取我们团队的模型。
### 自回归模型 ### 自回归模型
-[Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid) -[Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid)
-[Self-Forcing](https://github.com/guandeh17/Self-Forcing)
💡 参考[模型结构文档](https://lightx2v-zhcn.readthedocs.io/zh-cn/latest/deploy_guides/model_structure.html)快速上手 LightX2V
## 🚀 前端展示 ## 🚀 前端展示
...@@ -79,6 +76,7 @@ ...@@ -79,6 +76,7 @@
- **🔄 并行推理加速**: 多GPU并行处理,显著提升性能表现 - **🔄 并行推理加速**: 多GPU并行处理,显著提升性能表现
- **📱 灵活部署选择**: 支持Gradio、服务化部署、ComfyUI等多种部署方式 - **📱 灵活部署选择**: 支持Gradio、服务化部署、ComfyUI等多种部署方式
- **🎛️ 动态分辨率推理**: 自适应分辨率调整,优化生成质量 - **🎛️ 动态分辨率推理**: 自适应分辨率调整,优化生成质量
- **🎞️ 视频帧插值**: 基于RIFE的帧插值技术,实现流畅的帧率提升
## 🏆 性能基准测试 ## 🏆 性能基准测试
......
...@@ -16,11 +16,6 @@ ...@@ -16,11 +16,6 @@
500, 500,
250 250
], ],
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
},
"dit_quantized": true, "dit_quantized": true,
"dit_quant_scheme": "fp8-sgl", "dit_quant_scheme": "fp8-sgl"
"dit_quantized_ckpt": "/data/nvme0/gushiqiao/models/hf_lightx2v_models/models/wan2.1_i2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors"
} }
{
"infer_steps": 4,
"target_fps": 16,
"video_duration": 16,
"audio_sr": 16000,
"target_video_length": 81,
"target_height": 720,
"target_width": 1280,
"self_attn_1_type": "flash_attn3",
"cross_attn_1_type": "flash_attn3",
"cross_attn_2_type": "flash_attn3",
"sample_guide_scale": 1,
"sample_shift": 5,
"enable_cfg": false,
"cpu_offload": true,
"offload_granularity": "block",
"t5_cpu_offload": true,
"offload_ratio_val": 1,
"t5_offload_granularity": "block",
"use_tiling_vae": true,
"audio_encoder_cpu_offload": false,
"audio_adapter_cpu_offload": false
}
...@@ -9,10 +9,15 @@ ...@@ -9,10 +9,15 @@
"sample_guide_scale": 5, "sample_guide_scale": 5,
"sample_shift": 5, "sample_shift": 5,
"enable_cfg": true, "enable_cfg": true,
"dit_quantized": true,
"dit_quant_scheme": "fp8-q8f",
"t5_quantized": true,
"t5_quant_scheme": "fp8-q8f",
"clip_quantized": true,
"clip_quant_scheme": "fp8-q8f",
"cpu_offload": true, "cpu_offload": true,
"t5_cpu_offload": true,
"offload_granularity": "block", "offload_granularity": "block",
"mm_config": { "t5_cpu_offload": false,
"mm_type": "W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm" "vae_cpu_offload": false,
} "clip_cpu_offload": false
} }
...@@ -12,8 +12,7 @@ ...@@ -12,8 +12,7 @@
"t5_cpu_offload": true, "t5_cpu_offload": true,
"t5_offload_granularity": "block", "t5_offload_granularity": "block",
"t5_quantized": true, "t5_quantized": true,
"t5_quantized_ckpt": "/path/to/models_t5_umt5-xxl-enc-fp8.pth", "t5_quant_scheme": "fp8-sgl",
"t5_quant_scheme": "fp8", "unload_modules": false,
"unload_modules": true, "use_tiling_vae": false
"use_tiling_vae": true
} }
...@@ -10,9 +10,15 @@ ...@@ -10,9 +10,15 @@
"sample_guide_scale": 6, "sample_guide_scale": 6,
"sample_shift": 8, "sample_shift": 8,
"enable_cfg": true, "enable_cfg": true,
"dit_quantized": true,
"dit_quant_scheme": "fp8-q8f",
"t5_quantized": true,
"t5_quant_scheme": "fp8-q8f",
"clip_quantized": true,
"clip_quant_scheme": "fp8-q8f",
"cpu_offload": true, "cpu_offload": true,
"offload_granularity": "block", "offload_granularity": "block",
"t5_cpu_offload": true, "t5_cpu_offload": false,
"dit_quantized": true, "vae_cpu_offload": false,
"dit_quant_scheme": "fp8-sgl" "clip_cpu_offload": false
} }
...@@ -11,7 +11,14 @@ ...@@ -11,7 +11,14 @@
"enable_cfg": true, "enable_cfg": true,
"cpu_offload": true, "cpu_offload": true,
"offload_granularity": "phase", "offload_granularity": "phase",
"t5_cpu_offload": true, "t5_cpu_offload": false,
"t5_offload_granularity": "block", "clip_cpu_offload": false,
"use_tiling_vae": true "vae_cpu_offload": false,
"use_tiling_vae": false,
"dit_quantized": true,
"dit_quant_scheme": "fp8-q8f",
"t5_quantized": true,
"t5_quant_scheme": "fp8-q8f",
"clip_quantized": true,
"clip_quant_scheme": "fp8-q8f"
} }
...@@ -11,12 +11,15 @@ ...@@ -11,12 +11,15 @@
"sample_shift": 8, "sample_shift": 8,
"enable_cfg": true, "enable_cfg": true,
"cpu_offload": true, "cpu_offload": true,
"t5_cpu_offload": true,
"offload_granularity": "phase", "offload_granularity": "phase",
"dit_quantized_ckpt": "/path/to/dit_int8", "t5_cpu_offload": false,
"clip_cpu_offload": false,
"vae_cpu_offload": false,
"use_tiling_vae": false,
"dit_quantized": true, "dit_quantized": true,
"dit_quant_scheme": "int8-q8f", "dit_quant_scheme": "fp8-q8f",
"use_tiny_vae": true, "t5_quantized": true,
"tiny_vae_path": "/x2v_models/taew2_1.pth", "t5_quant_scheme": "fp8-q8f",
"t5_offload_granularity": "block" "clip_quantized": true,
"clip_quant_scheme": "fp8-q8f"
} }
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
"dit_quantized": true, "dit_quantized": true,
"dit_quant_scheme": "fp8-q8f", "dit_quant_scheme": "fp8-q8f",
"t5_quantized": true, "t5_quantized": true,
"t5_quant_scheme": "fp8", "t5_quant_scheme": "fp8-q8f",
"clip_quantized": true, "clip_quantized": true,
"clip_quant_scheme": "fp8" "clip_quant_scheme": "fp8-q8f"
} }
# Offload # Parameter Offload
## 📖 Overview ## 📖 Overview
Lightx2v implements a state-of-the-art parameter offloading mechanism specifically designed for efficient large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies, enabling dynamic scheduling between GPU, CPU, and disk storage. LightX2V implements an advanced parameter offload mechanism specifically designed for large model inference under limited hardware resources. The system provides an excellent speed-memory balance by intelligently managing model weights across different memory hierarchies.
**Core Features:** **Core Features:**
- **Intelligent Granularity Management**: Supports both Block and Phase offloading granularities for flexible memory control - **Block/Phase-level Offload**: Efficiently manages model weights in block/phase units for optimal memory usage
- **Block Granularity**: Complete Transformer layers as management units, containing self-attention, cross-attention, feed-forward networks, etc., suitable for memory-sufficient environments - **Block**: The basic computational unit of Transformer models, containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), serving as a larger memory management unit
- **Phase Granularity**: Individual computational components as management units, providing finer-grained memory control for memory-constrained deployment scenarios - **Phase**: Finer-grained computational stages within blocks, containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), providing more precise memory control
- **Multi-level Storage Architecture**: GPU → CPU → Disk three-tier storage hierarchy with intelligent caching strategies - **Multi-tier Storage Support**: GPU → CPU → Disk hierarchy with intelligent caching
- **Asynchronous Parallel Processing**: CUDA stream-based asynchronous computation and data transfer for maximum hardware utilization - **Asynchronous Operations**: Overlaps computation and data transfer using CUDA streams
- **Persistent Storage Support**: SSD/NVMe disk storage support for ultra-large model inference deployment - **Disk/NVMe Serialization**: Supports secondary storage when memory is insufficient
## 🎯 Offloading Strategy Details ## 🎯 Offload Strategies
### Strategy 1: GPU-CPU Granularity Offloading ### Strategy 1: GPU-CPU Block/Phase Offload
**Applicable Scenarios**: GPU VRAM insufficient but system memory resources adequate **Use Case**: Insufficient GPU memory but sufficient system memory
**Technical Principle**: Establishes efficient weight scheduling mechanism between GPU and CPU memory, managing model weights in Block or Phase units. Leverages CUDA stream asynchronous capabilities to achieve parallel execution of computation and data transfer. Blocks contain complete Transformer layer structures, while Phases correspond to individual computational components within layers. **How It Works**: Manages model weights in block or phase units between GPU and CPU memory, utilizing CUDA streams to overlap computation and data transfer. Blocks contain complete Transformer layers, while Phases are individual computational components within blocks.
**Granularity Selection Guide**:
- **Block Granularity**: Suitable for memory-sufficient environments, reduces management overhead and improves overall performance
- **Phase Granularity**: Suitable for memory-constrained environments, provides more flexible memory control and optimizes resource utilization
<div align="center"> <div align="center">
<img alt="GPU-CPU Block/Phase Offloading Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig1_en.png" width="75%"> <img alt="GPU-CPU block/phase offload workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig1_en.png" width="75%">
</div> </div>
<div align="center"> <div align="center">
<img alt="Swap Mechanism Core Concept" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig2_en.png" width="75%"> <img alt="Swap operation" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig2_en.png" width="75%">
</div> </div>
<div align="center"> <div align="center">
<img alt="Asynchronous Execution Flow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig3_en.png" width="75%"> <img alt="Swap concept" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig3_en.png" width="75%">
</div> </div>
**Technical Features:**
- **Multi-stream Parallel Architecture**: Employs three CUDA streams with different priorities to parallelize computation and transfer
- Compute Stream (priority=-1): High priority, responsible for current computation tasks
- GPU Load Stream (priority=0): Medium priority, responsible for weight prefetching from CPU to GPU
- CPU Load Stream (priority=0): Medium priority, responsible for weight offloading from GPU to CPU
- **Intelligent Prefetching Mechanism**: Predictively loads next Block/Phase based on computation progress
- **Efficient Cache Management**: Maintains weight cache pool in CPU memory for improved access efficiency
- **Stream Synchronization Guarantee**: Ensures temporal correctness of data transfer and computation
- **Position Rotation Optimization**: Achieves continuous computation through Swap operations, avoiding repeated loading/unloading
### Strategy 2: Disk-CPU-GPU Three-Level Offloading (Lazy Loading) **Block vs Phase Explanation**:
- **Block Granularity**: Larger memory management unit containing complete Transformer layers (self-attention, cross-attention, feedforward networks, etc.), suitable for sufficient memory scenarios with reduced management overhead
- **Phase Granularity**: Finer-grained memory management containing individual computational components (such as self-attention, cross-attention, feedforward networks, etc.), suitable for memory-constrained scenarios with more flexible memory control
**Key Features:**
- **Asynchronous Transfer**: Uses three CUDA streams with different priorities for parallel computation and transfer
- Compute stream (priority=-1): High priority, handles current computation
- GPU load stream (priority=0): Medium priority, handles CPU to GPU prefetching
- CPU load stream (priority=0): Medium priority, handles GPU to CPU offloading
- **Prefetch Mechanism**: Preloads the next block/phase to GPU in advance
- **Intelligent Caching**: Maintains weight cache in CPU memory
- **Stream Synchronization**: Ensures correctness of data transfer and computation
- **Swap Operation**: Rotates block/phase positions after computation for continuous execution
**Applicable Scenarios**: Both GPU VRAM and system memory resources insufficient in constrained environments ### Strategy 2: Disk-CPU-GPU Block/Phase Offload (Lazy Loading)
**Use Case**: Both GPU memory and system memory are insufficient
**How It Works**: Builds upon Strategy 1 by introducing disk storage, implementing a three-tier storage hierarchy (Disk → CPU → GPU). CPU continues to serve as a cache pool with configurable size, suitable for devices with limited CPU memory.
**Technical Principle**: Introduces disk storage layer on top of Strategy 1, constructing a Disk→CPU→GPU three-level storage architecture. CPU serves as a configurable intelligent cache pool, suitable for various memory-constrained deployment environments.
<div align="center"> <div align="center">
<img alt="Disk-CPU-GPU Three-Level Offloading Architecture" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig4_en.png" width="75%"> <img alt="Disk-CPU-GPU block/phase offload workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig4_en.png" width="75%">
</div> </div>
<div align="center"> <div align="center">
<img alt="Complete Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig5_en.png" width="75%"> <img alt="Working steps" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig5_en.png" width="75%">
</div> </div>
**Execution Steps Details:** **Key Features:**
1. **Disk Storage Layer**: Model weights organized by Block on SSD/NVMe, each Block corresponding to one .safetensors file - **Lazy Loading**: Model weights are loaded from disk on-demand, avoiding loading the entire model at once
2. **Task Scheduling Layer**: Priority queue-based intelligent scheduling system for disk loading task assignment - **Intelligent Caching**: CPU memory buffer uses FIFO strategy with configurable size
3. **Asynchronous Loading Layer**: Multi-threaded parallel reading of weight files from disk to CPU memory buffer - **Multi-threaded Prefetch**: Uses multiple disk worker threads for parallel loading
4. **Intelligent Cache Layer**: CPU memory buffer using FIFO strategy for cache management with dynamic size configuration - **Asynchronous Transfer**: Uses CUDA streams to overlap computation and data transfer
5. **Cache Hit Optimization**: Direct transfer to GPU when weights are already in cache, avoiding disk I/O overhead - **Swap Rotation**: Achieves continuous computation through position rotation, avoiding repeated loading/offloading
6. **Prefetch Transfer Layer**: Weights in cache asynchronously transferred to GPU memory via GPU load stream
7. **Compute Execution Layer**: Weights on GPU perform computation (compute stream) while background continues prefetching next Block/Phase **Working Steps**:
8. **Position Rotation Layer**: Swap rotation after computation completion for continuous computation flow - **Disk Storage**: Model weights are stored on SSD/NVMe by block, one .safetensors file per block
9. **Memory Management Layer**: Automatic eviction of earliest used weight Blocks/Phases when CPU cache is full - **Task Scheduling**: When a block/phase is needed, priority task queue assigns disk worker threads
- **Asynchronous Loading**: Multiple disk threads load weight files from disk to CPU memory buffer in parallel
- **Intelligent Caching**: CPU memory buffer manages cache using FIFO strategy with configurable size
- **Cache Hit**: If weights are already in cache, transfer directly to GPU without disk read
- **Prefetch Transfer**: Weights in cache are asynchronously transferred to GPU memory (using GPU load stream)
- **Compute Execution**: Weights on GPU perform computation (using compute stream) while background continues prefetching next block/phase
- **Swap Rotation**: After computation completes, rotate block/phase positions for continuous computation
- **Memory Management**: When CPU cache is full, automatically evict the least recently used weight block/phase
**Technical Features:**
- **On-demand Loading Mechanism**: Model weights loaded from disk only when needed, avoiding loading entire model at once
- **Configurable Cache Strategy**: CPU memory buffer supports FIFO strategy with dynamically adjustable size
- **Multi-threaded Parallel Loading**: Leverages multiple disk worker threads for parallel data loading
- **Asynchronous Transfer Optimization**: CUDA stream-based asynchronous data transfer for maximum hardware utilization
- **Continuous Computation Guarantee**: Achieves continuous computation through position rotation mechanism, avoiding repeated loading/unloading operations
## ⚙️ Configuration Parameters Details ## ⚙️ Configuration Parameters
### GPU-CPU Offloading Configuration ### GPU-CPU Offload Configuration
```python ```python
config = { config = {
"cpu_offload": True, # Enable CPU offloading functionality "cpu_offload": True,
"offload_ratio": 1.0, # Offload ratio (0.0-1.0), 1.0 means complete offloading "offload_ratio": 1.0, # Offload ratio (0.0-1.0)
"offload_granularity": "block", # Offload granularity selection: "block" or "phase" "offload_granularity": "block", # Offload granularity: "block" or "phase"
"lazy_load": False, # Disable lazy loading mode "lazy_load": False, # Disable lazy loading
} }
``` ```
### Disk-CPU-GPU Offloading Configuration ### Disk-CPU-GPU Offload Configuration
```python ```python
config = { config = {
"cpu_offload": True, # Enable CPU offloading functionality "cpu_offload": True,
"lazy_load": True, # Enable lazy loading mode "lazy_load": True, # Enable lazy loading
"offload_ratio": 1.0, # Offload ratio setting "offload_ratio": 1.0, # Offload ratio
"offload_granularity": "phase", # Recommended to use phase granularity for better memory control "offload_granularity": "phase", # Recommended to use phase granularity
"num_disk_workers": 2, # Number of disk worker threads "num_disk_workers": 2, # Number of disk worker threads
"offload_to_disk": True, # Enable disk offloading functionality "offload_to_disk": True, # Enable disk offload
"offload_path": ".", # Disk offload path configuration
} }
``` ```
**Intelligent Cache Key Parameter Descriptions:** **Intelligent Cache Key Parameters:**
- `max_memory`: Controls CPU cache size upper limit, directly affects cache hit rate and memory usage - `max_memory`: Controls CPU cache size, affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects data prefetch speed - `num_disk_workers`: Controls number of disk loading threads, affects prefetch speed
- `offload_granularity`: Controls cache management granularity, affects cache efficiency and memory utilization - `offload_granularity`: Controls cache granularity (block or phase), affects cache efficiency
- `"block"`: Cache management in units of complete Transformer layers, suitable for memory-sufficient environments - `"block"`: Cache management in complete Transformer layer units
- `"phase"`: Cache management in units of individual computational components, suitable for memory-constrained environments - `"phase"`: Cache management in individual computational component units
**Offload Configuration for Non-DIT Model Components (T5, CLIP, VAE):**
The offload behavior of these components follows these rules:
- **Default Behavior**: If not specified separately, T5, CLIP, VAE will follow the `cpu_offload` setting
- **Independent Configuration**: Can set offload strategy separately for each component for fine-grained control
**Configuration Example**:
```json
{
"cpu_offload": true, // DIT model offload switch
"t5_cpu_offload": false, // T5 encoder independent setting
"clip_cpu_offload": false, // CLIP encoder independent setting
"vae_cpu_offload": false // VAE encoder independent setting
}
```
For memory-constrained devices, a progressive offload strategy is recommended:
1. **Step 1**: Only enable `cpu_offload`, disable `t5_cpu_offload`, `clip_cpu_offload`, `vae_cpu_offload`
2. **Step 2**: If memory is still insufficient, gradually enable CPU offload for T5, CLIP, VAE
3. **Step 3**: If memory is still not enough, consider using quantization + CPU offload or enable `lazy_load`
**Practical Experience**:
- **RTX 4090 24GB + 14B Model**: Usually only need to enable `cpu_offload`, manually set other component offload to `false`, and use FP8 quantized version
- **Smaller Memory GPUs**: Need to combine quantization, CPU offload, and lazy loading
- **Quantization Schemes**: Refer to [Quantization Documentation](../method_tutorials/quantization.md) to select appropriate quantization strategy
Detailed configuration files can be referenced at [Official Configuration Repository](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
## 🎯 Deployment Strategy Recommendations **Configuration File Reference**:
- **Wan2.1 Series Models**: Refer to [offload config files](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
- **Wan2.2 Series Models**: Refer to [wan22 config files](https://github.com/ModelTC/lightx2v/tree/main/configs/wan22) with `4090` suffix
- 🔄 GPU-CPU Granularity Offloading: Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) but adequate system memory (>64G) ## 🎯 Usage Recommendations
- Advantages: Balances performance and memory usage, suitable for medium-scale model inference - 🔄 GPU-CPU Block/Phase Offload: Suitable for insufficient GPU memory (RTX 3090/4090 24G) but sufficient system memory (>64/128G)
- 💾 Disk-CPU-GPU Three-Level Offloading: Suitable for limited GPU VRAM (RTX 3060/4090 8G) and insufficient system memory (16-32G) - 💾 Disk-CPU-GPU Block/Phase Offload: Suitable for both insufficient GPU memory (RTX 3060/4090 8G) and system memory (16/32G)
- Advantages: Supports ultra-large model inference with lowest hardware threshold
- 🚫 No Offload Mode: Suitable for high-end hardware configurations pursuing optimal inference performance - 🚫 No Offload: Suitable for high-end hardware configurations pursuing best performance
- Advantages: Maximizes computational efficiency, suitable for latency-sensitive application scenarios
## 🔍 Troubleshooting and Solutions
### Common Performance Issues and Optimization Strategies ## 🔍 Troubleshooting
### Common Issues and Solutions
1. **Disk I/O Bottleneck**
- Solution: Use NVMe SSD, increase num_disk_workers
1. **Disk I/O Performance Bottleneck**
- Problem Symptoms: Slow model loading speed, high inference latency
- Solutions:
- Upgrade to NVMe SSD storage devices
- Increase num_disk_workers parameter value
- Optimize file system configuration
2. **Memory Buffer Overflow** 2. **Memory Buffer Overflow**
- Problem Symptoms: Insufficient system memory, program abnormal exit - Solution: Increase max_memory or reduce num_disk_workers
- Solutions:
- Increase max_memory parameter value 3. **Loading Timeout**
- Decrease num_disk_workers parameter value - Solution: Check disk performance, optimize file system
- Adjust offload_granularity to "phase"
3. **Model Loading Timeout** **Note**: This offload mechanism is specifically designed for LightX2V, fully utilizing the asynchronous computing capabilities of modern hardware, significantly lowering the hardware threshold for large model inference.
- Problem Symptoms: Timeout errors during model loading process
- Solutions:
- Check disk read/write performance
- Optimize file system parameters
- Verify storage device health status
## 📚 Technical Summary
Lightx2v's offloading mechanism is specifically designed for modern AI inference scenarios, fully leveraging GPU's asynchronous computing capabilities and multi-level storage architecture advantages. Through intelligent weight management and efficient parallel processing, this mechanism significantly reduces the hardware threshold for large model inference, providing developers with flexible and efficient deployment solutions.
**Technical Highlights:**
- 🚀 **Performance Optimization**: Asynchronous parallel processing maximizes hardware utilization
- 💾 **Intelligent Memory**: Multi-level caching strategies achieve optimal memory management
- 🔧 **Flexible Configuration**: Supports flexible configuration of multiple granularities and strategies
- 🛡️ **Stable and Reliable**: Comprehensive error handling and fault recovery mechanisms
# Model Quantization # Model Quantization Techniques
LightX2V supports quantization inference for linear layers in `Dit`, supporting `w8a8-int8`, `w8a8-fp8`, `w8a8-fp8block`, `w8a8-mxfp8`, and `w4a4-nvfp4` matrix multiplication. Additionally, LightX2V also supports quantization of T5 and CLIP encoders to further improve inference performance. ## 📖 Overview
## 📊 Quantization Scheme Overview LightX2V supports quantized inference for DIT, T5, and CLIP models, reducing memory usage and improving inference speed by lowering model precision.
### DIT Model Quantization ---
LightX2V supports multiple DIT matrix multiplication quantization schemes, configured through the `mm_type` parameter: ## 🔧 Quantization Modes
#### Supported mm_type Types | Quantization Mode | Weight Quantization | Activation Quantization | Compute Kernel | Supported Hardware |
|--------------|----------|----------|----------|----------|
| `fp8-vllm` | FP8 channel symmetric | FP8 channel dynamic symmetric | [VLLM](https://github.com/vllm-project/vllm) | H100/H200/H800, RTX 40 series, etc. |
| `int8-vllm` | INT8 channel symmetric | INT8 channel dynamic symmetric | [VLLM](https://github.com/vllm-project/vllm) | A100/A800, RTX 30/40 series, etc. |
| `fp8-sgl` | FP8 channel symmetric | FP8 channel dynamic symmetric | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | H100/H200/H800, RTX 40 series, etc. |
| `int8-sgl` | INT8 channel symmetric | INT8 channel dynamic symmetric | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | A100/A800, RTX 30/40 series, etc. |
| `fp8-q8f` | FP8 channel symmetric | FP8 channel dynamic symmetric | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40 series, L40S, etc. |
| `int8-q8f` | INT8 channel symmetric | INT8 channel dynamic symmetric | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40 series, L40S, etc. |
| `int8-torchao` | INT8 channel symmetric | INT8 channel dynamic symmetric | [TorchAO](https://github.com/pytorch/ao) | A100/A800, RTX 30/40 series, etc. |
| `int4-g128-marlin` | INT4 group symmetric | FP16 | [Marlin](https://github.com/IST-DASLab/marlin) | H200/H800/A100/A800, RTX 30/40 series, etc. |
| `fp8-b128-deepgemm` | FP8 block symmetric | FP8 group symmetric | [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) | H100/H200/H800, RTX 40 series, etc.|
| mm_type | Weight Quantization | Activation Quantization | Compute Kernel | ---
|---------|-------------------|------------------------|----------------|
| `Default` | No Quantization | No Quantization | PyTorch |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm` | FP8 Channel Symmetric | FP8 Channel Dynamic Symmetric | VLLM |
| `W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm` | INT8 Channel Symmetric | INT8 Channel Dynamic Symmetric | VLLM |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Q8F` | FP8 Channel Symmetric | FP8 Channel Dynamic Symmetric | Q8F |
| `W-int8-channel-sym-A-int8-channel-sym-dynamic-Q8F` | INT8 Channel Symmetric | INT8 Channel Dynamic Symmetric | Q8F |
| `W-fp8-block128-sym-A-fp8-channel-group128-sym-dynamic-Deepgemm` | FP8 Block Symmetric | FP8 Channel Group Symmetric | DeepGEMM |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Sgl` | FP8 Channel Symmetric | FP8 Channel Dynamic Symmetric | SGL |
#### Detailed Quantization Scheme Description ## 🔧 Obtaining Quantized Models
**FP8 Quantization Scheme**: ### Method 1: Download Pre-Quantized Models
- **Weight Quantization**: Uses `torch.float8_e4m3fn` format with per-channel symmetric quantization
- **Activation Quantization**: Dynamic quantization supporting per-token and per-channel modes
- **Advantages**: Provides optimal performance on FP8-supported GPUs with minimal precision loss (typically <1%)
- **Compatible Hardware**: H100, A100, RTX 40 series and other FP8-supported GPUs
**INT8 Quantization Scheme**: Download pre-quantized models from LightX2V model repositories:
- **Weight Quantization**: Uses `torch.int8` format with per-channel symmetric quantization
- **Activation Quantization**: Dynamic quantization supporting per-token mode
- **Advantages**: Best compatibility, suitable for most GPU hardware, reduces memory usage by ~50%
- **Compatible Hardware**: All INT8-supported GPUs
**Block Quantization Scheme**: **DIT Models**
- **Weight Quantization**: FP8 quantization by 128x128 blocks
- **Activation Quantization**: Quantization by channel groups (group size 128)
- **Advantages**: Particularly suitable for large models with higher memory efficiency, supports larger batch sizes
### T5 Encoder Quantization Download pre-quantized DIT models from [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models):
T5 encoder supports the following quantization schemes: ```bash
# Download DIT FP8 quantized model
huggingface-cli download lightx2v/Wan2.1-Distill-Models \
--local-dir ./models \
--include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"
```
#### Supported quant_scheme Types **Encoder Models**
| quant_scheme | Quantization Precision | Compute Kernel | Download pre-quantized T5 and CLIP models from [Encoders-LightX2V](https://huggingface.co/lightx2v/Encoders-Lightx2v):
|--------------|----------------------|----------------|
| `int8` | INT8 | VLLM |
| `fp8` | FP8 | VLLM |
| `int8-torchao` | INT8 | TorchAO |
| `int8-q8f` | INT8 | Q8F |
| `fp8-q8f` | FP8 | Q8F |
### CLIP Encoder Quantization ```bash
# Download T5 FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_t5_umt5-xxl-enc-fp8.pth"
CLIP encoder supports the same quantization schemes as T5 # Download CLIP FP8 quantized model
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"
```
## 🚀 Producing Quantized Models ### Method 2: Self-Quantize Models
Download quantized models from the [LightX2V Official Model Repository](https://huggingface.co/lightx2v), refer to the [Model Structure Documentation](../deploy_guides/model_structure.md) for details. For detailed quantization tool usage, refer to: [Model Conversion Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
Use LightX2V's convert tool to convert models into quantized models. Refer to the [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme.md). ---
## 📥 Loading Quantized Models for Inference ## 🚀 Using Quantized Models
### DIT Model Configuration ### DIT Model Quantization
Write the path of the converted quantized weights to the `dit_quantized_ckpt` field in the [configuration file](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization). #### Supported Quantization Modes
```json DIT quantization modes (`dit_quant_scheme`) support: `fp8-vllm`, `int8-vllm`, `fp8-sgl`, `int8-sgl`, `fp8-q8f`, `int8-q8f`, `int8-torchao`, `int4-g128-marlin`, `fp8-b128-deepgemm`
{
"dit_quantized_ckpt": "/path/to/dit_quantized_ckpt",
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
}
}
```
### T5 Encoder Configuration #### Configuration Example
```json ```json
{ {
"t5_quantized": true, "dit_quantized": true,
"t5_quant_scheme": "fp8", "dit_quant_scheme": "fp8-sgl",
"t5_quantized_ckpt": "/path/to/t5_quantized_ckpt" "dit_quantized_ckpt": "/path/to/dit_quantized_model" // Optional
} }
``` ```
### CLIP Encoder Configuration > 💡 **Tip**: When there's only one DIT model in the script's `model_path`, `dit_quantized_ckpt` doesn't need to be specified separately.
```json ### T5 Model Quantization
{
"clip_quantized": true, #### Supported Quantization Modes
"clip_quant_scheme": "fp8",
"clip_quantized_ckpt": "/path/to/clip_quantized_ckpt"
}
```
### Complete Configuration Example T5 quantization modes (`t5_quant_scheme`) support: `int8-vllm`, `fp8-sgl`, `int8-q8f`, `fp8-q8f`, `int8-torchao`
#### Configuration Example
```json ```json
{ {
"dit_quantized_ckpt": "/path/to/dit_quantized_ckpt",
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
},
"t5_quantized": true, "t5_quantized": true,
"t5_quant_scheme": "fp8", "t5_quant_scheme": "fp8-sgl",
"t5_quantized_ckpt": "/path/to/t5_quantized_ckpt", "t5_quantized_ckpt": "/path/to/t5_quantized_model" // Optional
"clip_quantized": true,
"clip_quant_scheme": "fp8",
"clip_quantized_ckpt": "/path/to/clip_quantized_ckpt"
} }
``` ```
By specifying `--config_json` to the specific config file, you can load the quantized model for inference. > 💡 **Tip**: When a T5 quantized model exists in the script's specified `model_path` (such as `models_t5_umt5-xxl-enc-fp8.pth` or `models_t5_umt5-xxl-enc-int8.pth`), `t5_quantized_ckpt` doesn't need to be specified separately.
[Here](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization) are some running scripts for use.
## 💡 Quantization Scheme Selection Recommendations
### Hardware Compatibility ### CLIP Model Quantization
- **H100/A100 GPU/RTX 4090/RTX 4060**: Recommended to use FP8 quantization schemes #### Supported Quantization Modes
- DIT: `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm`
- T5/CLIP: `fp8`
- **A100/RTX 3090/RTX 3060**: Recommended to use INT8 quantization schemes
- DIT: `W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm`
- T5/CLIP: `int8`
- **Other GPUs**: Choose based on hardware support
### Performance Optimization CLIP quantization modes (`clip_quant_scheme`) support: `int8-vllm`, `fp8-sgl`, `int8-q8f`, `fp8-q8f`, `int8-torchao`
- **Memory Constrained**: Choose INT8 quantization schemes #### Configuration Example
- **Speed Priority**: Choose FP8 quantization schemes
- **High Precision Requirements**: Use FP8 or mixed precision schemes
### Mixed Quantization Strategy
You can choose different quantization schemes for different components:
```json ```json
{ {
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
},
"t5_quantized": true,
"t5_quant_scheme": "int8",
"clip_quantized": true, "clip_quantized": true,
"clip_quant_scheme": "fp8" "clip_quant_scheme": "fp8-sgl",
"clip_quantized_ckpt": "/path/to/clip_quantized_model" // Optional
} }
``` ```
## 🔧 Advanced Quantization Features > 💡 **Tip**: When a CLIP quantized model exists in the script's specified `model_path` (such as `models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth` or `models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth`), `clip_quantized_ckpt` doesn't need to be specified separately.
For details, please refer to the documentation of the quantization tool [LLMC](https://github.com/ModelTC/llmc/blob/main/docs/en/source/backend/lightx2v.md) ### Performance Optimization Strategy
### Custom Quantization Kernels If memory is insufficient, you can combine parameter offloading to further reduce memory usage. Refer to [Parameter Offload Documentation](../method_tutorials/offload.md):
LightX2V supports custom quantization kernels that can be extended in the following ways: > - **Wan2.1 Configuration**: Refer to [offload config files](https://github.com/ModelTC/LightX2V/tree/main/configs/offload)
> - **Wan2.2 Configuration**: Refer to [wan22 config files](https://github.com/ModelTC/LightX2V/tree/main/configs/wan22) with `4090` suffix
1. **Register New mm_type**: Add new quantization classes in `mm_weight.py` ---
2. **Implement Quantization Functions**: Define quantization methods for weights and activations
3. **Integrate Compute Kernels**: Use custom matrix multiplication implementations
## 🚨 Important Notes ## 📚 Related Resources
1. **Hardware Requirements**: FP8 quantization requires FP8-supported GPUs (such as H100, RTX 40 series) ### Configuration File Examples
2. **Precision Impact**: Quantization will bring certain precision loss, which needs to be weighed based on application scenarios - [INT8 Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v.json)
- [Q8F Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_q8f.json)
- [TorchAO Quantization Config](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_torchao.json)
## 📚 Related Resources ### Run Scripts
- [Quantization Inference Scripts](https://github.com/ModelTC/LightX2V/tree/main/scripts/quantization)
### Tool Documentation
- [Quantization Tool Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
- [LightCompress Quantization Documentation](https://github.com/ModelTC/llmc/blob/main/docs/zh_cn/source/backend/lightx2v.md)
### Model Repositories
- [Wan2.1-LightX2V Quantized Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
- [Wan2.2-LightX2V Quantized Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
- [Encoders Quantized Models](https://huggingface.co/lightx2v/Encoders-Lightx2v)
---
Through this document, you should be able to:
✅ Understand quantization schemes supported by LightX2V
✅ Select appropriate quantization strategies based on hardware
✅ Correctly configure quantization parameters
✅ Obtain and use quantized models
✅ Optimize inference performance and memory usage
- [Quantization Tool Documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme.md) If you have other questions, feel free to ask in [GitHub Issues](https://github.com/ModelTC/LightX2V/issues).
- [Running Scripts](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization)
- [Configuration File Examples](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization)
- [LLMC Quantization Documentation](https://github.com/ModelTC/llmc/blob/main/docs/en/source/backend/lightx2v.md)
...@@ -109,7 +109,6 @@ config = { ...@@ -109,7 +109,6 @@ config = {
"offload_granularity": "phase", # 推荐使用phase粒度 "offload_granularity": "phase", # 推荐使用phase粒度
"num_disk_workers": 2, # 磁盘工作线程数 "num_disk_workers": 2, # 磁盘工作线程数
"offload_to_disk": True, # 启用磁盘卸载 "offload_to_disk": True, # 启用磁盘卸载
"offload_path": ".", # 磁盘卸载路径
} }
``` ```
...@@ -120,7 +119,37 @@ config = { ...@@ -120,7 +119,37 @@ config = {
- `"block"`:以完整的Transformer层为单位进行缓存管理 - `"block"`:以完整的Transformer层为单位进行缓存管理
- `"phase"`:以单个计算组件为单位进行缓存管理 - `"phase"`:以单个计算组件为单位进行缓存管理
详细配置文件可参考[config](https://github.com/ModelTC/lightx2v/tree/main/configs/offload) **非 DIT 模型组件(T5、CLIP、VAE)的卸载配置:**
这些组件的卸载行为遵循以下规则:
- **默认行为**:如果没有单独指定,T5、CLIP、VAE 会跟随 `cpu_offload` 的设置
- **独立配置**:可以为每个组件单独设置卸载策略,实现精细控制
**配置示例**
```json
{
"cpu_offload": true, // DIT 模型卸载开关
"t5_cpu_offload": false, // T5 编码器独立设置
"clip_cpu_offload": false, // CLIP 编码器独立设置
"vae_cpu_offload": false // VAE 编码器独立设置
}
```
在显存受限的设备上,建议采用渐进式卸载策略:
1. **第一步**:仅开启 `cpu_offload`,关闭 `t5_cpu_offload``clip_cpu_offload``vae_cpu_offload`
2. **第二步**:如果显存仍不足,逐步开启 T5、CLIP、VAE 的 CPU 卸载
3. **第三步**:如果显存仍然不够,考虑使用量化 + CPU 卸载或启用 `lazy_load`
**实践经验**
- **RTX 4090 24GB + 14B 模型**:通常只需开启 `cpu_offload`,其他组件卸载需要手动设为 `false`,同时使用 FP8 量化版本
- **更小显存的 GPU**:需要组合使用量化、CPU 卸载和延迟加载
- **量化方案**:建议参考[量化技术文档](../method_tutorials/quantization.md)选择合适的量化策略
**配置文件参考**
- **Wan2.1 系列模型**:参考 [offload 配置文件](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
- **Wan2.2 系列模型**:参考 [wan22 配置文件](https://github.com/ModelTC/lightx2v/tree/main/configs/wan22) 中以 `4090` 结尾的配置文件
## 🎯 使用建议 ## 🎯 使用建议
- 🔄 GPU-CPU分block/phase卸载:适合GPU显存不足(RTX 3090/4090 24G)但系统内存(>64/128G)充足 - 🔄 GPU-CPU分block/phase卸载:适合GPU显存不足(RTX 3090/4090 24G)但系统内存(>64/128G)充足
......
# 模型量化 # 模型量化技术
LightX2V支持对`Dit`中的线性层进行量化推理,支持`w8a8-int8``w8a8-fp8``w8a8-fp8block``w8a8-mxfp8``w4a4-nvfp4`的矩阵乘法。同时,LightX2V也支持对T5和CLIP编码器进行量化,以进一步提升推理性能。 ## 📖 概述
## 📊 量化方案概览 LightX2V 支持对 DIT、T5 和 CLIP 模型进行量化推理,通过降低模型精度来减少显存占用并提升推理速度。
### DIT 模型量化 ---
LightX2V支持多种DIT矩阵乘法量化方案,通过配置文件中的`mm_type`参数进行配置:
#### 支持的 mm_type 类型 ## 🔧 量化模式
| mm_type | 权重量化 | 激活量化 | 计算内核 | | 量化模式 | 权重量化 | 激活量化 | 计算内核 | 适用硬件 |
|---------|----------|----------|----------| |--------------|----------|----------|----------|----------|
| `Default` | 无量化 | 无量化 | PyTorch | | `fp8-vllm` | FP8 通道对称 | FP8 通道动态对称 | [VLLM](https://github.com/vllm-project/vllm) | H100/H200/H800, RTX 40系等 |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm` | FP8 通道对称 | FP8 通道动态对称 | VLLM | | `int8-vllm` | INT8 通道对称 | INT8 通道动态对称 | [VLLM](https://github.com/vllm-project/vllm) | A100/A800, RTX 30/40系等 |
| `W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm` | INT8 通道对称 | INT8 通道动态对称 | VLLM | | `fp8-sgl` | FP8 通道对称 | FP8 通道动态对称 | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | H100/H200/H800, RTX 40系等 |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Q8F` | FP8 通道对称 | FP8 通道动态对称 | Q8F | | `int8-sgl` | INT8 通道对称 | INT8 通道动态对称 | [SGL](https://github.com/sgl-project/sglang/tree/main/sgl-kernel) | A100/A800, RTX 30/40系等 |
| `W-int8-channel-sym-A-int8-channel-sym-dynamic-Q8F` | INT8 通道对称 | INT8 通道动态对称 | Q8F | | `fp8-q8f` | FP8 通道对称 | FP8 通道动态对称 | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40系, L40S等 |
| `W-fp8-block128-sym-A-fp8-channel-group128-sym-dynamic-Deepgemm` | FP8 块对称 | FP8 通道组对称 | DeepGEMM | | `int8-q8f` | INT8 通道对称 | INT8 通道动态对称 | [Q8-Kernels](https://github.com/KONAKONA666/q8_kernels) | RTX 40系, L40S等 |
| `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Sgl` | FP8 通道对称 | FP8 通道动态对称 | SGL | | `int8-torchao` | INT8 通道对称 | INT8 通道动态对称 | [TorchAO](https://github.com/pytorch/ao) | A100/A800, RTX 30/40系等 |
| `int4-g128-marlin` | INT4 分组对称 | FP16 | [Marlin](https://github.com/IST-DASLab/marlin) | H200/H800/A100/A800, RTX 30/40系等 |
| `fp8-b128-deepgemm` | FP8 分块对称 | FP8 分组对称 | [DeepGemm](https://github.com/deepseek-ai/DeepGEMM) | H100/H200/H800, RTX 40系等|
#### 量化方案详细说明 ---
**FP8 量化方案** ## 🔧 量化模型获取
- **权重量化**:使用 `torch.float8_e4m3fn` 格式,按通道进行对称量化
- **激活量化**:动态量化,支持 per-token 和 per-channel 模式
- **优势**:在支持 FP8 的 GPU 上提供最佳性能,精度损失最小(通常<1%)
- **适用硬件**:H100、A100、RTX 40系列等支持FP8的GPU
**INT8 量化方案** ### 方式一:下载预量化模型
- **权重量化**:使用 `torch.int8` 格式,按通道进行对称量化
- **激活量化**:动态量化,支持 per-token 模式
- **优势**:兼容性最好,适用于大多数 GPU 硬件,内存占用减少约50%
- **适用硬件**:所有支持INT8的GPU
**块量化方案** 从 LightX2V 模型仓库下载预量化的模型:
- **权重量化**:按 128x128 块进行 FP8 量化
- **激活量化**:按通道组(组大小128)进行量化
- **优势**:特别适合大模型,内存效率更高,支持更大的batch size
### T5 编码器量化 **DIT 模型**
T5编码器支持以下量化方案 [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models) 下载预量化的 DIT 模型
#### 支持的 quant_scheme 类型 ```bash
# 下载 DIT FP8 量化模型
| quant_scheme | 量化精度 | 计算内核 | huggingface-cli download lightx2v/Wan2.1-Distill-Models \
|--------------|----------|----------| --local-dir ./models \
| `int8` | INT8 | VLLM | --include "wan2.1_i2v_720p_scaled_fp8_e4m3_lightx2v_4step.safetensors"
| `fp8` | FP8 | VLLM | ```
| `int8-torchao` | INT8 | TorchAO |
| `int8-q8f` | INT8 | Q8F |
| `fp8-q8f` | FP8 | Q8F |
#### T5量化特性 **Encoder 模型**
- **线性层量化**:量化注意力层和FFN层中的线性变换 [Encoders-LightX2V](https://huggingface.co/lightx2v/Encoders-Lightx2v) 下载预量化的 T5 和 CLIP 模型:
- **动态量化**:激活在推理过程中动态量化,无需预计算
- **精度保持**:通过对称量化和缩放因子保持数值精度
### CLIP 编码器量化 ```bash
# 下载 T5 FP8 量化模型
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_t5_umt5-xxl-enc-fp8.pth"
CLIP编码器支持与T5相同的量化方案: # 下载 CLIP FP8 量化模型
huggingface-cli download lightx2v/Encoders-Lightx2v \
--local-dir ./models \
--include "models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth"
```
#### CLIP量化特性 ### 方式二:自行量化模型
- **视觉编码器量化**:量化Vision Transformer中的线性层 详细量化工具使用方法请参考:[模型转换文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
- **文本编码器量化**:量化文本编码器中的线性层
- **多模态对齐**:保持视觉和文本特征之间的对齐精度
## 🚀 生产量化模型 ---
可通过[LightX2V 官方模型仓库](https://huggingface.co/lightx2v)下载量化模型,具体可参考[模型结构文档](../deploy_guides/model_structure.md) ## 🚀 量化模型使用
使用LightX2V的convert工具,将模型转换成量化模型,参考[文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md) ### DIT 模型量化
## 📥 加载量化模型进行推理 #### 支持的量化模式
### DIT 模型配置 DIT 量化模式(`dit_quant_scheme`)支持:`fp8-vllm``int8-vllm``fp8-sgl``int8-sgl``fp8-q8f``int8-q8f``int8-torchao``int4-g128-marlin``fp8-b128-deepgemm`
将转换后的量化权重的路径,写到[配置文件](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization)中的`dit_quantized_ckpt`中。 #### 配置示例
```json ```json
{ {
"dit_quantized_ckpt": "/path/to/dit_quantized_ckpt", "dit_quantized": true,
"mm_config": { "dit_quant_scheme": "fp8-sgl",
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm" "dit_quantized_ckpt": "/path/to/dit_quantized_model" // 可选
}
} }
``` ```
### T5 编码器配置 > 💡 **提示**:当运行脚本的 `model_path` 中只有一个 DIT 模型时,`dit_quantized_ckpt` 可以不用单独指定。
```json ### T5 模型量化
{
"t5_quantized": true,
"t5_quant_scheme": "fp8",
"t5_quantized_ckpt": "/path/to/t5_quantized_ckpt"
}
```
### CLIP 编码器配置 #### 支持的量化模式
```json T5 量化模式(`t5_quant_scheme`)支持:`int8-vllm``fp8-sgl``int8-q8f``fp8-q8f``int8-torchao`
{
"clip_quantized": true,
"clip_quant_scheme": "fp8",
"clip_quantized_ckpt": "/path/to/clip_quantized_ckpt"
}
```
### 完整配置示例 #### 配置示例
```json ```json
{ {
"dit_quantized_ckpt": "/path/to/dit_quantized_ckpt",
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
},
"t5_quantized": true, "t5_quantized": true,
"t5_quant_scheme": "fp8", "t5_quant_scheme": "fp8-sgl",
"t5_quantized_ckpt": "/path/to/t5_quantized_ckpt", "t5_quantized_ckpt": "/path/to/t5_quantized_model" // 可选
"clip_quantized": true,
"clip_quant_scheme": "fp8",
"clip_quantized_ckpt": "/path/to/clip_quantized_ckpt"
} }
``` ```
通过指定`--config_json`到具体的config文件,即可以加载量化模型进行推理。 > 💡 **提示**:当运行脚本指定的 `model_path` 中存在 T5 量化模型(如 `models_t5_umt5-xxl-enc-fp8.pth` 或 `models_t5_umt5-xxl-enc-int8.pth`)时,`t5_quantized_ckpt` 可以不用单独指定。
[这里](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization)有一些运行脚本供使用。
## 💡 量化方案选择建议
### 硬件兼容性
- **H100/A100 GPU/RTX 4090/RTX 4060**:推荐使用 FP8 量化方案 ### CLIP 模型量化
- DIT: `W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm`
- T5/CLIP: `fp8`
- **A100/RTX 3090/RTX 3060**:推荐使用 INT8 量化方案
- DIT: `W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm`
- T5/CLIP: `int8`
- **其他 GPU**:根据硬件支持情况选择
### 性能优化 #### 支持的量化模式
- **内存受限**:选择 INT8 量化方案 CLIP 量化模式(`clip_quant_scheme`)支持:`int8-vllm``fp8-sgl``int8-q8f``fp8-q8f``int8-torchao`
- **速度优先**:选择 FP8 量化方案
- **精度要求高**:使用 FP8 或混合精度方案
### 混合量化策略 #### 配置示例
可以针对不同组件选择不同的量化方案:
```json ```json
{ {
"mm_config": {
"mm_type": "W-fp8-channel-sym-A-fp8-channel-sym-dynamic-Vllm"
},
"t5_quantized": true,
"t5_quant_scheme": "int8",
"clip_quantized": true, "clip_quantized": true,
"clip_quant_scheme": "fp8" "clip_quant_scheme": "fp8-sgl",
"clip_quantized_ckpt": "/path/to/clip_quantized_model" // 可选
} }
``` ```
## 🔧 高阶量化功能 > 💡 **提示**:当运行脚本指定的 `model_path` 中存在 CLIP 量化模型(如 `models_clip_open-clip-xlm-roberta-large-vit-huge-14-fp8.pth` 或 `models_clip_open-clip-xlm-roberta-large-vit-huge-14-int8.pth`)时,`clip_quantized_ckpt` 可以不用单独指定。
### 量化算法调优
具体可参考量化工具[LightCompress的文档](https://github.com/ModelTC/llmc/blob/main/docs/zh_cn/source/backend/lightx2v.md)
### 自定义量化内核 ### 性能优化策略
LightX2V支持自定义量化内核,可以通过以下方式扩展 如果显存不够,可以结合参数卸载来进一步减少显存占用,参考[参数卸载文档](../method_tutorials/offload.md)
1. **注册新的 mm_type**:在 `mm_weight.py` 中添加新的量化类 > - **Wan2.1 配置**:参考 [offload 配置文件](https://github.com/ModelTC/LightX2V/tree/main/configs/offload)
2. **实现量化函数**:定义权重和激活的量化方法 > - **Wan2.2 配置**:参考 [wan22 配置文件](https://github.com/ModelTC/LightX2V/tree/main/configs/wan22) 中以 `4090` 结尾的配置
3. **集成计算内核**:使用自定义的矩阵乘法实现
---
## 🚨 重要注意事项 ## 📚 相关资源
1. **硬件要求**:FP8 量化需要支持 FP8 的 GPU(如 H100、RTX40系) ### 配置文件示例
2. **精度影响**:量化会带来一定的精度损失,需要根据应用场景权衡 - [INT8 量化配置](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v.json)
- [Q8F 量化配置](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_q8f.json)
- [TorchAO 量化配置](https://github.com/ModelTC/LightX2V/blob/main/configs/quantization/wan_i2v_torchao.json)
## 📚 相关资源 ### 运行脚本
- [量化推理脚本](https://github.com/ModelTC/LightX2V/tree/main/scripts/quantization)
### 工具文档
- [量化工具文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md) - [量化工具文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)
- [运行脚本](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization)
- [配置文件示例](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization)
- [LightCompress 量化文档](https://github.com/ModelTC/llmc/blob/main/docs/zh_cn/source/backend/lightx2v.md) - [LightCompress 量化文档](https://github.com/ModelTC/llmc/blob/main/docs/zh_cn/source/backend/lightx2v.md)
### 模型仓库
- [Wan2.1-LightX2V 量化模型](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
- [Wan2.2-LightX2V 量化模型](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
- [Encoders 量化模型](https://huggingface.co/lightx2v/Encoders-Lightx2v)
---
通过本文档,您应该能够:
✅ 理解 LightX2V 支持的量化方案
✅ 根据硬件选择合适的量化策略
✅ 正确配置量化参数
✅ 获取和使用量化模型
✅ 优化推理性能和显存使用
如有其他问题,欢迎在 [GitHub Issues](https://github.com/ModelTC/LightX2V/issues) 中提问。
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment