Update docs

d0ec1aaa · gushiqiao · 38c40fa0 · d0ec1aaa · d0ec1aaa · d0ec1aaa
Commit d0ec1aaa authored Jul 12, 2025 by gushiqiao
14 changed files
--- a/assets/figs/offload/fig1_en.png
+++ b/assets/figs/offload/fig1_en.png
--- a/assets/figs/offload/fig1_zh.png
+++ b/assets/figs/offload/fig1_zh.png
--- a/assets/figs/offload/fig2_en.png
+++ b/assets/figs/offload/fig2_en.png
--- a/assets/figs/offload/fig2_zh.png
+++ b/assets/figs/offload/fig2_zh.png
--- a/assets/figs/offload/fig3_en.png
+++ b/assets/figs/offload/fig3_en.png
--- a/assets/figs/offload/fig3_zh.png
+++ b/assets/figs/offload/fig3_zh.png
--- a/assets/figs/offload/fig4_en.png
+++ b/assets/figs/offload/fig4_en.png
--- a/assets/figs/offload/fig4_zh.png
+++ b/assets/figs/offload/fig4_zh.png
--- a/assets/figs/offload/fig5_en.png
+++ b/assets/figs/offload/fig5_en.png
--- a/assets/figs/offload/fig5_zh.png
+++ b/assets/figs/offload/fig5_zh.png
--- a/docs/EN/source/deploy_guides/for_low_resource.md
+++ b/docs/EN/source/deploy_guides/for_low_resource.md
-# 低资源场景部署
+# Lightx2v Low-Resource Deployment Guide
-xxx
+## 📋 Overview
+This guide is specifically designed for hardware resource-constrained environments, particularly configurations with **8GB VRAM + 16/32GB RAM**, providing detailed instructions on how to successfully run Lightx2v 14B models for 480p and 720p video generation.
+Lightx2v is a powerful video generation model, but it requires careful optimization to run smoothly in resource-constrained environments. This guide provides a complete solution from hardware selection to software configuration, ensuring you can achieve the best video generation experience under limited hardware conditions.
+## 🎯 Target Hardware Configuration
+### Recommended Hardware Specifications
+**GPU Requirements**:
+- **VRAM**: 8GB (RTX 3060/3070/4060/4060Ti, etc.)
+- **Architecture**: NVIDIA graphics cards with CUDA support
+**System Memory**:
+- **Minimum**: 16GB DDR4
+- **Recommended**: 32GB DDR4/DDR5
+- **Memory Speed**: 3200MHz or higher recommended
+**Storage Requirements**:
+- **Type**: NVMe SSD strongly recommended
+- **Capacity**: At least 50GB available space
+- **Speed**: Read speed of 3000MB/s or higher recommended
+**CPU Requirements**:
+- **Cores**: 8 cores or more recommended
+- **Frequency**: 3.0GHz or higher recommended
+- **Architecture**: Support for AVX2 instruction set
+## ⚙️ Core Optimization Strategies
+### 1. Environment Optimization
+Before running Lightx2v, it's recommended to set the following environment variables to optimize performance:
+```bash
+# CUDA memory allocation optimization
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+# Enable CUDA Graph mode to improve inference performance
+export ENABLE_GRAPH_MODE=true
+# Use BF16 precision for inference to reduce VRAM usage (default FP32 precision)
+export DTYPE=BF16
+```
+**Optimization Details**:
+- `expandable_segments:True`: Allows dynamic expansion of CUDA memory segments, reducing memory fragmentation
+- `ENABLE_GRAPH_MODE=true`: Enables CUDA Graph to reduce kernel launch overhead
+- `DTYPE=BF16`: Uses BF16 precision to reduce VRAM usage while maintaining quality
+### 2. Quantization Strategy
+Quantization is a key optimization technique in low-resource environments, reducing memory usage by lowering model precision.
+#### Quantization Scheme Comparison
+**FP8 Quantization** (Recommended for RTX 40 series):
+```python
+# Suitable for GPUs supporting FP8, providing better precision
+dit_quant_scheme = "fp8"      # DIT model quantization
+t5_quant_scheme = "fp8"       # T5 text encoder quantization
+clip_quant_scheme = "fp8"     # CLIP visual encoder quantization
+```
+**INT8 Quantization** (Universal solution):
+```python
+# Suitable for all GPUs, minimal memory usage
+dit_quant_scheme = "int8"     # 8-bit integer quantization
+t5_quant_scheme = "int8"      # Text encoder quantization
+clip_quant_scheme = "int8"    # Visual encoder quantization
+```
+### 3. Efficient Operator Selection Guide
+Choosing the right operators can significantly improve inference speed and reduce memory usage.
+#### Attention Operator Selection
+**Recommended Priority**:
+1. **[Sage Attention](https://github.com/thu-ml/SageAttention)** (Highest priority)
+2. **[Flash Attention](https://github.com/Dao-AILab/flash-attention)** (Universal solution)
+#### Matrix Multiplication Operator Selection
+**ADA Architecture GPUs** (RTX 40 series):
+Recommended priority:
+1. **[q8-kernel](https://github.com/KONAKONA666/q8_kernels)** (Highest performance, ADA architecture only)
+2. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (Balanced solution)
+3. **[vllm-kernel](https://github.com/vllm-project/vllm)** (Universal solution)
+**Other Architecture GPUs**:
+1. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (Recommended)
+2. **[vllm-kernel](https://github.com/vllm-project/vllm)** (Alternative)
+### 4. Parameter Offloading Strategy
+Parameter offloading technology allows models to dynamically schedule parameters between CPU and disk, breaking through VRAM limitations.
+#### Three-Level Offloading Architecture
+```python
+# Disk-CPU-GPU three-level offloading configuration
+cpu_offload=True             # Enable CPU offloading
+t5_cpu_offload=True          # Enable T5 encoder CPU offloading
+offload_granularity=phase    # DIT model fine-grained offloading
+t5_offload_granularity=block # T5 encoder fine-grained offloading
+lazy_load = True             # Enable lazy loading mechanism
+num_disk_workers = 2         # Disk I/O worker threads
+```
+#### Offloading Strategy Details
+**Lazy Loading Mechanism**:
+- Model parameters are loaded from disk to CPU on demand
+- Reduces runtime memory usage
+- Supports large models running with limited memory
+**Disk Storage Optimization**:
+- Use high-speed SSD to store model parameters
+- Store model files grouped by blocks
+- Refer to conversion script [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md), specify `--save_by_block` parameter during conversion
+### 5. VRAM Optimization Techniques
+VRAM optimization strategies for 720p video generation.
+#### CUDA Memory Management
+```python
+# CUDA memory cleanup configuration
+clean_cuda_cache = True        # Timely cleanup of GPU cache
+rotary_chunk = True            # Rotary position encoding chunked computation
+rotary_chunk_size = 100        # Chunk size, adjustable based on VRAM
+```
+#### Chunked Computation Strategy
+**Rotary Position Encoding Chunking**:
+- Process long sequences in small chunks
+- Reduce peak VRAM usage
+- Maintain computational precision
+### 6. VAE Optimization
+VAE (Variational Autoencoder) is a key component in video generation, and optimizing VAE can significantly improve performance.
+#### VAE Chunked Inference
+```python
+# VAE optimization configuration
+use_tiling_vae = True          # Enable VAE chunked inference
+```
+#### [Lightweight VAE](https://github.com/madebyollin/taehv/blob/main/taew2_1.pth)
+```python
+# VAE optimization configuration
+use_tiny_vae = True            # Use lightweight VAE
+```
+**VAE Optimization Effects**:
+- Standard VAE: Baseline performance, 100% quality retention
+- Standard VAE chunked: Reduces VRAM usage, increases inference time, 100% quality retention
+- Lightweight VAE: Extremely low VRAM usage, video quality loss
+### 7. Model Selection Strategy
+Choosing the right model version is crucial for low-resource environments.
+#### Recommended Model Comparison
+**Distilled Models** (Strongly recommended):
+- ✅ **[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)**
+- ✅ **[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)**
+#### Performance Optimization Suggestions
+When using the above distilled models, you can further optimize performance:
+- Disable CFG: `"enable_cfg": false`
+- Reduce inference steps: `infer_step: 4`
+- Reference configuration files: [config](https://github.com/ModelTC/LightX2V/tree/main/configs/distill)
+## 🚀 Complete Configuration Examples
+### Pre-configured Templates
+- **[14B Model 480p Video Generation Configuration](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_480p.json)**
+- **[14B Model 720p Video Generation Configuration](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_720p.json)**
+- **[1.3B Model 720p Video Generation Configuration](https://github.com/ModelTC/LightX2V/tree/main/configs/offload/block/wan_t2v_1_3b.json)**
+  - The inference bottleneck for 1.3B models is the T5 encoder, so the configuration file specifically optimizes for T5
+**[Launch Script](https://github.com/ModelTC/LightX2V/tree/main/scripts/wan/run_wan_i2v_lazy_load.sh)**
+## 📚 Reference Resources
+- [Parameter Offloading Mechanism Documentation](../method_tutorials/offload.md) - In-depth understanding of offloading technology principles
+- [Quantization Technology Guide](../method_tutorials/quantization.md) - Detailed explanation of quantization technology
+- [Gradio Deployment Guide](deploy_gradio.md) - Detailed Gradio deployment instructions
+## ⚠️ Important Notes
+1. **Hardware Requirements**: Ensure your hardware meets minimum configuration requirements
+2. **Driver Version**: Recommend using the latest NVIDIA drivers (535+)
+3. **CUDA Version**: Ensure CUDA version is compatible with PyTorch (recommend CUDA 11.8+)
+4. **Storage Space**: Reserve sufficient disk space for model caching (at least 50GB)
+5. **Network Environment**: Stable network connection required for initial model download
+6. **Environment Variables**: Be sure to set the recommended environment variables to optimize performance
+**Technical Support**: If you encounter issues, please submit an Issue to the project repository.
--- a/docs/EN/source/method_tutorials/offload.md
+++ b/docs/EN/source/method_tutorials/offload.md
--- a/docs/ZH_CN/source/deploy_guides/for_low_resource.md
+++ b/docs/ZH_CN/source/deploy_guides/for_low_resource.md
-# 低资源场景部署
+# Lightx2v 低资源部署指南
-xxx
+## 📋 概述
+本指南专门针对硬件资源受限的环境，特别是**8GB显存 + 16/32GB内存**的配置，详细说明如何成功运行Lightx2v 14B模型进行480p和720p视频生成。
+Lightx2v是一个强大的视频生成模型，但在资源受限的环境下需要精心优化才能流畅运行。本指南将为您提供从硬件选择到软件配置的完整解决方案，确保您能够在有限的硬件条件下获得最佳的视频生成体验。
+## 🎯 目标硬件配置详解
+### 推荐硬件规格
+**GPU要求**:
+- **显存**: 8GB (RTX 3060/3070/4060/4060Ti 等)
+- **架构**: 支持CUDA的NVIDIA显卡
+**系统内存**:
+- **最低要求**: 16GB DDR4
+- **推荐配置**: 32GB DDR4/DDR5
+- **内存速度**: 建议3200MHz及以上
+**存储要求**:
+- **类型**: 强烈推荐NVMe SSD
+- **容量**: 至少50GB可用空间
+- **速度**: 读取速度建议3000MB/s以上
+**CPU要求**:
+- **核心数**: 建议8核心及以上
+- **频率**: 建议3.0GHz及以上
+- **架构**: 支持AVX2指令集
+## ⚙️ 核心优化策略详解
+### 1. 环境优化
+在运行Lightx2v之前，建议设置以下环境变量以优化性能：
+```bash
+# CUDA内存分配优化
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+# 启用CUDA Graph模式，提升推理性能
+export ENABLE_GRAPH_MODE=true
+# 使用BF16精度推理，减少显存占用（默认FP32精度）
+export DTYPE=BF16
+```
+**优化说明**:
+- `expandable_segments:True`: 允许CUDA内存段动态扩展，减少内存碎片
+- `ENABLE_GRAPH_MODE=true`: 启用CUDA Graph，减少内核启动开销
+- `DTYPE=BF16`: 使用BF16精度，在保持质量的同时减少显存占用
+### 2. 量化策略
+量化是低资源环境下的关键优化技术，通过降低模型精度来减少内存占用。
+#### 量化方案对比
+**FP8量化** (推荐用于RTX 40系列):
+```python
+# 适用于支持FP8的GPU，提供更好的精度
+dit_quant_scheme = "fp8"      # DIT模型量化
+t5_quant_scheme = "fp8"       # T5文本编码器量化
+clip_quant_scheme = "fp8"     # CLIP视觉编码器量化
+```
+**INT8量化** (通用方案):
+```python
+# 适用于所有GPU，内存占用最小
+dit_quant_scheme = "int8"     # 8位整数量化
+t5_quant_scheme = "int8"      # 文本编码器量化
+clip_quant_scheme = "int8"    # 视觉编码器量化
+```
+### 3. 高效算子选择指南
+选择合适的算子可以显著提升推理速度和减少内存占用。
+#### 注意力算子选择
+**推荐优先级**:
+1. **[Sage Attention](https://github.com/thu-ml/SageAttention)** (最高优先级)
+2. **[Flash Attention](https://github.com/Dao-AILab/flash-attention)** (通用方案)
+#### 矩阵乘算子选择
+**ADA架构显卡** (RTX 40系列):
+推荐优先级:
+1. **[q8-kernel](https://github.com/KONAKONA666/q8_kernels)** (最高性能，仅支持ADA架构)
+2. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (平衡方案)
+3. **[vllm-kernel](https://github.com/vllm-project/vllm)** (通用方案)
+**其他架构显卡**:
+1. **[sglang-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)** (推荐)
+2. **[vllm-kernel](https://github.com/vllm-project/vllm)** (备选)
+### 4. 参数卸载策略详解
+参数卸载技术允许模型在CPU和磁盘之间动态调度参数，突破显存限制。
+#### 三级卸载架构
+```python
+# 磁盘-CPU-GPU三级卸载配置
+cpu_offload=True             # 启用CPU卸载
+t5_cpu_offload=True          # 启用T5编码器CPU卸载
+offload_granularity=phase    # DIT模型细粒度卸载
+t5_offload_granularity=block # T5编码器细粒度卸载
+lazy_load = True             # 启用延迟加载机制
+num_disk_workers = 2         # 磁盘I/O工作线程数
+```
+#### 卸载策略详解
+**延迟加载机制**:
+- 模型参数按需从磁盘加载到CPU
+- 减少运行时内存占用
+- 支持大模型在有限内存下运行
+**磁盘存储优化**:
+- 使用高速SSD存储模型参数
+- 按照block分组存储模型文件
+- 参考转换脚本[文档](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme_zh.md)，转换时指定`--save_by_block`参数
+### 5. 显存优化技术详解
+针对720p视频生成的显存优化策略。
+#### CUDA内存管理
+```python
+# CUDA内存清理配置
+clean_cuda_cache = True        # 及时清理GPU缓存
+rotary_chunk = True            # 旋转位置编码分块计算
+rotary_chunk_size = 100        # 分块大小，可根据显存调整
+```
+#### 分块计算策略
+**旋转位置编码分块**:
+- 将长序列分成小块处理
+- 减少峰值显存占用
+- 保持计算精度
+### 6. VAE优化详解
+VAE (变分自编码器) 是视频生成的关键组件，优化VAE可以显著提升性能。
+#### VAE分块推理
+```python
+# VAE优化配置
+use_tiling_vae = True          # 启用VAE分块推理
+```
+#### [轻量级VAE](https://github.com/madebyollin/taehv/blob/main/taew2_1.pth)
+```python
+# VAE优化配置
+use_tiny_vae = True            # 使用轻量级VAE
+```
+**VAE优化效果**:
+- 标准VAE: 基准性能，100%质量保持
+- 标准VAE分块: 降低显存，增加推理时间，100%质量保持
+- 轻量VAE: 极低显存，视频质量有损
+### 7. 模型选择策略
+选择合适的模型版本对低资源环境至关重要。
+#### 推荐模型对比
+**蒸馏模型** (强烈推荐):
+- ✅ **[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v)**
+- ✅ **[Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v)**
+#### 性能优化建议
+使用上述蒸馏模型时，可以进一步优化性能：
+- 关闭CFG: `"enable_cfg": false`
+- 减少推理步数: `infer_step: 4`
+- 参考配置文件: [config](https://github.com/ModelTC/LightX2V/tree/main/configs/distill)
+## 🚀 完整配置示例
+### 预配置模板
+- **[14B模型480p视频生成配置](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_480p.json)**
+- **[14B模型720p视频生成配置](https://github.com/ModelTC/lightx2v/tree/main/configs/offload/disk/wan_i2v_phase_lazy_load_720p.json)**
+- **[1.3B模型720p视频生成配置](https://github.com/ModelTC/LightX2V/tree/main/configs/offload/block/wan_t2v_1_3b.json)**
+  - 1.3B模型推理瓶颈是T5 encoder，配置文件专门针对T5进行优化
+**[启动脚本](https://github.com/ModelTC/LightX2V/tree/main/scripts/wan/run_wan_i2v_lazy_load.sh)**
+## 📚 参考资源
+- [参数卸载机制文档](../method_tutorials/offload.md) - 深入了解卸载技术原理
+- [量化技术指南](../method_tutorials/quantization.md) - 量化技术详细说明
+- [Gradio部署指南](deploy_gradio.md) - Gradio部署详细说明
+## ⚠️ 重要注意事项
+1. **硬件要求**: 确保您的硬件满足最低配置要求
+2. **驱动版本**: 建议使用最新的NVIDIA驱动 (535+)
+3. **CUDA版本**: 确保CUDA版本与PyTorch兼容 (建议CUDA 11.8+)
+4. **存储空间**: 预留足够的磁盘空间用于模型缓存 (至少50GB)
+5. **网络环境**: 首次下载模型需要稳定的网络连接
+6. **环境变量**: 务必设置推荐的环境变量以优化性能
+**技术支持**: 如遇到问题，请提交Issue到项目仓库。
--- a/docs/ZH_CN/source/method_tutorials/offload.md
+++ b/docs/ZH_CN/source/method_tutorials/offload.md