Update docs

9c1e5015 · gushiqiao · 238fcc6d · 9c1e5015 · 9c1e5015 · 9c1e5015
Commit 9c1e5015 authored Jul 17, 2025 by gushiqiao
4 changed files
--- a/README.md
+++ b/README.md
@@ -22,23 +22,6 @@
 For comprehensive usage instructions, please refer to our documentation: **[English Docs](https://lightx2v-en.readthedocs.io/en/latest/) | [中文文档](https://lightx2v-zhcn.readthedocs.io/zh-cn/latest/)**
-## 🚀 Core Features
-### 🎯 **Ultimate Performance Optimization**
- **🔥 SOTA Inference Speed**: Achieve **~15x** acceleration through step distillation and operator optimization
- **⚡️ Revolutionary 4-Step Distillation**: Compress original 40-50 step inference to just 4 steps without CFG requirements
- **🛠️ Advanced Operator Support**: Integrated with cutting-edge operators including [Sage Attention](https://github.com/thu-ml/SageAttention), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Radial Attention](https://github.com/mit-han-lab/radial-attention), [q8-kernel](https://github.com/KONAKONA666/q8_kernels), [sgl-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel), [vllm](https://github.com/vllm-project/vllm)
-### 💾 **Resource-Efficient Deployment**
- **💡 Breaking Hardware Barriers**: Run 14B models for 480P/720P video generation with only **8GB VRAM + 16GB RAM**
- **🔧 Intelligent Parameter Offloading**: Advanced disk-CPU-GPU three-tier offloading architecture with phase/block-level granular management
- **⚙️ Comprehensive Quantization**: Support for `w8a8-int8`, `w8a8-fp8`, `w4a4-nvfp4` and other quantization strategies
-### 🎨 **Rich Feature Ecosystem**
- **📈 Smart Feature Caching**: Intelligent caching mechanisms to eliminate redundant computations
- **🔄 Parallel Inference**: Multi-GPU parallel processing for enhanced performance
- **📱 Flexible Deployment Options**: Support for Gradio, service deployment, ComfyUI and other deployment methods
- **🎛️ Dynamic Resolution Inference**: Adaptive resolution adjustment for optimal generation quality
 ## 🤖 Supported Model Ecosystem
@@ -62,6 +45,26 @@ For comprehensive usage instructions, please refer to our documentation: **[Engl
 ### Autoregressive Models
 - ✅ [Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid)
+## 🚀 Core Features
+### 🎯 **Ultimate Performance Optimization**
+- **🔥 SOTA Inference Speed**: Achieve **~15x** acceleration via step distillation and system optimization (single GPU)
+- **⚡️ Revolutionary 4-Step Distillation**: Compress original 40-50 step inference to just 4 steps without CFG requirements
+- **🛠️ Advanced Operator Support**: Integrated with cutting-edge operators including [Sage Attention](https://github.com/thu-ml/SageAttention), [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Radial Attention](https://github.com/mit-han-lab/radial-attention), [q8-kernel](https://github.com/KONAKONA666/q8_kernels), [sgl-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel), [vllm](https://github.com/vllm-project/vllm)
+### 💾 **Resource-Efficient Deployment**
+- **💡 Breaking Hardware Barriers**: Run 14B models for 480P/720P video generation with only **8GB VRAM + 16GB RAM**
+- **🔧 Intelligent Parameter Offloading**: Advanced disk-CPU-GPU three-tier offloading architecture with phase/block-level granular management
+- **⚙️ Comprehensive Quantization**: Support for `w8a8-int8`, `w8a8-fp8`, `w4a4-nvfp4` and other quantization strategies
+### 🎨 **Rich Feature Ecosystem**
+- **📈 Smart Feature Caching**: Intelligent caching mechanisms to eliminate redundant computations
+- **🔄 Parallel Inference**: Multi-GPU parallel processing for enhanced performance
+- **📱 Flexible Deployment Options**: Support for Gradio, service deployment, ComfyUI and other deployment methods
+- **🎛️ Dynamic Resolution Inference**: Adaptive resolution adjustment for optimal generation quality
 ## 🏆 Performance Benchmarks
 For detailed performance metrics and comparisons, please refer to our [benchmark documentation](https://github.com/ModelTC/LightX2V/blob/main/docs/EN/source/getting_started/benchmark_source.md).

--- a/README_zh.md
+++ b/README_zh.md
@@ -22,24 +22,6 @@
 详细使用说明请参考我们的文档：**[英文文档](https://lightx2v-en.readthedocs.io/en/latest/) | [中文文档](https://lightx2v-zhcn.readthedocs.io/zh-cn/latest/)**
-## 🚀 核心特性
-### 🎯 **极致性能优化**
- **🔥 SOTA推理速度**: 通过步数蒸馏和算子优化实现**15倍**极速加速
- **⚡️ 革命性4步蒸馏**: 将原始40-50步推理压缩至仅需4步，且无需CFG配置
- **🛠️ 先进算子支持**: 集成顶尖算子，包括[Sage Attention](https://github.com/thu-ml/SageAttention)、[Flash Attention](https://github.com/Dao-AILab/flash-attention)、[Radial Attention](https://github.com/mit-han-lab/radial-attention)、[q8-kernel](https://github.com/KONAKONA666/q8_kernels)、[sgl-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)、[vllm](https://github.com/vllm-project/vllm)
-### 💾 **资源高效部署**
- **💡 突破硬件限制**: **仅需8GB显存 + 16GB内存**即可运行14B模型生成480P/720P视频
- **🔧 智能参数卸载**: 先进的磁盘-CPU-GPU三级卸载架构，支持阶段/块级别的精细化管理
- **⚙️ 全面量化支持**: 支持`w8a8-int8`、`w8a8-fp8`、`w4a4-nvfp4`等多种量化策略
-### 🎨 **丰富功能生态**
- **📈 智能特征缓存**: 智能缓存机制，消除冗余计算，提升效率
- **🔄 并行推理加速**: 多GPU并行处理，显著提升性能表现
- **📱 灵活部署选择**: 支持Gradio、服务化部署、ComfyUI等多种部署方式
- **🎛️ 动态分辨率推理**: 自适应分辨率调整，优化生成质量
 ## 🤖 支持的模型生态
 ### 官方开源模型
@@ -62,6 +44,25 @@
 ### 自回归模型
 - ✅ [Wan2.1-T2V-CausVid](https://huggingface.co/lightx2v/Wan2.1-T2V-14B-CausVid)
+## 🚀 核心特性
+### 🎯 **极致性能优化**
+- **🔥 SOTA推理速度**: 通过步数蒸馏和系统优化实现**15倍**极速加速(单GPU)
+- **⚡️ 革命性4步蒸馏**: 将原始40-50步推理压缩至仅需4步，且无需CFG配置
+- **🛠️ 先进算子支持**: 集成顶尖算子，包括[Sage Attention](https://github.com/thu-ml/SageAttention)、[Flash Attention](https://github.com/Dao-AILab/flash-attention)、[Radial Attention](https://github.com/mit-han-lab/radial-attention)、[q8-kernel](https://github.com/KONAKONA666/q8_kernels)、[sgl-kernel](https://github.com/sgl-project/sglang/tree/main/sgl-kernel)、[vllm](https://github.com/vllm-project/vllm)
+### 💾 **资源高效部署**
+- **💡 突破硬件限制**: **仅需8GB显存 + 16GB内存**即可运行14B模型生成480P/720P视频
+- **🔧 智能参数卸载**: 先进的磁盘-CPU-GPU三级卸载架构，支持阶段/块级别的精细化管理
+- **⚙️ 全面量化支持**: 支持`w8a8-int8`、`w8a8-fp8`、`w4a4-nvfp4`等多种量化策略
+### 🎨 **丰富功能生态**
+- **📈 智能特征缓存**: 智能缓存机制，消除冗余计算，提升效率
+- **🔄 并行推理加速**: 多GPU并行处理，显著提升性能表现
+- **📱 灵活部署选择**: 支持Gradio、服务化部署、ComfyUI等多种部署方式
+- **🎛️ 动态分辨率推理**: 自适应分辨率调整，优化生成质量
 ## 🏆 性能基准测试
 详细的性能指标和对比分析，请参考我们的[基准测试文档](https://github.com/ModelTC/LightX2V/blob/main/docs/ZH_CN/source/getting_started/benchmark_source.md)。

--- a/docs/EN/source/getting_started/benchmark_source.md
+++ b/docs/EN/source/getting_started/benchmark_source.md
@@ -26,6 +26,7 @@
 | **LightX2V_2** | 37.24 | 216.16 | 50 | **1.69x** | <video src="PATH_TO_LIGHTX2V_2_480P_VIDEO" width="200px"></video> |
 | **LightX2V_3** | 23.62 | 190.73 | 35 | **1.92x** | <video src="PATH_TO_LIGHTX2V_3_480P_VIDEO" width="200px"></video> |
 | **LightX2V_4** | 23.62 | 107.19 | 35 | **3.41x** | <video src="PATH_TO_LIGHTX2V_4_480P_VIDEO" width="200px"></video> |
+| **LightX2V_4-Distill** | 23.62 | 107.19 | 35 | **3.41x** | <video src="PATH_TO_LIGHTX2V_4_DISTILL_480P_VIDEO" width="200px"></video> |
 ### 720P 5s Video
@@ -56,3 +57,4 @@
 - **LightX2V_2**: Unified BF16 precision computation to further reduce memory usage and computational overhead while maintaining generation quality
 - **LightX2V_3**: Quantization optimization introducing FP8 quantization technology to significantly reduce computational precision requirements, combined with Tiling VAE technology to optimize memory usage
 - **LightX2V_4**: Ultimate optimization adding TeaCache (teacache_thresh=0.2) caching reuse technology on top of LightX2V_3 to achieve maximum acceleration by intelligently skipping redundant computations
+- **LightX2V_4-Distill**: Building on LightX2V_4 with 4-step distilled model ([Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v))
--- a/docs/ZH_CN/source/getting_started/benchmark_source.md
+++ b/docs/ZH_CN/source/getting_started/benchmark_source.md
@@ -21,11 +21,13 @@
 | 配置 | 模型加载时间(s) | 推理时间(s) | GPU显存占用(GB) | 加速比 | 视频效果 |
 |:-----|:---------------:|:----------:|:---------------:|:------:|:--------:|
-| Wan2.1 Official(baseline) | 68.26 | 366.04 | 71 | 1.0x | <video src="https://github.com/user-attachments/assets/24fb112e-c868-4484-b7f0-d9542979c2c3" width="200px"></video> |
+| **Wan2.1 Official** | 68.26 | 366.04 | 71 | 1.0x | <video src="https://github.com/user-attachments/assets/24fb112e-c868-4484-b7f0-d9542979c2c3" width="200px"></video> |
+| **Fast Video** | xx | xx | xx | **xx** | <video src="" width="200px"></video> |
 | **LightX2V_1** | 37.28 | 249.54 | 53 | **1.47x** | <video src="https://github.com/user-attachments/assets/7bffe48f-e433-430b-91dc-ac745908ba3a" width="200px"></video> |
 | **LightX2V_2** | 37.24 | 216.16 | 50 | **1.69x** | <video src="https://github.com/user-attachments/assets/0a24ca47-c466-433e-8a53-96f259d19841" width="200px"></video> |
 | **LightX2V_3** | 23.62 | 190.73 | 35 | **1.92x** | <video src="https://github.com/user-attachments/assets/970c73d3-1d60-444e-b64d-9bf8af9b19f1" width="200px"></video> |
 | **LightX2V_4** | 23.62 | 107.19 | 35 | **3.41x** | <video src="https://github.com/user-attachments/assets/49cd2760-4be2-432c-bf4e-01af9a1303dd" width="200px"></video> |
+| **LightX2V_4-Distill** | xxx| xxx | xx | **xx** | <video src="" width="200px"></video> |
 ### 720P 5s视频
@@ -51,8 +53,10 @@
 ## 表格说明
- **Wan2.1 Official(baseline)**: 基于[Wan2.1官方仓库](https://github.com/Wan-Video/Wan2.1)的基线实现
+- **Wan2.1 Official**: 基于[Wan2.1官方仓库](https://github.com/Wan-Video/Wan2.1)
+- **FastVideo**: 基于[FastVideo官方仓库](https://github.com/hao-ai-lab/FastVideo)
 - **LightX2V_1**: 使用SageAttention2替换原生注意力机制，采用DIT BF16+FP32(部分敏感层)混合精度计算，在保持精度的同时提升计算效率
 - **LightX2V_2**: 统一使用BF16精度计算，进一步减少显存占用和计算开销，同时保持生成质量
 - **LightX2V_3**: 引入FP8量化技术显著减少计算精度要求，结合Tiling VAE技术优化显存使用
- **LightX2V_4**: 在LightX2V_3基础上加入TeaCache(teacache_thresh=0.2)缓存复用技术，通过智能跳过冗余计算实现最大加速
+- **LightX2V_4**: 在LightX2V_3基础上加入TeaCache(teacache_thresh=0.2)缓存复用技术，通过智能跳过冗余计算实现加速
+- **LightX2V_4-Distill**: 在LightX2V_4基础上使用4步蒸馏模型([Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v))