Commit e08c4f90 authored by sandy's avatar sandy Committed by GitHub
Browse files

Merge branch 'main' into audio_r2v

parents 12bfd120 6d07a72e
# Feature Caching
# Feature Cache
## Cache Acceleration Algorithm
- Cache reuse is an important acceleration algorithm in the inference process of diffusion models.
- Its core idea is to skip redundant computations at certain time steps by reusing historical cache results to improve inference efficiency.
- The key to the algorithm is how to decide at which time steps to perform cache reuse, usually based on dynamic judgment of model state changes or error thresholds.
- During inference, key content such as intermediate features, residuals, and attention outputs need to be cached. When entering reusable time steps, directly use the cached content and reconstruct the current output through approximation methods like Taylor expansion, thereby reducing repetitive calculations and achieving efficient inference.
## TeaCache
The core idea of `TeaCache` is to accumulate the **relative L1** distance between adjacent time step inputs, and when the cumulative distance reaches a set threshold, determine that the current time step can perform cache reuse.
- Specifically, the algorithm calculates the relative L1 distance between the current input and the previous step input at each inference step, and accumulates it.
- When the cumulative distance exceeds the threshold, indicating that the model state has changed sufficiently, it directly reuses the most recently cached content, skipping some redundant computations. This can significantly reduce the number of forward computations of the model and improve inference speed.
In actual effect, TeaCache achieves significant acceleration while ensuring generation quality. The video comparison before and after acceleration is as follows:
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 58s | Single H200 inference time: 17.9s |
| ![Effect before acceleration](../../../../assets/gifs/1.gif) | ![Effect after acceleration](../../../../assets/gifs/2.gif) |
- Speedup ratio: **3.24**
- Reference paper: [https://arxiv.org/abs/2411.19108](https://arxiv.org/abs/2411.19108)
## TaylorSeer Cache
The core of `TaylorSeer Cache` lies in using Taylor formula to recalculate cached content as residual compensation for cache reuse time steps. The specific approach is that at cache reuse time steps, not only simply reuse historical cache, but also approximate reconstruction of current output through Taylor expansion. This can further improve output accuracy while reducing computational load. Taylor expansion can effectively capture subtle changes in model state, compensating for errors brought by cache reuse, thereby ensuring generation quality while accelerating. `TaylorSeer Cache` is suitable for scenarios with high requirements for output precision, and can further improve model inference performance on the basis of cache reuse.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 57.7s | Single H200 inference time: 41.3s |
| ![Effect before acceleration](../../../../assets/gifs/3.gif) | ![Effect after acceleration](../../../../assets/gifs/4.gif) |
- Speedup ratio: **1.39**
- Reference paper: [https://arxiv.org/abs/2503.06923](https://arxiv.org/abs/2503.06923)
## AdaCache
The core idea of `AdaCache` is to dynamically adjust the step size of cache reuse based on partial cached content in specified block chunks.
- The algorithm analyzes feature differences between two adjacent time steps within specific blocks, and adaptively decides the next cache reuse time step interval based on the difference magnitude.
- When model state changes are small, the step size automatically increases, reducing cache update frequency; when state changes are large, the step size decreases to ensure output quality.
This allows flexible adjustment of cache strategies based on dynamic changes in the actual inference process, achieving more efficient acceleration and better generation effects. AdaCache is suitable for application scenarios with high requirements for both inference speed and generation quality.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 227s | Single H200 inference time: 83s |
| ![Effect before acceleration](../../../../assets/gifs/5.gif) | ![Effect after acceleration](../../../../assets/gifs/6.gif) |
- Speedup ratio: **2.73**
- Reference paper: [https://arxiv.org/abs/2411.02397](https://arxiv.org/abs/2411.02397)
## CustomCache
`CustomCache` combines the advantages of `TeaCache` and `TaylorSeer Cache`.
- It combines the real-time and rationality of `TeaCache` in cache decision-making, determining when to perform cache reuse through dynamic thresholds.
- At the same time, it utilizes `TaylorSeer`'s Taylor expansion method to make use of cached content.
This not only efficiently determines the timing of cache reuse, but also maximizes the utilization of cached content, improving output accuracy and generation quality. Actual tests show that `CustomCache` generates video quality superior to using `TeaCache`, `TaylorSeer Cache`, or `AdaCache` alone across multiple content generation tasks, making it one of the currently optimal comprehensive performance cache acceleration algorithms.
| Before Acceleration | After Acceleration |
|:------:|:------:|
| Single H200 inference time: 57.9s | Single H200 inference time: 16.6s |
| ![Effect before acceleration](../../../../assets/gifs/7.gif) | ![Effect after acceleration](../../../../assets/gifs/8.gif) |
- Speedup ratio: **3.49**
To demonstrate some video playback effects, you can get better display effects and corresponding documentation content on this [🔗 page](https://github.com/ModelTC/LightX2V/blob/main/docs/EN/source/method_tutorials/cache_source.md).
# Feature Caching
## Cache Acceleration Algorithm
- In the inference process of diffusion models, cache reuse is an important acceleration algorithm.
- The core idea is to skip redundant computations at certain time steps by reusing historical cache results to improve inference efficiency.
- The key to the algorithm is how to decide which time steps to perform cache reuse, usually based on dynamic judgment of model state changes or error thresholds.
- During inference, key content such as intermediate features, residuals, and attention outputs need to be cached. When entering reusable time steps, the cached content is directly utilized, and the current output is reconstructed through approximation methods like Taylor expansion, thereby reducing repeated calculations and achieving efficient inference.
### TeaCache
The core idea of `TeaCache` is to accumulate the **relative L1** distance between adjacent time step inputs. When the accumulated distance reaches a set threshold, it determines that the current time step should not use cache reuse; conversely, when the accumulated distance does not reach the set threshold, cache reuse is used to accelerate the inference process.
- Specifically, the algorithm calculates the relative L1 distance between the current input and the previous step input at each inference step and accumulates it.
- When the accumulated distance does not exceed the threshold, it indicates that the model state change is not obvious, so the most recently cached content is directly reused, skipping some redundant calculations. This can significantly reduce the number of forward computations of the model and improve inference speed.
In practical effects, TeaCache achieves significant acceleration while ensuring generation quality. On a single H200 card, the time consumption and video comparison before and after acceleration are as follows:
<table>
<tr>
<td align="center">
Before acceleration: 58s
</td>
<td align="center">
After acceleration: 17.9s
</td>
</tr>
<tr>
<td align="center">
<video src="https://github.com/user-attachments/assets/1781df9b-04df-4586-b22f-5d15f8e1bff6" width="100%"></video>
</td>
<td align="center">
<video src="https://github.com/user-attachments/assets/e93f91eb-3825-4866-90c2-351176263a2f" width="100%"></video>
</td>
</tr>
</table>
- Acceleration ratio: **3.24**
- Config: [wan_t2v_1_3b_tea_480p.json](https://github.com/ModelTC/lightx2v/tree/main/configs/caching/teacache/wan_t2v_1_3b_tea_480p.json)
- Reference paper: [https://arxiv.org/abs/2411.19108](https://arxiv.org/abs/2411.19108)
### TaylorSeer Cache
The core of `TaylorSeer Cache` lies in using Taylor's formula to recalculate cached content as residual compensation for cache reuse time steps.
- The specific approach is to not only simply reuse historical cache at cache reuse time steps, but also approximately reconstruct the current output through Taylor expansion. This can further improve output accuracy while reducing computational load.
- Taylor expansion can effectively capture minor changes in model state, allowing errors caused by cache reuse to be compensated, thereby ensuring generation quality while accelerating.
`TaylorSeer Cache` is suitable for scenarios with high output accuracy requirements and can further improve model inference performance based on cache reuse.
<table>
<tr>
<td align="center">
Before acceleration: 57.7s
</td>
<td align="center">
After acceleration: 41.3s
</td>
</tr>
<tr>
<td align="center">
<video src="https://github.com/user-attachments/assets/2d04005c-853b-4752-884b-29f8ea5717d2" width="100%"></video>
</td>
<td align="center">
<video src="https://github.com/user-attachments/assets/270e3624-c904-468c-813e-0c65daf1594d" width="100%"></video>
</td>
</tr>
</table>
- Acceleration ratio: **1.39**
- Config: [wan_t2v_taylorseer](https://github.com/ModelTC/lightx2v/tree/main/configs/caching/taylorseer/wan_t2v_taylorseer.json)
- Reference paper: [https://arxiv.org/abs/2503.06923](https://arxiv.org/abs/2503.06923)
### AdaCache
The core idea of `AdaCache` is to dynamically adjust the step size of cache reuse based on partial cached content in specified block chunks.
- The algorithm analyzes feature differences between two adjacent time steps within specific blocks and adaptively determines the next cache reuse time step interval based on the difference magnitude.
- When model state changes are small, the step size automatically increases, reducing cache update frequency; when state changes are large, the step size decreases to ensure output quality.
This allows flexible adjustment of caching strategies based on dynamic changes in the actual inference process, achieving more efficient acceleration and better generation results. AdaCache is suitable for application scenarios that have high requirements for both inference speed and generation quality.
<table>
<tr>
<td align="center">
Before acceleration: 227s
</td>
<td align="center">
After acceleration: 83s
</td>
</tr>
<tr>
<td align="center">
<video src="https://github.com/user-attachments/assets/33b2206d-17e6-4433-bed7-bfa890f9fa7d" width="100%"></video>
</td>
<td align="center">
<video src="https://github.com/user-attachments/assets/084dbe3d-6ff3-4afc-9a7c-453ec53b3672" width="100%"></video>
</td>
</tr>
</table>
- Acceleration ratio: **2.73**
- Config: [wan_i2v_ada](https://github.com/ModelTC/lightx2v/tree/main/configs/caching/adacache/wan_i2v_ada.json)
- Reference paper: [https://arxiv.org/abs/2411.02397](https://arxiv.org/abs/2411.02397)
### CustomCache
`CustomCache` combines the advantages of `TeaCache` and `TaylorSeer Cache`.
- It combines the real-time and reasonable cache decision-making of `TeaCache`, determining when to perform cache reuse through dynamic thresholds.
- At the same time, it utilizes `TaylorSeer`'s Taylor expansion method to make use of cached content.
This not only efficiently determines the timing of cache reuse but also maximizes the utilization of cached content, improving output accuracy and generation quality. Actual testing shows that `CustomCache` produces video quality superior to using `TeaCache`, `TaylorSeer Cache`, or `AdaCache` alone across multiple content generation tasks, making it one of the currently optimal comprehensive cache acceleration algorithms.
<table>
<tr>
<td align="center">
Before acceleration: 57.9s
</td>
<td align="center">
After acceleration: 16.6s
</td>
</tr>
<tr>
<td align="center">
<video src="https://github.com/user-attachments/assets/304ff1e8-ad1c-4013-bcf1-959ac140f67f" width="100%"></video>
</td>
<td align="center">
<video src="https://github.com/user-attachments/assets/d3fb474a-79af-4f33-b965-23d402d3cf16" width="100%"></video>
</td>
</tr>
</table>
- Acceleration ratio: **3.49**
- Config: [wan_t2v_custom_1_3b](https://github.com/ModelTC/lightx2v/tree/main/configs/caching/custom/wan_t2v_custom_1_3b.json)
## Usage
The config files for feature caching are located [here](https://github.com/ModelTC/lightx2v/tree/main/configs/caching)
By specifying --config_json to the specific config file, you can test different cache algorithms.
[Here](https://github.com/ModelTC/lightx2v/tree/main/scripts/cache) are some running scripts for use.
# Changing Resolution Inference
## Overview
Changing resolution inference is a technical strategy for optimizing the denoising process. It improves computational efficiency while maintaining generation quality by using different resolutions at different denoising stages. The core idea is to use lower resolution for rough denoising in the early stages of the denoising process, then switch to normal resolution for fine-tuning in the later stages.
## Technical Principles
### Phased Denoising Strategy
Changing resolution inference is based on the following observations:
- **Early-stage denoising**: Mainly processes rough noise and overall structure, doesn't require excessive detail information
- **Late-stage denoising**: Focuses on detail optimization and high-frequency information recovery, requires complete resolution information
### Resolution Switching Mechanism
1. **Low Resolution Stage** (Early stage)
- Downsample the input to lower resolution (e.g., 0.75 of original size)
- Execute initial denoising steps
- Quickly remove most noise and establish basic structure
2. **Normal Resolution Stage** (Late stage)
- Upsample the denoising result from the first step back to original resolution
- Continue executing remaining denoising steps
- Recover detail information and complete fine-tuning
## Usage
The config files for changing resolution inference are available [here](https://github.com/ModelTC/LightX2V/tree/main/configs/changing_resolution)
By specifying --config_json to the specific config file, you can test changing resolution inference.
You can refer to [this script](https://github.com/ModelTC/LightX2V/blob/main/scripts/wan/run_wan_t2v_changing_resolution.sh).
# 参数卸载
# Offload
xxx
## 📖 Overview
Lightx2v implements a state-of-the-art parameter offloading mechanism specifically designed for efficient large model inference under limited hardware resources. This system provides excellent speed-memory balance through intelligent management of model weights across different memory hierarchies, enabling dynamic scheduling between GPU, CPU, and disk storage.
**Core Features:**
- **Intelligent Granularity Management**: Supports both Block and Phase offloading granularities for flexible memory control
- **Block Granularity**: Complete Transformer layers as management units, containing self-attention, cross-attention, feed-forward networks, etc., suitable for memory-sufficient environments
- **Phase Granularity**: Individual computational components as management units, providing finer-grained memory control for memory-constrained deployment scenarios
- **Multi-level Storage Architecture**: GPU → CPU → Disk three-tier storage hierarchy with intelligent caching strategies
- **Asynchronous Parallel Processing**: CUDA stream-based asynchronous computation and data transfer for maximum hardware utilization
- **Persistent Storage Support**: SSD/NVMe disk storage support for ultra-large model inference deployment
## 🎯 Offloading Strategy Details
### Strategy 1: GPU-CPU Granularity Offloading
**Applicable Scenarios**: GPU VRAM insufficient but system memory resources adequate
**Technical Principle**: Establishes efficient weight scheduling mechanism between GPU and CPU memory, managing model weights in Block or Phase units. Leverages CUDA stream asynchronous capabilities to achieve parallel execution of computation and data transfer. Blocks contain complete Transformer layer structures, while Phases correspond to individual computational components within layers.
**Granularity Selection Guide**:
- **Block Granularity**: Suitable for memory-sufficient environments, reduces management overhead and improves overall performance
- **Phase Granularity**: Suitable for memory-constrained environments, provides more flexible memory control and optimizes resource utilization
<div align="center">
<img alt="GPU-CPU Block/Phase Offloading Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig1_en.png" width="75%">
</div>
<div align="center">
<img alt="Swap Mechanism Core Concept" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig2_en.png" width="75%">
</div>
<div align="center">
<img alt="Asynchronous Execution Flow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig3_en.png" width="75%">
</div>
**Technical Features:**
- **Multi-stream Parallel Architecture**: Employs three CUDA streams with different priorities to parallelize computation and transfer
- Compute Stream (priority=-1): High priority, responsible for current computation tasks
- GPU Load Stream (priority=0): Medium priority, responsible for weight prefetching from CPU to GPU
- CPU Load Stream (priority=0): Medium priority, responsible for weight offloading from GPU to CPU
- **Intelligent Prefetching Mechanism**: Predictively loads next Block/Phase based on computation progress
- **Efficient Cache Management**: Maintains weight cache pool in CPU memory for improved access efficiency
- **Stream Synchronization Guarantee**: Ensures temporal correctness of data transfer and computation
- **Position Rotation Optimization**: Achieves continuous computation through Swap operations, avoiding repeated loading/unloading
### Strategy 2: Disk-CPU-GPU Three-Level Offloading (Lazy Loading)
**Applicable Scenarios**: Both GPU VRAM and system memory resources insufficient in constrained environments
**Technical Principle**: Introduces disk storage layer on top of Strategy 1, constructing a Disk→CPU→GPU three-level storage architecture. CPU serves as a configurable intelligent cache pool, suitable for various memory-constrained deployment environments.
<div align="center">
<img alt="Disk-CPU-GPU Three-Level Offloading Architecture" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig4_en.png" width="75%">
</div>
<div align="center">
<img alt="Complete Workflow" src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/figs/offload/fig5_en.png" width="75%">
</div>
**Execution Steps Details:**
1. **Disk Storage Layer**: Model weights organized by Block on SSD/NVMe, each Block corresponding to one .safetensors file
2. **Task Scheduling Layer**: Priority queue-based intelligent scheduling system for disk loading task assignment
3. **Asynchronous Loading Layer**: Multi-threaded parallel reading of weight files from disk to CPU memory buffer
4. **Intelligent Cache Layer**: CPU memory buffer using FIFO strategy for cache management with dynamic size configuration
5. **Cache Hit Optimization**: Direct transfer to GPU when weights are already in cache, avoiding disk I/O overhead
6. **Prefetch Transfer Layer**: Weights in cache asynchronously transferred to GPU memory via GPU load stream
7. **Compute Execution Layer**: Weights on GPU perform computation (compute stream) while background continues prefetching next Block/Phase
8. **Position Rotation Layer**: Swap rotation after computation completion for continuous computation flow
9. **Memory Management Layer**: Automatic eviction of earliest used weight Blocks/Phases when CPU cache is full
**Technical Features:**
- **On-demand Loading Mechanism**: Model weights loaded from disk only when needed, avoiding loading entire model at once
- **Configurable Cache Strategy**: CPU memory buffer supports FIFO strategy with dynamically adjustable size
- **Multi-threaded Parallel Loading**: Leverages multiple disk worker threads for parallel data loading
- **Asynchronous Transfer Optimization**: CUDA stream-based asynchronous data transfer for maximum hardware utilization
- **Continuous Computation Guarantee**: Achieves continuous computation through position rotation mechanism, avoiding repeated loading/unloading operations
## ⚙️ Configuration Parameters Details
### GPU-CPU Offloading Configuration
```python
config = {
"cpu_offload": True, # Enable CPU offloading functionality
"offload_ratio": 1.0, # Offload ratio (0.0-1.0), 1.0 means complete offloading
"offload_granularity": "block", # Offload granularity selection: "block" or "phase"
"lazy_load": False, # Disable lazy loading mode
}
```
### Disk-CPU-GPU Offloading Configuration
```python
config = {
"cpu_offload": True, # Enable CPU offloading functionality
"lazy_load": True, # Enable lazy loading mode
"offload_ratio": 1.0, # Offload ratio setting
"offload_granularity": "phase", # Recommended to use phase granularity for better memory control
"num_disk_workers": 2, # Number of disk worker threads
"offload_to_disk": True, # Enable disk offloading functionality
"offload_path": ".", # Disk offload path configuration
}
```
**Intelligent Cache Key Parameter Descriptions:**
- `max_memory`: Controls CPU cache size upper limit, directly affects cache hit rate and memory usage
- `num_disk_workers`: Controls number of disk loading threads, affects data prefetch speed
- `offload_granularity`: Controls cache management granularity, affects cache efficiency and memory utilization
- `"block"`: Cache management in units of complete Transformer layers, suitable for memory-sufficient environments
- `"phase"`: Cache management in units of individual computational components, suitable for memory-constrained environments
Detailed configuration files can be referenced at [Official Configuration Repository](https://github.com/ModelTC/lightx2v/tree/main/configs/offload)
## 🎯 Deployment Strategy Recommendations
- 🔄 GPU-CPU Granularity Offloading: Suitable for insufficient GPU VRAM (RTX 3090/4090 24G) but adequate system memory (>64G)
- Advantages: Balances performance and memory usage, suitable for medium-scale model inference
- 💾 Disk-CPU-GPU Three-Level Offloading: Suitable for limited GPU VRAM (RTX 3060/4090 8G) and insufficient system memory (16-32G)
- Advantages: Supports ultra-large model inference with lowest hardware threshold
- 🚫 No Offload Mode: Suitable for high-end hardware configurations pursuing optimal inference performance
- Advantages: Maximizes computational efficiency, suitable for latency-sensitive application scenarios
## 🔍 Troubleshooting and Solutions
### Common Performance Issues and Optimization Strategies
1. **Disk I/O Performance Bottleneck**
- Problem Symptoms: Slow model loading speed, high inference latency
- Solutions:
- Upgrade to NVMe SSD storage devices
- Increase num_disk_workers parameter value
- Optimize file system configuration
2. **Memory Buffer Overflow**
- Problem Symptoms: Insufficient system memory, program abnormal exit
- Solutions:
- Increase max_memory parameter value
- Decrease num_disk_workers parameter value
- Adjust offload_granularity to "phase"
3. **Model Loading Timeout**
- Problem Symptoms: Timeout errors during model loading process
- Solutions:
- Check disk read/write performance
- Optimize file system parameters
- Verify storage device health status
## 📚 Technical Summary
Lightx2v's offloading mechanism is specifically designed for modern AI inference scenarios, fully leveraging GPU's asynchronous computing capabilities and multi-level storage architecture advantages. Through intelligent weight management and efficient parallel processing, this mechanism significantly reduces the hardware threshold for large model inference, providing developers with flexible and efficient deployment solutions.
**Technical Highlights:**
- 🚀 **Performance Optimization**: Asynchronous parallel processing maximizes hardware utilization
- 💾 **Intelligent Memory**: Multi-level caching strategies achieve optimal memory management
- 🔧 **Flexible Configuration**: Supports flexible configuration of multiple granularities and strategies
- 🛡️ **Stable and Reliable**: Comprehensive error handling and fault recovery mechanisms
# 并行推理
# Parallel Inference
xxx
LightX2V supports distributed parallel inference, enabling the use of multiple GPUs for inference. The DiT part supports two parallel attention mechanisms: **Ulysses** and **Ring**, and also supports **VAE parallel inference**. Parallel inference significantly reduces inference time and alleviates the memory overhead of each GPU.
## DiT Parallel Configuration
DiT parallel is controlled by the `parallel_attn_type` parameter and supports two parallel attention mechanisms:
### 1. Ulysses Parallel
**Configuration:**
```json
{
"parallel_attn_type": "ulysses"
}
```
### 2. Ring Parallel
**Configuration:**
```json
{
"parallel_attn_type": "ring"
}
```
## VAE Parallel Configuration
VAE parallel is controlled by the `parallel_vae` parameter:
```json
{
"parallel_vae": true
}
```
**Configuration Description:**
- `parallel_vae: true`: Enable VAE parallel inference (recommended setting)
- `parallel_vae: false`: Disable VAE parallel, use single GPU processing
**Usage Recommendations:**
- In multi-GPU environments, it is recommended to always enable VAE parallel
- VAE parallel can be combined with any attention parallel method (Ulysses/Ring)
- For memory-constrained scenarios, VAE parallel can significantly reduce memory usage
## Usage
The config files for parallel inference are available [here](https://github.com/ModelTC/lightx2v/tree/main/configs/dist_infer)
By specifying --config_json to the specific config file, you can test parallel inference.
Some running scripts are available [here](https://github.com/ModelTC/lightx2v/tree/main/scripts/dist_infer) for use.
# Model Quantization
lightx2v supports quantized inference for linear layers in **Dit**, enabling `w8a8-int8` and `w8a8-fp8` matrix multiplication.
LightX2V supports quantization inference for linear layers in `Dit`, supporting `w8a8-int8`, `w8a8-fp8`, `w8a8-fp8block`, `w8a8-mxfp8`, and `w4a4-nvfp4` matrix multiplication.
## Generating Quantized Models
### Automatic Quantization
## Producing Quantized Models
lightx2v supports automatic weight quantization during inference. Refer to the [configuration file](https://github.com/ModelTC/lightx2v/tree/main/configs/quantization/wan_i2v_quant_auto.json).
**Key configuration**:
Set `"mm_config": {"mm_type": "W-int8-channel-sym-A-int8-channel-sym-dynamic-Vllm", "weight_auto_quant": true}`.
- `mm_type`: Specifies the quantized operator
- `weight_auto_quant: true`: Enables automatic model quantization
Use LightX2V's convert tool to convert models into quantized models. Refer to the [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme.md).
### Offline Quantization
## Loading Quantized Models for Inference
lightx2v also supports direct loading of pre-quantized weights. For offline model quantization, refer to the [documentation](https://github.com/ModelTC/lightx2v/tree/main/tools/convert/readme.md).
Configure the [quantization file](https://github.com/ModelTC/lightx2v/tree/main/configs/quantization/wan_i2v_quant_offline.json):
1. Set `dit_quantized_ckpt` to the converted weight path
2. Set `weight_auto_quant` to `false` in `mm_type`
Write the path of the converted quantized weights to the `dit_quantized_ckpt` field in the [configuration file](https://github.com/ModelTC/lightx2v/blob/main/configs/quantization).
By specifying --config_json to the specific config file, you can load the quantized model for inference.
## Quantized Inference
### Automatic Quantization
```shell
bash scripts/run_wan_i2v_quant_auto.sh
```
### Offline Quantization
```shell
bash scripts/run_wan_i2v_quant_offline.sh
```
## Launching Quantization Service
After offline quantization, point `--config_json` to the offline quantization JSON file.
Example modification in `scripts/start_server.sh`:
```shell
export RUNNING_FLAG=infer
python -m lightx2v.api_server \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/quantization/wan_i2v_quant_offline.json \
--port 8000
```
[Here](https://github.com/ModelTC/lightx2v/tree/main/scripts/quantization) are some running scripts for use.
## Advanced Quantization Features
Refer to the quantization tool [LLMC documentation](https://github.com/ModelTC/llmc/blob/main/docs/en/source/backend/lightx2v.md) for details.
For details, please refer to the documentation of the quantization tool [LLMC](https://github.com/ModelTC/llmc/blob/main/docs/en/source/backend/lightx2v.md)
# 步数蒸馏
# Step Distillation
xxx
Step distillation is an important optimization technique in LightX2V. By training distilled models, it significantly reduces inference steps from the original 40-50 steps to **4 steps**, dramatically improving inference speed while maintaining video quality. LightX2V implements step distillation along with CFG distillation to further enhance inference speed.
## 🔍 Technical Principle
Step distillation is implemented through [Self-Forcing](https://github.com/guandeh17/Self-Forcing) technology. Self-Forcing performs step distillation and CFG distillation on 1.3B autoregressive models. LightX2V extends it with a series of enhancements:
1. **Larger Models**: Supports step distillation training for 14B models;
2. **More Model Types**: Supports standard bidirectional models and I2V model step distillation training;
For detailed implementation, refer to [Self-Forcing-Plus](https://github.com/GoatWu/Self-Forcing-Plus).
## 🎯 Technical Features
- **Inference Acceleration**: Reduces inference steps from 40-50 to 4 steps without CFG, achieving approximately **20-24x** speedup
- **Quality Preservation**: Maintains original video generation quality through distillation techniques
- **Strong Compatibility**: Supports both T2V and I2V tasks
- **Flexible Usage**: Supports loading complete step distillation models or loading step distillation LoRA on top of native models
## 🛠️ Configuration Files
### Basic Configuration Files
Multiple configuration options are provided in the [configs/distill/](https://github.com/ModelTC/lightx2v/tree/main/configs/distill) directory:
| Configuration File | Purpose | Model Address |
|-------------------|---------|---------------|
| [wan_t2v_distill_4step_cfg.json](https://github.com/ModelTC/lightx2v/blob/main/configs/distill/wan_t2v_distill_4step_cfg.json) | Load T2V 4-step distillation complete model | TODO |
| [wan_i2v_distill_4step_cfg.json](https://github.com/ModelTC/lightx2v/blob/main/configs/distill/wan_i2v_distill_4step_cfg.json) | Load I2V 4-step distillation complete model | TODO |
| [wan_t2v_distill_4step_cfg_lora.json](https://github.com/ModelTC/lightx2v/blob/main/configs/distill/wan_t2v_distill_4step_cfg_lora.json) | Load Wan-T2V model and step distillation LoRA | TODO |
| [wan_i2v_distill_4step_cfg_lora.json](https://github.com/ModelTC/lightx2v/blob/main/configs/distill/wan_i2v_distill_4step_cfg_lora.json) | Load Wan-I2V model and step distillation LoRA | TODO |
### Key Configuration Parameters
```json
{
"infer_steps": 4, // Inference steps
"denoising_step_list": [999, 750, 500, 250], // Denoising timestep list
"enable_cfg": false, // Disable CFG for speed improvement
"lora_configs": [ // LoRA weights path (optional)
{
"path": "path/to/distill_lora.safetensors",
"strength": 1.0
}
]
}
```
## 📜 Usage
### Model Preparation
**Complete Model:**
Place the downloaded model (`distill_model.pt` or `distill_model.safetensors`) in the `distill_models/` folder under the Wan model root directory:
- For T2V: `Wan2.1-T2V-14B/distill_models/`
- For I2V-480P: `Wan2.1-I2V-14B-480P/distill_models/`
**LoRA:**
1. Place the downloaded LoRA in any location
2. Modify the `lora_path` parameter in the configuration file to the LoRA storage path
### Inference Scripts
**T2V Complete Model:**
```bash
bash scripts/wan/run_wan_t2v_distill_4step_cfg.sh
```
**I2V Complete Model:**
```bash
bash scripts/wan/run_wan_i2v_distill_4step_cfg.sh
```
### Step Distillation LoRA Inference Scripts
**T2V LoRA:**
```bash
bash scripts/wan/run_wan_t2v_distill_4step_cfg_lora.sh
```
**I2V LoRA:**
```bash
bash scripts/wan/run_wan_i2v_distill_4step_cfg_lora.sh
```
## 🔧 Service Deployment
### Start Distillation Model Service
Modify the startup command in [scripts/server/start_server.sh](https://github.com/ModelTC/lightx2v/blob/main/scripts/server/start_server.sh):
```bash
python -m lightx2v.api_server \
--model_cls wan2.1_distill \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/distill/wan_t2v_distill_4step_cfg.json \
--port 8000 \
--nproc_per_node 1
```
Run the service startup script:
```bash
scripts/server/start_server.sh
```
For more details, see [Service Deployment](https://lightx2v-en.readthedocs.io/en/latest/deploy_guides/deploy_service.html).
### Usage in Gradio Interface
See [Gradio Documentation](https://lightx2v-en.readthedocs.io/en/latest/deploy_guides/deploy_gradio.html)
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-20.04
tools:
python: "3.10"
formats:
- epub
sphinx:
configuration: docs/PAPERS_ZH_CN/source/conf.py
python:
install:
- requirements: requirements-docs.txt
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import logging
import os
import sys
from typing import List
from sphinx.ext import autodoc
import sphinxcontrib.redoc
logger = logging.getLogger(__name__)
sys.path.append(os.path.abspath("../.."))
# -- Project information -----------------------------------------------------
project = "Lightx2v"
copyright = "2025, Lightx2v Team"
author = "the Lightx2v Team"
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx.ext.intersphinx",
"sphinx_copybutton",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"myst_parser",
"sphinxarg.ext",
"sphinxcontrib.redoc",
"sphinxcontrib.openapi",
]
html_static_path = ["_static"]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns: List[str] = ["**/*.template.rst"]
# Exclude the prompt "$" when copying code
copybutton_prompt_text = r"\$ "
copybutton_prompt_is_regexp = True
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_title = project
html_theme = "sphinx_book_theme"
# html_theme = 'sphinx_rtd_theme'
html_logo = "../../../assets/img_lightx2v.png"
html_theme_options = {
"path_to_docs": "docs/ZH_CN/source",
"repository_url": "https://github.com/ModelTC/lightx2v",
"use_repository_button": True,
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
# Generate additional rst documentation here.
def setup(app):
# from docs.source.generate_examples import generate_examples
# generate_examples()
pass
# Mock out external dependencies here.
autodoc_mock_imports = [
"cpuinfo",
"torch",
"transformers",
"psutil",
"prometheus_client",
"sentencepiece",
"lightllmnumpy",
"tqdm",
"tensorizer",
]
for mock_target in autodoc_mock_imports:
if mock_target in sys.modules:
logger.info(
"Potentially problematic mock target (%s) found; autodoc_mock_imports cannot mock modules that have already been loaded into sys.modules when the sphinx build starts.",
mock_target,
)
class MockedClassDocumenter(autodoc.ClassDocumenter):
"""Remove note about base class when a class is derived from object."""
def add_line(self, line: str, source: str, *lineno: int) -> None:
if line == " Bases: :py:class:`object`":
return
super().add_line(line, source, *lineno)
autodoc.ClassDocumenter = MockedClassDocumenter
navigation_with_keys = False
欢迎了解 Lightx2v 论文收藏集!
==================
.. figure:: ../../../assets/img_lightx2v.png
:width: 80%
:align: center
:alt: Lightx2v
:class: no-scaled-link
.. raw:: html
<div align="center" style="font-family: charter;">
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a>
<a href="https://deepwiki.com/ModelTC/lightx2v"><img src="https://deepwiki.com/badge.svg" alt="Ask DeepWiki"></a>
<a href="https://lightx2v-en.readthedocs.io/en/latest"><img src="https://img.shields.io/badge/docs-English-99cc2" alt="Doc"></a>
<a href="https://lightx2v-zhcn.readthedocs.io/zh-cn/latest"><img src="https://img.shields.io/badge/文档-中文-99cc2" alt="Doc"></a>
<a href="https://lightx2v-papers-zhcn.readthedocs.io/zh-cn/latest"><img src="https://img.shields.io/badge/论文集-中文-99cc2" alt="Papers"></a>
<a href="https://hub.docker.com/r/lightx2v/lightx2v/tags"><img src="https://badgen.net/badge/icon/docker?icon=docker&label" alt="Docker"></a>
</div>
<div align="center" style="font-family: charter;">
<strong>LightX2V: 一个轻量级的视频生成推理框架</strong>
</div>
LightX2V 是一个轻量级的视频生成推理框架。这里是我们维护的一个视频生成推理加速相关的论文收藏集,帮助你快速了解视频生成推理加速相关的经典方法和最新进展。
GitHub: https://github.com/ModelTC/lightx2v
HuggingFace: https://huggingface.co/lightx2v
论文收藏集
-------------
.. toctree::
:maxdepth: 1
:caption: 论文分类
图像视频生成基础 <papers/generation_basics.md>
模型量化 <papers/quantization.md>
特征缓存 <papers/cache.md>
注意力机制 <papers/attention.md>
参数卸载 <papers/offload.md>
并行推理 <papers/parallel.md>
变分辨率推理 <papers/changing_resolution.md>
步数蒸馏 <papers/step_distill.md>
自回归蒸馏 <papers/autoregressive_distill.md>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment