Commit 0862ec5b authored by helloyongyang's avatar helloyongyang
Browse files

update doc, scripts, configs

parent 834b09c4
...@@ -12,6 +12,6 @@ ...@@ -12,6 +12,6 @@
"enable_cfg": true, "enable_cfg": true,
"cpu_offload": false, "cpu_offload": false,
"changing_resolution": true, "changing_resolution": true,
"resolution_rate": [1.0, 0.75], "resolution_rate": [0.75],
"changing_resolution_steps": [5, 25] "changing_resolution_steps": [20]
} }
{
"infer_steps": 40,
"target_video_length": 81,
"target_height": 480,
"target_width": 832,
"self_attn_1_type": "flash_attn3",
"cross_attn_1_type": "flash_attn3",
"cross_attn_2_type": "flash_attn3",
"seed": 442,
"sample_guide_scale": 5,
"sample_shift": 3,
"enable_cfg": true,
"cpu_offload": false,
"changing_resolution": true,
"resolution_rate": [1.0, 0.75],
"changing_resolution_steps": [5, 25]
}
...@@ -13,6 +13,6 @@ ...@@ -13,6 +13,6 @@
"enable_cfg": true, "enable_cfg": true,
"cpu_offload": false, "cpu_offload": false,
"changing_resolution": true, "changing_resolution": true,
"resolution_rate": [1.0, 0.75], "resolution_rate": [0.75],
"changing_resolution_steps": [10, 35] "changing_resolution_steps": [25]
} }
{
"infer_steps": 50,
"target_video_length": 81,
"text_len": 512,
"target_height": 480,
"target_width": 832,
"self_attn_1_type": "flash_attn3",
"cross_attn_1_type": "flash_attn3",
"cross_attn_2_type": "flash_attn3",
"seed": 42,
"sample_guide_scale": 6,
"sample_shift": 8,
"enable_cfg": true,
"cpu_offload": false,
"changing_resolution": true,
"resolution_rate": [1.0, 0.75],
"changing_resolution_steps": [10, 35]
}
# Changing Resolution Inference # Variable Resolution Inference
## Overview ## Overview
Changing resolution inference is a technical strategy for optimizing the denoising process. It improves computational efficiency while maintaining generation quality by using different resolutions at different denoising stages. The core idea is to use lower resolution for rough denoising in the early stages of the denoising process, then switch to normal resolution for fine-tuning in the later stages. Variable resolution inference is a technical strategy for optimizing the denoising process. It improves computational efficiency while maintaining generation quality by using different resolutions at different stages of the denoising process. The core idea of this method is to use lower resolution for coarse denoising in the early stages and switch to normal resolution for fine processing in the later stages.
## Technical Principles ## Technical Principles
### Phased Denoising Strategy ### Multi-stage Denoising Strategy
Changing resolution inference is based on the following observations: Variable resolution inference is based on the following observations:
- **Early-stage denoising**: Mainly processes rough noise and overall structure, doesn't require excessive detail information
- **Late-stage denoising**: Focuses on detail optimization and high-frequency information recovery, requires complete resolution information - **Early-stage denoising**: Mainly handles coarse noise and overall structure, requiring less detailed information
- **Late-stage denoising**: Focuses on detail optimization and high-frequency information recovery, requiring complete resolution information
### Resolution Switching Mechanism ### Resolution Switching Mechanism
1. **Low Resolution Stage** (Early stage) 1. **Low-resolution stage** (early stage)
- Downsample the input to lower resolution (e.g., 0.75 of original size) - Downsample the input to a lower resolution (e.g., 0.75x of original size)
- Execute initial denoising steps - Execute initial denoising steps
- Quickly remove most noise and establish basic structure - Quickly remove most noise and establish basic structure
2. **Normal Resolution Stage** (Late stage) 2. **Normal resolution stage** (late stage)
- Upsample the denoising result from the first step back to original resolution - Upsample the denoising result from the first step back to original resolution
- Continue executing remaining denoising steps - Continue executing remaining denoising steps
- Recover detail information and complete fine-tuning - Restore detailed information and complete fine processing
### U-shaped Resolution Strategy
If resolution is reduced at the very beginning of the denoising steps, it may cause significant differences between the final generated video and the video generated through normal inference. Therefore, a U-shaped resolution strategy can be adopted, where the original resolution is maintained for the first few steps, then resolution is reduced for inference.
## Usage ## Usage
The config files for changing resolution inference are available [here](https://github.com/ModelTC/LightX2V/tree/main/configs/changing_resolution) The config files for variable resolution inference are located [here](https://github.com/ModelTC/LightX2V/tree/main/configs/changing_resolution)
You can test variable resolution inference by specifying --config_json to the specific config file.
You can refer to the scripts [here](https://github.com/ModelTC/LightX2V/blob/main/scripts/changing_resolution) to run.
### Example 1:
```
{
"infer_steps": 50,
"changing_resolution": true,
"resolution_rate": [0.75],
"changing_resolution_steps": [25]
}
```
This means a total of 50 steps, with resolution at 0.75x original resolution from step 0 to 25, and original resolution from step 26 to the final step.
By specifying --config_json to the specific config file, you can test changing resolution inference. ### Example 2:
```
{
"infer_steps": 50,
"changing_resolution": true,
"resolution_rate": [1.0, 0.75],
"changing_resolution_steps": [10, 35]
}
```
You can refer to [this script](https://github.com/ModelTC/LightX2V/blob/main/scripts/wan/run_wan_t2v_changing_resolution.sh). This means a total of 50 steps, with original resolution from step 0 to 10, 0.75x original resolution from step 11 to 35, and original resolution from step 36 to the final step.
...@@ -9,6 +9,7 @@ ...@@ -9,6 +9,7 @@
### 分阶段去噪策略 ### 分阶段去噪策略
变分辨率推理基于以下观察: 变分辨率推理基于以下观察:
- **前期去噪**:主要处理粗糙的噪声和整体结构,不需要过多的细节信息 - **前期去噪**:主要处理粗糙的噪声和整体结构,不需要过多的细节信息
- **后期去噪**:专注于细节优化和高频信息恢复,需要完整的分辨率信息 - **后期去噪**:专注于细节优化和高频信息恢复,需要完整的分辨率信息
...@@ -25,10 +26,39 @@ ...@@ -25,10 +26,39 @@
- 恢复细节信息,完成精细化处理 - 恢复细节信息,完成精细化处理
### U型分辨率策略
如果在刚开始的去噪步,降低分辨率,可能会导致最后的生成的视频和正常推理的生成的视频,整体差异偏大。因为可以采用U型的分辨率策略,即最一开始保持几步的原始分辨率,再降低分辨率推理。
## 使用方式 ## 使用方式
变分辨率推理的config文件在[这里](https://github.com/ModelTC/LightX2V/tree/main/configs/changing_resolution) 变分辨率推理的config文件在[这里](https://github.com/ModelTC/LightX2V/tree/main/configs/changing_resolution)
通过指定--config_json到具体的config文件,即可以测试变分辨率推理。 通过指定--config_json到具体的config文件,即可以测试变分辨率推理。
可以参考[该脚本](https://github.com/ModelTC/LightX2V/blob/main/scripts/wan/run_wan_t2v_changing_resolution.sh) 可以参考[这里](https://github.com/ModelTC/LightX2V/blob/main/scripts/changing_resolution)的脚本运行。
### 举例1:
```
{
"infer_steps": 50,
"changing_resolution": true,
"resolution_rate": [0.75],
"changing_resolution_steps": [25]
}
```
表示总步数为50,0到25步的分辨率为0.75倍原始分辨率,26到最后一步的分辨率为原始分辨率。
### 举例2:
```
{
"infer_steps": 50,
"changing_resolution": true,
"resolution_rate": [1.0, 0.75],
"changing_resolution_steps": [10, 35]
}
```
表示总步数为50,0到10步的分辨率为原始分辨率,11到35步的分辨率为0.75倍原始分辨率,36到最后一步的分辨率为原始分辨率。
...@@ -33,7 +33,7 @@ python -m lightx2v.infer \ ...@@ -33,7 +33,7 @@ python -m lightx2v.infer \
--model_cls wan2.1 \ --model_cls wan2.1 \
--task i2v \ --task i2v \
--model_path $model_path \ --model_path $model_path \
--config_json ${lightx2v_path}/configs/changing_resolution/wan_i2v.json \ --config_json ${lightx2v_path}/configs/changing_resolution/wan_i2v_U.json \
--prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." \ --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." \
--negative_prompt "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \ --negative_prompt "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
--image_path ${lightx2v_path}/assets/inputs/imgs/img_0.jpg \ --image_path ${lightx2v_path}/assets/inputs/imgs/img_0.jpg \
......
...@@ -32,7 +32,7 @@ python -m lightx2v.infer \ ...@@ -32,7 +32,7 @@ python -m lightx2v.infer \
--model_cls wan2.1 \ --model_cls wan2.1 \
--task t2v \ --task t2v \
--model_path $model_path \ --model_path $model_path \
--config_json ${lightx2v_path}/configs/changing_resolution/wan_t2v.json \ --config_json ${lightx2v_path}/configs/changing_resolution/wan_t2v_U.json \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \ --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--negative_prompt "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \ --negative_prompt "镜头晃动,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
--save_video_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v_changing_resolution.mp4 --save_video_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v_changing_resolution.mp4
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment