Wan21-14B.md

# 从Wan21-14B体验T2V和I2V

本文档包含 Wan2.1-T2V-14B 和 Wan2.1-I2V-14B-480P、Wan2.1-I2V-14B-720P 模型的使用示例。

## 准备环境

请参考[01.PrepareEnv](01.PrepareEnv.md)

## 开始运行

准备模型
```
# 从huggingface下载
hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan-AI/Wan2.1-T2V-14B
hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir Wan-AI/Wan2.1-I2V-14B-480P
hf download Wan-AI/Wan2.1-I2V-14B-720P --local-dir Wan-AI/Wan2.1-I2V-14B-720P

#下载蒸馏模型
hf download lightx2v/Wan2.1-Distill-Models --local-dir lightx2v/Wan2.1-Distill-Models
hf download lightx2v/Wan2.1-Distill-Loras --local-dir lightx2v/Wan2.1-Distill-Loras
```
我们提供三种方式，来运行Wan21-14B模型生成视频：

1. 运行脚本生成: 预设的bash脚本，可以直接运行，便于快速验证

    1.1 单卡推理

    1.2 单卡offload推理

    1.3 多卡并行推理

2. 启动服务生成: 先启动服务，再发请求，适合多次推理和实际的线上部署

    2.1 单卡推理

    2.2 单卡offload推理

    2.3 多卡并行推理

3. python代码生成: 用python代码运行，便于集成到已有的代码环境中

    3.1 单卡推理

    3.2 单卡offload推理

    3.3 多卡并行推理


### 1. 运行脚本生成

```
git clone https://github.com/ModelTC/LightX2V.git

# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
# 例如：lightx2v_path=/home/user/LightX2V
# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
```

#### 1.1 单卡推理

Wan2.1-T2V-14B模型
```
# model_path=Wan-AI/Wan2.1-T2V-14B
cd LightX2V/scripts/wan
bash run_wan_t2v.sh

# 步数蒸馏模型 Lora
# model_path=Wan-AI/Wan2.1-T2V-14B
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_lora_4step_cfg.sh

# 步数蒸馏模型 merge Lora
# model_path=Wan-AI/Wan2.1-T2V-14B
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_model_4step_cfg.sh

# 步数蒸馏+FP8量化模型
# model_path=Wan-AI/Wan2.1-T2V-14B
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_fp8_4step_cfg.sh
```
注意：bash脚本中的model_path为pre-train原模型的路径；config文件中的lora_configs、dit_original_ckpt和dit_quantized_ckpt为所使用的蒸馏模型路径，需要修改为绝对路径，例如：/home/user/models/lightx2v/Wan2.1-Distill-Models/wan2.1_i2v_480p_int8_lightx2v_4step.safetensors

使用单张H100，运行时间及使用`watch -n 1 nvidia-smi`观测的峰值显存测试如下：
1. Wan2.1-T2V-14B模型：Total Cost cost 278.902019 seconds；43768MiB
2. 步数蒸馏模型 Lora：Total Cost cost 31.365923 seconds；44438MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 25.794410 seconds；44418MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 22.000187 seconds；31032MiB

Wan2.1-I2V-14B模型
```
# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
cd LightX2V/scripts/wan
bash run_wan_i2v.sh

# 步数蒸馏模型 Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_lora_4step_cfg.sh

# 步数蒸馏模型 merge Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_model_4step_cfg.sh

# 步数蒸馏+FP8量化模型
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_fp8_4step_cfg.sh
```
使用单张H100，运行时间及使用观测的峰值显存测试如下：
1. Wan2.1-I2V-14B-480P模型：Total Cost cost 232.971375 seconds；49872MiB
2. 步数蒸馏模型 Lora：Total Cost cost 277.535991 seconds；49782MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 26.841140 seconds；49526MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 25.430433 seconds；34218MiB


#### 1.2 单卡offload推理

如下修改 config 文件中的 cpu_offload，开启offload
```
    "cpu_offload": true,
    "offload_granularity": "model"
```

Wan2.1-T2V-14B模型
```
cd LightX2V/scripts/wan
bash run_wan_t2v.sh

# 步数蒸馏模型 Lora
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_lora_4step_cfg.sh

# 步数蒸馏模型 merge Lora
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_model_4step_cfg.sh

# 步数蒸馏+FP8量化模型
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_fp8_4step_cfg.sh
```
使用单张H100，运行时间及观测的峰值显存测试如下：
1. Wan2.1-T2V-14B模型：Total Cost cost 319.019743 seconds；34932MiB
2. 步数蒸馏模型 Lora：Total Cost cost 74.180393 seconds；34562MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 68.621963 seconds；34562MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 58.921504 seconds；21290MiB

Wan2.1-I2V-14B模型
```
# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
cd LightX2V/scripts/wan
bash run_wan_i2v.sh

# 步数蒸馏模型 Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_lora_4step_cfg.sh

# 步数蒸馏模型 merge Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_model_4step_cfg.sh

# 步数蒸馏+FP8量化模型
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_fp8_4step_cfg.sh
```
使用单张H100，运行时间及观测的峰值显存测试如下：
1. Wan2.1-I2V-14B-480P模型：Total Cost cost 276.509557 seconds；38906MiB
2. 步数蒸馏模型 Lora：Total Cost cost 85.217124 seconds；38556MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 79.389818 seconds；38556MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 68.124415 seconds；23400MiB

#### 1.3 多卡并行推理

Wan2.1-T2V-14B模型
```
# 运行前需将CUDA_VISIBLE_DEVICES替换为实际用的GPU
# 同时config文件中的parallel参数也需对应修改，满足cfg_p_size * seq_p_size = GPU数目
cd LightX2V/scripts/dist_infer
bash run_wan_t2v_dist_cfg_ulysses.sh

# 步数蒸馏模型 Lora
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_lora_4step_cfg_ulysses.sh

# 步数蒸馏模型 merge Lora
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_model_4step_cfg_ulysses.sh

# 步数蒸馏+FP8量化模型
cd LightX2V/scripts/wan/distill
bash run_wan_t2v_distill_fp8_4step_cfg_ulysses.sh
```
使用8张H100，运行时间及观测的每张卡峰值显存测试如下：
1. Wan2.1-I2V-14B-480P模型：Total Cost cost 131.553567 seconds；44624MiB
2. 步数蒸馏模型 Lora：Total Cost cost 38.337339 seconds；43850MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 29.021527 seconds；43470MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 26.409164 seconds；30162MiB

Wan2.1-I2V-14B模型
```
# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
cd LightX2V/scripts/dist_infer
bash run_wan_i2v_dist_cfg_ulysses.sh

# 步数蒸馏模型 Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_lora_4step_cfg_ulysses.sh

# 步数蒸馏模型 merge Lora
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_model_4step_cfg_ulysses.sh

# 步数蒸馏+FP8量化模型
cd LightX2V/scripts/wan/distill
bash run_wan_i2v_distill_fp8_4step_cfg_ulysses.sh
```
使用8张H100，运行时间及观测的每张卡峰值显存测试如下：
1. Wan2.1-I2V-14B-480P模型：Total Cost cost 116.455286 seconds；49668MiB
2. 步数蒸馏模型 Lora：Total Cost cost 45.899316 seconds；48854MiB
3. 步数蒸馏模型 merge Lora：Total Cost cost 33.472992 seconds；48674MiB
4. 步数蒸馏+FP8量化模型：Total Cost cost 30.796211 seconds；33328MiB

解释细节

run_wan_t2v_dist_cfg_ulysses.sh脚本内容如下：
```
#!/bin/bash

# set path firstly
lightx2v_path=
model_path=

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# set environment variables
source ${lightx2v_path}/scripts/base/base.sh

torchrun --nproc_per_node=8 -m lightx2v.infer \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--negative_prompt "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4

```
`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` 表示使用0-7号显卡，共8张

`source ${lightx2v_path}/scripts/base/base.sh` 设置一些基础的环境变量

`torchrun --nproc_per_node=8 -m lightx2v.infer` 表示使用torchrun启动多卡，启动8个进程，每个进程绑定1张 GPU

`--model_cls wan2.1` 表示使用wan2.1模型

`--task t2v` 表示使用t2v任务，在运行 Wan2.1-I2V-14B 模型时对应为 i2v

`--model_path` 表示模型的路径

`--config_json` 表示配置文件的路径

`--prompt` 表示提示词

`--negative_prompt` 表示负向提示词

`--save_result_path` 表示保存结果的路径

由于不同的模型都有其各自的特性，所以`config_json`文件中会存有对应模型的更多细节的配置参数，不同模型的`config_json`文件内容有所不同

wan_t2v_dist_cfg_ulysses.json内容如下：
```
{
    "infer_steps": 50,
    "target_video_length": 81,
    "text_len": 512,
    "target_height": 480,
    "target_width": 832,
    "self_attn_1_type": "flash_attn3",
    "cross_attn_1_type": "flash_attn3",
    "cross_attn_2_type": "flash_attn3",
    "sample_guide_scale": 6,
    "sample_shift": 8,
    "enable_cfg": true,
    "cpu_offload": false,
    "parallel": {
        "seq_p_size": 4,
        "seq_p_attn_type": "ulysses",
        "cfg_p_size": 2
    }
}

```
`infer_steps` 表示推理的步数

`target_video_length` 表示目标视频的帧数(对于wan2.1模型来说，fps=16，所以target_video_length=81，表示视频时长为5秒)

`target_height` 表示目标视频的高度

`target_width` 表示目标视频的宽度

`self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type` 表示wan2.1模型内部的三个注意力层的算子的类型，这里使用flash_attn3，仅限于Hopper架构的显卡(H100, H20等)，其他显卡可以使用flash_attn2进行替代

`enable_cfg` 表示是否启用cfg，这里设置为true，表示会推理两次，第一次使用正向提示词，第二次使用负向提示词，这样可以得到更好的效果，但是会增加推理时间，如果是已经做了CFG蒸馏的模型，这里就可以设置为false

`cpu_offload` 表示是否启用cpu offload，启用cpu offload能达到降低显存的效果。若是开启cpu offload，则需要加上`"offload_granularity": "model"` ，表示卸载粒度，按整个模型模块卸载。开启之后可以使用`watch -n 1 nvidia-smi` 观察显存使用情况。

`parallel` 表示并行参数设置。DiT支持两种并行注意力机制：Ulysses 和 Ring，同时还支持 Cfg 并行推理。并行推理能够显著降低推理耗时和减轻每个GPU的显存开销。这里使用 cfg＋Ulysses 并行，对应 seq_p_size*cfg_p_size=8 八卡配置

wan_t2v_distill_lora_4step_cfg_ulysses.json内容如下：
```
{
    "infer_steps": 4,
    "target_video_length": 81,
    "text_len": 512,
    "target_height": 480,
    "target_width": 832,
    "self_attn_1_type": "flash_attn3",
    "cross_attn_1_type": "flash_attn3",
    "cross_attn_2_type": "flash_attn3",
    "sample_guide_scale": 6,
    "sample_shift": 5,
    "enable_cfg": false,
    "cpu_offload": false,
    "denoising_step_list": [1000, 750, 500, 250],
    "lora_configs": [
      {
        "path": "lightx2v/Wan2.1-Distill-Loras/wan2.1_t2v_14b_lora_rank64_lightx2v_4step.safetensors",
        "strength": 1.0
      }
    ],
    "parallel": {
        "seq_p_size": 4,
        "seq_p_attn_type": "ulysses",
        "cfg_p_size": 2
    }
  }

```
`infer_steps` 表示推理的步数，这里使用的是蒸馏模型，推理步数蒸馏成4步

`denoising_step_list` 表示 4 步去噪步骤对应的时间步

`lora_configs` 表示LoRA 插件配置，填入蒸馏模型的路径，需为绝对路径

wan_t2v_distill_model_4step_cfg_ulysses.json内容如下：
```
{
    "infer_steps": 4,
    "target_video_length": 81,
    "text_len": 512,
    "target_height": 480,
    "target_width": 832,
    "self_attn_1_type": "flash_attn3",
    "cross_attn_1_type": "flash_attn3",
    "cross_attn_2_type": "flash_attn3",
    "sample_guide_scale": 6,
    "sample_shift": 5,
    "enable_cfg": false,
    "cpu_offload": false,
    "denoising_step_list": [1000, 750, 500, 250],
    "dit_original_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_lightx2v_4step.safetensors",
    "parallel": {
        "seq_p_size": 4,
        "seq_p_attn_type": "ulysses",
        "cfg_p_size": 2
    }
}

```
`dit_original_ckpt` 表示 merge Lora 后的蒸馏模型路径

wan_t2v_distill_fp8_4step_cfg_ulysses.json内容如下：
```
{
    "infer_steps": 4,
    "target_video_length": 81,
    "text_len": 512,
    "target_height": 480,
    "target_width": 832,
    "self_attn_1_type": "flash_attn3",
    "cross_attn_1_type": "flash_attn3",
    "cross_attn_2_type": "flash_attn3",
    "sample_guide_scale": 6,
    "sample_shift": 5,
    "enable_cfg": false,
    "cpu_offload": false,
    "denoising_step_list": [1000, 750, 500, 250],
    "dit_quantized": true,
    "dit_quantized_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_scaled_fp8_e4m3_lightx2v_4step.safetensors",
    "dit_quant_scheme": "fp8-sgl",
    "parallel": {
        "seq_p_size": 4,
        "seq_p_attn_type": "ulysses",
        "cfg_p_size": 2
    }
}

```
`dit_quantized`	表示是否启用 DIT 量化，设置为True表示对模型核心的 DIT 模块做量化处理

`dit_quantized_ckpt` 表示 DIT 量化权重路径，指定 FP8 量化后的 DIT 权重文件的本地路径

`dit_quant_scheme` 表示 DIT 量化方案，指定量化类型为 "fp8-sgl"（fp8-sgl表示使用sglang的fp8 kernel进行推理）

### 2.启动服务生成

#### 2.1单卡推理

启动服务
```
cd LightX2V/scripts/server

# 运行下面的脚本之前，需要将脚本中的lightx2v_path、model_path以及config_json替换为实际路径
# 例如：lightx2v_path=/home/user/LightX2V
# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
# 例如：config_json ${lightx2v_path}/configs/wan/wan_t2v.json

# 切换model_path和config_json路径体验不同模型
bash start_server.sh
```
向服务端发送请求

此处需要打开第二个终端作为用户
```
cd LightX2V/scripts/server

# 此时生成视频，url = "http://localhost:8000/v1/tasks/video/"
python post.py
```
发送完请求后，可以在服务端看到推理的日志

#### 2.2 单卡offload推理

如下修改 config 文件中的 cpu_offload，开启offload
```
    "cpu_offload": true,
    "offload_granularity": "model"
```
启动服务
```
cd LightX2V/scripts/server

bash start_server.sh
```
向服务端发送请求

```
cd LightX2V/scripts/server

# 此时生成视频，url = "http://localhost:8000/v1/tasks/video/"
python post.py
```

#### 2.3 多卡并行推理

启动服务
```
cd LightX2V/scripts/server

bash start_server_cfg_ulysses.sh
```
向服务端发送请求

```
cd LightX2V/scripts/server

python post.py
```
运行时间及观测的每张卡峰值显存测试如下：
1. 单卡推理：Run DiT cost 261.699812 seconds；RUN pipeline cost 261.973479 seconds；43968MiB
2. 单卡offload推理：Run DiT cost 264.445139 seconds；RUN pipeline cost 265.565198 seconds；34932MiB
3. 多卡并行推理：Run DiT cost 109.518894 seconds；RUN pipeline cost 110.085543 seconds；44624MiB

解释细节

start_server.sh脚本内容如下
```
#!/bin/bash

# set path firstly
lightx2v_path=
model_path=

export CUDA_VISIBLE_DEVICES=0

# set environment variables
source ${lightx2v_path}/scripts/base/base.sh


# Start API server with distributed inference service
python -m lightx2v.server \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
--host 0.0.0.0 \
--port 8000

echo "Service stopped"

```
`--host 0.0.0.0`和`--port 8000`，表示服务起在本机ip的8000端口上

post.py内容如下
```
import requests
from loguru import logger

if __name__ == "__main__":
    url = "http://localhost:8000/v1/tasks/video/"

    message = {
        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
        "negative_prompt": "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
        "image_path": "",
        "seed": 42,
        "save_result_path": "./cat_boxing_seed42.mp4",
    }

    logger.info(f"message: {message}")

    response = requests.post(url, json=message)

    logger.info(f"response: {response.json()}")

```
`url = "http://localhost:8000/v1/tasks/video/" `表示向本机ip的8000端口上，发送一个视频生成任务。如果是图像生成任务，url需改成 url = "http://localhost:8000/v1/tasks/image/"

`message字典` 表示向服务端发送的请求的内容，其中`seed`若不指定，每次发送请求会随机生成一个`seed`，`save_result_path`若不指定也会生成一个和任务id一致命名的文件

### 3.python代码生成

#### 3.1单卡推理

```
cd LightX2V/examples/wan/wan_t2v.py

# 修改model_path、save_result_path、config_json

PYTHONPATH=/home/user/LightX2V python wan_t2v.py
```
注意1：设置运行中的参数中，推荐使用传入config_json的方式，用来和前面的运行脚本生成视频和启动服务生成视频进行超参数对齐

注意2：PYTHONPATH的路径需为绝对路径

#### 3.2 单卡offload推理

如下修改 config 文件中的 cpu_offload，开启offload
```
    "cpu_offload": true,
    "offload_granularity": "model"
```
```
cd LightX2V/examples/wan/wan_t2v.py

PYTHONPATH=/home/user/LightX2V python wan_t2v.py
```

#### 3.3 多卡并行推理
```
cd LightX2V/examples/wan/wan_t2v.py
# 代码中需将config_json改成：LightX2V/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json

PROFILING_DEBUG_LEVEL=2 PYTHONPATH=/home/user/LightX2V torchrun --nproc_per_node=8 wan_t2v.py
```
运行时间及观测的每张卡峰值显存测试如下：
1. 单卡推理：Run DiT cost 262.745393 seconds；RUN pipeline cost 263.279303 seconds；44792MiB
2. 单卡offload推理：Run DiT cost 263.725956 seconds；RUN pipeline cost 264.919227 seconds；34936MiB
3. 多卡并行推理：Run DiT cost 113.736238 seconds；RUN pipeline cost 114.297859 seconds；44624MiB

解释细节

wan_t2v.py内容如下
```
"""
Wan2.1 text-to-video generation example.
This example demonstrates how to use LightX2V with Wan2.1 model for T2V generation.
"""

from lightx2v import LightX2VPipeline

# Initialize pipeline for Wan2.1 T2V task
pipe = LightX2VPipeline(
    model_path="/path/to/Wan2.1-T2V-14B",
    model_cls="wan2.1",
    task="t2v",
)

# Alternative: create generator from config JSON file
# pipe.create_generator(config_json="../configs/wan/wan_t2v.json")

# Create generator with specified parameters
pipe.create_generator(
    attn_mode="sage_attn2",
    infer_steps=50,
    height=480,  # Can be set to 720 for higher resolution
    width=832,  # Can be set to 1280 for higher resolution
    num_frames=81,
    guidance_scale=5.0,
    sample_shift=5.0,
)

seed = 42
prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
save_result_path = "/path/to/save_results/output.mp4"

pipe.generate(
    seed=seed,
    prompt=prompt,
    negative_prompt=negative_prompt,
    save_result_path=save_result_path,
)
```
注意1：需要修改 model_path、save_result_path 为实际的路径

注意2：设置运行中的参数中，推荐使用传入config_json的方式，用来和前面的运行脚本生成视频和启动服务生成视频进行超参数对齐