02.Wan21-1.3B.md 10.5 KB
Newer Older
litzh's avatar
litzh committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# Getting Started with the LightX2V Project from Wan2.1-T2V-1.3B

We recommend starting the entire LightX2V project with Wan2.1-T2V-1.3B. Regardless of which model you intend to use, we suggest reading this document first to understand the overall workflow of LightX2V.

## Prepare the Environment

Please refer to [01.PrepareEnv](01.PrepareEnv.md)

## Getting Started

Prepare the model (choose either huggingface or modelscope to download):

```
# download from huggingface
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir Wan-AI/Wan2.1-T2V-1.3B

# download from modelscope
modelscope download --model Wan-AI/Wan2.1-T2V-1.3B --local_dir Wan-AI/Wan2.1-T2V-1.3B
```

We provide three ways to run the Wan2.1-T2V-1.3B model to generate videos:

1. Run a script to generate video: Preset bash scripts for quick verification.
2. Start a service to generate video: Start the service and send requests, suitable for multiple inferences and actual deployment.
3. Generate video with Python code: Run with Python code, convenient for integration into existing codebases.

### Run Script to Generate Video

```
git clone https://github.com/ModelTC/LightX2V.git
cd LightX2V/scripts/wan

# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
# Example: lightx2v_path=/home/user/LightX2V
# Example: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B

bash run_wan_t2v.sh
```

Explanation of details

The content of run_wan_t2v.sh is as follows:
```
#!/bin/bash

# set path firstly
lightx2v_path=
model_path=

export CUDA_VISIBLE_DEVICES=0

# set environment variables
source ${lightx2v_path}/scripts/base/base.sh

python -m lightx2v.infer \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--negative_prompt "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards" \
--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4
```

`export CUDA_VISIBLE_DEVICES=0` means using GPU 0.

`source ${lightx2v_path}/scripts/base/base.sh` sets some basic environment variables.

`--model_cls wan2.1` specifies using the wan2.1 model.

`--task t2v` specifies the t2v task.

`--model_path` specifies the model path.

`--config_json` specifies the config file path.

`--prompt` specifies the prompt.

`--negative_prompt` specifies the negative prompt.

`--save_result_path` specifies the path to save the result.

Since each model has its own characteristics, the `config_json` file contains more detailed configuration parameters for the corresponding model. The content of `config_json` files varies for different models.

The content of wan_t2v.json is as follows:
```
{
    "infer_steps": 50,
    "target_video_length": 81,
    "text_len": 512,
    "target_height": 480,
    "target_width": 832,
    "self_attn_1_type": "flash_attn3",
    "cross_attn_1_type": "flash_attn3",
    "cross_attn_2_type": "flash_attn3",
    "sample_guide_scale": 6,
    "sample_shift": 8,
    "enable_cfg": true,
    "cpu_offload": false
}
```
Some important configuration parameters:

`infer_steps`: Number of inference steps.

`target_video_length`: Number of frames in the target video (for wan2.1, fps=16, so target_video_length=81 means a 5-second video).

`target_height`: Target video height.

`target_width`: Target video width.

`self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type`: Types of the three attention layers inside the wan2.1 model. Here, flash_attn3 is used, which is only supported on Hopper architecture GPUs (H100, H20, etc.). For other GPUs, use flash_attn2 instead.

`enable_cfg`: Whether to enable cfg. Set to true here, meaning two inferences will be performed: one with the prompt and one with the negative prompt, for better results but increased inference time. If the model has already been CFG distilled, set this to false.

`cpu_offload`: Whether to enable CPU offload. Set to false here, meaning CPU offload is not enabled. For Wan2.1-T2V-1.3B, at 480*832 resolution, about 21GB of GPU memory is used. If GPU memory is insufficient, enable cpu_offload.

The above wan_t2v.json can be used as the standard config for H100, H200, H20. For A100-80G, 4090-24G, and 5090-32G, replace flash_attn3 with flash_attn2.


### Start Service to Generate Video

For actual deployment, we usually start a service and users send requests for generation tasks.

Start the service:
```
cd LightX2V/scripts/server

# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
# Example: lightx2v_path=/home/user/LightX2V
# Example: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B

bash start_server.sh
```

The content of start_server.sh is as follows:
```
#!/bin/bash

# set path firstly
lightx2v_path=
model_path=

export CUDA_VISIBLE_DEVICES=0

# set environment variables
source ${lightx2v_path}/scripts/base/base.sh


# Start API server with distributed inference service
python -m lightx2v.server \
--model_cls wan2.1 \
--task t2v \
--model_path $model_path \
--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
--host 0.0.0.0 \
--port 8000
```

`--host 0.0.0.0` and `--port 8000` mean the service runs on port 8000 of the local machine.

`--config_json` should be consistent with the config file used in the previous script.

Send a request to the server:

Here we need to open a second terminal as a user.
```
cd LightX2V/scripts/server

python post.py
```
After sending the request, you can see the inference logs on the server.

The content of post.py is as follows:
```
import requests
from loguru import logger

if __name__ == "__main__":
    url = "http://localhost:8000/v1/tasks/video/"

    message = {
        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
        "negative_prompt": "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards",
        "image_path": "",
        "seed": 42,
        "save_result_path": "./cat_boxing_seed42.mp4"
    }

    logger.info(f"message: {message}")

    response = requests.post(url, json=message)

    logger.info(f"response: {response.json()}")

```

url = "http://localhost:8000/v1/tasks/video/" means sending a video generation task to port 8000 of the local machine.

For image generation tasks, the url is:

url = "http://localhost:8000/v1/tasks/image/"

The message dictionary is the content sent to the server. If `seed` is not specified, a random seed will be generated for each request. If `save_result_path` is not specified, a file named after the task id will be generated.


### Generate Video with Python Code

Create a new pytest.py file, environment variables need to be set before running.
```
#example:
cd /pytest_path
export PYTHONPATH=lightx2v_path
# Then run the code
python pytest.py
```

The content of pytest.py is as follows:
```
from lightx2v import LightX2VPipeline

# Step 1: Create LightX2VPipeline
pipe = LightX2VPipeline(
    model_path="/data/nvme0/models/Wan-AI/Wan2.1-T2V-1.3B",
    model_cls="wan2.1",
    task="t2v",
)

# Step 2: Set runtime parameters

# You can set runtime parameters by passing a config
# Or by passing function arguments
# Only one method can be used at a time

# Option 1: Pass in the config file path (Option 1 and Option 2 cannot be used simultaneously when calling create_generator; only one may be selected!)
# pipe.create_generator(config_json="path_to_config/wan_t2v.json")

# Option 2: Pass in parameters via function arguments (Option 1 and Option 2 cannot be used simultaneously when calling create_generator; only one may be selected!)
pipe.create_generator(
    attn_mode="sage_attn2",
    infer_steps=50,
    height=480,  # Can be set to 720 for higher resolution
    width=832,  # Can be set to 1280 for higher resolution
    num_frames=81,
    guidance_scale=5.0,
    sample_shift=5.0,
)


# Step 3: Start generating videos, can generate multiple times

# First generation case
seed = 42
prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards"
save_result_path = "./cat_boxing_seed42.mp4"

pipe.generate(
    seed=seed,
    prompt=prompt,
    negative_prompt=negative_prompt,
    save_result_path=save_result_path,
)

# Second generation case
seed = 1000
prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards"
save_result_path = "./cat_boxing_seed1000.mp4"

pipe.generate(
    seed=seed,
    prompt=prompt,
    negative_prompt=negative_prompt,
    save_result_path=save_result_path,
)
```

Note 1: In Step 2 (Set runtime parameters), it is recommended to use the config_json method to align hyperparameters with the previous script-based and service-based video generation methods.

Note 2: The previous `Run Script to Generate Video` sets some additional environment variables, which can be found [here](https://github.com/ModelTC/LightX2V/blob/main/scripts/base/base.sh). Among them, `export PROFILING_DEBUG_LEVEL=2` enables inference time logging. For full alignment, you can set these environment variables before running the above Python code.