Initial commit

e2778d0d · litzh · e2778d0d · e2778d0d · e2778d0d · e2778d0d
Commit e2778d0d authored Feb 05, 2026 by litzh
20 changed files
--- a/docs/ZH_CN/source/method_tutorials/video_frame_interpolation.md
+++ b/docs/ZH_CN/source/method_tutorials/video_frame_interpolation.md
+# 视频帧插值 (VFI)
+> **重要说明**: 视频帧插值功能通过配置文件启用，而不是通过命令行参数。请在配置 JSON 文件中添加 `video_frame_interpolation` 配置块来启用此功能。
+## 概述
+视频帧插值（VFI）是一种在现有帧之间生成中间帧的技术，用于提高帧率并创建更流畅的视频播放效果。LightX2V 集成了 RIFE（Real-Time Intermediate Flow Estimation）模型，提供高质量的帧插值能力。
+## 什么是 RIFE？
+RIFE 是一种最先进的视频帧插值方法，使用光流估计来生成中间帧。它能够有效地：
+- 提高视频帧率（例如，从 16 FPS 提升到 32 FPS）
+- 创建平滑的运动过渡
+- 保持高视觉质量，最少伪影
+- 实时处理视频
+## 安装和设置
+### 下载 RIFE 模型
+首先，使用提供的脚本下载 RIFE 模型权重：
+```bash
+python tools/download_rife.py <目标目录>
+```
+例如，下载到指定位置：
+```bash
+python tools/download_rife.py /path/to/rife/train_log
+```
+此脚本将：
+- 从 HuggingFace 下载 RIFEv4.26 模型
+- 提取并将模型文件放置在正确的目录中
+- 清理临时文件
+## 使用方法
+### 配置文件设置
+视频帧插值功能通过配置文件启用。在你的配置 JSON 文件中添加 `video_frame_interpolation` 配置块：
+```json
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "target_height": 480,
+    "target_width": 832,
+    "fps": 16,
+    "video_frame_interpolation": {
+        "algo": "rife",
+        "target_fps": 32,
+        "model_path": "/path/to/rife/train_log"
+    }
+}
+```
+### 命令行使用
+使用包含 VFI 配置的配置文件运行推理：
+```bash
+python lightx2v/infer.py \
+    --model_cls wan2.1 \
+    --task t2v \
+    --model_path /path/to/model \
+    --config_json ./configs/video_frame_interpolation/wan_t2v.json \
+    --prompt "美丽的海上日落" \
+    --save_result_path ./output.mp4
+```
+### 配置参数说明
+在 `video_frame_interpolation` 配置块中：
+- `algo`: 帧插值算法，目前支持 "rife"
+- `target_fps`: 输出视频的目标帧率
+- `model_path`: RIFE 模型路径，通常为 "train_log"
+其他相关配置：
+- `fps`: 源视频帧率（默认 16）
+### 配置优先级
+系统会自动处理视频帧率配置，优先级如下：
+1. `video_frame_interpolation.target_fps` - 如果启用视频帧插值，使用此帧率作为输出帧率
+2. `fps`（默认 16）- 如果未启用视频帧插值，使用此帧率；同时总是用作源帧率
+## 工作原理
+### 帧插值过程
+1. **源视频生成**: 基础模型以源 FPS 生成视频帧
+2. **帧分析**: RIFE 分析相邻帧以估计光流
+3. **中间帧生成**: 在现有帧之间生成新帧
+4. **时序平滑**: 插值帧创建平滑的运动过渡
+### 技术细节
+- **输入格式**: ComfyUI 图像张量 [N, H, W, C]，范围 [0, 1]
+- **输出格式**: 插值后的 ComfyUI 图像张量 [M, H, W, C]，范围 [0, 1]
+- **处理**: 自动填充和分辨率处理
+- **内存优化**: 高效的 GPU 内存管理
+## 示例配置
+### 基础帧率翻倍
+创建配置文件 `wan_t2v_vfi_32fps.json`：
+```json
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "target_height": 480,
+    "target_width": 832,
+    "seed": 42,
+    "sample_guide_scale": 6,
+    "enable_cfg": true,
+    "fps": 16,
+    "video_frame_interpolation": {
+        "algo": "rife",
+        "target_fps": 32,
+        "model_path": "/path/to/rife/train_log"
+    }
+}
+```
+运行命令：
+```bash
+python lightx2v/infer.py \
+    --model_cls wan2.1 \
+    --task t2v \
+    --model_path ./models/wan2.1 \
+    --config_json ./wan_t2v_vfi_32fps.json \
+    --prompt "一只小猫在花园里玩耍" \
+    --save_result_path ./output_32fps.mp4
+```
+### 更高帧率增强
+创建配置文件 `wan_i2v_vfi_60fps.json`：
+```json
+{
+    "infer_steps": 30,
+    "target_video_length": 81,
+    "target_height": 480,
+    "target_width": 832,
+    "seed": 42,
+    "sample_guide_scale": 6,
+    "enable_cfg": true,
+    "fps": 16,
+    "video_frame_interpolation": {
+        "algo": "rife",
+        "target_fps": 60,
+        "model_path": "/path/to/rife/train_log"
+    }
+}
+```
+运行命令：
+```bash
+python lightx2v/infer.py \
+    --model_cls wan2.1 \
+    --task i2v \
+    --model_path ./models/wan2.1 \
+    --config_json ./wan_i2v_vfi_60fps.json \
+    --image_path ./input.jpg \
+    --prompt "平滑的相机运动" \
+    --save_result_path ./output_60fps.mp4
+```
+## 性能考虑
+### 内存使用
+- RIFE 处理需要额外的 GPU 内存
+- 内存使用量与视频分辨率和长度成正比
+- 对于较长的视频，考虑使用较低的分辨率
+### 处理时间
+- 帧插值会增加处理开销
+- 更高的目标帧率需要更多计算
+- 处理时间大致与插值帧数成正比
+### 质量与速度权衡
+- 更高的插值比率可能引入伪影
+- 最佳范围：2x 到 4x 帧率增加
+- 对于极端插值（>4x），考虑多次处理
+## 最佳实践
+### 最佳使用场景
+- **运动密集视频**: 从帧插值中受益最多
+- **相机运动**: 更平滑的平移和缩放
+- **动作序列**: 减少运动模糊感知
+- **慢动作效果**: 创建流畅的慢动作视频
+### 推荐设置
+- **源 FPS**: 16-24 FPS（基础模型生成）
+- **目标 FPS**: 32-60 FPS（2x 到 4x 增加）
+- **分辨率**: 最高 720p 以获得最佳性能
+### 故障排除
+#### 常见问题
+1. **内存不足**: 减少视频分辨率或目标 FPS
+2. **输出中有伪影**: 降低插值比率
+3. **处理缓慢**: 检查 GPU 内存并考虑使用 CPU 卸载
+#### 解决方案
+通过修改配置文件来解决问题：
+```json
+{
+    // 内存问题解决：使用较低分辨率
+    "target_height": 480,
+    "target_width": 832,
+    // 质量问题解决：使用适中的插值
+    "video_frame_interpolation": {
+        "target_fps": 24  // 而不是 60
+    },
+    // 性能问题解决：启用卸载
+    "cpu_offload": true
+}
+```
+## 技术实现
+LightX2V 中的 RIFE 集成包括：
+- **RIFEWrapper**: 与 ComfyUI 兼容的 RIFE 模型包装器
+- **自动模型加载**: 与推理管道的无缝集成
+- **内存优化**: 高效的张量管理和 GPU 内存使用
+- **质量保持**: 在添加帧的同时保持原始视频质量
--- a/examples/BeginnerGuide/EN/01.PrepareEnv.md
+++ b/examples/BeginnerGuide/EN/01.PrepareEnv.md
+# Preparing the Environment
+Before running, we need to set up the environment for LightX2V.
+## Method 1: Linux + Docker Environment
+We highly recommend running LightX2V in a Docker environment. This ensures consistency with our runtime environment and minimizes potential issues.
+Refer to the [quickstart guide](https://lightx2v-en.readthedocs.io/en/latest/getting_started/quickstart.html).
+The Docker image we provide already includes all necessary dependencies.
+## Method 2: Linux + Manual Setup
+Coming soon
+## Method 3: Windows + WSL2
+Coming soon
+## Method 4: Windows + Installer
+Coming soon
--- a/examples/BeginnerGuide/EN/02.Wan21-1.3B.md
+++ b/examples/BeginnerGuide/EN/02.Wan21-1.3B.md
+# Getting Started with the LightX2V Project from Wan2.1-T2V-1.3B
+We recommend starting the entire LightX2V project with Wan2.1-T2V-1.3B. Regardless of which model you intend to use, we suggest reading this document first to understand the overall workflow of LightX2V.
+## Prepare the Environment
+Please refer to [01.PrepareEnv](01.PrepareEnv.md)
+## Getting Started
+Prepare the model (choose either huggingface or modelscope to download):
+```
+# download from huggingface
+hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir Wan-AI/Wan2.1-T2V-1.3B
+# download from modelscope
+modelscope download --model Wan-AI/Wan2.1-T2V-1.3B --local_dir Wan-AI/Wan2.1-T2V-1.3B
+```
+We provide three ways to run the Wan2.1-T2V-1.3B model to generate videos:
+1. Run a script to generate video: Preset bash scripts for quick verification.
+2. Start a service to generate video: Start the service and send requests, suitable for multiple inferences and actual deployment.
+3. Generate video with Python code: Run with Python code, convenient for integration into existing codebases.
+### Run Script to Generate Video
+```
+git clone https://github.com/ModelTC/LightX2V.git
+cd LightX2V/scripts/wan
+# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
+# Example: lightx2v_path=/home/user/LightX2V
+# Example: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B
+bash run_wan_t2v.sh
+```
+Explanation of details
+The content of run_wan_t2v.sh is as follows:
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+python -m lightx2v.infer \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+--negative_prompt "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards" \
+--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4
+```
+`export CUDA_VISIBLE_DEVICES=0` means using GPU 0.
+`source ${lightx2v_path}/scripts/base/base.sh` sets some basic environment variables.
+`--model_cls wan2.1` specifies using the wan2.1 model.
+`--task t2v` specifies the t2v task.
+`--model_path` specifies the model path.
+`--config_json` specifies the config file path.
+`--prompt` specifies the prompt.
+`--negative_prompt` specifies the negative prompt.
+`--save_result_path` specifies the path to save the result.
+Since each model has its own characteristics, the `config_json` file contains more detailed configuration parameters for the corresponding model. The content of `config_json` files varies for different models.
+The content of wan_t2v.json is as follows:
+```
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 8,
+    "enable_cfg": true,
+    "cpu_offload": false
+}
+```
+Some important configuration parameters:
+`infer_steps`: Number of inference steps.
+`target_video_length`: Number of frames in the target video (for wan2.1, fps=16, so target_video_length=81 means a 5-second video).
+`target_height`: Target video height.
+`target_width`: Target video width.
+`self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type`: Types of the three attention layers inside the wan2.1 model. Here, flash_attn3 is used, which is only supported on Hopper architecture GPUs (H100, H20, etc.). For other GPUs, use flash_attn2 instead.
+`enable_cfg`: Whether to enable cfg. Set to true here, meaning two inferences will be performed: one with the prompt and one with the negative prompt, for better results but increased inference time. If the model has already been CFG distilled, set this to false.
+`cpu_offload`: Whether to enable CPU offload. Set to false here, meaning CPU offload is not enabled. For Wan2.1-T2V-1.3B, at 480*832 resolution, about 21GB of GPU memory is used. If GPU memory is insufficient, enable cpu_offload.
+The above wan_t2v.json can be used as the standard config for H100, H200, H20. For A100-80G, 4090-24G, and 5090-32G, replace flash_attn3 with flash_attn2.
+### Start Service to Generate Video
+For actual deployment, we usually start a service and users send requests for generation tasks.
+Start the service:
+```
+cd LightX2V/scripts/server
+# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
+# Example: lightx2v_path=/home/user/LightX2V
+# Example: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B
+bash start_server.sh
+```
+The content of start_server.sh is as follows:
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+# Start API server with distributed inference service
+python -m lightx2v.server \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--host 0.0.0.0 \
+--port 8000
+```
+`--host 0.0.0.0` and `--port 8000` mean the service runs on port 8000 of the local machine.
+`--config_json` should be consistent with the config file used in the previous script.
+Send a request to the server:
+Here we need to open a second terminal as a user.
+```
+cd LightX2V/scripts/server
+python post.py
+```
+After sending the request, you can see the inference logs on the server.
+The content of post.py is as follows:
+```
+import requests
+from loguru import logger
+if __name__ == "__main__":
+    url = "http://localhost:8000/v1/tasks/video/"
+    message = {
+        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "negative_prompt": "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards",
+        "image_path": "",
+        "seed": 42,
+        "save_result_path": "./cat_boxing_seed42.mp4"
+    }
+    logger.info(f"message: {message}")
+    response = requests.post(url, json=message)
+    logger.info(f"response: {response.json()}")
+```
+url = "http://localhost:8000/v1/tasks/video/" means sending a video generation task to port 8000 of the local machine.
+For image generation tasks, the url is:
+url = "http://localhost:8000/v1/tasks/image/"
+The message dictionary is the content sent to the server. If `seed` is not specified, a random seed will be generated for each request. If `save_result_path` is not specified, a file named after the task id will be generated.
+### Generate Video with Python Code
+Create a new pytest.py file, environment variables need to be set before running.
+```
+#example:
+cd /pytest_path
+export PYTHONPATH=lightx2v_path
+# Then run the code
+python pytest.py
+```
+The content of pytest.py is as follows:
+```
+from lightx2v import LightX2VPipeline
+# Step 1: Create LightX2VPipeline
+pipe = LightX2VPipeline(
+    model_path="/data/nvme0/models/Wan-AI/Wan2.1-T2V-1.3B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+# Step 2: Set runtime parameters
+# You can set runtime parameters by passing a config
+# Or by passing function arguments
+# Only one method can be used at a time
+# Option 1: Pass in the config file path (Option 1 and Option 2 cannot be used simultaneously when calling create_generator; only one may be selected!)
+# pipe.create_generator(config_json="path_to_config/wan_t2v.json")
+# Option 2: Pass in parameters via function arguments (Option 1 and Option 2 cannot be used simultaneously when calling create_generator; only one may be selected!)
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,  # Can be set to 720 for higher resolution
+    width=832,  # Can be set to 1280 for higher resolution
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+# Step 3: Start generating videos, can generate multiple times
+# First generation case
+seed = 42
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards"
+save_result_path = "./cat_boxing_seed42.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+# Second generation case
+seed = 1000
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "Camera shake, vivid colors, overexposure, static, blurry details, subtitles, style, artwork, painting, still, grayish overall, worst quality, low quality, JPEG artifacts, ugly, defective, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, static frame, cluttered background, three legs, crowded background, walking backwards"
+save_result_path = "./cat_boxing_seed1000.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+Note 1: In Step 2 (Set runtime parameters), it is recommended to use the config_json method to align hyperparameters with the previous script-based and service-based video generation methods.
+Note 2: The previous `Run Script to Generate Video` sets some additional environment variables, which can be found [here](https://github.com/ModelTC/LightX2V/blob/main/scripts/base/base.sh). Among them, `export PROFILING_DEBUG_LEVEL=2` enables inference time logging. For full alignment, you can set these environment variables before running the above Python code.
--- a/examples/BeginnerGuide/EN/QwenImage.md
+++ b/examples/BeginnerGuide/EN/QwenImage.md
+# Experience T2I and I2I with Qwen Image
+This document contains usage examples for Qwen Image and Qwen Image Edit models.
+Among them, the text-to-image model used is Qwen-Image-2512, and the image editing model is Qwen-Image-Edit-2511, both are the latest available versions of the respective models at present.
+## Prepare the Environment
+Please refer to [01.PrepareEnv](01.PrepareEnv.md)
+## Getting Started
+Prepare the model
+```
+# download from huggingface
+# Inference with 2512 text-to-image original model
+hf download Qwen/Qwen-Image-2512 --local-dir Qwen/Qwen-Image-2512
+# Inference with 2512 text-to-image step-distilled model
+hf download lightx2v/Qwen-Image-2512-Lightning --local-dir Qwen/Qwen-Image-2512-Lightning
+# Inference with 2511 image editing original model
+hf download Qwen/Qwen-Image-Edit-2511 --local-dir Qwen/Qwen-Image-2511
+# Inference with 2511 image editing step-distilled model
+hf download lightx2v/Qwen-Image-Edit-2511-Lightning --local-dir Qwen/Qwen-Image-2511-Lightning
+```
+We provide three ways to run the QwenImage model to generate images:
+1. Run a script to generate images: Preset bash scripts for quick verification.
+2. Start a service to generate images: Start the service and send requests, suitable for multiple inferences and actual deployment.
+3. Generate images with Python code: Run with Python code, convenient for integration into existing codebases.
+### Run Script to Generate Image
+```
+git clone https://github.com/ModelTC/LightX2V.git
+cd LightX2V/scripts/qwen_image
+# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
+# Example: lightx2v_path=/home/user/LightX2V
+# Example: model_path=/home/user/models/Qwen/Qwen-Image-2511
+```
+Text-to-Image Models
+```
+# Inference with 2512 text-to-image original model, default is 50 steps
+bash qwen_image_t2i_2512.sh
+# Inference with 2512 text-to-image step-distilled model, default is 8 steps, requires modify the lora_configs path in config_json file
+bash qwen_image_t2i_2512_distill.sh
+# Inference with 2512 text-to-image step-distilled + FP8 quantized model, default is 8 steps, requires modify the dit_quantized_ckpt path in config_json file
+bash qwen_image_t2i_2512_distill_fp8.sh
+```
+Note 1: In the scripts qwen_image_t2i_2512_distill.sh、qwen_image_t2i_2512_distill_fp8.sh, the model_path parameter shall be consistent with that in qwen_image_t2i_2512.sh, and it refers to the local path of the Qwen-Image-2512 model for all these scripts.
+Note 2: The config_json file to be modified is located in the directory LightX2V/configs/qwen_image, lora_configs and dit_quantized_ckpt respectively refer to the local paths of the distilled model being used.
+Image Editing Models
+```
+# Inference with 2511 image editing original model, default is 40 steps
+bash qwen_image_i2i_2511.sh
+# Inference with 2511 image editing step-distilled model, default is 8 steps, requires modify the lora_configs path in config_json file
+bash qwen_image_i2i_2511_distill.sh
+# Inference with 2511 image editing step-distilled + FP8 quantized model, default is 8 steps, requires modify the dit_quantized_ckpt path in config_json file
+bash qwen_image_i2i_2511_distill_fp8.sh
+```
+Note 1: The model_path parameter in all bash scripts shall be set to the path of the Qwen-Image-2511 model. The paths to be modified in the config_json file refer respectively to the paths of the distilled model being used.
+Note 2: You need to modify the image path parameter image_path in the bash scripts, and you can pass in your own image to test the model.
+Explanation of details
+The content of qwen_image_t2i_2512.sh is as follows:
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+python -m lightx2v.infer \
+--model_cls qwen_image \
+--task t2i \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/qwen_image/qwen_image_t2i_2512.json \
+--prompt 'A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition, Ultra HD, 4K, cinematic composition.' \
+--negative_prompt " " \
+--save_result_path ${lightx2v_path}/save_results/qwen_image_t2i_2512.png \
+--seed 42
+```
+`source ${lightx2v_path}/scripts/base/base.sh` sets some basic environment variables.
+`--model_cls qwen_image` specifies using the qwen_image model.
+`--task t2i` specifies the t2i task.
+`--model_path` specifies the model path.
+`--config_json` specifies the config file path.
+`--prompt` specifies the prompt.
+`--negative_prompt` specifies the negative prompt.
+The content of qwen_image_t2i_2512.json is as follows:
+```
+{
+    "infer_steps": 50,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": true,
+    "sample_guide_scale": 4.0
+}
+```
+`infer_steps`: Number of inference steps.
+`aspect_ratio` specifies the aspect ratio of the target image.
+`prompt_template_encode` specifies the template used for prompt encoding.
+`prompt_template_encode_start_idx` specifies the valid starting index of the prompt template.
+`attn_type` specifies the attention layers inside the qwen model. Here, flash_attn3 is used, which is only supported on Hopper architecture GPUs (H100, H20, etc.). For other GPUs, use flash_attn2 instead.
+`enable_cfg`: Whether to enable cfg. Set to true here, meaning two inferences will be performed: one with the prompt and one with the negative prompt, for better results but increased inference time.
+`sample_guide_scale` specifies CFG guidance strength, controls the intensity of CFG effect.
+qwen_image_t2i_2512_distill.json内容如下：
+```
+{
+    "infer_steps": 8,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": false,
+    "sample_guide_scale": 4.0,
+    "lora_configs": [
+        {
+          "path": "lightx2v/Qwen-Image-2512-Lightning/Qwen-Image-2512-Lightning-8steps-V1.0-fp32.safetensors",
+          "strength": 1.0
+        }
+      ]
+}
+```
+`infer_steps` number of inference steps.This is a distilled model, and the inference steps have been distilled to 8 steps.
+`enable_cfg` whether to enable CFG. For models that have undergone CFG distillation, set this parameter to false.
+`lora_configs` specifies the LoRA weight configuration, and the path needs to be modified to the actual local path.
+qwen_image_t2i_2512_distill_fp8.json内容如下：
+```
+{
+    "infer_steps": 8,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": false,
+    "sample_guide_scale": 4.0,
+    "dit_quantized": true,
+    "dit_quantized_ckpt": "lightx2v/Qwen-Image-2512-Lightning/qwen_image_2512_fp8_e4m3fn_scaled_8steps_v1.0.safetensors",
+    "dit_quant_scheme": "fp8-sgl"
+}
+```
+`dit_quantized`	whether to enable DIT quantization; setting it to True means performing quantization processing on the core DIT module of the model.
+`dit_quantized_ckpt` specifies the DIT quantized weight path, which specifies the local path of the DIT weight file after FP8 quantization.
+`dit_quant_scheme` the DIT quantization scheme, which specifies the quantization type as "fp8-sgl" (where "fp8-sgl" means using the FP8 kernel of sglang for inference).
+### Start Service to Generate Image
+Start the service:
+```
+cd LightX2V/scripts/server
+# Before running the script below, replace lightx2v_path and model_path in the script with actual paths
+# Example: lightx2v_path=/home/user/LightX2V
+# Example: model_path=/home/user/models/Qwen/Qwen-Image-2511
+# Additionally: Set config_json to the corresponding model config path.
+# Example: config_json ${lightx2v_path}/configs/qwen_image/qwen_image_t2i_2512.json
+bash start_server_t2i.sh
+```
+Send a request to the server:
+Here we need to open a second terminal as a user.
+```
+cd LightX2V/scripts/server
+# Before running post.py, you need to modify the url in the script to url = "http://localhost:8000/v1/tasks/image/"
+python post.py
+```
+After sending the request, you can see the inference logs on the server.
+### Generate Image with Python Code
+Running Step-Distilled + FP8 Quantized Model
+Run the `qwen_2511_fp8.py` script, which uses a model optimized with step distillation and FP8 quantization:
+```
+cd examples/qwen_image/
+# Environment variables need to be set before running.
+export PYTHONPATH=/home/user/LightX2V
+# Before running, you need to modify the paths in the script to the actual paths, including: model_path, dit_quantized_ckpt, image_path, save_result_path
+python qwen_2511_fp8.py
+```
+This approach reduces inference steps through step distillation technology while using FP8 quantization to reduce model size and memory footprint, achieving faster inference speed.
+Explanation of details
+The content of qwen_2511_fp8.py is as follows:
+```
+"""
+Qwen-image-edit image-to-image generation example.
+This example demonstrates how to use LightX2V with Qwen-Image-Edit model for I2I generation.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for Qwen-image-edit I2I task
+# For Qwen-Image-Edit-2511, use model_cls="qwen-image-edit-2511"
+pipe = LightX2VPipeline(
+    model_path="/path/to/Qwen-Image-Edit-2511",
+    model_cls="qwen-image-edit-2511",
+    task="i2i",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(
+#     config_json="../configs/qwen_image/qwen_image_i2i_2511_distill_fp8.json"
+# )
+# Enable offloading to significantly reduce VRAM usage with minimal speed impact
+# Suitable for RTX 30/40/50 consumer GPUs
+# pipe.enable_offload(
+#     cpu_offload=True,
+#     offload_granularity="block", #["block", "phase"]
+#     text_encoder_offload=True,
+#     vae_offload=False,
+# )
+# Load fp8 distilled weights (and int4 Qwen2_5 vl model (optional))
+pipe.enable_quantize(
+    dit_quantized=True,
+    dit_quantized_ckpt="lightx2v/Qwen-Image-Edit-2511-Lightning/qwen_image_edit_2511_fp8_e4m3fn_scaled_lightning_4steps_v1.0.safetensors",
+    quant_scheme="fp8-sgl",
+    # text_encoder_quantized=True,
+    # text_encoder_quantized_ckpt="lightx2v/Encoders/GPTQModel/Qwen25-VL-4bit-GPTQ",
+    # text_encoder_quant_scheme="int4"
+)
+# Create generator manually with specified parameters
+pipe.create_generator(
+    attn_mode="flash_attn3",
+    resize_mode="adaptive",
+    infer_steps=8,
+    guidance_scale=1,
+)
+# Generation parameters
+seed = 42
+prompt = "Replace the polka-dot shirt with a light blue shirt."
+negative_prompt = ""
+image_path = "/path/to/img.png"  # or "/path/to/img_0.jpg,/path/to/img_1.jpg"
+save_result_path = "/path/to/save_results/output.png"
+# Generate video
+pipe.generate(
+    seed=seed,
+    image_path=image_path,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+Note 1: You can set runtime parameters by passing a config or by passing function arguments. Only one method can be used at a time. The script adopts the function parameter passing method and the config passing section is commented out, but it is recommended to use the config passing method instead. For GPUs such as A100-80G, 4090-24G, and 5090-32G, replace flash_attn3 with flash_attn2.
+Note 2: Offload can be enabled for RTX 30/40/50 series GPUs to optimize VRAM usage.
+Running Qwen-Image-Edit-2511 Model + Distilled LoRA
+Run the qwen_2511_with_distill_lora.py script, which uses the Qwen-Image-Edit-2511 base model with distilled LoRA:
+```
+cd examples/qwen_image/
+# Before running, you need to modify the paths in the script to the actual paths, including: model_path, path in pipe.enable_lora, image_path, save_result_path
+python qwen_2511_with_distill_lora.py
+```
+This approach uses the complete Qwen-Image-Edit-2511 model and optimizes it through distilled LoRA, improving inference efficiency while maintaining model performance.
--- a/examples/BeginnerGuide/EN/QwenImageEncoderSplitDeploy.md
+++ b/examples/BeginnerGuide/EN/QwenImageEncoderSplitDeploy.md
+# Text Encoder Separation/Optimization Guide (Advanced Guide)
+For large-scale model inference, the Text Encoder often consumes significant memory and its computation is relatively independent. LightX2V offers two advanced Text Encoder optimization schemes: **Service Mode** and **Kernel Mode**.
+These schemes have been deeply optimized for the **Qwen-Image** series Text Encoder, significantly reducing memory usage and improving inference throughput.
+## Comparison
+| Feature | **Baseline (Original HF)** | **Service Mode (Separated)** | **Kernel Mode (Kernel Optimized)** |
+| :--- | :--- | :--- | :--- |
+| **Deployment Architecture** | Same process as main model | Independent service via HTTP/SHM | Same process as main model |
+| **Memory Usage** | High (Loads full HF model) | **Very Low** (Client loads no weights) | **Medium** (Loads simplified model + Kernel) |
+| **Cross-Request Reuse** | No | **Supported** (Shared by multiple clients) | No |
+| **Communication Overhead** | None | Yes (HTTP/SharedMemory) | None |
+| **Inference Speed** | Slow (Standard Layer) | **Very Fast** (LightLLM backend acceleration) | **Fast** (Integrated LightLLM Kernels) |
+| **Applicable Scenarios** | Quick validation, small memory single-card | **Multi-card/Multi-node production**, DiT memory bottleneck | **High-performance single-node**, extreme speed pursuit |
+For detailed performance data, please refer to: [Performance Benchmark](https://github.com/ModelTC/LightX2V/pull/829)
+---
+## 1. Service Mode (Separated Deployment)
+Service Mode runs the Text Encoder as an independent service based on the high-performance LLM inference framework **LightLLM**. The main model (LightX2V Client) retrieves hidden states via API requests.
+### 1.1 Environment Preparation
+The Text Encoder server side requires the **LightLLM** framework.
+**Server Installation Steps:**
+1. Clone LightLLM code (specify `return_hiddens` branch)
+```bash
+git clone git@github.com:ModelTC/LightLLM.git -b return_hiddens
+cd LightLLM
+```
+2. Configure Environment
+Please refer to the LightLLM official documentation to configure the Python environment (usually requires PyTorch, CUDA, Triton, etc.).
+*Note: Ensure the server environment supports FlashAttention to achieve the best performance.*
+### 1.2 Start Text Encoder Service
+Use `lightllm.server.api_server` to start the service.
+**Create start script `start_encoder_service.sh` (example):**
+```bash
+#!/bin/bash
+# GPU settings (e.g., use a separate card for Text Encoder)
+export CUDA_VISIBLE_DEVICES=1
+export LOADWORKER=18
+# Point to LightLLM code directory
+# export PYTHONPATH=/path/to/LightLLM:$PYTHONPATH
+# Model paths (replace with actual paths)
+MODEL_DIR="/path/to/models/Qwen-Image-Edit-2511/text_encoder"
+TOKENIZER_DIR="/path/to/models/Qwen-Image-Edit-2511/tokenizer"
+PROCESSOR_DIR="/path/to/models/Qwen-Image-Edit-2511/processor"
+# Set environment variables for LightLLM internal use
+export LIGHTLLM_TOKENIZER_DIR=$TOKENIZER_DIR
+export LIGHTLLM_PROCESSOR_DIR=$PROCESSOR_DIR
+export LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1
+python -m lightllm.server.api_server \
+    --model_dir $MODEL_DIR \
+    --host 0.0.0.0 \
+    --port 8010 \
+    --tp 1 \
+    --enable_fa3 \
+    --return_input_hidden_states \
+    --enable_multimodal \
+    --disable_dynamic_prompt_cache
+```
+**Key Arguments Explanation:**
+*   `--return_input_hidden_states`: **Must be enabled**. Instructs LightLLM to return hidden states instead of generated tokens, which is the core of Service Mode.
+*   `--enable_multimodal`: Enable multimodal support (handles Qwen's Vision Token).
+*   `--port 8010`: Service listening port, must match the Client configuration.
+*   `--tp 1`: Tensor Parallel degree, usually 1 is sufficient for Text Encoder.
+*   `--enable_fa3`: Enable FlashAttention.
+*   `--disable_dynamic_prompt_cache`: Disable dynamic prompt cache.
+Start the service:
+```bash
+bash start_encoder_service.sh
+```
+Seeing something like "Uvicorn running on http://0.0.0.0:8010" indicates successful startup.
+### 1.3 Configure LightX2V Client
+On the LightX2V side, simply modify the `config_json` to enable Service Mode.
+**Configuration File (`configs/qwen_image/qwen_image_i2i_2511_service.json`):**
+```json
+{
+    "text_encoder_type": "lightllm_service",
+    "lightllm_config": {
+        "service_url": "http://localhost:8010",
+        "service_timeout": 30,
+        "service_retry": 3,
+        "use_shm": true
+    },
+    // ... other parameters (infer_steps, prompt_template, etc.) ...
+}
+```
+**Parameters Explanation:**
+*   `text_encoder_type`: Set to **"lightllm_service"**.
+*   `service_url`: The address of the Text Encoder service.
+*   `use_shm`: **Strongly Recommended**.
+    *   `true`: Enable Shared Memory communication. If Client and Server are on the same machine (even in different Docker containers, provided shared memory is mounted), data transfer will happen via direct memory reading, **zero-copy, extremely fast**.
+    *   `false`: Use HTTP to transfer base64 encoded data. Suitable for cross-machine deployment.
+**Run Inference:**
+Create a run script (`scripts/qwen_image/qwen_image_i2i_2511_service.sh`):
+```bash
+python -m lightx2v.infer \
+    --model_cls qwen_image \
+    --task i2i \
+    --model_path /path/to/Qwen-Image-Edit-2511 \
+    --config_json configs/qwen_image/qwen_image_i2i_2511_service.json \
+    --prompt "Make the girl from Image 1 wear the black dress from Image 2..." \
+    --image_path "1.png,2.png,3.png" \
+    --save_result_path output.png
+```
+---
+## 2. Kernel Mode (Kernel Optimization)
+Kernel Mode is suitable for single-node high-performance inference scenarios. It does not start an independent service in the background, but loads the Text Encoder directly in the process, while **replacing HuggingFace's original slow operators** with LightLLM's core Triton Kernels.
+### 2.1 Advantages
+*   **No Independent Service**: Simplifies deployment and operations.
+*   **Triton Acceleration**: Uses highly optimized FlashAttention and Fused RMSNorm Triton Kernels.
+*   **No Communication Overhead**: Pure in-process memory operations.
+### 2.2 Configuration
+Simply modify `config_json` to enable Kernel Mode.
+**Configuration File (`configs/qwen_image/qwen_image_i2i_2511_kernel.json`):**
+```json
+{
+    "text_encoder_type": "lightllm_kernel",
+    "lightllm_config": {
+        "use_flash_attention_kernel": true,
+        "use_rmsnorm_kernel": true
+    },
+    // ... other parameters ...
+}
+```
+**Parameters Explanation:**
+*   `text_encoder_type`: Set to **"lightllm_kernel"**.
+*   `use_flash_attention_kernel`: Enable FlashAttention acceleration for Attention layers. By default flash_attention_2 will be used, but you can also use "use_flash_attention_kernel": "flash_attention_3".
+*   `use_rmsnorm_kernel`: Enable Fused RMSNorm Kernel (requires `sgl_kernel` or related dependencies; will automatically downgrade if not installed).
+**Run Inference:**
+Create run script (`scripts/qwen_image/qwen_image_i2i_2511_kernel.sh`):
+```bash
+python -m lightx2v.infer \
+    --model_cls qwen_image \
+    --task i2i \
+    --model_path /path/to/Qwen-Image-Edit-2511 \
+    --config_json configs/qwen_image/qwen_image_i2i_2511_kernel.json \
+    --prompt "..." \
+    --image_path "..." \
+    --save_result_path output.png
+```
+---
+## Summary and Recommendations
+*   **Development/Debugging**: Default Mode (HuggingFace) for best compatibility.
+*   **High-Performance Single-Node**: Use **Kernel Mode**.
+*   **Multi-Node/Multi-Card/Memory Constrained**: Use **Service Mode**. Deploy the Text Encoder on a card with smaller memory, let the main card focus on DiT inference, and achieve efficient communication via Shared Memory.
--- a/examples/BeginnerGuide/EN/SelfForcing.md
+++ b/examples/BeginnerGuide/EN/SelfForcing.md
--- a/examples/BeginnerGuide/EN/Wan21-14B.md
+++ b/examples/BeginnerGuide/EN/Wan21-14B.md
+# Trying T2V and I2V with Wan21-14B
+This document contains usage examples for the Wan2.1-T2V-14B and Wan2.1-I2V-14B-480P / Wan2.1-I2V-14B-720P models.
+## Prepare the environment
+Please refer to [01.PrepareEnv](01.PrepareEnv.md)
+## Getting started
+Prepare the models:
+```bash
+# Download from Hugging Face
+hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan-AI/Wan2.1-T2V-14B
+hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir Wan-AI/Wan2.1-I2V-14B-480P
+hf download Wan-AI/Wan2.1-I2V-14B-720P --local-dir Wan-AI/Wan2.1-I2V-14B-720P
+# Download distillation models
+hf download lightx2v/Wan2.1-Distill-Models --local-dir lightx2v/Wan2.1-Distill-Models
+hf download lightx2v/Wan2.1-Distill-Loras --local-dir lightx2v/Wan2.1-Distill-Loras
+```
+We provide three ways to run the Wan2.1-14B models to generate videos:
+1. Run the provided scripts (quick verification).
+   - Single-GPU inference
+   - Single-GPU offload inference
+   - Multi-GPU parallel inference
+2. Start a server and send requests (repeated inference / production).
+   - Single-GPU inference
+   - Single-GPU offload inference
+   - Multi-GPU parallel inference
+3. Use Python code (integration into codebases).
+   - Single-GPU inference
+   - Single-GPU offload inference
+   - Multi-GPU parallel inference
+### 1. Run scripts
+```bash
+git clone https://github.com/ModelTC/LightX2V.git
+# Before running the scripts, replace `lightx2v_path` and `model_path` with real paths
+# e.g.: lightx2v_path=/home/user/LightX2V
+# e.g.: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
+```
+#### 1.1 Single-GPU inference
+Wan2.1-T2V-14B model:
+```bash
+# model_path=Wan-AI/Wan2.1-T2V-14B
+cd LightX2V/scripts/wan
+bash run_wan_t2v.sh
+# Distillation (LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg.sh
+# Distillation (merged LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg.sh
+# Distillation + FP8 quantized model
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg.sh
+```
+Note: In the bash scripts, `model_path` points to the pre-trained original model; in config files, set `lora_configs`, `dit_original_ckpt` and `dit_quantized_ckpt` to the distillation model paths (use absolute paths), for example `/home/user/models/lightx2v/Wan2.1-Distill-Models/wan2.1_i2v_480p_int8_lightx2v_4step.safetensors`.
+Measured on a single H100 (use `watch -n 1 nvidia-smi` to observe peak GPU memory):
+- Wan2.1-T2V-14B: Total Cost 278.902019 seconds; peak 43768 MiB
+- Distill (LoRA): Total Cost 31.365923 seconds; peak 44438 MiB
+- Distill (merged LoRA): Total Cost 25.794410 seconds; peak 44418 MiB
+- Distill + FP8: Total Cost 22.000187 seconds; peak 31032 MiB
+Wan2.1-I2V-14B models:
+```bash
+# Switch `model_path` and `config_json` to try Wan2.1-I2V-14B-480P or Wan2.1-I2V-14B-720P
+cd LightX2V/scripts/wan
+bash run_wan_i2v.sh
+# Distillation (LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_lora_4step_cfg.sh
+# Distillation (merged LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_model_4step_cfg.sh
+# Distillation + FP8
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_fp8_4step_cfg.sh
+```
+Measured on a single H100:
+- Wan2.1-I2V-14B-480P: Total Cost 232.971375 seconds; peak 49872 MiB
+- Distill (LoRA): Total Cost 277.535991 seconds; peak 49782 MiB
+- Distill (merged LoRA): Total Cost 26.841140 seconds; peak 49526 MiB
+- Distill + FP8: Total Cost 25.430433 seconds; peak 34218 MiB
+#### 1.2 Single-GPU offload inference
+Enable offload in the config:
+```json
+    "cpu_offload": true,
+    "offload_granularity": "model"
+```
+Then run the same scripts as in 1.1:
+```bash
+cd LightX2V/scripts/wan
+bash run_wan_t2v.sh
+# Distillation (LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg.sh
+# Distillation (merged LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg.sh
+# Distillation + FP8
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg.sh
+```
+Measured on a single H100:
+- Wan2.1-T2V-14B: Total Cost 319.019743 seconds; peak 34932 MiB
+- Distill (LoRA): Total Cost 74.180393 seconds; peak 34562 MiB
+- Distill (merged LoRA): Total Cost 68.621963 seconds; peak 34562 MiB
+- Distill + FP8: Total Cost 58.921504 seconds; peak 21290 MiB
+Wan2.1-I2V-14B measured on single H100:
+- Wan2.1-I2V-14B-480P: Total Cost 276.509557 seconds; peak 38906 MiB
+- Distill (LoRA): Total Cost 85.217124 seconds; peak 38556 MiB
+- Distill (merged LoRA): Total Cost 79.389818 seconds; peak 38556 MiB
+- Distill + FP8: Total Cost 68.124415 seconds; peak 23400 MiB
+#### 1.3 Multi-GPU parallel inference
+Before running, set `CUDA_VISIBLE_DEVICES` to the GPUs you will use and configure the `parallel` parameters so that `cfg_p_size * seq_p_size = number_of_GPUs`.
+Wan2.1-T2V-14B (example):
+```bash
+cd LightX2V/scripts/dist_infer
+bash run_wan_t2v_dist_cfg_ulysses.sh
+# Distillation (LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg_ulysses.sh
+# Distillation (merged LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg_ulysses.sh
+# Distillation + FP8
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg_ulysses.sh
+```
+Measured on 8×H100 (per-GPU peaks):
+- Wan2.1-T2V-14B: Total Cost 131.553567 seconds; per-GPU peak 44624 MiB
+- Distill (LoRA): Total Cost 38.337339 seconds; per-GPU peak 43850 MiB
+- Distill (merged LoRA): Total Cost 29.021527 seconds; per-GPU peak 43470 MiB
+- Distill + FP8: Total Cost 26.409164 seconds; per-GPU peak 30162 MiB
+Wan2.1-I2V-14B (example):
+```bash
+cd LightX2V/scripts/dist_infer
+bash run_wan_i2v_dist_cfg_ulysses.sh
+# Distillation (LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_lora_4step_cfg_ulysses.sh
+# Distillation (merged LoRA)
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_model_4step_cfg_ulysses.sh
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_fp8_4step_cfg_ulysses.sh
+```
+Measured on 8×H100:
+- Wan2.1-I2V-14B-480P: Total Cost 116.455286 seconds; per-GPU peak 49668 MiB
+- Distill (LoRA): Total Cost 45.899316 seconds; per-GPU peak 48854 MiB
+- Distill (merged LoRA): Total Cost 33.472992 seconds; per-GPU peak 48674 MiB
+- Distill + FP8: Total Cost 30.796211 seconds; per-GPU peak 33328 MiB
+Explanation and example scripts
+`run_wan_t2v_dist_cfg_ulysses.sh`:
+```bash
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+torchrun --nproc_per_node=8 -m lightx2v.infer \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json \
+--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+--negative_prompt "camera shake, vivid color tones, overexposure, static, blurred details, subtitles, style marks, artwork, painting-like, still image, overall grayish, worst quality, low quality, JPEG compression artifacts, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, motionless frame, cluttered background, three legs, many people in background, walking backwards" \
+--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4
+```
+`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` uses GPUs 0–7 (eight GPUs total).
+`source ${lightx2v_path}/scripts/base/base.sh` sets base environment variables.
+`torchrun --nproc_per_node=8 -m lightx2v.infer` runs multi-GPU inference with 8 processes.
+`wan_t2v_dist_cfg_ulysses.json`:
+```json
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 8,
+    "enable_cfg": true,
+    "cpu_offload": false,
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+Key fields:
+- `infer_steps`: number of inference steps.
+- `target_video_length`: target frame count (Wan2.1 uses fps=16, so 81 frames ≈ 5 seconds).
+- `target_height` / `target_width`: frame dimensions.
+- `self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type`: attention operator types; `flash_attn3` is for Hopper GPUs (H100, H20); replace with `flash_attn2` for other GPUs.
+- `enable_cfg`: if true, CFG runs both positive and negative prompts (better quality but doubles inference time). Set false for CFG-distilled models.
+- `cpu_offload`: enable CPU offload to reduce GPU memory. If enabled, add `"offload_granularity": "model"` to offload entire model modules. Monitor with `watch -n 1 nvidia-smi`.
+- `parallel`: parallel inference settings. DiT supports Ulysses and Ring attention modes and CFG parallelism. Parallel inference reduces runtime and per-GPU memory. The example uses cfg + Ulysses with `seq_p_size * cfg_p_size = 8` for 8 GPUs.
+`wan_t2v_distill_lora_4step_cfg_ulysses.json`:
+```json
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "lora_configs": [
+      {
+        "path": "lightx2v/Wan2.1-Distill-Loras/wan2.1_t2v_14b_lora_rank64_lightx2v_4step.safetensors",
+        "strength": 1.0
+      }
+    ],
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+- `denoising_step_list`: timesteps for the 4-step denoising schedule.
+- `lora_configs`: LoRA plugin config; use absolute paths.
+`wan_t2v_distill_model_4step_cfg_ulysses.json`:
+```json
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "dit_original_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_lightx2v_4step.safetensors",
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+- `dit_original_ckpt`: path to the merged distillation checkpoint.
+`wan_t2v_distill_fp8_4step_cfg_ulysses.json`:
+```json
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "dit_quantized": true,
+    "dit_quantized_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+    "dit_quant_scheme": "fp8-sgl",
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+- `dit_quantized`: enable DIT quantization for the core model.
+- `dit_quantized_ckpt`: local path to FP8-quantized DIT weights.
+- `dit_quant_scheme`: quantization scheme, e.g., `fp8-sgl`.
+### 2. Start server mode
+#### 2.1 Single-GPU inference
+Start the server:
+```bash
+cd LightX2V/scripts/server
+# Before running, set `lightx2v_path`, `model_path`, and `config_json` appropriately
+# e.g.: lightx2v_path=/home/user/LightX2V
+# e.g.: model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
+# e.g.: config_json ${lightx2v_path}/configs/wan/wan_t2v.json
+bash start_server.sh
+```
+Send a request from a client terminal:
+```bash
+cd LightX2V/scripts/server
+# Video endpoint:
+python post.py
+```
+Server-side logs will show inference progress.
+#### 2.2 Single-GPU offload inference
+Enable offload in the config (see earlier snippet) and restart the server:
+```bash
+cd LightX2V/scripts/server
+bash start_server.sh
+```
+Client request:
+```bash
+cd LightX2V/scripts/server
+python post.py
+```
+#### 2.3 Multi-GPU parallel inference
+Start the multi-GPU server:
+```bash
+cd LightX2V/scripts/server
+bash start_server_cfg_ulysses.sh
+```
+Client request:
+```bash
+cd LightX2V/scripts/server
+python post.py
+```
+Measured runtimes and per-GPU peaks:
+1. Single-GPU inference: Run DiT cost 261.699812 seconds; RUN pipeline cost 261.973479 seconds; peak 43968 MiB
+2. Single-GPU offload: Run DiT cost 264.445139 seconds; RUN pipeline cost 265.565198 seconds; peak 34932 MiB
+3. Multi-GPU parallel: Run DiT cost 109.518894 seconds; RUN pipeline cost 110.085543 seconds; per-GPU peak 44624 MiB
+`start_server.sh` example:
+```bash
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+# Start API server with distributed inference service
+python -m lightx2v.server \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--host 0.0.0.0 \
+--port 8000
+echo "Service stopped"
+```
+- `--host 0.0.0.0` and `--port 8000` bind the service to port 8000 on all interfaces.
+`post.py` example:
+```python
+import requests
+from loguru import logger
+if __name__ == "__main__":
+    url = "http://localhost:8000/v1/tasks/video/"
+    message = {
+        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "negative_prompt": "camera shake, vivid color tones, overexposure, static, blurred details, subtitles, style marks, artwork, painting-like, still image, overall grayish, worst quality, low quality, JPEG compression artifacts, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, motionless frame, cluttered background, three legs, many people in background, walking backwards",
+        "image_path": "",
+        "seed": 42,
+        "save_result_path": "./cat_boxing_seed42.mp4",
+    }
+    logger.info(f"message: {message}")
+    response = requests.post(url, json=message)
+    logger.info(f"response: {response.json()}")
+```
+- `url = "http://localhost:8000/v1/tasks/video/"` posts a video generation task. For image tasks use `http://localhost:8000/v1/tasks/image/`.
+- `message` fields: if `seed` is omitted a random seed is used; if `save_result_path` is omitted the server will save the result with the task ID as filename.
+### 3. Generate via Python code
+#### 3.1 Single-GPU inference
+```bash
+cd LightX2V/examples/wan
+# Edit `wan_t2v.py` to set `model_path`, `save_result_path`, and `config_json`
+PYTHONPATH=/home/user/LightX2V python wan_t2v.py
+```
+Notes:
+1. Prefer passing `config_json` to align hyperparameters with script/server runs.
+2. `PYTHONPATH` must be an absolute path.
+#### 3.2 Single-GPU offload inference
+Enable offload in the config, then:
+```bash
+cd LightX2V/examples/wan
+PYTHONPATH=/home/user/LightX2V python wan_t2v.py
+```
+#### 3.3 Multi-GPU parallel inference
+Edit `wan_t2v.py` to use `LightX2V/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json` and run:
+```bash
+PROFILING_DEBUG_LEVEL=2 PYTHONPATH=/home/user/LightX2V torchrun --nproc_per_node=8 wan_t2v.py
+```
+Measured runtimes and per-GPU peaks:
+- Single-GPU: Run DiT cost 262.745393 seconds; RUN pipeline cost 263.279303 seconds; peak 44792 MiB
+- Single-GPU offload: Run DiT cost 263.725956 seconds; RUN pipeline cost 264.919227 seconds; peak 34936 MiB
+- Multi-GPU parallel: Run DiT cost 113.736238 seconds; RUN pipeline cost 114.297859 seconds; per-GPU peak 44624 MiB
+Example `wan_t2v.py`:
+```python
+"""
+Wan2.1 text-to-video generation example.
+This example demonstrates how to use LightX2V with Wan2.1 model for T2V generation.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for Wan2.1 T2V task
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.1-T2V-14B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(config_json="../configs/wan/wan_t2v.json")
+# Create generator with specified parameters
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,  # Can be set to 720 for higher resolution
+    width=832,  # Can be set to 1280 for higher resolution
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+seed = 42
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "camera shake, vivid color tones, overexposure, static, blurred details, subtitles, style marks, artwork, painting-like, still image, overall grayish, worst quality, low quality, JPEG compression artifacts, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, motionless frame, cluttered background, three legs, many people in background, walking backwards"
+save_result_path = "/path/to/save_results/output.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+Notes:
+1. Update `model_path` and `save_result_path` to actual paths.
+2. Prefer passing `config_json` for parameter alignment with script/server runs.
--- a/examples/BeginnerGuide/EN/Wan22-moe.md
+++ b/examples/BeginnerGuide/EN/Wan22-moe.md
--- a/examples/BeginnerGuide/README.md
+++ b/examples/BeginnerGuide/README.md
+# BeginnerGuide
+Here, we’ll guide users step by step in using LightX2V, starting from real tasks and practical examples.
+在这里，我们会从具体的任务角度出发，用实际的例子，手把手指导用户去使用LightX2V。
--- a/examples/BeginnerGuide/ZH_CN/01.PrepareEnv.md
+++ b/examples/BeginnerGuide/ZH_CN/01.PrepareEnv.md
+# 准备环境
+在实际运行之前，我们需要安装LightX2V的运行环境。
+## 方式1：Linux+docker环境
+我们强烈推荐在docker环境中，运行LightX2V，这样可以和我们的运行环境保持一致，最大可能减少问题的发生。
+可以参考[quickstart](https://lightx2v-zhcn.readthedocs.io/zh-cn/latest/getting_started/quickstart.html)
+我们提供的docker镜像，已经安装了所有必备的依赖。
+## 方式2：Linux+自己搭建环境
+即将更新
+## 方式3：Windows+WSL2
+即将更新
+## 方式4：Windows+安装包
+即将更新
--- a/examples/BeginnerGuide/ZH_CN/02.Wan21-1.3B.md
+++ b/examples/BeginnerGuide/ZH_CN/02.Wan21-1.3B.md
+# 从Wan2.1-T2V-1.3B开始整个LightX2V项目
+我们推荐从Wan2.1-T2V-1.3B开始整个LightX2V项目，不管你是想使用什么模型，我们都建议先看一下这个文档，了解整个LightX2V的运行流程。
+## 准备环境
+请参考[01.PrepareEnv](01.PrepareEnv.md)
+## 开始运行
+准备模型(huggingface和modelscope，任选其一下载)
+```
+# download from huggingface
+hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir Wan-AI/Wan2.1-T2V-1.3B
+# download from modelscope
+modelscope download --model Wan-AI/Wan2.1-T2V-1.3B --local_dir Wan-AI/Wan2.1-T2V-1.3B
+```
+我们提供三种方式，来运行Wan2.1-T2V-1.3B模型生成视频：
+1. 运行脚本生成视频: 预设的bash脚本，可以直接运行，便于快速验证
+2. 启动服务生成视频: 先启动服务，再发请求，适合多次推理和实际的线上部署
+3. python代码生成视频: 用python代码运行，便于集成到已有的代码环境中
+### 运行脚本生成视频
+```
+git clone https://github.com/ModelTC/LightX2V.git
+cd LightX2V/scripts/wan
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B
+bash run_wan_t2v.sh
+```
+解释细节
+run_wan_t2v.sh脚本内容如下：
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+python -m lightx2v.infer \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+--negative_prompt "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
+--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4
+```
+`export CUDA_VISIBLE_DEVICES=0` 表示使用0号显卡
+`source ${lightx2v_path}/scripts/base/base.sh` 设置一些基础的环境变量
+`--model_cls wan2.1` 表示使用wan2.1模型
+`--task t2v` 表示使用t2v任务
+`--model_path` 表示模型的路径
+`--config_json` 表示配置文件的路径
+`--prompt` 表示提示词
+`--negative_prompt` 表示负向提示词
+`--save_result_path` 表示保存结果的路径
+由于不同的模型都有其各自的特性，所以`config_json`文件中会存有对应模型的更多细节的配置参数，不同模型的`config_json`文件内容有所不同，
+wan_t2v.json文件内容如下：
+```
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 8,
+    "enable_cfg": true,
+    "cpu_offload": false
+}
+```
+其中一些重要的配置参数说明：
+`infer_steps` 表示推理的步数
+`target_video_length` 表示目标视频的帧数(对于wan2.1模型来说，fps=16，所以target_video_length=81，表示视频时长为5秒)
+`target_height` 表示目标视频的高度
+`target_width` 表示目标视频的宽度
+`self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type` 表示wan2.1模型内部的三个注意力层的算子的类型，这里使用flash_attn3，仅限于Hopper架构的显卡(H100, H20等)，其他显卡可以使用flash_attn2进行替代
+`enable_cfg` 表示是否启用cfg，这里设置为true，表示会推理两次，第一次使用正向提示词，第二次使用负向提示词，这样可以得到更好的效果，但是会增加推理时间，如果是已经做了CFG蒸馏的模型，这里就可以设置为false
+`cpu_offload` 表示是否启用cpu offload，这里设置为false，表示不启用cpu offload，Wan2.1-T2V-1.3B模型的显存在480*832的生成分辨率下，消耗显存约21GB，如果显存不足，则需要开启cpu_offload。
+上述wan_t2v.json文件，可以作为H100，H200，H20的标准配置文件，对于A100-80G, 4090-24G和5090-32G等显卡，把flash_attn3替换为flash_attn2
+### 启动服务生成视频
+在实际部署中，我们往往是启动一个服务，用户发送请求进行生成任务。
+启动服务
+```
+cd LightX2V/scripts/server
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-1.3B
+bash start_server.sh
+```
+start_server.sh脚本内容如下
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+# Start API server with distributed inference service
+python -m lightx2v.server \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--host 0.0.0.0 \
+--port 8000
+```
+`--host 0.0.0.0`和`--port 8000`，表示服务起在本机ip的8000端口上
+`--config_json`和前面脚本推理所用的配置文件保持一致
+向服务端发送请求
+此处需要打开第二个终端作为用户
+```
+cd LightX2V/scripts/server
+python post.py
+```
+发送完请求后，可以在服务端看到推理的日志
+post.py脚本内容如下
+```
+import requests
+from loguru import logger
+if __name__ == "__main__":
+    url = "http://localhost:8000/v1/tasks/video/"
+    message = {
+        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "negative_prompt": "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+        "image_path": "",
+        "seed": 42,
+        "save_result_path": "./cat_boxing_seed42.mp4"
+    }
+    logger.info(f"message: {message}")
+    response = requests.post(url, json=message)
+    logger.info(f"response: {response.json()}")
+```
+url = "http://localhost:8000/v1/tasks/video/" 表示向本机ip的8000端口上，发送一个视频生成任务
+如果是图像生成任务，url就是
+url = "http://localhost:8000/v1/tasks/image/"
+message字典表示向服务端发送的请求的内容，其中`seed`若不指定，每次发送请求会随机生成一个`seed`，`save_result_path`若不指定也会生成一个和任务id一致命名的文件
+### python代码生成视频
+新建pytest.py文件，运行前需设置环境变量
+```
+#例如：
+cd /pytest_path
+export PYTHONPATH=lightx2v_path
+#再运行代码
+python pytest.py
+```
+pytest.py脚本内容如下
+```
+from lightx2v import LightX2VPipeline
+# 步骤1: 创建LightX2VPipeline
+pipe = LightX2VPipeline(
+    model_path="/data/nvme0/models/Wan-AI/Wan2.1-T2V-1.3B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+# 步骤2: 设置运行中的参数
+# 可以通过传入config的方式，设置运行中的参数
+# 也可以通过函数参数传入的方式，设置运行中的参数
+# 二者只能选其一，不可同时使用
+# 方式1: 传入config文件路径 (create_generator的方式1和方式2一次只能选择一个使用!)
+# pipe.create_generator(config_json="path_to_config/wan_t2v.json")
+# 方式2: 函数参数传入 (create_generator的方式1和方式2一次只能选择一个使用!)
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,  # Can be set to 720 for higher resolution
+    width=832,  # Can be set to 1280 for higher resolution
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+# 步骤3: 开始生成视频，可以多次生成
+# 第一个生成case
+seed = 42
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
+save_result_path = "./cat_boxing_seed42.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+# 第二个生成case
+seed = 1000
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
+save_result_path = "./cat_boxing_seed1000.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+注意1：步骤2设置运行中的参数中，推荐使用传入config_json的方式，用来和前面的运行脚本生成视频和启动服务生成视频进行超参数对齐
+注意2：前面的运行脚本生成视频会额外设置一些环境变量，相关变量在[这里](https://github.com/ModelTC/LightX2V/blob/main/scripts/base/base.sh)，其中`export PROFILING_DEBUG_LEVEL=2`可以把推理耗时日志打开，为了完全对齐，可以在运行上面的python代码之前，把这些环境变量先设置好。
--- a/examples/BeginnerGuide/ZH_CN/QwenImage.md
+++ b/examples/BeginnerGuide/ZH_CN/QwenImage.md
+# 从Qwen Image体验T2I与I2I
+本文档包含 Qwen Image 和 Qwen Image Edit 模型的使用示例。
+其中文生图模型使用的是Qwen-Image-2512，图像编辑模型使用的是Qwen-Image-Edit-2511，都为目前最新的模型。
+## 准备环境
+请参考[01.PrepareEnv](01.PrepareEnv.md)
+## 开始运行
+准备模型
+```
+# 从huggingface下载
+# 推理2512文生图原始模型
+hf download Qwen/Qwen-Image-2512 --local-dir Qwen/Qwen-Image-2512
+# 推理2512文生图步数蒸馏模型
+hf download lightx2v/Qwen-Image-2512-Lightning --local-dir Qwen/Qwen-Image-2512-Lightning
+# 推理2511图像编辑原始模型
+hf download Qwen/Qwen-Image-Edit-2511 --local-dir Qwen/Qwen-Image-2511
+# 推理2511图像编辑步数蒸馏模型
+hf download lightx2v/Qwen-Image-Edit-2511-Lightning --local-dir Qwen/Qwen-Image-2511-Lightning
+```
+我们提供三种方式，来运行 Qwen Image 模型生成图片：
+1. 运行脚本生成图片: 预设的bash脚本，可以直接运行，便于快速验证
+2. 启动服务生成图片: 先启动服务，再发请求，适合多次推理和实际的线上部署
+3. python代码生成图片: 用python代码运行，便于集成到已有的代码环境中
+### 运行脚本生成图片
+```
+git clone https://github.com/ModelTC/LightX2V.git
+cd LightX2V/scripts/qwen_image
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Qwen/Qwen-Image-2511
+```
+文生图模型
+```
+# 推理2512文生图原始模型，默认是50步
+bash qwen_image_t2i_2512.sh
+# 推理2512文生图步数蒸馏模型，默认是8步，需要修改config_json文件中的lora_configs的路径
+bash qwen_image_t2i_2512_distill.sh
+# 推理2512文生图步数蒸馏+FP8量化模型，默认是8步，需要修改config_json文件中的dit_quantized_ckpt的路径
+bash qwen_image_t2i_2512_distill_fp8.sh
+```
+注意1：在qwen_image_t2i_2512_distill.sh、qwen_image_t2i_2512_distill_fp8.sh脚本中，model_path与qwen_image_t2i_2512.sh保持一致，都为Qwen-Image-2512模型的本地路径
+注意2：需要修改的config_json文件在LightX2V/configs/qwen_image中，lora_configs、dit_quantized_ckpt分别为所使用蒸馏模型的本地路径
+图像编辑模型
+```
+# 推理2511图像编辑原始模型，默认是40步
+bash qwen_image_i2i_2511.sh
+# 推理2511图像编辑步数蒸馏模型，默认是8步，需要修改config_json文件中的lora_configs的路径
+bash qwen_image_i2i_2511_distill.sh
+# 推理2511图像编辑步数蒸馏+FP8量化模型，默认是8步，需要修改config_json文件中的dit_quantized_ckpt的路径
+bash qwen_image_i2i_2511_distill_fp8.sh
+```
+注意1：bash脚本的model_path都为Qwen-Image-2511路径，config_json文件中需要修改的路径分别为所使用蒸馏模型的路径
+注意2：需要修改bash脚本中的图片路径image_path，可以传入你自己的图片来测试模型
+解释细节
+qwen_image_t2i_2512.sh脚本内容如下：
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+python -m lightx2v.infer \
+--model_cls qwen_image \
+--task t2i \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/qwen_image/qwen_image_t2i_2512.json \
+--prompt 'A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition, Ultra HD, 4K, cinematic composition.' \
+--negative_prompt " " \
+--save_result_path ${lightx2v_path}/save_results/qwen_image_t2i_2512.png \
+--seed 42
+```
+`source ${lightx2v_path}/scripts/base/base.sh` 设置一些基础的环境变量
+`--model_cls qwen_image` 表示使用qwen_image模型
+`--task t2i` 表示使用t2i任务
+`--model_path` 表示模型的路径
+`--config_json` 表示配置文件的路径
+`--prompt` 表示提示词
+`--negative_prompt` 表示负向提示词
+qwen_image_t2i_2512.json内容如下
+```
+{
+    "infer_steps": 50,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": true,
+    "sample_guide_scale": 4.0
+}
+```
+`infer_steps` 表示推理的步数
+`aspect_ratio` 表示目标图片的宽高比
+`prompt_template_encode` 表示提示词编码的模板
+`prompt_template_encode_start_idx` 表示提示词模板的有效起始索引
+`attn_type` 表示模型内部的注意力层算子的类型，这里使用flash_attn3，仅限于Hopper架构的显卡(H100, H20等)，其他显卡可以使用flash_attn2进行替代
+`enable_cfg` 表示是否启用cfg，这里设置为true，表示会推理两次，第一次使用正向提示词，第二次使用负向提示词，这样可以得到更好的效果，但是会增加推理时间
+`sample_guide_scale` 表示 CFG 引导强度，控制 CFG 的作用力度
+qwen_image_t2i_2512_distill.json内容如下：
+```
+{
+    "infer_steps": 8,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": false,
+    "sample_guide_scale": 4.0,
+    "lora_configs": [
+        {
+          "path": "lightx2v/Qwen-Image-2512-Lightning/Qwen-Image-2512-Lightning-8steps-V1.0-fp32.safetensors",
+          "strength": 1.0
+        }
+      ]
+}
+```
+`infer_steps` 表示推理的步数，这是蒸馏模型，推理步数蒸馏成8步
+`enable_cfg` 表示是否启用cfg，已经做了CFG蒸馏的模型，设置为false
+`lora_configs` 表示Lora权重配置，需修改路径为本地实际路径
+qwen_image_t2i_2512_distill_fp8.json内容如下：
+```
+{
+    "infer_steps": 8,
+    "aspect_ratio": "16:9",
+    "prompt_template_encode": "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n",
+    "prompt_template_encode_start_idx": 34,
+    "attn_type": "flash_attn3",
+    "enable_cfg": false,
+    "sample_guide_scale": 4.0,
+    "dit_quantized": true,
+    "dit_quantized_ckpt": "lightx2v/Qwen-Image-2512-Lightning/qwen_image_2512_fp8_e4m3fn_scaled_8steps_v1.0.safetensors",
+    "dit_quant_scheme": "fp8-sgl"
+}
+```
+`dit_quantized`	表示是否启用 DIT 量化，设置为True表示对模型核心的 DIT 模块做量化处理
+`dit_quantized_ckpt` 表示 DIT 量化权重路径，指定 FP8 量化后的 DIT 权重文件的本地路径
+`dit_quant_scheme` 表示 DIT 量化方案，指定量化类型为 "fp8-sgl"（fp8-sgl表示使用sglang的fp8 kernel进行推理）
+### 启动服务生成图片
+启动服务
+```
+cd LightX2V/scripts/server
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Qwen/Qwen-Image-2511
+# 同时：config_json也需要配成对应的模型config路径
+# 例如：config_json ${lightx2v_path}/configs/qwen_image/qwen_image_t2i_2512.json
+bash start_server_t2i.sh
+```
+向服务端发送请求
+此处需要打开第二个终端作为用户
+```
+cd LightX2V/scripts/server
+# 运行post.py前，需要将脚本中的url修改为 url = "http://localhost:8000/v1/tasks/image/"
+python post.py
+```
+发送完请求后，可以在服务端看到推理的日志
+### python代码生成图片
+运行步数蒸馏 + FP8 量化模型
+运行 `qwen_2511_fp8.py` 脚本，该脚本使用步数蒸馏和 FP8 量化优化的模型：
+```
+cd examples/qwen_image/
+# 运行前需设置环境变量
+export PYTHONPATH=/home/user/LightX2V
+# 运行前需修改脚本中的路径为实际路径，包括：model_path、dit_quantized_ckpt、image_path、save_result_path
+python qwen_2511_fp8.py
+```
+该方式通过步数蒸馏技术减少推理步数，同时使用 FP8 量化降低模型大小和内存占用，实现更快的推理速度。
+解释细节：
+qwen_2511_fp8.py脚本内容如下：
+```
+"""
+Qwen-image-edit image-to-image generation example.
+This example demonstrates how to use LightX2V with Qwen-Image-Edit model for I2I generation.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for Qwen-image-edit I2I task
+# For Qwen-Image-Edit-2511, use model_cls="qwen-image-edit-2511"
+pipe = LightX2VPipeline(
+    model_path="/path/to/Qwen-Image-Edit-2511",
+    model_cls="qwen-image-edit-2511",
+    task="i2i",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(
+#     config_json="../configs/qwen_image/qwen_image_i2i_2511_distill_fp8.json"
+# )
+# Enable offloading to significantly reduce VRAM usage with minimal speed impact
+# Suitable for RTX 30/40/50 consumer GPUs
+# pipe.enable_offload(
+#     cpu_offload=True,
+#     offload_granularity="block", #["block", "phase"]
+#     text_encoder_offload=True,
+#     vae_offload=False,
+# )
+# Load fp8 distilled weights (and int4 Qwen2_5 vl model (optional))
+pipe.enable_quantize(
+    dit_quantized=True,
+    dit_quantized_ckpt="lightx2v/Qwen-Image-Edit-2511-Lightning/qwen_image_edit_2511_fp8_e4m3fn_scaled_lightning_4steps_v1.0.safetensors",
+    quant_scheme="fp8-sgl",
+    # text_encoder_quantized=True,
+    # text_encoder_quantized_ckpt="lightx2v/Encoders/GPTQModel/Qwen25-VL-4bit-GPTQ",
+    # text_encoder_quant_scheme="int4"
+)
+# Create generator manually with specified parameters
+pipe.create_generator(
+    attn_mode="flash_attn3",
+    resize_mode="adaptive",
+    infer_steps=8,
+    guidance_scale=1,
+)
+# Generation parameters
+seed = 42
+prompt = "Replace the polka-dot shirt with a light blue shirt."
+negative_prompt = ""
+image_path = "/path/to/img.png"  # or "/path/to/img_0.jpg,/path/to/img_1.jpg"
+save_result_path = "/path/to/save_results/output.png"
+# Generate video
+pipe.generate(
+    seed=seed,
+    image_path=image_path,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+注意1：可以通过传入config的方式，设置运行中的参数，也可以通过函数参数传入的方式，设置运行中的参数二者只能选其一，不可同时使用。脚本中使用的是函数参数传入，将传入config的部分注释，推荐使用传入config的方式。对于A100-80G, 4090-24G和5090-32G等显卡，把flash_attn3替换为flash_attn2
+注意2：RTX 30/40/50 GPUs可以启用 Offload 优化显存
+运行 Qwen-Image-Edit-2511 模型 + 蒸馏 LoRA
+运行 qwen_2511_with_distill_lora.py 脚本，该脚本使用 Qwen-Image-Edit-2511 基础模型配合蒸馏 LoRA：
+```
+cd examples/qwen_image/
+# 运行前需修改脚本中的路径为实际路径，包括：model_path、pipe.enable_lora中的path、image_path、save_result_path
+python qwen_2511_with_distill_lora.py
+```
+该方式使用完整的 Qwen-Image-Edit-2511 模型，并通过蒸馏 LoRA 进行模型优化，在保持模型性能的同时提升推理效率。
--- a/examples/BeginnerGuide/ZH_CN/QwenImageEncoderSplitDeploy.md
+++ b/examples/BeginnerGuide/ZH_CN/QwenImageEncoderSplitDeploy.md
+# Text Encoder 分离部署/优化指南 (Advanced Guide)
+对于大规模模型推理，Text Encoder 往往占据显存且计算相对独立。LightX2V 提供了两种先进的 Text Encoder 优化方案：**Service Mode (分离部署)** 和 **Kernel Mode (内核优化)**。
+这两种方案目前针对 **Qwen-Image** 系列模型的 Text Encoder 进行过深度优化，显著降低了显存占用并提升了推理吞吐量。
+## 方案对比
+| 特性 | **Baseline (原始 huggingface)** | **Service Mode (分离部署)** | **Kernel Mode (内核优化)** |
+| :--- | :--- | :--- | :--- |
+| **部署架构** | 与主模型在同一进程 | 独立服务，通过 HTTP/SHM 通信 | 与主模型在同一进程 |
+| **显存占用** | 高 (加载完整 HF 模型) | **极低** (Client 端不加载权重) | **中** (加载精简模型 + Kernel) |
+| **跨请求复用** | 无 | **支持** (多客户端共享一个 Encoder) | 无 |
+| **通信开销** | 无 | 有 (HTTP/SharedMemory) | 无 |
+| **推理速度** | 慢 (标准 Layer) | **极快** (LightLLM 后端加速) | **快** (集成 LightLLM Kernel) |
+| **适用场景** | 快速验证、小显存单卡 | **多卡/多机生产环境**、DiT 显存瓶颈 | **高性能单机推理**、追求极限速度 |
+详情性能数据可参考: [Performance Benchmark](https://github.com/ModelTC/LightX2V/pull/829)
+---
+## 1. Service Mode (分离部署模式)
+Service Mode 将 Text Encoder 作为一个独立的服务启动，基于高性能 LLM 推理框架 **LightLLM**。主模型 (LightX2V Client) 通过 API 请求获取 hidden states。
+### 1.1 环境准备
+Text Encoder 服务端需要使用 **LightLLM** 框架。
+**服务端安装步骤:**
+1. 拉取 LightLLM 代码 (指定 `return_hiddens` 分支)
+```bash
+git clone git@github.com:ModelTC/LightLLM.git -b return_hiddens
+cd LightLLM
+```
+2. 配置环境
+请参考 LightLLM 官方文档配置 Python 环境 (通常需要 PyTorch, CUDA, Triton 等)。
+*注意：确保服务端环境支持 FlashAttention 以获得最佳性能。*
+### 1.2 启动 Text Encoder 服务
+使用 `lightllm.server.api_server` 启动服务。
+**编写启动脚本 `start_encoder_service.sh` (参考示例):**
+```bash
+#!/bin/bash
+# 显卡设置 (例如使用独立的卡运行 Text Encoder)
+export CUDA_VISIBLE_DEVICES=1
+export LOADWORKER=18
+# 指向 LightLLM 代码目录
+# export PYTHONPATH=/path/to/LightLLM:$PYTHONPATH
+# 模型相关路径 (需替换为实际路径)
+MODEL_DIR="/path/to/models/Qwen-Image-Edit-2511/text_encoder"
+TOKENIZER_DIR="/path/to/models/Qwen-Image-Edit-2511/tokenizer"
+PROCESSOR_DIR="/path/to/models/Qwen-Image-Edit-2511/processor"
+# 设置环境变量供 LightLLM 内部使用
+export LIGHTLLM_TOKENIZER_DIR=$TOKENIZER_DIR
+export LIGHTLLM_PROCESSOR_DIR=$PROCESSOR_DIR
+export LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1
+python -m lightllm.server.api_server \
+    --model_dir $MODEL_DIR \
+    --host 0.0.0.0 \
+    --port 8010 \
+    --tp 1 \
+    --enable_fa3 \
+    --return_input_hidden_states \
+    --enable_multimodal \
+    --disable_dynamic_prompt_cache
+```
+**关键参数说明:**
+*   `--return_input_hidden_states`: **必须开启**。让 LightLLM 返回 hidden states 而不是生成的 token，这是 Service Mode 的核心。
+*   `--enable_multimodal`: 开启多模态支持 (处理 Qwen 的 Vision Token)。
+*   `--port 8010`: 服务监听端口，需与 Client 端配置一致。
+*   `--tp 1`: Tensor Parallel 度，通常 Text Encoder 单卡即可部署。
+*   `--enable_fa3`: 启用 FlashAttention。
+*   `--disable_dynamic_prompt_cache`: 禁用动态 Prompt Cache。
+启动服务:
+```bash
+bash start_encoder_service.sh
+```
+看到类似 "Uvicorn running on http://0.0.0.0:8010" 即表示启动成功。
+### 1.3 配置 LightX2V Client
+在 LightX2V 端，只需修改 `config_json` 来启用 Service Mode。
+**配置文件 (`configs/qwen_image/qwen_image_i2i_2511_service.json`):**
+```json
+{
+    "text_encoder_type": "lightllm_service",
+    "lightllm_config": {
+        "service_url": "http://localhost:8010",
+        "service_timeout": 30,
+        "service_retry": 3,
+        "use_shm": true
+    },
+    // ... 其他参数 (infer_steps, prompt_template 等) ...
+}
+```
+**参数说明:**
+*   `text_encoder_type`: 设置为 **"lightllm_service"**。
+*   `service_url`: Text Encoder 服务的地址。
+*   `use_shm`: **强烈推荐开启**。
+    *   `true`: 启用共享内存 (Shared Memory) 通信。如果 Client 和 Server 在同一台机器 (即使不同 Docker 容器，需挂载共享内存)，数据传输将通过内存直读，**零拷贝，速度极快**。
+    *   `false`: 使用 HTTP 传输 base64 编码数据。适用于跨机部署。
+**运行推理:**
+编写运行脚本 (`scripts/qwen_image/qwen_image_i2i_2511_service.sh`):
+```bash
+python -m lightx2v.infer \
+    --model_cls qwen_image \
+    --task i2i \
+    --model_path /path/to/Qwen-Image-Edit-2511 \
+    --config_json configs/qwen_image/qwen_image_i2i_2511_service.json \
+    --prompt "Make the girl from Image 1 wear the black dress from Image 2..." \
+    --image_path "1.png,2.png,3.png" \
+    --save_result_path output.png
+```
+---
+## 2. Kernel Mode (内核优化模式)
+Kernel Mode 适合单机高性能推理场景。它不在后台启动独立服务，而是在进程内直接加载 Text Encoder，但**替换了 HuggingFace 原始的慢速算子**，集成了 LightLLM 的核心 Triton Kernel。
+### 2.1 优势
+*   **无需独立服务**: 简化部署运维。
+*   **Triton 加速**: 使用高度优化的 FlashAttention 和 Fused RMSNorm Triton Kernel。
+*   **无通信开销**: 纯进程内内存操作。
+### 2.2 配置方法
+只需修改 `config_json` 启用 Kernel Mode。
+**配置文件 (`configs/qwen_image/qwen_image_i2i_2511_kernel.json`):**
+```json
+{
+    "text_encoder_type": "lightllm_kernel",
+    "lightllm_config": {
+        "use_flash_attention_kernel": true,
+        "use_rmsnorm_kernel": true
+    },
+    // ... 其他参数 ...
+}
+```
+**参数说明:**
+*   `text_encoder_type`: 设置为 **"lightllm_kernel"**。
+*   `use_flash_attention_kernel`: 启用 FlashAttention 加速 Attention 层。 默认情况下将使用 flash_attention_2，但你也可以使用 “use_flash_attention_kernel”: “flash_attention_3”。
+*   `use_rmsnorm_kernel`: 启用 Fused RMSNorm Kernel (需安装 `sgl_kernel` 或相关依赖，如未安装会自动降级)。
+**运行推理:**
+编写运行脚本 (`scripts/qwen_image/qwen_image_i2i_2511_kernel.sh`):
+```bash
+python -m lightx2v.infer \
+    --model_cls qwen_image \
+    --task i2i \
+    --model_path /path/to/Qwen-Image-Edit-2511 \
+    --config_json configs/qwen_image/qwen_image_i2i_2511_kernel.json \
+    --prompt "..." \
+    --image_path "..." \
+    --save_result_path output.png
+```
+---
+## 总结建议
+*   **开发调试**: 默认模式 (HuggingFace) 兼容性最好。
+*   **单机高性能**: 使用 **Kernel Mode**。
+*   **多机/多卡/显存受限**: 使用 **Service Mode**。将 Text Encoder 部署在显存较小的卡上，主卡专注于 DiT 推理，并通过 Shared Memory 实现高效通信。
--- a/examples/BeginnerGuide/ZH_CN/SelfForcing.md
+++ b/examples/BeginnerGuide/ZH_CN/SelfForcing.md
--- a/examples/BeginnerGuide/ZH_CN/Wan21-14B.md
+++ b/examples/BeginnerGuide/ZH_CN/Wan21-14B.md
+# 从Wan21-14B体验T2V和I2V
+本文档包含 Wan2.1-T2V-14B 和 Wan2.1-I2V-14B-480P、Wan2.1-I2V-14B-720P 模型的使用示例。
+## 准备环境
+请参考[01.PrepareEnv](01.PrepareEnv.md)
+## 开始运行
+准备模型
+```
+# 从huggingface下载
+hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan-AI/Wan2.1-T2V-14B
+hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir Wan-AI/Wan2.1-I2V-14B-480P
+hf download Wan-AI/Wan2.1-I2V-14B-720P --local-dir Wan-AI/Wan2.1-I2V-14B-720P
+#下载蒸馏模型
+hf download lightx2v/Wan2.1-Distill-Models --local-dir lightx2v/Wan2.1-Distill-Models
+hf download lightx2v/Wan2.1-Distill-Loras --local-dir lightx2v/Wan2.1-Distill-Loras
+```
+我们提供三种方式，来运行Wan21-14B模型生成视频：
+1. 运行脚本生成: 预设的bash脚本，可以直接运行，便于快速验证
+    1.1 单卡推理
+    1.2 单卡offload推理
+    1.3 多卡并行推理
+2. 启动服务生成: 先启动服务，再发请求，适合多次推理和实际的线上部署
+    2.1 单卡推理
+    2.2 单卡offload推理
+    2.3 多卡并行推理
+3. python代码生成: 用python代码运行，便于集成到已有的代码环境中
+    3.1 单卡推理
+    3.2 单卡offload推理
+    3.3 多卡并行推理
+### 1. 运行脚本生成
+```
+git clone https://github.com/ModelTC/LightX2V.git
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
+```
+#### 1.1 单卡推理
+Wan2.1-T2V-14B模型
+```
+# model_path=Wan-AI/Wan2.1-T2V-14B
+cd LightX2V/scripts/wan
+bash run_wan_t2v.sh
+# 步数蒸馏模型 Lora
+# model_path=Wan-AI/Wan2.1-T2V-14B
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg.sh
+# 步数蒸馏模型 merge Lora
+# model_path=Wan-AI/Wan2.1-T2V-14B
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg.sh
+# 步数蒸馏+FP8量化模型
+# model_path=Wan-AI/Wan2.1-T2V-14B
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg.sh
+```
+注意：bash脚本中的model_path为pre-train原模型的路径；config文件中的lora_configs、dit_original_ckpt和dit_quantized_ckpt为所使用的蒸馏模型路径，需要修改为绝对路径，例如：/home/user/models/lightx2v/Wan2.1-Distill-Models/wan2.1_i2v_480p_int8_lightx2v_4step.safetensors
+使用单张H100，运行时间及使用`watch -n 1 nvidia-smi`观测的峰值显存测试如下：
+1. Wan2.1-T2V-14B模型：Total Cost cost 278.902019 seconds；43768MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 31.365923 seconds；44438MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 25.794410 seconds；44418MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 22.000187 seconds；31032MiB
+Wan2.1-I2V-14B模型
+```
+# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
+cd LightX2V/scripts/wan
+bash run_wan_i2v.sh
+# 步数蒸馏模型 Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_lora_4step_cfg.sh
+# 步数蒸馏模型 merge Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_model_4step_cfg.sh
+# 步数蒸馏+FP8量化模型
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_fp8_4step_cfg.sh
+```
+使用单张H100，运行时间及使用观测的峰值显存测试如下：
+1. Wan2.1-I2V-14B-480P模型：Total Cost cost 232.971375 seconds；49872MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 277.535991 seconds；49782MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 26.841140 seconds；49526MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 25.430433 seconds；34218MiB
+#### 1.2 单卡offload推理
+如下修改 config 文件中的 cpu_offload，开启offload
+```
+    "cpu_offload": true,
+    "offload_granularity": "model"
+```
+Wan2.1-T2V-14B模型
+```
+cd LightX2V/scripts/wan
+bash run_wan_t2v.sh
+# 步数蒸馏模型 Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg.sh
+# 步数蒸馏模型 merge Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg.sh
+# 步数蒸馏+FP8量化模型
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg.sh
+```
+使用单张H100，运行时间及观测的峰值显存测试如下：
+1. Wan2.1-T2V-14B模型：Total Cost cost 319.019743 seconds；34932MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 74.180393 seconds；34562MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 68.621963 seconds；34562MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 58.921504 seconds；21290MiB
+Wan2.1-I2V-14B模型
+```
+# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
+cd LightX2V/scripts/wan
+bash run_wan_i2v.sh
+# 步数蒸馏模型 Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_lora_4step_cfg.sh
+# 步数蒸馏模型 merge Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_model_4step_cfg.sh
+# 步数蒸馏+FP8量化模型
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_fp8_4step_cfg.sh
+```
+使用单张H100，运行时间及观测的峰值显存测试如下：
+1. Wan2.1-I2V-14B-480P模型：Total Cost cost 276.509557 seconds；38906MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 85.217124 seconds；38556MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 79.389818 seconds；38556MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 68.124415 seconds；23400MiB
+#### 1.3 多卡并行推理
+Wan2.1-T2V-14B模型
+```
+# 运行前需将CUDA_VISIBLE_DEVICES替换为实际用的GPU
+# 同时config文件中的parallel参数也需对应修改，满足cfg_p_size * seq_p_size = GPU数目
+cd LightX2V/scripts/dist_infer
+bash run_wan_t2v_dist_cfg_ulysses.sh
+# 步数蒸馏模型 Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_lora_4step_cfg_ulysses.sh
+# 步数蒸馏模型 merge Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_model_4step_cfg_ulysses.sh
+# 步数蒸馏+FP8量化模型
+cd LightX2V/scripts/wan/distill
+bash run_wan_t2v_distill_fp8_4step_cfg_ulysses.sh
+```
+使用8张H100，运行时间及观测的每张卡峰值显存测试如下：
+1. Wan2.1-I2V-14B-480P模型：Total Cost cost 131.553567 seconds；44624MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 38.337339 seconds；43850MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 29.021527 seconds；43470MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 26.409164 seconds；30162MiB
+Wan2.1-I2V-14B模型
+```
+# 切换model_path与config_json体验Wan2.1-I2V-14B-480P与Wan2.1-I2V-14B-720P
+cd LightX2V/scripts/dist_infer
+bash run_wan_i2v_dist_cfg_ulysses.sh
+# 步数蒸馏模型 Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_lora_4step_cfg_ulysses.sh
+# 步数蒸馏模型 merge Lora
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_model_4step_cfg_ulysses.sh
+# 步数蒸馏+FP8量化模型
+cd LightX2V/scripts/wan/distill
+bash run_wan_i2v_distill_fp8_4step_cfg_ulysses.sh
+```
+使用8张H100，运行时间及观测的每张卡峰值显存测试如下：
+1. Wan2.1-I2V-14B-480P模型：Total Cost cost 116.455286 seconds；49668MiB
+2. 步数蒸馏模型 Lora：Total Cost cost 45.899316 seconds；48854MiB
+3. 步数蒸馏模型 merge Lora：Total Cost cost 33.472992 seconds；48674MiB
+4. 步数蒸馏+FP8量化模型：Total Cost cost 30.796211 seconds；33328MiB
+解释细节
+run_wan_t2v_dist_cfg_ulysses.sh脚本内容如下：
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+torchrun --nproc_per_node=8 -m lightx2v.infer \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json \
+--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
+--negative_prompt "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
+--save_result_path ${lightx2v_path}/save_results/output_lightx2v_wan_t2v.mp4
+```
+`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` 表示使用0-7号显卡，共8张
+`source ${lightx2v_path}/scripts/base/base.sh` 设置一些基础的环境变量
+`torchrun --nproc_per_node=8 -m lightx2v.infer` 表示使用torchrun启动多卡，启动8个进程，每个进程绑定1张 GPU
+`--model_cls wan2.1` 表示使用wan2.1模型
+`--task t2v` 表示使用t2v任务，在运行 Wan2.1-I2V-14B 模型时对应为 i2v
+`--model_path` 表示模型的路径
+`--config_json` 表示配置文件的路径
+`--prompt` 表示提示词
+`--negative_prompt` 表示负向提示词
+`--save_result_path` 表示保存结果的路径
+由于不同的模型都有其各自的特性，所以`config_json`文件中会存有对应模型的更多细节的配置参数，不同模型的`config_json`文件内容有所不同
+wan_t2v_dist_cfg_ulysses.json内容如下：
+```
+{
+    "infer_steps": 50,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 8,
+    "enable_cfg": true,
+    "cpu_offload": false,
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+`infer_steps` 表示推理的步数
+`target_video_length` 表示目标视频的帧数(对于wan2.1模型来说，fps=16，所以target_video_length=81，表示视频时长为5秒)
+`target_height` 表示目标视频的高度
+`target_width` 表示目标视频的宽度
+`self_attn_1_type`, `cross_attn_1_type`, `cross_attn_2_type` 表示wan2.1模型内部的三个注意力层的算子的类型，这里使用flash_attn3，仅限于Hopper架构的显卡(H100, H20等)，其他显卡可以使用flash_attn2进行替代
+`enable_cfg` 表示是否启用cfg，这里设置为true，表示会推理两次，第一次使用正向提示词，第二次使用负向提示词，这样可以得到更好的效果，但是会增加推理时间，如果是已经做了CFG蒸馏的模型，这里就可以设置为false
+`cpu_offload` 表示是否启用cpu offload，启用cpu offload能达到降低显存的效果。若是开启cpu offload，则需要加上`"offload_granularity": "model"` ，表示卸载粒度，按整个模型模块卸载。开启之后可以使用`watch -n 1 nvidia-smi` 观察显存使用情况。
+`parallel` 表示并行参数设置。DiT支持两种并行注意力机制：Ulysses 和 Ring，同时还支持 Cfg 并行推理。并行推理能够显著降低推理耗时和减轻每个GPU的显存开销。这里使用 cfg＋Ulysses 并行，对应 seq_p_size*cfg_p_size=8 八卡配置
+wan_t2v_distill_lora_4step_cfg_ulysses.json内容如下：
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "lora_configs": [
+      {
+        "path": "lightx2v/Wan2.1-Distill-Loras/wan2.1_t2v_14b_lora_rank64_lightx2v_4step.safetensors",
+        "strength": 1.0
+      }
+    ],
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+  }
+```
+`infer_steps` 表示推理的步数，这里使用的是蒸馏模型，推理步数蒸馏成4步
+`denoising_step_list` 表示 4 步去噪步骤对应的时间步
+`lora_configs` 表示LoRA 插件配置，填入蒸馏模型的路径，需为绝对路径
+wan_t2v_distill_model_4step_cfg_ulysses.json内容如下：
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "dit_original_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_lightx2v_4step.safetensors",
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+`dit_original_ckpt` 表示 merge Lora 后的蒸馏模型路径
+wan_t2v_distill_fp8_4step_cfg_ulysses.json内容如下：
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": 6,
+    "sample_shift": 5,
+    "enable_cfg": false,
+    "cpu_offload": false,
+    "denoising_step_list": [1000, 750, 500, 250],
+    "dit_quantized": true,
+    "dit_quantized_ckpt": "lightx2v/Wan2.1-Distill-Models/wan2.1_t2v_14b_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+    "dit_quant_scheme": "fp8-sgl",
+    "parallel": {
+        "seq_p_size": 4,
+        "seq_p_attn_type": "ulysses",
+        "cfg_p_size": 2
+    }
+}
+```
+`dit_quantized`	表示是否启用 DIT 量化，设置为True表示对模型核心的 DIT 模块做量化处理
+`dit_quantized_ckpt` 表示 DIT 量化权重路径，指定 FP8 量化后的 DIT 权重文件的本地路径
+`dit_quant_scheme` 表示 DIT 量化方案，指定量化类型为 "fp8-sgl"（fp8-sgl表示使用sglang的fp8 kernel进行推理）
+### 2.启动服务生成
+#### 2.1单卡推理
+启动服务
+```
+cd LightX2V/scripts/server
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path、model_path以及config_json替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Wan-AI/Wan2.1-T2V-14B
+# 例如：config_json ${lightx2v_path}/configs/wan/wan_t2v.json
+# 切换model_path和config_json路径体验不同模型
+bash start_server.sh
+```
+向服务端发送请求
+此处需要打开第二个终端作为用户
+```
+cd LightX2V/scripts/server
+# 此时生成视频，url = "http://localhost:8000/v1/tasks/video/"
+python post.py
+```
+发送完请求后，可以在服务端看到推理的日志
+#### 2.2 单卡offload推理
+如下修改 config 文件中的 cpu_offload，开启offload
+```
+    "cpu_offload": true,
+    "offload_granularity": "model"
+```
+启动服务
+```
+cd LightX2V/scripts/server
+bash start_server.sh
+```
+向服务端发送请求
+```
+cd LightX2V/scripts/server
+# 此时生成视频，url = "http://localhost:8000/v1/tasks/video/"
+python post.py
+```
+#### 2.3 多卡并行推理
+启动服务
+```
+cd LightX2V/scripts/server
+bash start_server_cfg_ulysses.sh
+```
+向服务端发送请求
+```
+cd LightX2V/scripts/server
+python post.py
+```
+运行时间及观测的每张卡峰值显存测试如下：
+1. 单卡推理：Run DiT cost 261.699812 seconds；RUN pipeline cost 261.973479 seconds；43968MiB
+2. 单卡offload推理：Run DiT cost 264.445139 seconds；RUN pipeline cost 265.565198 seconds；34932MiB
+3. 多卡并行推理：Run DiT cost 109.518894 seconds；RUN pipeline cost 110.085543 seconds；44624MiB
+解释细节
+start_server.sh脚本内容如下
+```
+#!/bin/bash
+# set path firstly
+lightx2v_path=
+model_path=
+export CUDA_VISIBLE_DEVICES=0
+# set environment variables
+source ${lightx2v_path}/scripts/base/base.sh
+# Start API server with distributed inference service
+python -m lightx2v.server \
+--model_cls wan2.1 \
+--task t2v \
+--model_path $model_path \
+--config_json ${lightx2v_path}/configs/wan/wan_t2v.json \
+--host 0.0.0.0 \
+--port 8000
+echo "Service stopped"
+```
+`--host 0.0.0.0`和`--port 8000`，表示服务起在本机ip的8000端口上
+post.py内容如下
+```
+import requests
+from loguru import logger
+if __name__ == "__main__":
+    url = "http://localhost:8000/v1/tasks/video/"
+    message = {
+        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "negative_prompt": "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+        "image_path": "",
+        "seed": 42,
+        "save_result_path": "./cat_boxing_seed42.mp4",
+    }
+    logger.info(f"message: {message}")
+    response = requests.post(url, json=message)
+    logger.info(f"response: {response.json()}")
+```
+`url = "http://localhost:8000/v1/tasks/video/" `表示向本机ip的8000端口上，发送一个视频生成任务。如果是图像生成任务，url需改成 url = "http://localhost:8000/v1/tasks/image/"
+`message字典` 表示向服务端发送的请求的内容，其中`seed`若不指定，每次发送请求会随机生成一个`seed`，`save_result_path`若不指定也会生成一个和任务id一致命名的文件
+### 3.python代码生成
+#### 3.1单卡推理
+```
+cd LightX2V/examples/wan/wan_t2v.py
+# 修改model_path、save_result_path、config_json
+PYTHONPATH=/home/user/LightX2V python wan_t2v.py
+```
+注意1：设置运行中的参数中，推荐使用传入config_json的方式，用来和前面的运行脚本生成视频和启动服务生成视频进行超参数对齐
+注意2：PYTHONPATH的路径需为绝对路径
+#### 3.2 单卡offload推理
+如下修改 config 文件中的 cpu_offload，开启offload
+```
+    "cpu_offload": true,
+    "offload_granularity": "model"
+```
+```
+cd LightX2V/examples/wan/wan_t2v.py
+PYTHONPATH=/home/user/LightX2V python wan_t2v.py
+```
+#### 3.3 多卡并行推理
+```
+cd LightX2V/examples/wan/wan_t2v.py
+# 代码中需将config_json改成：LightX2V/configs/dist_infer/wan_t2v_dist_cfg_ulysses.json
+PROFILING_DEBUG_LEVEL=2 PYTHONPATH=/home/user/LightX2V torchrun --nproc_per_node=8 wan_t2v.py
+```
+运行时间及观测的每张卡峰值显存测试如下：
+1. 单卡推理：Run DiT cost 262.745393 seconds；RUN pipeline cost 263.279303 seconds；44792MiB
+2. 单卡offload推理：Run DiT cost 263.725956 seconds；RUN pipeline cost 264.919227 seconds；34936MiB
+3. 多卡并行推理：Run DiT cost 113.736238 seconds；RUN pipeline cost 114.297859 seconds；44624MiB
+解释细节
+wan_t2v.py内容如下
+```
+"""
+Wan2.1 text-to-video generation example.
+This example demonstrates how to use LightX2V with Wan2.1 model for T2V generation.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for Wan2.1 T2V task
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.1-T2V-14B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(config_json="../configs/wan/wan_t2v.json")
+# Create generator with specified parameters
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,  # Can be set to 720 for higher resolution
+    width=832,  # Can be set to 1280 for higher resolution
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+seed = 42
+prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+negative_prompt = "镜头晃动，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
+save_result_path = "/path/to/save_results/output.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+注意1：需要修改 model_path、save_result_path 为实际的路径
+注意2：设置运行中的参数中，推荐使用传入config_json的方式，用来和前面的运行脚本生成视频和启动服务生成视频进行超参数对齐
--- a/examples/BeginnerGuide/ZH_CN/Wan22-moe.md
+++ b/examples/BeginnerGuide/ZH_CN/Wan22-moe.md
+#从Wan2.2体验MoE
+本文档包含 Wan2.2-T2V-A14B 和 Wan2.2-I2V-A14B 模型的使用示例。
+## 准备环境
+请参考[01.PrepareEnv](01.PrepareEnv.md)
+## 开始运行
+准备模型
+```
+# 从huggingface下载
+hf download Wan-AI/Wan2.2-T2V-A14B --local-dir Wan-AI/Wan2.2-T2V-A14B
+hf download Wan-AI/Wan2.2-I2V-A14B --local-dir Wan-AI/Wan2.2-I2V-A14B
+hf download lightx2v/Wan2.2-Distill-Models --local-dir Wan-AI/Wan2.2-Distill-Models
+hf download lightx2v/Wan2.2-Distill-Loras --local-dir Wan-AI/Wan2.2-Distill-Loras
+```
+### 运行脚本生成
+Wan2.2-T2V-A14B
+```
+# 运行前需将CUDA_VISIBLE_DEVICES替换为实际用的GPU
+# 同时config文件中的parallel参数也需对应修改，满足cfg_p_size * seq_p_size = GPU数目
+cd LightX2V/scripts/dist_infer
+bash bash run_wan22_moe_t2v_cfg_ulysses.sh
+# 步数蒸馏模型 Lora
+# 修改 config_json 为LightX2V/configs/wan22/wan_moe_t2v_distill_lora.json，并修改其中的lora_configs为所使用的蒸馏模型路径
+cd LightX2V/scripts/wan22
+bash run_wan22_moe_t2v_distill.sh
+```
+Wan2.2-I2V-A14B
+```
+cd LightX2V/scripts/dist_infer
+bash run_wan22_moe_i2v_cfg_ulysses.sh
+# 步数蒸馏模型 Lora
+# 修改 config_json 为LightX2V/configs/wan22/wan_moe_i2v_distill_with_lora.json
+cd LightX2V/scripts/wan22
+bash run_wan22_moe_i2v_distill.sh
+# 步数蒸馏模型 merge Lora
+# 修改 config_json 为LightX2V/configs/wan22/wan_moe_i2v_distill.json
+cd LightX2V/scripts/wan22
+bash run_wan22_moe_i2v_distill.sh
+# 步数蒸馏+FP8量化模型
+# 修改 config_json 为LightX2V/configs/wan22/wan_moe_i2v_distill_quant.json
+cd LightX2V/scripts/wan22
+bash run_wan22_moe_i2v_distill.sh
+```
+解释细节
+wan_moe_t2v_distill_lora.json内容如下：
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 480,
+    "target_width": 832,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": [
+        4.0,
+        3.0
+    ],
+    "sample_shift": 5.0,
+    "enable_cfg": false,
+    "cpu_offload": true,
+    "offload_granularity": "model",
+    "t5_cpu_offload": false,
+    "vae_cpu_offload": false,
+    "boundary_step_index": 2,
+    "denoising_step_list": [
+        1000,
+        750,
+        500,
+        250
+    ],
+    "lora_configs": [
+        {
+            "name": "high_noise_model",
+            "path": "lightx2v/Wan2.2-Distill-Loras/wan2.2_t2v_A14b_high_noise_lora_rank64_lightx2v_4step_1217.safetensors",
+            "strength": 1.0
+        },
+        {
+            "name": "low_noise_model",
+            "path": "lightx2v/Wan2.2-Distill-Loras/wan2.2_t2v_A14b_low_noise_lora_rank64_lightx2v_4step_1217.safetensors",
+            "strength": 1.0
+        }
+    ]
+}
+```
+`boundary_step_index` 表示噪声阶段分界索引，切换高噪声模型和低噪声模型
+`lora_configs`: 包含两个LoRA适配器，高噪声模型负责生成视频的高频细节和结构，低噪声模型负责平滑噪声和优化全局一致性。这种分工使得模型能够在不同阶段专注于不同的生成任务，从而提升整体性能。
+wan_moe_i2v_distill.json内容如下
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 720,
+    "target_width": 1280,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": [
+        3.5,
+        3.5
+    ],
+    "sample_shift": 5.0,
+    "enable_cfg": false,
+    "cpu_offload": true,
+    "offload_granularity": "block",
+    "t5_cpu_offload": false,
+    "vae_cpu_offload": false,
+    "use_image_encoder": false,
+    "boundary_step_index": 2,
+    "denoising_step_list": [
+        1000,
+        750,
+        500,
+        250
+    ],
+    "high_noise_original_ckpt": "lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_lightx2v_4step.safetensors",
+    "low_noise_original_ckpt": "lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_lightx2v_4step.safetensors"
+}
+```
+`high_noise_original_ckpt` 表示高噪声阶段使用的蒸馏模型路径
+`low_noise_original_ckpt` 表示低噪声阶段使用的蒸馏模型路径
+wan_moe_i2v_distill_quant.json内容如下：
+```
+{
+    "infer_steps": 4,
+    "target_video_length": 81,
+    "text_len": 512,
+    "target_height": 720,
+    "target_width": 1280,
+    "self_attn_1_type": "flash_attn3",
+    "cross_attn_1_type": "flash_attn3",
+    "cross_attn_2_type": "flash_attn3",
+    "sample_guide_scale": [
+        3.5,
+        3.5
+    ],
+    "sample_shift": 5.0,
+    "enable_cfg": false,
+    "cpu_offload": true,
+    "offload_granularity": "block",
+    "t5_cpu_offload": false,
+    "vae_cpu_offload": false,
+    "use_image_encoder": false,
+    "boundary_step_index": 2,
+    "denoising_step_list": [
+        1000,
+        750,
+        500,
+        250
+    ],
+    "dit_quantized": true,
+    "dit_quant_scheme": "fp8-sgl",
+    "t5_quantized": false,
+    "t5_quant_scheme": "fp8-sgl",
+    "high_noise_quantized_ckpt": "lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_high_noise_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+    "low_noise_quantized_ckpt": "lightx2v/Wan2.2-Distill-Models/wan2.2_i2v_A14b_low_noise_scaled_fp8_e4m3_lightx2v_4step.safetensors"
+}
+```
+`high_noise_quantized_ckpt` 表示高噪声阶段使用的步数蒸馏+FP8量化模型路径
+`low_noise_quantized_ckpt` 表示低噪声阶段使用的蒸馏+FP8量化模型路径
+### 启动服务生成
+启动服务
+```
+cd LightX2V/scripts/server
+# 运行下面的脚本之前，需要将脚本中的lightx2v_path和model_path替换为实际路径
+# 例如：lightx2v_path=/home/user/LightX2V
+# 例如：model_path=/home/user/models/Wan-AI/Wan2.2-T2V-A14B
+# 同时：config_json也需要配成对应的模型config路径
+# 例如：config_json ${lightx2v_path}/configs/wan22/wan_moe_t2v.json
+# 切换model_path和config_json路径体验不同模型
+bash start_server.sh
+```
+向服务端发送请求
+此处需要打开第二个终端作为用户
+```
+cd LightX2V/scripts/server
+# 此时生成视频，url = "http://localhost:8000/v1/tasks/video/"
+python post.py
+```
+发送完请求后，可以在服务端看到推理的日志
+### python代码生成
--- a/examples/README.md
+++ b/examples/README.md
+# LightX2V Usage Examples
+This document introduces how to use LightX2V for video generation, including basic usage and advanced configurations.
+## 📋 Table of Contents
+- [Environment Setup](#environment-setup)
+- [Basic Usage Examples](#basic-usage-examples)
+- [Model Path Configuration](#model-path-configuration)
+- [Creating Generator](#creating-generator)
+- [Advanced Configurations](#advanced-configurations)
+  - [Parameter Offloading](#parameter-offloading)
+  - [Model Quantization](#model-quantization)
+  - [Parallel Inference](#parallel-inference)
+  - [Feature Caching](#feature-caching)
+  - [LoRA Support](#lora-support)
+  - [Lightweight VAE](#lightweight-vae)
+## 🔧 Environment Setup
+Please refer to the main project's [Quick Start Guide](../docs/EN/source/getting_started/quickstart.md) for environment setup.
+## 🚀 Basic Usage Examples
+A minimal code example can be found in `examples/wan_t2v.py`:
+```python
+from lightx2v import LightX2VPipeline
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.1-T2V-14B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,
+    width=832,
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+seed = 42
+prompt = "Your prompt here"
+negative_prompt = ""
+save_result_path="/path/to/save_results/output.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+## 📁 Model Path Configuration
+### Basic Configuration
+Pass the model path to `LightX2VPipeline`:
+```python
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.2-I2V-A14B",
+    model_cls="wan2.2_moe",  # For wan2.1, use "wan2.1"
+    task="i2v",
+)
+```
+### Specifying Multiple Model Weight Versions
+When there are multiple versions of bf16 precision DIT model safetensors files in the `model_path` directory, you need to use the following parameters to specify which weights to use:
+- **`dit_original_ckpt`**: Used to specify the original DIT weight path for models like wan2.1 and hunyuan15
+- **`low_noise_original_ckpt`**: Used to specify the low noise branch weight path for wan2.2 models
+- **`high_noise_original_ckpt`**: Used to specify the high noise branch weight path for wan2.2 models
+**Usage Example:**
+```python
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.2-I2V-A14B",
+    model_cls="wan2.2_moe",
+    task="i2v",
+    low_noise_original_ckpt="/path/to/low_noise_model.safetensors",
+    high_noise_original_ckpt="/path/to/high_noise_model.safetensors",
+)
+```
+## 🎛️ Creating Generator
+### Loading from Configuration File
+The generator can be loaded directly from a JSON configuration file. Configuration files are located in the `configs` directory:
+```python
+pipe.create_generator(config_json="../configs/wan/wan_t2v.json")
+```
+### Creating Generator Manually
+You can also create the generator manually and configure multiple parameters:
+```python
+pipe.create_generator(
+    attn_mode="flash_attn2",  # Options: flash_attn2, flash_attn3, sage_attn2, sage_attn3 (B-architecture GPUs)
+    infer_steps=50,           # Number of inference steps
+    num_frames=81,            # Number of video frames
+    height=480,               # Video height
+    width=832,                # Video width
+    guidance_scale=5.0,       # CFG guidance strength (CFG disabled when =1)
+    sample_shift=5.0,         # Sample shift
+    fps=16,                   # Frame rate
+    aspect_ratio="16:9",      # Aspect ratio
+    boundary=0.900,           # Boundary value
+    boundary_step_index=2,    # Boundary step index
+    denoising_step_list=[1000, 750, 500, 250],  # Denoising step list
+)
+```
+**Parameter Description:**
+- **Resolution**: Specified via `height` and `width`
+- **CFG**: Specified via `guidance_scale` (set to 1 to disable CFG)
+- **FPS**: Specified via `fps`
+- **Video Length**: Specified via `num_frames`
+- **Inference Steps**: Specified via `infer_steps`
+- **Sample Shift**: Specified via `sample_shift`
+- **Attention Mode**: Specified via `attn_mode`, options include `flash_attn2`, `flash_attn3`, `sage_attn2`, `sage_attn3` (for B-architecture GPUs)
+## ⚙️ Advanced Configurations
+**⚠️ Important: When manually creating a generator, you can configure some advanced options. All advanced configurations must be specified before `create_generator()`, otherwise they will not take effect!**
+### Parameter Offloading
+Significantly reduces memory usage with almost no impact on inference speed. Suitable for RTX 30/40/50 series GPUs.
+```python
+pipe.enable_offload(
+    cpu_offload=True,              # Enable CPU offloading
+    offload_granularity="block",   # Offload granularity: "block" or "phase"
+    text_encoder_offload=False,    # Whether to offload text encoder
+    image_encoder_offload=False,   # Whether to offload image encoder
+    vae_offload=False,             # Whether to offload VAE
+)
+```
+**Notes:**
+- For Wan models, `offload_granularity` supports both `"block"` and `"phase"`
+- For HunyuanVideo-1.5, only `"block"` is currently supported
+### Model Quantization
+Quantization can significantly reduce memory usage and accelerate inference.
+```python
+pipe.enable_quantize(
+    dit_quantized=False,                    # Whether to use quantized DIT model
+    text_encoder_quantized=False,           # Whether to use quantized text encoder
+    image_encoder_quantized=False,          # Whether to use quantized image encoder
+    dit_quantized_ckpt=None,                # DIT quantized weight path (required when model_path doesn't contain quantized weights or has multiple weight files)
+    low_noise_quantized_ckpt=None,          # Wan2.2 low noise branch quantized weight path
+    high_noise_quantized_ckpt=None,         # Wan2.2 high noise branch quantized weight path
+    text_encoder_quantized_ckpt=None,       # Text encoder quantized weight path (required when model_path doesn't contain quantized weights or has multiple weight files)
+    image_encoder_quantized_ckpt=None,      # Image encoder quantized weight path (required when model_path doesn't contain quantized weights or has multiple weight files)
+    quant_scheme="fp8-sgl",                 # Quantization scheme
+)
+```
+**Parameter Description:**
+- **`dit_quantized_ckpt`**: When the `model_path` directory doesn't contain quantized weights, or has multiple weight files, you need to specify the specific DIT quantized weight path
+- **`text_encoder_quantized_ckpt`** and **`image_encoder_quantized_ckpt`**: Similarly, used to specify encoder quantized weight paths
+- **`low_noise_quantized_ckpt`** and **`high_noise_quantized_ckpt`**: Used to specify dual-branch quantized weights for Wan2.2 models
+**Quantized Model Downloads:**
+- **Wan-2.1 Quantized Models**: Download from [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
+- **Wan-2.2 Quantized Models**: Download from [Wan2.2-Distill-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
+- **HunyuanVideo-1.5 Quantized Models**: Download from [Hy1.5-Quantized-Models](https://huggingface.co/lightx2v/Hy1.5-Quantized-Models)
+  - `hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors` is the quantized weight for the text encoder
+**Usage Examples:**
+```python
+# HunyuanVideo-1.5 Quantization Example
+pipe.enable_quantize(
+    quant_scheme='fp8-sgl',
+    dit_quantized=True,
+    dit_quantized_ckpt="/path/to/hy15_720p_i2v_fp8_e4m3_lightx2v.safetensors",
+    text_encoder_quantized=True,
+    image_encoder_quantized=False,
+    text_encoder_quantized_ckpt="/path/to/hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors",
+)
+# Wan2.1 Quantization Example
+pipe.enable_quantize(
+    dit_quantized=True,
+    dit_quantized_ckpt="/path/to/wan2.1_i2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+)
+# Wan2.2 Quantization Example
+pipe.enable_quantize(
+    dit_quantized=True,
+    low_noise_quantized_ckpt="/path/to/wan2.2_i2v_A14b_low_noise_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+    high_noise_quantized_ckpt="/path/to/wan2.2_i2v_A14b_high_noise_scaled_fp8_e4m3_lightx2v_4step_1030.safetensors",
+)
+```
+**Quantization Scheme Reference:** For detailed information, please refer to the [Quantization Documentation](../docs/EN/source/method_tutorials/quantization.md)
+### Parallel Inference
+Supports multi-GPU parallel inference. Requires running with `torchrun`:
+```python
+pipe.enable_parallel(
+    seq_p_size=4,                    # Sequence parallel size
+    seq_p_attn_type="ulysses",       # Sequence parallel attention type
+)
+```
+**Running Method:**
+```bash
+torchrun --nproc_per_node=4 your_script.py
+```
+### Feature Caching
+You can specify the cache method as Mag or Tea, using MagCache and TeaCache methods:
+```python
+pipe.enable_cache(
+    cache_method='Tea',  # Cache method: 'Tea' or 'Mag'
+    coefficients=[-3.08907507e+04, 1.67786188e+04, -3.19178643e+03,
+                  2.60740519e+02, -8.19205881e+00, 1.07913775e-01],  # Coefficients
+    teacache_thresh=0.15,  # TeaCache threshold
+)
+```
+**Coefficient Reference:** Refer to configuration files in `configs/caching` or `configs/hunyuan_video_15/cache` directories
+### LoRA Support
+Supports loading distilled LoRA weights, which can significantly accelerate inference.
+**Usage Examples:**
+```python
+# Qwen-Image Single LoRA Example
+pipe.enable_lora(
+    [
+        {"path": "/path/to/Qwen-Image-2512-Lightning-4steps-V1.0-fp32.safetensors", "strength": 1.0},
+    ],
+    lora_dynamic_apply=False,
+)
+# Wan2.2 Multiple LoRAs Example
+pipe.enable_lora(
+    [
+        {"name": "high_noise_model", "path": "/path/to/wan2.2_i2v_A14b_high_noise_lora_rank64_lightx2v_4step_1022.safetensors", "strength": 1.0},
+        {"name": "low_noise_model", "path": "/path/to/wan2.2_i2v_A14b_low_noise_lora_rank64_lightx2v_4step_1022.safetensors", "strength": 1.0},
+    ],
+    lora_dynamic_apply=False,
+)
+```
+**Parameter Description:**
+- **`lora_configs`**: List of LoRA configurations, each containing:
+  - **`path`**: Path to LoRA weight file (required)
+  - **`name`**: LoRA name (optional, used when multiple LoRAs are needed, e.g., Wan2.2)
+  - **`strength`**: LoRA strength, default is 1.0
+- **`lora_dynamic_apply`**: Whether to dynamically apply LoRA weights
+  - `False` (default): Merge LoRA weights during loading, faster inference but uses more memory
+  - `True`: Dynamically apply LoRA weights during inference, saves memory but slower
+**LoRA Model Downloads:**
+- **Wan-2.1 LoRA**: Download from [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models)
+- **Wan-2.2 LoRA**: Download from [Wan2.2-Distill-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models)
+- **Qwen-Image LoRA**: Download from [Qwen-Image-2512-Lightning](https://huggingface.co/lightx2v/Qwen-Image-2512-Lightning) or [Qwen-Image-Edit-2511-Lightning](https://huggingface.co/lightx2v/Qwen-Image-Edit-2511-Lightning)
+### Lightweight VAE
+Using lightweight VAE can accelerate decoding and reduce memory usage.
+```python
+pipe.enable_lightvae(
+    use_lightvae=False,    # Whether to use LightVAE
+    use_tae=False,         # Whether to use LightTAE
+    vae_path=None,         # Path to LightVAE
+    tae_path=None,         # Path to LightTAE
+)
+```
+**Support Status:**
+- **LightVAE**: Currently only supports wan2.1, wan2.2 moe
+- **LightTAE**: Currently only supports wan2.1, wan2.2-ti2v, wan2.2 moe, HunyuanVideo-1.5
+**Model Downloads:** Lightweight VAE models can be downloaded from [Autoencoders](https://huggingface.co/lightx2v/Autoencoders)
+- LightVAE for Wan-2.1: [lightvaew2_1.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lightvaew2_1.safetensors)
+- LightTAE for Wan-2.1: [lighttaew2_1.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaew2_1.safetensors)
+- LightTAE for Wan-2.2-ti2v: [lighttaew2_2.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaew2_2.safetensors)
+- LightTAE for HunyuanVideo-1.5: [lighttaehy1_5.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaehy1_5.safetensors)
+**Usage Example:**
+```python
+# Using LightTAE for HunyuanVideo-1.5
+pipe.enable_lightvae(
+    use_tae=True,
+    tae_path="/path/to/lighttaehy1_5.safetensors",
+    use_lightvae=False,
+    vae_path=None
+)
+```
+## 📚 More Resources
+- [Full Documentation](https://lightx2v-en.readthedocs.io/en/latest/)
+- [GitHub Repository](https://github.com/ModelTC/LightX2V)
+- [HuggingFace Model Hub](https://huggingface.co/lightx2v)
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
+# LightX2V 使用示例
+本文档介绍如何使用 LightX2V 进行视频生成，包括基础使用和进阶配置。
+## 📋 目录
+- [环境安装](#环境安装)
+- [基础运行示例](#基础运行示例)
+- [模型路径配置](#模型路径配置)
+- [创建生成器](#创建生成器)
+- [进阶配置](#进阶配置)
+  - [参数卸载 (Offload)](#参数卸载-offload)
+  - [模型量化 (Quantization)](#模型量化-quantization)
+  - [并行推理 (Parallel Inference)](#并行推理-parallel-inference)
+  - [特征缓存 (Cache)](#特征缓存-cache)
+  - [LoRA 支持](#lora-支持)
+  - [轻量 VAE (Light VAE)](#轻量-vae-light-vae)
+## 🔧 环境安装
+请参考主项目的[快速入门文档](../docs/ZH_CN/source/getting_started/quickstart.md)进行环境安装。
+## 🚀 基础运行示例
+最小化代码示例可参考 `examples/wan_t2v.py`：
+```python
+from lightx2v import LightX2VPipeline
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.1-T2V-14B",
+    model_cls="wan2.1",
+    task="t2v",
+)
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    height=480,
+    width=832,
+    num_frames=81,
+    guidance_scale=5.0,
+    sample_shift=5.0,
+)
+seed = 42
+prompt = "Your prompt here"
+negative_prompt = ""
+save_result_path="/path/to/save_results/output.mp4"
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
+```
+## 📁 模型路径配置
+### 基础配置
+将模型路径传入 `LightX2VPipeline`：
+```python
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.2-I2V-A14B",
+    model_cls="wan2.2_moe",  # 对于 wan2.1，使用 "wan2.1"
+    task="i2v",
+)
+```
+### 多版本模型权重指定
+当 `model_path` 目录下存在多个不同版本的 bf16 精度 DIT 模型 safetensors 文件时，需要使用以下参数指定具体使用哪个权重：
+- **`dit_original_ckpt`**: 用于指定 wan2.1 和 hunyuan15 等模型的原始 DIT 权重路径
+- **`low_noise_original_ckpt`**: 用于指定 wan2.2 模型的低噪声分支权重路径
+- **`high_noise_original_ckpt`**: 用于指定 wan2.2 模型的高噪声分支权重路径
+**使用示例：**
+```python
+pipe = LightX2VPipeline(
+    model_path="/path/to/Wan2.2-I2V-A14B",
+    model_cls="wan2.2_moe",
+    task="i2v",
+    low_noise_original_ckpt="/path/to/low_noise_model.safetensors",
+    high_noise_original_ckpt="/path/to/high_noise_model.safetensors",
+)
+```
+## 🎛️ 创建生成器
+### 从配置文件加载
+生成器可以从 JSON 配置文件直接加载，配置文件位于 `configs` 目录：
+```python
+pipe.create_generator(config_json="../configs/wan/wan_t2v.json")
+```
+### 手动创建生成器
+也可以手动创建生成器，并配置多个参数：
+```python
+pipe.create_generator(
+    attn_mode="flash_attn2",  # 可选: flash_attn2, flash_attn3, sage_attn2, sage_attn3 (B架构显卡适用)
+    infer_steps=50,           # 推理步数
+    num_frames=81,            # 视频帧数
+    height=480,               # 视频高度
+    width=832,                # 视频宽度
+    guidance_scale=5.0,       # CFG引导强度 (=1时弃用CFG)
+    sample_shift=5.0,         # 采样偏移
+    fps=16,                   # 帧率
+    aspect_ratio="16:9",      # 宽高比
+    boundary=0.900,           # 边界值
+    boundary_step_index=2,    # 边界步索引
+    denoising_step_list=[1000, 750, 500, 250],  # 去噪步列表
+)
+```
+**参数说明：**
+- **分辨率**: 通过 `height` 和 `width` 指定
+- **CFG**: 通过 `guidance_scale` 指定（设置为 1 时禁用 CFG）
+- **FPS**: 通过 `fps` 指定帧率
+- **视频长度**: 通过 `num_frames` 指定帧数
+- **推理步数**: 通过 `infer_steps` 指定
+- **采样偏移**: 通过 `sample_shift` 指定
+- **注意力模式**: 通过 `attn_mode` 指定，可选 `flash_attn2`, `flash_attn3`, `sage_attn2`, `sage_attn3`（B架构显卡适用）
+## ⚙️ 进阶配置
+**⚠️ 重要提示：手动创建生成器时，可以配置一些进阶选项，所有进阶配置必须在 `create_generator()` 之前指定，否则会失效！**
+### 参数卸载 (Offload)
+显著降低显存占用，几乎不影响推理速度，适用于 RTX 30/40/50 系列显卡。
+```python
+pipe.enable_offload(
+    cpu_offload=True,              # 启用 CPU 卸载
+    offload_granularity="block",   # 卸载粒度: "block" 或 "phase"
+    text_encoder_offload=False,    # 文本编码器是否卸载
+    image_encoder_offload=False,   # 图像编码器是否卸载
+    vae_offload=False,             # VAE 是否卸载
+)
+```
+**说明：**
+- 对于 Wan 模型，`offload_granularity` 支持 `"block"` 和 `"phase"`
+- 对于 HunyuanVideo-1.5，目前只支持 `"block"`
+### 模型量化 (Quantization)
+量化可以显著降低显存占用并加速推理。
+```python
+pipe.enable_quantize(
+    dit_quantized=False,                    # 是否使用量化的 DIT 模型
+    text_encoder_quantized=False,           # 是否使用量化的文本编码器
+    image_encoder_quantized=False,          # 是否使用量化的图像编码器
+    dit_quantized_ckpt=None,                # DIT 量化权重路径（当 model_path 下没有量化权重或存在多个权重时需要指定）
+    low_noise_quantized_ckpt=None,          # Wan2.2 低噪声分支量化权重路径
+    high_noise_quantized_ckpt=None,         # Wan2.2 高噪声分支量化权重路径
+    text_encoder_quantized_ckpt=None,       # 文本编码器量化权重路径（当 model_path 下没有量化权重或存在多个权重时需要指定）
+    image_encoder_quantized_ckpt=None,      # 图像编码器量化权重路径（当 model_path 下没有量化权重或存在多个权重时需要指定）
+    quant_scheme="fp8-sgl",                 # 量化方案
+)
+```
+**参数说明：**
+- **`dit_quantized_ckpt`**: 当 `model_path` 目录下没有量化权重，或存在多个权重文件时，需要指定具体的 DIT 量化权重路径
+- **`text_encoder_quantized_ckpt`** 和 **`image_encoder_quantized_ckpt`**: 类似地，用于指定编码器的量化权重路径
+- **`low_noise_quantized_ckpt`** 和 **`high_noise_quantized_ckpt`**: 用于指定 Wan2.2 模型的双分支量化权重
+**量化模型下载：**
+- **Wan-2.1 量化模型**: 从 [Hy1.5-Quantized-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models) 下载
+- **Wan-2.2 量化模型**: 从 [Hy1.5-Quantized-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models) 下载
+- **HunyuanVideo-1.5 量化模型**: 从 [Hy1.5-Quantized-Models](https://huggingface.co/lightx2v/Hy1.5-Quantized-Models) 下载
+  - `hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors` 是文本编码器的量化权重
+**使用示例：**
+```python
+# HunyuanVideo-1.5 量化示例
+pipe.enable_quantize(
+    quant_scheme='fp8-sgl',
+    dit_quantized=True,
+    dit_quantized_ckpt="/path/to/hy15_720p_i2v_fp8_e4m3_lightx2v.safetensors",
+    text_encoder_quantized=True,
+    image_encoder_quantized=False,
+    text_encoder_quantized_ckpt="/path/to/hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors",
+)
+# Wan2.1 量化示例
+pipe.enable_quantize(
+    dit_quantized=True,
+    dit_quantized_ckpt="/path/to/wan2.1_i2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+)
+# Wan2.2 量化示例
+pipe.enable_quantize(
+    dit_quantized=True,
+    low_noise_quantized_ckpt="/path/to/wan2.2_i2v_A14b_low_noise_scaled_fp8_e4m3_lightx2v_4step.safetensors",
+    high_noise_quantized_ckpt="/path/to/wan2.2_i2v_A14b_high_noise_scaled_fp8_e4m3_lightx2v_4step_1030.safetensors",
+)
+```
+**量化方案参考：** 详细说明请参考 [量化文档](../docs/ZH_CN/source/method_tutorials/quantization.md)
+### 并行推理 (Parallel Inference)
+支持多 GPU 并行推理，需要使用 `torchrun` 运行：
+```python
+pipe.enable_parallel(
+    seq_p_size=4,                    # 序列并行大小
+    seq_p_attn_type="ulysses",       # 序列并行注意力类型
+)
+```
+**运行方式：**
+```bash
+torchrun --nproc_per_node=4 your_script.py
+```
+### 特征缓存 (Cache)
+可以指定缓存方法为 Mag 或 Tea，使用 MagCache 和 TeaCache 方法：
+```python
+pipe.enable_cache(
+    cache_method='Tea',  # 缓存方法: 'Tea' 或 'Mag'
+    coefficients=[-3.08907507e+04, 1.67786188e+04, -3.19178643e+03,
+                  2.60740519e+02, -8.19205881e+00, 1.07913775e-01],  # 系数
+    teacache_thresh=0.15,  # TeaCache 阈值
+)
+```
+**系数参考：** 可参考 `configs/caching` 或 `configs/hunyuan_video_15/cache` 目录下的配置文件
+### LoRA 支持
+支持加载蒸馏 LoRA 权重，可显著加速推理。
+**使用示例：**
+```python
+# Qwen-Image 单 LoRA 示例
+pipe.enable_lora(
+    [
+        {"path": "/path/to/Qwen-Image-2512-Lightning-4steps-V1.0-fp32.safetensors", "strength": 1.0},
+    ],
+    lora_dynamic_apply=False,
+)
+# Wan2.2 多 LoRA 示例
+pipe.enable_lora(
+    [
+        {"name": "high_noise_model", "path": "/path/to/wan2.2_i2v_A14b_high_noise_lora_rank64_lightx2v_4step_1022.safetensors", "strength": 1.0},
+        {"name": "low_noise_model", "path": "/path/to/wan2.2_i2v_A14b_low_noise_lora_rank64_lightx2v_4step_1022.safetensors", "strength": 1.0},
+    ],
+    lora_dynamic_apply=False,
+)
+```
+**参数说明：**
+- **`lora_configs`**: LoRA 配置列表，每个配置包含：
+  - **`path`**: LoRA 权重文件路径（必需）
+  - **`name`**: LoRA 名称（可选，用于需要多个 LoRA 的情况，如 Wan2.2）
+  - **`strength`**: LoRA 强度，默认为 1.0
+- **`lora_dynamic_apply`**: 是否动态应用 LoRA 权重
+  - `False`（默认）: 在加载时合并 LoRA 权重，推理速度快但占用更多内存
+  - `True`: 在推理时动态应用 LoRA 权重，节省内存但速度较慢
+**LoRA 模型下载：**
+- **Wan-2.1 LoRA**: 从 [Wan2.1-Distill-Models](https://huggingface.co/lightx2v/Wan2.1-Distill-Models) 下载
+- **Wan-2.2 LoRA**: 从 [Wan2.2-Distill-Models](https://huggingface.co/lightx2v/Wan2.2-Distill-Models) 下载
+- **Qwen-Image LoRA**: 从 [Qwen-Image-2512-Lightning](https://huggingface.co/lightx2v/Qwen-Image-2512-Lightning) 或 [Qwen-Image-Edit-2511-Lightning](https://huggingface.co/lightx2v/Qwen-Image-Edit-2511-Lightning) 下载
+### 轻量 VAE (Light VAE)
+使用轻量 VAE 可以加速解码并降低显存占用。
+```python
+pipe.enable_lightvae(
+    use_lightvae=False,    # 是否使用 LightVAE
+    use_tae=False,         # 是否使用 LightTAE
+    vae_path=None,         # LightVAE 的路径
+    tae_path=None,         # LightTAE 的路径
+)
+```
+**支持情况：**
+- **LightVAE**: 目前只支持 wan2.1、wan2.2 moe
+- **LightTAE**: 目前只支持 wan2.1、wan2.2-ti2v、wan2.2 moe、HunyuanVideo-1.5
+**模型下载：** 轻量 VAE 模型可从 [Autoencoders](https://huggingface.co/lightx2v/Autoencoders) 下载
+- Wan-2.1 的 LightVAE: [lightvaew2_1.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lightvaew2_1.safetensors)
+- Wan-2.1 的 LightTAE: [lighttaew2_1.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaew2_1.safetensors)
+- Wan-2.2-ti2v 的 LightTAE: [lighttaew2_2.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaew2_2.safetensors)
+- HunyuanVideo-1.5 的 LightTAE: [lighttaehy1_5.safetensors](https://huggingface.co/lightx2v/Autoencoders/blob/main/lighttaehy1_5.safetensors)
+**使用示例：**
+```python
+# 使用 HunyuanVideo-1.5 的 LightTAE
+pipe.enable_lightvae(
+    use_tae=True,
+    tae_path="/path/to/lighttaehy1_5.safetensors",
+    use_lightvae=False,
+    vae_path=None
+)
+```
+## 📚 更多资源
+- [完整文档](https://lightx2v-zhcn.readthedocs.io/zh-cn/latest/)
+- [GitHub 仓库](https://github.com/ModelTC/LightX2V)
+- [HuggingFace 模型库](https://huggingface.co/lightx2v)
--- a/examples/hunyuan_video/hunyuan_i2v.py
+++ b/examples/hunyuan_video/hunyuan_i2v.py
+"""
+HunyuanVideo-1.5 image-to-video generation example with quantization.
+This example demonstrates how to use LightX2V with HunyuanVideo-1.5 model for I2V generation,
+including quantized model usage for reduced memory consumption.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for HunyuanVideo-1.5 I2V task
+pipe = LightX2VPipeline(
+    model_path="/path/to/ckpts/hunyuanvideo-1.5/",
+    model_cls="hunyuan_video_1.5",
+    transformer_model_name="720p_i2v",
+    task="i2v",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(config_json="../configs/hunyuan_video_15/hunyuan_video_i2v_720p.json")
+# Enable offloading to significantly reduce VRAM usage with minimal speed impact
+# Suitable for RTX 30/40/50 consumer GPUs
+pipe.enable_offload(
+    cpu_offload=True,
+    offload_granularity="block",  # For HunyuanVideo-1.5, only "block" is supported
+    text_encoder_offload=True,
+    image_encoder_offload=False,
+    vae_offload=False,
+)
+# Enable quantization for reduced memory usage
+# Quantized models can be downloaded from: https://huggingface.co/lightx2v/Hy1.5-Quantized-Models
+pipe.enable_quantize(
+    quant_scheme="fp8-sgl",
+    dit_quantized=True,
+    dit_quantized_ckpt="/path/to/hy15_720p_i2v_fp8_e4m3_lightx2v.safetensors",
+    text_encoder_quantized=True,
+    image_encoder_quantized=False,
+    text_encoder_quantized_ckpt="/path/to/hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors",
+)
+# Create generator with specified parameters
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    num_frames=121,
+    guidance_scale=6.0,
+    sample_shift=7.0,
+    fps=24,
+)
+# Generation parameters
+seed = 42
+prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
+negative_prompt = ""
+save_result_path = "/path/to/save_results/output2.mp4"
+# Generate video
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)
--- a/examples/hunyuan_video/hunyuan_t2v.py
+++ b/examples/hunyuan_video/hunyuan_t2v.py
+"""
+HunyuanVideo-1.5 text-to-video generation example.
+This example demonstrates how to use LightX2V with HunyuanVideo-1.5 model for T2V generation.
+"""
+from lightx2v import LightX2VPipeline
+# Initialize pipeline for HunyuanVideo-1.5
+pipe = LightX2VPipeline(
+    model_path="/path/to/ckpts/hunyuanvideo-1.5/",
+    model_cls="hunyuan_video_1.5",
+    transformer_model_name="720p_t2v",
+    task="t2v",
+)
+# Alternative: create generator from config JSON file
+# pipe.create_generator(config_json="../configs/hunyuan_video_15/hunyuan_video_t2v_720p.json")
+# Enable offloading to significantly reduce VRAM usage with minimal speed impact
+# Suitable for RTX 30/40/50 consumer GPUs
+pipe.enable_offload(
+    cpu_offload=True,
+    offload_granularity="block",  # For HunyuanVideo-1.5, only "block" is supported
+    text_encoder_offload=True,
+    image_encoder_offload=False,
+    vae_offload=False,
+)
+# Use lighttae
+pipe.enable_lightvae(
+    use_tae=True,
+    tae_path="/path/to/lighttaehy1_5.safetensors",
+    use_lightvae=False,
+    vae_path=None,
+)
+# Create generator with specified parameters
+pipe.create_generator(
+    attn_mode="sage_attn2",
+    infer_steps=50,
+    num_frames=121,
+    guidance_scale=6.0,
+    sample_shift=9.0,
+    aspect_ratio="16:9",
+    fps=24,
+)
+# Generation parameters
+seed = 123
+prompt = "A close-up shot captures a scene on a polished, light-colored granite kitchen counter, illuminated by soft natural light from an unseen window. Initially, the frame focuses on a tall, clear glass filled with golden, translucent apple juice standing next to a single, shiny red apple with a green leaf still attached to its stem. The camera moves horizontally to the right. As the shot progresses, a white ceramic plate smoothly enters the frame, revealing a fresh arrangement of about seven or eight more apples, a mix of vibrant reds and greens, piled neatly upon it. A shallow depth of field keeps the focus sharply on the fruit and glass, while the kitchen backsplash in the background remains softly blurred. The scene is in a realistic style."
+negative_prompt = ""
+save_result_path = "/path/to/save_results/output.mp4"
+# Generate video
+pipe.generate(
+    seed=seed,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    save_result_path=save_result_path,
+)