The `Wan-AI/Wan2.2-T2V-A14B-Diffusers` pipeline generates short videos from text prompts.
## Local CLI Usage
```bash
python text_to_video.py \
--prompt"Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."\
--negative_prompt"<optional quality filter>"\
--height 480 \
--width 832 \
--num_frames 33 \
--guidance_scale 4.0 \
--guidance_scale_high 3.0 \
--flow_shift 12.0 \
--num_inference_steps 40 \
--fps 16 \
--output t2v_out.mp4
```
Key arguments:
-`--prompt`: text description (string).
-`--height/--width`: output resolution (defaults 480x832, i.e. 480P). Dimensions should align with Wan VAE downsampling (multiples of 8).
-`--num_frames`: Number of frames (Wan default is 81).
-`--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high).
-`--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
-`--boundary_ratio`: Boundary split ratio for low/high DiT. Default `0.875` uses both transformers for best quality. Set to `1.0` to load only the low-noise transformer (saves noticeable memory with good quality, recommended if memory is limited). Set to `0.0` loads only the high-noise transformer (not recommended, lower quality).
-`--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
-`--output`: path to save the generated video.
-`--vae_use_slicing`: enable VAE slicing for memory optimization.
-`--vae_use_tiling`: enable VAE tiling for memory optimization.
-`--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
-`--enable-cpu-offload`: enable CPU offloading for diffusion models.
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
help="Boundary split ratio for low/high DiT. Default 0.875 uses both transformers for best quality. Set to 1.0 to load only the low-noise transformer (saves noticeable memory with good quality, recommended if memory is limited).",
)
parser.add_argument(
"--flow_shift",type=float,default=5.0,help="Scheduler flow_shift (5.0 for 720p, 12.0 for 480p)."
**Note**: These examples work with the default configuration on an **NVIDIA A100 (80GB)**. We also tested on dual **NVIDIA RTX 5000 Ada (32GB each)**. For dual-GPU setups, please modify the stage configuration to distribute the model across devices.
-`--height`: Image height in pixels (default: 512)
-`--width`: Image width in pixels (default: 512)
-`--steps`: Number of inference steps (default: 25)
-`--seed`: Random seed (default: 42)
-`--negative`: Negative prompt for image generation
Example with custom parameters:
```bash
python openai_chat_client.py \
--prompt"A futuristic city"\
--modality text2img \
--height 768 \
--width 768 \
--steps 50 \
--seed 42 \
--negative"blurry, low quality"
```
## Modality Control
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
"messages": [{"role": "user", "content": [{"type": "text", "text": "<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n"}]}]
"modalities": ["text"]
}'
```
## FAQ
- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
```bash
sudo apt update
sudo apt install ffmpeg
```
- If you don’t know how much VRAM is needed for the model or encounter the OOM error, you can try to decrease the max_model_len.
-`--video-path` (or `-v`): Path to local video file or URL. If not provided and query-type uses video, uses default video URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs. Example: `--video-path /path/to/video.mp4` or `--video-path https://example.com/video.mp4`
-`--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type uses image, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
-`--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type uses audio, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
-`--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
For example, to use mixed modalities with all local files:
You can control output modalities to specify which types of output the model should generate. This is useful when you only need text output and want to skip audio generation stages for better performance.
### Supported modalities
| Modalities | Output |
|------------|--------|
| `["text"]` | Text only |
| `["audio"]` | Text + Audio |
| `["text", "audio"]` | Text + Audio |
| Not specified | Text + Audio (default) |
### Using curl
#### Text only
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
"modalities": ["text"]
}'
```
#### Text + Audio
```bash
curl http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "Qwen/Qwen2.5-Omni-7B",
"messages": [{"role": "user", "content": "Describe vLLM in brief."}],
If you want to enable streaming output, please set the argument as below. The final output will be obtained just after generated by corresponding stage. Now we only support text streaming output. Other modalities can output normally.
# Build user content and extra fields based on query type
case"$QUERY_TYPE"in
text)
user_content='[
{
"type": "text",
"text": "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
mixed_modalities)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "image_url",
"image_url": {
"url": "'"$CHERRY_BLOSSOM_IMAGE_URL"'"
}
},
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "What is recited in the audio? What is the content of this image? Why is this video funny?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
use_audio_in_video)
user_content='[
{
"type": "video_url",
"video_url": {
"url": "'"$SAMPLE_VIDEO_URL"'"
}
},
{
"type": "text",
"text": "Describe the content of the video, then convert what the baby say into text."
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs='{
"use_audio_in_video": true
}'
;;
multi_audios)
user_content='[
{
"type": "audio_url",
"audio_url": {
"url": "'"$MARY_HAD_LAMB_AUDIO_URL"'"
}
},
{
"type": "audio_url",
"audio_url": {
"url": "'"$WINNING_CALL_AUDIO_URL"'"
}
},
{
"type": "text",
"text": "Are these two audio clips the same?"
}
]'
sampling_params_list='[
'"$thinker_sampling_params"',
'"$talker_sampling_params"',
'"$code2wav_sampling_params"'
]'
mm_processor_kwargs="{}"
;;
esac
echo"Running query type: $QUERY_TYPE"
echo""
output=$(curl -sS-X POST http://localhost:8091/v1/chat/completions \
-H"Content-Type: application/json"\
-d @- <<EOF
{
"model": "Qwen/Qwen2.5-Omni-7B",
"sampling_params_list": $sampling_params_list,
"mm_processor_kwargs": $mm_processor_kwargs,
"modalities": $MODALITIES,
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
}
]
},
{
"role": "user",
"content": $user_content
}
]
}
EOF
)
# Here it only shows the text content of the first choice. Audio content has many binaries, so it's not displayed here.
echo"Output of request: $(echo"$output" | jq '.choices[0].message.content')"