"tests/git@developer.sourcefind.cn:OpenDAS/fairscale.git" did not exist on "c4af33b6d2ec1a93af074e89c86374b7848e20f4"
Unverified Commit 8336405e authored by Yuxuan.Zhang's avatar Yuxuan.Zhang Committed by GitHub
Browse files

CogVideoX-5b-I2V support (#9418)



* draft Init

* draft

* vae encode image

* make style

* image latents preparation

* remove image encoder from conversion script

* fix minor bugs

* make pipeline work

* make style

* remove debug prints

* fix imports

* update example

* make fix-copies

* add fast tests

* fix import

* update vae

* update docs

* update image link

* apply suggestions from review

* apply suggestions from review

* add slow test

* make use of learned positional embeddings

* apply suggestions from review

* doc change

* Update convert_cogvideox_to_diffusers.py

* make style

* final changes

* make style

* fix tests

---------
Co-authored-by: default avatarAryan <aryan@huggingface.co>
parent 2171f77a
...@@ -23,6 +23,8 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load: ...@@ -23,6 +23,8 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
## Supported pipelines ## Supported pipelines
- [`CogVideoXPipeline`] - [`CogVideoXPipeline`]
- [`CogVideoXImageToVideoPipeline`]
- [`CogVideoXVideoToVideoPipeline`]
- [`StableDiffusionPipeline`] - [`StableDiffusionPipeline`]
- [`StableDiffusionImg2ImgPipeline`] - [`StableDiffusionImg2ImgPipeline`]
- [`StableDiffusionInpaintPipeline`] - [`StableDiffusionInpaintPipeline`]
......
...@@ -29,9 +29,12 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m ...@@ -29,9 +29,12 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
There are two models available that can be used with the CogVideoX pipeline: There are two models available that can be used with the text-to-video and video-to-video CogVideoX pipelines:
- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b) - [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `fp16`.
- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b) - [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `bf16`.
There is one model available that can be used with the image-to-video CogVideoX pipeline:
- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `bf16`.
## Inference ## Inference
...@@ -41,10 +44,15 @@ First, load the pipeline: ...@@ -41,10 +44,15 @@ First, load the pipeline:
```python ```python
import torch import torch
from diffusers import CogVideoXPipeline from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video from diffusers.utils import export_to_video,load_image
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b").to("cuda") # or "THUDM/CogVideoX-2b"
```
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda") If you are using the image-to-video pipeline, load it as follows:
```python
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda")
``` ```
Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`: Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
...@@ -53,7 +61,7 @@ Then change the memory layout of the pipelines `transformer` component to `torch ...@@ -53,7 +61,7 @@ Then change the memory layout of the pipelines `transformer` component to `torch
pipe.transformer.to(memory_format=torch.channels_last) pipe.transformer.to(memory_format=torch.channels_last)
``` ```
Finally, compile the components and run inference: Compile the components and run inference:
```python ```python
pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
...@@ -63,7 +71,7 @@ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wood ...@@ -63,7 +71,7 @@ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wood
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
``` ```
The [benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are: The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
``` ```
Without torch.compile(): Average inference time: 96.89 seconds. Without torch.compile(): Average inference time: 96.89 seconds.
...@@ -98,6 +106,12 @@ It is also worth noting that torchao quantization is fully compatible with [torc ...@@ -98,6 +106,12 @@ It is also worth noting that torchao quantization is fully compatible with [torc
- all - all
- __call__ - __call__
## CogVideoXImageToVideoPipeline
[[autodoc]] CogVideoXImageToVideoPipeline
- all
- __call__
## CogVideoXVideoToVideoPipeline ## CogVideoXVideoToVideoPipeline
[[autodoc]] CogVideoXVideoToVideoPipeline [[autodoc]] CogVideoXVideoToVideoPipeline
......
...@@ -4,7 +4,13 @@ from typing import Any, Dict ...@@ -4,7 +4,13 @@ from typing import Any, Dict
import torch import torch
from transformers import T5EncoderModel, T5Tokenizer from transformers import T5EncoderModel, T5Tokenizer
from diffusers import AutoencoderKLCogVideoX, CogVideoXDDIMScheduler, CogVideoXPipeline, CogVideoXTransformer3DModel from diffusers import (
AutoencoderKLCogVideoX,
CogVideoXDDIMScheduler,
CogVideoXImageToVideoPipeline,
CogVideoXPipeline,
CogVideoXTransformer3DModel,
)
def reassign_query_key_value_inplace(key: str, state_dict: Dict[str, Any]): def reassign_query_key_value_inplace(key: str, state_dict: Dict[str, Any]):
...@@ -78,6 +84,7 @@ TRANSFORMER_KEYS_RENAME_DICT = { ...@@ -78,6 +84,7 @@ TRANSFORMER_KEYS_RENAME_DICT = {
"mixins.final_layer.norm_final": "norm_out.norm", "mixins.final_layer.norm_final": "norm_out.norm",
"mixins.final_layer.linear": "proj_out", "mixins.final_layer.linear": "proj_out",
"mixins.final_layer.adaLN_modulation.1": "norm_out.linear", "mixins.final_layer.adaLN_modulation.1": "norm_out.linear",
"mixins.pos_embed.pos_embedding": "patch_embed.pos_embedding", # Specific to CogVideoX-5b-I2V
} }
TRANSFORMER_SPECIAL_KEYS_REMAP = { TRANSFORMER_SPECIAL_KEYS_REMAP = {
...@@ -131,15 +138,18 @@ def convert_transformer( ...@@ -131,15 +138,18 @@ def convert_transformer(
num_layers: int, num_layers: int,
num_attention_heads: int, num_attention_heads: int,
use_rotary_positional_embeddings: bool, use_rotary_positional_embeddings: bool,
i2v: bool,
dtype: torch.dtype, dtype: torch.dtype,
): ):
PREFIX_KEY = "model.diffusion_model." PREFIX_KEY = "model.diffusion_model."
original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", mmap=True)) original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", mmap=True))
transformer = CogVideoXTransformer3DModel( transformer = CogVideoXTransformer3DModel(
in_channels=32 if i2v else 16,
num_layers=num_layers, num_layers=num_layers,
num_attention_heads=num_attention_heads, num_attention_heads=num_attention_heads,
use_rotary_positional_embeddings=use_rotary_positional_embeddings, use_rotary_positional_embeddings=use_rotary_positional_embeddings,
use_learned_positional_embeddings=i2v,
).to(dtype=dtype) ).to(dtype=dtype)
for key in list(original_state_dict.keys()): for key in list(original_state_dict.keys()):
...@@ -153,7 +163,6 @@ def convert_transformer( ...@@ -153,7 +163,6 @@ def convert_transformer(
if special_key not in key: if special_key not in key:
continue continue
handler_fn_inplace(key, original_state_dict) handler_fn_inplace(key, original_state_dict)
transformer.load_state_dict(original_state_dict, strict=True) transformer.load_state_dict(original_state_dict, strict=True)
return transformer return transformer
...@@ -205,6 +214,7 @@ def get_args(): ...@@ -205,6 +214,7 @@ def get_args():
parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE") parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE")
# For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0 # For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0
parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE") parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE")
parser.add_argument("--i2v", action="store_true", default=False, help="Whether to save the model weights in fp16")
return parser.parse_args() return parser.parse_args()
...@@ -225,6 +235,7 @@ if __name__ == "__main__": ...@@ -225,6 +235,7 @@ if __name__ == "__main__":
args.num_layers, args.num_layers,
args.num_attention_heads, args.num_attention_heads,
args.use_rotary_positional_embeddings, args.use_rotary_positional_embeddings,
args.i2v,
dtype, dtype,
) )
if args.vae_ckpt_path is not None: if args.vae_ckpt_path is not None:
...@@ -234,7 +245,7 @@ if __name__ == "__main__": ...@@ -234,7 +245,7 @@ if __name__ == "__main__":
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH) tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir) text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)
# Apparently, the conversion does not work any more without this :shrug: # Apparently, the conversion does not work anymore without this :shrug:
for param in text_encoder.parameters(): for param in text_encoder.parameters():
param.data = param.data.contiguous() param.data = param.data.contiguous()
...@@ -252,9 +263,17 @@ if __name__ == "__main__": ...@@ -252,9 +263,17 @@ if __name__ == "__main__":
"timestep_spacing": "trailing", "timestep_spacing": "trailing",
} }
) )
if args.i2v:
pipe = CogVideoXPipeline( pipeline_cls = CogVideoXImageToVideoPipeline
tokenizer=tokenizer, text_encoder=text_encoder, vae=vae, transformer=transformer, scheduler=scheduler else:
pipeline_cls = CogVideoXPipeline
pipe = pipeline_cls(
tokenizer=tokenizer,
text_encoder=text_encoder,
vae=vae,
transformer=transformer,
scheduler=scheduler,
) )
if args.fp16: if args.fp16:
...@@ -265,4 +284,7 @@ if __name__ == "__main__": ...@@ -265,4 +284,7 @@ if __name__ == "__main__":
# We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird # We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird
# for users to specify variant when the default is not fp32 and they want to run with the correct default (which # for users to specify variant when the default is not fp32 and they want to run with the correct default (which
# is either fp16/bf16 here). # is either fp16/bf16 here).
pipe.save_pretrained(args.output_path, safe_serialization=True, push_to_hub=args.push_to_hub)
# This is necessary This is necessary for users with insufficient memory,
# such as those using Colab and notebooks, as it can save some memory used for model loading.
pipe.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB", push_to_hub=args.push_to_hub)
...@@ -255,6 +255,7 @@ else: ...@@ -255,6 +255,7 @@ else:
"BlipDiffusionControlNetPipeline", "BlipDiffusionControlNetPipeline",
"BlipDiffusionPipeline", "BlipDiffusionPipeline",
"CLIPImageProjection", "CLIPImageProjection",
"CogVideoXImageToVideoPipeline",
"CogVideoXPipeline", "CogVideoXPipeline",
"CogVideoXVideoToVideoPipeline", "CogVideoXVideoToVideoPipeline",
"CycleDiffusionPipeline", "CycleDiffusionPipeline",
...@@ -703,6 +704,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -703,6 +704,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AudioLDMPipeline, AudioLDMPipeline,
AuraFlowPipeline, AuraFlowPipeline,
CLIPImageProjection, CLIPImageProjection,
CogVideoXImageToVideoPipeline,
CogVideoXPipeline, CogVideoXPipeline,
CogVideoXVideoToVideoPipeline, CogVideoXVideoToVideoPipeline,
CycleDiffusionPipeline, CycleDiffusionPipeline,
......
...@@ -1089,8 +1089,10 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin): ...@@ -1089,8 +1089,10 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin):
return self.tiled_encode(x) return self.tiled_encode(x)
frame_batch_size = self.num_sample_frames_batch_size frame_batch_size = self.num_sample_frames_batch_size
# Note: We expect the number of frames to be either `1` or `frame_batch_size * k` or `frame_batch_size * k + 1` for some k.
num_batches = num_frames // frame_batch_size if num_frames > 1 else 1
enc = [] enc = []
for i in range(num_frames // frame_batch_size): for i in range(num_batches):
remaining_frames = num_frames % frame_batch_size remaining_frames = num_frames % frame_batch_size
start_frame = frame_batch_size * i + (0 if i == 0 else remaining_frames) start_frame = frame_batch_size * i + (0 if i == 0 else remaining_frames)
end_frame = frame_batch_size * (i + 1) + remaining_frames end_frame = frame_batch_size * (i + 1) + remaining_frames
...@@ -1140,8 +1142,9 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin): ...@@ -1140,8 +1142,9 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin):
return self.tiled_decode(z, return_dict=return_dict) return self.tiled_decode(z, return_dict=return_dict)
frame_batch_size = self.num_latent_frames_batch_size frame_batch_size = self.num_latent_frames_batch_size
num_batches = num_frames // frame_batch_size
dec = [] dec = []
for i in range(num_frames // frame_batch_size): for i in range(num_batches):
remaining_frames = num_frames % frame_batch_size remaining_frames = num_frames % frame_batch_size
start_frame = frame_batch_size * i + (0 if i == 0 else remaining_frames) start_frame = frame_batch_size * i + (0 if i == 0 else remaining_frames)
end_frame = frame_batch_size * (i + 1) + remaining_frames end_frame = frame_batch_size * (i + 1) + remaining_frames
...@@ -1233,8 +1236,10 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin): ...@@ -1233,8 +1236,10 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin):
for i in range(0, height, overlap_height): for i in range(0, height, overlap_height):
row = [] row = []
for j in range(0, width, overlap_width): for j in range(0, width, overlap_width):
# Note: We expect the number of frames to be either `1` or `frame_batch_size * k` or `frame_batch_size * k + 1` for some k.
num_batches = num_frames // frame_batch_size if num_frames > 1 else 1
time = [] time = []
for k in range(num_frames // frame_batch_size): for k in range(num_batches):
remaining_frames = num_frames % frame_batch_size remaining_frames = num_frames % frame_batch_size
start_frame = frame_batch_size * k + (0 if k == 0 else remaining_frames) start_frame = frame_batch_size * k + (0 if k == 0 else remaining_frames)
end_frame = frame_batch_size * (k + 1) + remaining_frames end_frame = frame_batch_size * (k + 1) + remaining_frames
...@@ -1309,8 +1314,9 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin): ...@@ -1309,8 +1314,9 @@ class AutoencoderKLCogVideoX(ModelMixin, ConfigMixin, FromOriginalModelMixin):
for i in range(0, height, overlap_height): for i in range(0, height, overlap_height):
row = [] row = []
for j in range(0, width, overlap_width): for j in range(0, width, overlap_width):
num_batches = num_frames // frame_batch_size
time = [] time = []
for k in range(num_frames // frame_batch_size): for k in range(num_batches):
remaining_frames = num_frames % frame_batch_size remaining_frames = num_frames % frame_batch_size
start_frame = frame_batch_size * k + (0 if k == 0 else remaining_frames) start_frame = frame_batch_size * k + (0 if k == 0 else remaining_frames)
end_frame = frame_batch_size * (k + 1) + remaining_frames end_frame = frame_batch_size * (k + 1) + remaining_frames
......
...@@ -350,6 +350,7 @@ class CogVideoXPatchEmbed(nn.Module): ...@@ -350,6 +350,7 @@ class CogVideoXPatchEmbed(nn.Module):
spatial_interpolation_scale: float = 1.875, spatial_interpolation_scale: float = 1.875,
temporal_interpolation_scale: float = 1.0, temporal_interpolation_scale: float = 1.0,
use_positional_embeddings: bool = True, use_positional_embeddings: bool = True,
use_learned_positional_embeddings: bool = True,
) -> None: ) -> None:
super().__init__() super().__init__()
...@@ -363,15 +364,17 @@ class CogVideoXPatchEmbed(nn.Module): ...@@ -363,15 +364,17 @@ class CogVideoXPatchEmbed(nn.Module):
self.spatial_interpolation_scale = spatial_interpolation_scale self.spatial_interpolation_scale = spatial_interpolation_scale
self.temporal_interpolation_scale = temporal_interpolation_scale self.temporal_interpolation_scale = temporal_interpolation_scale
self.use_positional_embeddings = use_positional_embeddings self.use_positional_embeddings = use_positional_embeddings
self.use_learned_positional_embeddings = use_learned_positional_embeddings
self.proj = nn.Conv2d( self.proj = nn.Conv2d(
in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
) )
self.text_proj = nn.Linear(text_embed_dim, embed_dim) self.text_proj = nn.Linear(text_embed_dim, embed_dim)
if use_positional_embeddings: if use_positional_embeddings or use_learned_positional_embeddings:
persistent = use_learned_positional_embeddings
pos_embedding = self._get_positional_embeddings(sample_height, sample_width, sample_frames) pos_embedding = self._get_positional_embeddings(sample_height, sample_width, sample_frames)
self.register_buffer("pos_embedding", pos_embedding, persistent=False) self.register_buffer("pos_embedding", pos_embedding, persistent=persistent)
def _get_positional_embeddings(self, sample_height: int, sample_width: int, sample_frames: int) -> torch.Tensor: def _get_positional_embeddings(self, sample_height: int, sample_width: int, sample_frames: int) -> torch.Tensor:
post_patch_height = sample_height // self.patch_size post_patch_height = sample_height // self.patch_size
...@@ -415,8 +418,15 @@ class CogVideoXPatchEmbed(nn.Module): ...@@ -415,8 +418,15 @@ class CogVideoXPatchEmbed(nn.Module):
[text_embeds, image_embeds], dim=1 [text_embeds, image_embeds], dim=1
).contiguous() # [batch, seq_length + num_frames x height x width, channels] ).contiguous() # [batch, seq_length + num_frames x height x width, channels]
if self.use_positional_embeddings: if self.use_positional_embeddings or self.use_learned_positional_embeddings:
if self.use_learned_positional_embeddings and (self.sample_width != width or self.sample_height != height):
raise ValueError(
"It is currently not possible to generate videos at a different resolution that the defaults. This should only be the case with 'THUDM/CogVideoX-5b-I2V'."
"If you think this is incorrect, please open an issue at https://github.com/huggingface/diffusers/issues."
)
pre_time_compression_frames = (num_frames - 1) * self.temporal_compression_ratio + 1 pre_time_compression_frames = (num_frames - 1) * self.temporal_compression_ratio + 1
if ( if (
self.sample_height != height self.sample_height != height
or self.sample_width != width or self.sample_width != width
......
...@@ -235,10 +235,18 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin): ...@@ -235,10 +235,18 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin):
spatial_interpolation_scale: float = 1.875, spatial_interpolation_scale: float = 1.875,
temporal_interpolation_scale: float = 1.0, temporal_interpolation_scale: float = 1.0,
use_rotary_positional_embeddings: bool = False, use_rotary_positional_embeddings: bool = False,
use_learned_positional_embeddings: bool = False,
): ):
super().__init__() super().__init__()
inner_dim = num_attention_heads * attention_head_dim inner_dim = num_attention_heads * attention_head_dim
if not use_rotary_positional_embeddings and use_learned_positional_embeddings:
raise ValueError(
"There are no CogVideoX checkpoints available with disable rotary embeddings and learned positional "
"embeddings. If you're using a custom model and/or believe this should be supported, please open an "
"issue at https://github.com/huggingface/diffusers/issues."
)
# 1. Patch embedding # 1. Patch embedding
self.patch_embed = CogVideoXPatchEmbed( self.patch_embed = CogVideoXPatchEmbed(
patch_size=patch_size, patch_size=patch_size,
...@@ -254,6 +262,7 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin): ...@@ -254,6 +262,7 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin):
spatial_interpolation_scale=spatial_interpolation_scale, spatial_interpolation_scale=spatial_interpolation_scale,
temporal_interpolation_scale=temporal_interpolation_scale, temporal_interpolation_scale=temporal_interpolation_scale,
use_positional_embeddings=not use_rotary_positional_embeddings, use_positional_embeddings=not use_rotary_positional_embeddings,
use_learned_positional_embeddings=use_learned_positional_embeddings,
) )
self.embedding_dropout = nn.Dropout(dropout) self.embedding_dropout = nn.Dropout(dropout)
...@@ -465,8 +474,11 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin): ...@@ -465,8 +474,11 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin):
hidden_states = self.proj_out(hidden_states) hidden_states = self.proj_out(hidden_states)
# 5. Unpatchify # 5. Unpatchify
# Note: we use `-1` instead of `channels`:
# - It is okay to `channels` use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels)
# - However, for CogVideoX-5b-I2V also takes concatenated input image latents (number of input channels is twice the output channels)
p = self.config.patch_size p = self.config.patch_size
output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, channels, p, p) output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4) output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4)
if not return_dict: if not return_dict:
......
...@@ -138,7 +138,11 @@ else: ...@@ -138,7 +138,11 @@ else:
"AudioLDM2UNet2DConditionModel", "AudioLDM2UNet2DConditionModel",
] ]
_import_structure["blip_diffusion"] = ["BlipDiffusionPipeline"] _import_structure["blip_diffusion"] = ["BlipDiffusionPipeline"]
_import_structure["cogvideo"] = ["CogVideoXPipeline", "CogVideoXVideoToVideoPipeline"] _import_structure["cogvideo"] = [
"CogVideoXPipeline",
"CogVideoXImageToVideoPipeline",
"CogVideoXVideoToVideoPipeline",
]
_import_structure["controlnet"].extend( _import_structure["controlnet"].extend(
[ [
"BlipDiffusionControlNetPipeline", "BlipDiffusionControlNetPipeline",
...@@ -461,7 +465,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -461,7 +465,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
) )
from .aura_flow import AuraFlowPipeline from .aura_flow import AuraFlowPipeline
from .blip_diffusion import BlipDiffusionPipeline from .blip_diffusion import BlipDiffusionPipeline
from .cogvideo import CogVideoXPipeline, CogVideoXVideoToVideoPipeline from .cogvideo import CogVideoXImageToVideoPipeline, CogVideoXPipeline, CogVideoXVideoToVideoPipeline
from .controlnet import ( from .controlnet import (
BlipDiffusionControlNetPipeline, BlipDiffusionControlNetPipeline,
StableDiffusionControlNetImg2ImgPipeline, StableDiffusionControlNetImg2ImgPipeline,
......
...@@ -23,6 +23,7 @@ except OptionalDependencyNotAvailable: ...@@ -23,6 +23,7 @@ except OptionalDependencyNotAvailable:
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects)) _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else: else:
_import_structure["pipeline_cogvideox"] = ["CogVideoXPipeline"] _import_structure["pipeline_cogvideox"] = ["CogVideoXPipeline"]
_import_structure["pipeline_cogvideox_image2video"] = ["CogVideoXImageToVideoPipeline"]
_import_structure["pipeline_cogvideox_video2video"] = ["CogVideoXVideoToVideoPipeline"] _import_structure["pipeline_cogvideox_video2video"] = ["CogVideoXVideoToVideoPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
...@@ -34,6 +35,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -34,6 +35,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from ...utils.dummy_torch_and_transformers_objects import * from ...utils.dummy_torch_and_transformers_objects import *
else: else:
from .pipeline_cogvideox import CogVideoXPipeline from .pipeline_cogvideox import CogVideoXPipeline
from .pipeline_cogvideox_image2video import CogVideoXImageToVideoPipeline
from .pipeline_cogvideox_video2video import CogVideoXVideoToVideoPipeline from .pipeline_cogvideox_video2video import CogVideoXVideoToVideoPipeline
else: else:
......
...@@ -272,6 +272,21 @@ class CLIPImageProjection(metaclass=DummyObject): ...@@ -272,6 +272,21 @@ class CLIPImageProjection(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"]) requires_backends(cls, ["torch", "transformers"])
class CogVideoXImageToVideoPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class CogVideoXPipeline(metaclass=DummyObject): class CogVideoXPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"] _backends = ["torch", "transformers"]
......
# Copyright 2024 The HuggingFace Team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import inspect
import unittest
import numpy as np
import torch
from PIL import Image
from transformers import AutoTokenizer, T5EncoderModel
from diffusers import AutoencoderKLCogVideoX, CogVideoXImageToVideoPipeline, CogVideoXTransformer3DModel, DDIMScheduler
from diffusers.utils import load_image
from diffusers.utils.testing_utils import (
enable_full_determinism,
numpy_cosine_similarity_distance,
require_torch_gpu,
slow,
torch_device,
)
from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
from ..test_pipelines_common import (
PipelineTesterMixin,
check_qkv_fusion_matches_attn_procs_length,
check_qkv_fusion_processors_exist,
to_np,
)
enable_full_determinism()
class CogVideoXPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = CogVideoXImageToVideoPipeline
params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
batch_params = TEXT_TO_IMAGE_BATCH_PARAMS.union({"image"})
image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
required_optional_params = frozenset(
[
"num_inference_steps",
"generator",
"latents",
"return_dict",
"callback_on_step_end",
"callback_on_step_end_tensor_inputs",
]
)
test_xformers_attention = False
def get_dummy_components(self):
torch.manual_seed(0)
transformer = CogVideoXTransformer3DModel(
# Product of num_attention_heads * attention_head_dim must be divisible by 16 for 3D positional embeddings
# But, since we are using tiny-random-t5 here, we need the internal dim of CogVideoXTransformer3DModel
# to be 32. The internal dim is product of num_attention_heads and attention_head_dim
# Note: The num_attention_heads and attention_head_dim is different from the T2V and I2V tests because
# attention_head_dim must be divisible by 16 for RoPE to work. We also need to maintain a product of 32 as
# detailed above.
num_attention_heads=2,
attention_head_dim=16,
in_channels=8,
out_channels=4,
time_embed_dim=2,
text_embed_dim=32, # Must match with tiny-random-t5
num_layers=1,
sample_width=2, # latent width: 2 -> final width: 16
sample_height=2, # latent height: 2 -> final height: 16
sample_frames=9, # latent frames: (9 - 1) / 4 + 1 = 3 -> final frames: 9
patch_size=2,
temporal_compression_ratio=4,
max_text_seq_length=16,
use_rotary_positional_embeddings=True,
use_learned_positional_embeddings=True,
)
torch.manual_seed(0)
vae = AutoencoderKLCogVideoX(
in_channels=3,
out_channels=3,
down_block_types=(
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
"CogVideoXDownBlock3D",
),
up_block_types=(
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
"CogVideoXUpBlock3D",
),
block_out_channels=(8, 8, 8, 8),
latent_channels=4,
layers_per_block=1,
norm_num_groups=2,
temporal_compression_ratio=4,
)
torch.manual_seed(0)
scheduler = DDIMScheduler()
text_encoder = T5EncoderModel.from_pretrained("hf-internal-testing/tiny-random-t5")
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-t5")
components = {
"transformer": transformer,
"vae": vae,
"scheduler": scheduler,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
}
return components
def get_dummy_inputs(self, device, seed=0):
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device=device).manual_seed(seed)
# Cannot reduce below 16 because convolution kernel becomes bigger than sample
# Cannot reduce below 32 because 3D RoPE errors out
image_height = 16
image_width = 16
image = Image.new("RGB", (image_width, image_height))
inputs = {
"image": image,
"prompt": "dance monkey",
"negative_prompt": "",
"generator": generator,
"num_inference_steps": 2,
"guidance_scale": 6.0,
"height": image_height,
"width": image_width,
"num_frames": 8,
"max_sequence_length": 16,
"output_type": "pt",
}
return inputs
def test_inference(self):
device = "cpu"
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
video = pipe(**inputs).frames
generated_video = video[0]
self.assertEqual(generated_video.shape, (8, 3, 16, 16))
expected_video = torch.randn(8, 3, 16, 16)
max_diff = np.abs(generated_video - expected_video).max()
self.assertLessEqual(max_diff, 1e10)
def test_callback_inputs(self):
sig = inspect.signature(self.pipeline_class.__call__)
has_callback_tensor_inputs = "callback_on_step_end_tensor_inputs" in sig.parameters
has_callback_step_end = "callback_on_step_end" in sig.parameters
if not (has_callback_tensor_inputs and has_callback_step_end):
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
self.assertTrue(
hasattr(pipe, "_callback_tensor_inputs"),
f" {self.pipeline_class} should have `_callback_tensor_inputs` that defines a list of tensor variables its callback function can use as inputs",
)
def callback_inputs_subset(pipe, i, t, callback_kwargs):
# iterate over callback args
for tensor_name, tensor_value in callback_kwargs.items():
# check that we're only passing in allowed tensor inputs
assert tensor_name in pipe._callback_tensor_inputs
return callback_kwargs
def callback_inputs_all(pipe, i, t, callback_kwargs):
for tensor_name in pipe._callback_tensor_inputs:
assert tensor_name in callback_kwargs
# iterate over callback args
for tensor_name, tensor_value in callback_kwargs.items():
# check that we're only passing in allowed tensor inputs
assert tensor_name in pipe._callback_tensor_inputs
return callback_kwargs
inputs = self.get_dummy_inputs(torch_device)
# Test passing in a subset
inputs["callback_on_step_end"] = callback_inputs_subset
inputs["callback_on_step_end_tensor_inputs"] = ["latents"]
output = pipe(**inputs)[0]
# Test passing in a everything
inputs["callback_on_step_end"] = callback_inputs_all
inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
output = pipe(**inputs)[0]
def callback_inputs_change_tensor(pipe, i, t, callback_kwargs):
is_last = i == (pipe.num_timesteps - 1)
if is_last:
callback_kwargs["latents"] = torch.zeros_like(callback_kwargs["latents"])
return callback_kwargs
inputs["callback_on_step_end"] = callback_inputs_change_tensor
inputs["callback_on_step_end_tensor_inputs"] = pipe._callback_tensor_inputs
output = pipe(**inputs)[0]
assert output.abs().sum() < 1e10
def test_inference_batch_single_identical(self):
self._test_inference_batch_single_identical(batch_size=3, expected_max_diff=1e-3)
def test_attention_slicing_forward_pass(
self, test_max_difference=True, test_mean_pixel_difference=True, expected_max_diff=1e-3
):
if not self.test_attention_slicing:
return
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
for component in pipe.components.values():
if hasattr(component, "set_default_attn_processor"):
component.set_default_attn_processor()
pipe.to(torch_device)
pipe.set_progress_bar_config(disable=None)
generator_device = "cpu"
inputs = self.get_dummy_inputs(generator_device)
output_without_slicing = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=1)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing1 = pipe(**inputs)[0]
pipe.enable_attention_slicing(slice_size=2)
inputs = self.get_dummy_inputs(generator_device)
output_with_slicing2 = pipe(**inputs)[0]
if test_max_difference:
max_diff1 = np.abs(to_np(output_with_slicing1) - to_np(output_without_slicing)).max()
max_diff2 = np.abs(to_np(output_with_slicing2) - to_np(output_without_slicing)).max()
self.assertLess(
max(max_diff1, max_diff2),
expected_max_diff,
"Attention slicing should not affect the inference results",
)
def test_vae_tiling(self, expected_diff_max: float = 0.3):
# Note(aryan): Investigate why this needs a bit higher tolerance
generator_device = "cpu"
components = self.get_dummy_components()
# The reason to modify it this way is because I2V Transformer limits the generation to resolutions.
# See the if-statement on "self.use_learned_positional_embeddings"
components["transformer"] = CogVideoXTransformer3DModel.from_config(
components["transformer"].config,
sample_height=16,
sample_width=16,
)
pipe = self.pipeline_class(**components)
pipe.to("cpu")
pipe.set_progress_bar_config(disable=None)
# Without tiling
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
output_without_tiling = pipe(**inputs)[0]
# With tiling
pipe.vae.enable_tiling(
tile_sample_min_height=96,
tile_sample_min_width=96,
tile_overlap_factor_height=1 / 12,
tile_overlap_factor_width=1 / 12,
)
inputs = self.get_dummy_inputs(generator_device)
inputs["height"] = inputs["width"] = 128
output_with_tiling = pipe(**inputs)[0]
self.assertLess(
(to_np(output_without_tiling) - to_np(output_with_tiling)).max(),
expected_diff_max,
"VAE tiling should not affect the inference results",
)
def test_fused_qkv_projections(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
frames = pipe(**inputs).frames # [B, F, C, H, W]
original_image_slice = frames[0, -2:, -1, -3:, -3:]
pipe.fuse_qkv_projections()
assert check_qkv_fusion_processors_exist(
pipe.transformer
), "Something wrong with the fused attention processors. Expected all the attention processors to be fused."
assert check_qkv_fusion_matches_attn_procs_length(
pipe.transformer, pipe.transformer.original_attn_processors
), "Something wrong with the attention processors concerning the fused QKV projections."
inputs = self.get_dummy_inputs(device)
frames = pipe(**inputs).frames
image_slice_fused = frames[0, -2:, -1, -3:, -3:]
pipe.transformer.unfuse_qkv_projections()
inputs = self.get_dummy_inputs(device)
frames = pipe(**inputs).frames
image_slice_disabled = frames[0, -2:, -1, -3:, -3:]
assert np.allclose(
original_image_slice, image_slice_fused, atol=1e-3, rtol=1e-3
), "Fusion of QKV projections shouldn't affect the outputs."
assert np.allclose(
image_slice_fused, image_slice_disabled, atol=1e-3, rtol=1e-3
), "Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."
assert np.allclose(
original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2
), "Original outputs should match when fused QKV projections are disabled."
@unittest.skip("The model 'THUDM/CogVideoX-5b-I2V' is not public yet.")
@slow
@require_torch_gpu
class CogVideoXImageToVideoPipelineIntegrationTests(unittest.TestCase):
prompt = "A painting of a squirrel eating a burger."
def setUp(self):
super().setUp()
gc.collect()
torch.cuda.empty_cache()
def tearDown(self):
super().tearDown()
gc.collect()
torch.cuda.empty_cache()
def test_cogvideox(self):
generator = torch.Generator("cpu").manual_seed(0)
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
prompt = self.prompt
image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
videos = pipe(
image=image,
prompt=prompt,
height=480,
width=720,
num_frames=16,
generator=generator,
num_inference_steps=2,
output_type="pt",
).frames
video = videos[0]
expected_video = torch.randn(1, 16, 480, 720, 3).numpy()
max_diff = numpy_cosine_similarity_distance(video, expected_video)
assert max_diff < 1e-3, f"Max diff is too high. got {video}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment