"...text-generation-inference.git" did not exist on "97e22369f46fd0a8085856d9798ef2f61946fa6c"
Unverified Commit 7b904941 authored by Aryan's avatar Aryan Committed by GitHub
Browse files

Cosmos (#10660)



* begin transformer conversion

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* update

* add conversion script

* add pipeline

* make fix-copies

* remove einops

* update docs

* gradient checkpointing

* add transformer test

* update

* debug

* remove prints

* match sigmas

* add vae pt. 1

* finish CV* vae

* update

* update

* update

* update

* update

* update

* make fix-copies

* update

* make fix-copies

* fix

* update

* update

* make fix-copies

* update

* update tests

* handle device and dtype for safety checker; required in latest diffusers

* remove enable_gqa and use repeat_interleave instead

* enforce safety checker; use dummy checker in fast tests

* add review suggestion for ONNX export
Co-Authored-By: default avatarAsfiya Baig <asfiyab@nvidia.com>

* fix safety_checker issues when not passed explicitly

We could either do what's done in this commit, or update the Cosmos examples to explicitly pass the safety checker

* use cosmos guardrail package

* auto format docs

* update conversion script to support 14B models

* update name CosmosPipeline -> CosmosTextToWorldPipeline

* update docs

* fix docs

* fix group offload test failing for vae

---------
Co-authored-by: default avatarAsfiya Baig <asfiyab@nvidia.com>
parent fb29132b
...@@ -295,6 +295,8 @@ ...@@ -295,6 +295,8 @@
title: CogView4Transformer2DModel title: CogView4Transformer2DModel
- local: api/models/consisid_transformer3d - local: api/models/consisid_transformer3d
title: ConsisIDTransformer3DModel title: ConsisIDTransformer3DModel
- local: api/models/cosmos_transformer3d
title: CosmosTransformer3DModel
- local: api/models/dit_transformer2d - local: api/models/dit_transformer2d
title: DiTTransformer2DModel title: DiTTransformer2DModel
- local: api/models/easyanimate_transformer3d - local: api/models/easyanimate_transformer3d
...@@ -363,6 +365,8 @@ ...@@ -363,6 +365,8 @@
title: AutoencoderKLAllegro title: AutoencoderKLAllegro
- local: api/models/autoencoderkl_cogvideox - local: api/models/autoencoderkl_cogvideox
title: AutoencoderKLCogVideoX title: AutoencoderKLCogVideoX
- local: api/models/autoencoderkl_cosmos
title: AutoencoderKLCosmos
- local: api/models/autoencoder_kl_hunyuan_video - local: api/models/autoencoder_kl_hunyuan_video
title: AutoencoderKLHunyuanVideo title: AutoencoderKLHunyuanVideo
- local: api/models/autoencoderkl_ltx_video - local: api/models/autoencoderkl_ltx_video
...@@ -433,6 +437,8 @@ ...@@ -433,6 +437,8 @@
title: ControlNet-XS with Stable Diffusion XL title: ControlNet-XS with Stable Diffusion XL
- local: api/pipelines/controlnet_union - local: api/pipelines/controlnet_union
title: ControlNetUnion title: ControlNetUnion
- local: api/pipelines/cosmos
title: Cosmos
- local: api/pipelines/dance_diffusion - local: api/pipelines/dance_diffusion
title: Dance Diffusion title: Dance Diffusion
- local: api/pipelines/ddim - local: api/pipelines/ddim
......
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# AutoencoderKLCosmos
[Cosmos Tokenizers](https://github.com/NVIDIA/Cosmos-Tokenizer).
Supported models:
- [nvidia/Cosmos-1.0-Tokenizer-CV8x8x8](https://huggingface.co/nvidia/Cosmos-1.0-Tokenizer-CV8x8x8)
The model can be loaded with the following code snippet.
```python
from diffusers import AutoencoderKLCosmos
vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae")
```
## AutoencoderKLCosmos
[[autodoc]] AutoencoderKLCosmos
- decode
- encode
- all
## AutoencoderKLOutput
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
## DecoderOutput
[[autodoc]] models.autoencoders.vae.DecoderOutput
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->
# CosmosTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
The model can be loaded with the following code snippet.
```python
from diffusers import CosmosTransformer3DModel
transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16)
```
## CosmosTransformer3DModel
[[autodoc]] CosmosTransformer3DModel
## Transformer2DModelOutput
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->
# Cosmos
[Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
*Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.*
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
</Tip>
## CosmosTextToWorldPipeline
[[autodoc]] CosmosTextToWorldPipeline
- all
- __call__
## CosmosVideoToWorldPipeline
[[autodoc]] CosmosVideoToWorldPipeline
- all
- __call__
## CosmosPipelineOutput
[[autodoc]] pipelines.cosmos.pipeline_output.CosmosPipelineOutput
import argparse
import pathlib
from typing import Any, Dict
import torch
from accelerate import init_empty_weights
from huggingface_hub import snapshot_download
from transformers import T5EncoderModel, T5TokenizerFast
from diffusers import AutoencoderKLCosmos, CosmosTextToWorldPipeline, CosmosTransformer3DModel, EDMEulerScheduler
def remove_keys_(key: str, state_dict: Dict[str, Any]):
state_dict.pop(key)
def update_state_dict_(state_dict: Dict[str, Any], old_key: str, new_key: str) -> Dict[str, Any]:
state_dict[new_key] = state_dict.pop(old_key)
def rename_transformer_blocks_(key: str, state_dict: Dict[str, Any]):
block_index = int(key.split(".")[1].removeprefix("block"))
new_key = key
old_prefix = f"blocks.block{block_index}"
new_prefix = f"transformer_blocks.{block_index}"
new_key = new_prefix + new_key.removeprefix(old_prefix)
state_dict[new_key] = state_dict.pop(key)
TRANSFORMER_KEYS_RENAME_DICT = {
"t_embedder.1": "time_embed.t_embedder",
"affline_norm": "time_embed.norm",
".blocks.0.block.attn": ".attn1",
".blocks.1.block.attn": ".attn2",
".blocks.2.block": ".ff",
".blocks.0.adaLN_modulation.1": ".norm1.linear_1",
".blocks.0.adaLN_modulation.2": ".norm1.linear_2",
".blocks.1.adaLN_modulation.1": ".norm2.linear_1",
".blocks.1.adaLN_modulation.2": ".norm2.linear_2",
".blocks.2.adaLN_modulation.1": ".norm3.linear_1",
".blocks.2.adaLN_modulation.2": ".norm3.linear_2",
"to_q.0": "to_q",
"to_q.1": "norm_q",
"to_k.0": "to_k",
"to_k.1": "norm_k",
"to_v.0": "to_v",
"layer1": "net.0.proj",
"layer2": "net.2",
"proj.1": "proj",
"x_embedder": "patch_embed",
"extra_pos_embedder": "learnable_pos_embed",
"final_layer.adaLN_modulation.1": "norm_out.linear_1",
"final_layer.adaLN_modulation.2": "norm_out.linear_2",
"final_layer.linear": "proj_out",
}
TRANSFORMER_SPECIAL_KEYS_REMAP = {
"blocks.block": rename_transformer_blocks_,
"logvar.0.freqs": remove_keys_,
"logvar.0.phases": remove_keys_,
"logvar.1.weight": remove_keys_,
"pos_embedder.seq": remove_keys_,
}
TRANSFORMER_CONFIGS = {
"Cosmos-1.0-Diffusion-7B-Text2World": {
"in_channels": 16,
"out_channels": 16,
"num_attention_heads": 32,
"attention_head_dim": 128,
"num_layers": 28,
"mlp_ratio": 4.0,
"text_embed_dim": 1024,
"adaln_lora_dim": 256,
"max_size": (128, 240, 240),
"patch_size": (1, 2, 2),
"rope_scale": (2.0, 1.0, 1.0),
"concat_padding_mask": True,
"extra_pos_embed_type": "learnable",
},
"Cosmos-1.0-Diffusion-7B-Video2World": {
"in_channels": 16 + 1,
"out_channels": 16,
"num_attention_heads": 32,
"attention_head_dim": 128,
"num_layers": 28,
"mlp_ratio": 4.0,
"text_embed_dim": 1024,
"adaln_lora_dim": 256,
"max_size": (128, 240, 240),
"patch_size": (1, 2, 2),
"rope_scale": (2.0, 1.0, 1.0),
"concat_padding_mask": True,
"extra_pos_embed_type": "learnable",
},
"Cosmos-1.0-Diffusion-14B-Text2World": {
"in_channels": 16,
"out_channels": 16,
"num_attention_heads": 40,
"attention_head_dim": 128,
"num_layers": 36,
"mlp_ratio": 4.0,
"text_embed_dim": 1024,
"adaln_lora_dim": 256,
"max_size": (128, 240, 240),
"patch_size": (1, 2, 2),
"rope_scale": (2.0, 2.0, 2.0),
"concat_padding_mask": True,
"extra_pos_embed_type": "learnable",
},
"Cosmos-1.0-Diffusion-14B-Video2World": {
"in_channels": 16 + 1,
"out_channels": 16,
"num_attention_heads": 40,
"attention_head_dim": 128,
"num_layers": 36,
"mlp_ratio": 4.0,
"text_embed_dim": 1024,
"adaln_lora_dim": 256,
"max_size": (128, 240, 240),
"patch_size": (1, 2, 2),
"rope_scale": (2.0, 2.0, 2.0),
"concat_padding_mask": True,
"extra_pos_embed_type": "learnable",
},
}
VAE_KEYS_RENAME_DICT = {
"down.0": "down_blocks.0",
"down.1": "down_blocks.1",
"down.2": "down_blocks.2",
"up.0": "up_blocks.2",
"up.1": "up_blocks.1",
"up.2": "up_blocks.0",
".block.": ".resnets.",
"downsample": "downsamplers.0",
"upsample": "upsamplers.0",
"mid.block_1": "mid_block.resnets.0",
"mid.attn_1.0": "mid_block.attentions.0",
"mid.attn_1.1": "mid_block.temp_attentions.0",
"mid.block_2": "mid_block.resnets.1",
".q.conv3d": ".to_q",
".k.conv3d": ".to_k",
".v.conv3d": ".to_v",
".proj_out.conv3d": ".to_out.0",
".0.conv3d": ".conv_s",
".1.conv3d": ".conv_t",
"conv1.conv3d": "conv1",
"conv2.conv3d": "conv2",
"conv3.conv3d": "conv3",
"nin_shortcut.conv3d": "conv_shortcut",
"quant_conv.conv3d": "quant_conv",
"post_quant_conv.conv3d": "post_quant_conv",
}
VAE_SPECIAL_KEYS_REMAP = {
"wavelets": remove_keys_,
"_arange": remove_keys_,
"patch_size_buffer": remove_keys_,
}
VAE_CONFIGS = {
"CV8x8x8-0.1": {
"name": "nvidia/Cosmos-0.1-Tokenizer-CV8x8x8",
"diffusers_config": {
"in_channels": 3,
"out_channels": 3,
"latent_channels": 16,
"encoder_block_out_channels": (128, 256, 512, 512),
"decode_block_out_channels": (256, 512, 512, 512),
"attention_resolutions": (32,),
"resolution": 1024,
"num_layers": 2,
"patch_size": 4,
"patch_type": "haar",
"scaling_factor": 1.0,
"spatial_compression_ratio": 8,
"temporal_compression_ratio": 8,
"latents_mean": None,
"latents_std": None,
},
},
"CV8x8x8-1.0": {
"name": "nvidia/Cosmos-1.0-Tokenizer-CV8x8x8",
"diffusers_config": {
"in_channels": 3,
"out_channels": 3,
"latent_channels": 16,
"encoder_block_out_channels": (128, 256, 512, 512),
"decode_block_out_channels": (256, 512, 512, 512),
"attention_resolutions": (32,),
"resolution": 1024,
"num_layers": 2,
"patch_size": 4,
"patch_type": "haar",
"scaling_factor": 1.0,
"spatial_compression_ratio": 8,
"temporal_compression_ratio": 8,
"latents_mean": None,
"latents_std": None,
},
},
}
def get_state_dict(saved_dict: Dict[str, Any]) -> Dict[str, Any]:
state_dict = saved_dict
if "model" in saved_dict.keys():
state_dict = state_dict["model"]
if "module" in saved_dict.keys():
state_dict = state_dict["module"]
if "state_dict" in saved_dict.keys():
state_dict = state_dict["state_dict"]
return state_dict
def convert_transformer(transformer_type: str, ckpt_path: str):
PREFIX_KEY = "net."
original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", weights_only=True))
with init_empty_weights():
config = TRANSFORMER_CONFIGS[transformer_type]
transformer = CosmosTransformer3DModel(**config)
for key in list(original_state_dict.keys()):
new_key = key[:]
if new_key.startswith(PREFIX_KEY):
new_key = new_key.removeprefix(PREFIX_KEY)
for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
new_key = new_key.replace(replace_key, rename_key)
update_state_dict_(original_state_dict, key, new_key)
for key in list(original_state_dict.keys()):
for special_key, handler_fn_inplace in TRANSFORMER_SPECIAL_KEYS_REMAP.items():
if special_key not in key:
continue
handler_fn_inplace(key, original_state_dict)
transformer.load_state_dict(original_state_dict, strict=True, assign=True)
return transformer
def convert_vae(vae_type: str):
model_name = VAE_CONFIGS[vae_type]["name"]
snapshot_directory = snapshot_download(model_name, repo_type="model")
directory = pathlib.Path(snapshot_directory)
autoencoder_file = directory / "autoencoder.jit"
mean_std_file = directory / "mean_std.pt"
original_state_dict = torch.jit.load(autoencoder_file.as_posix()).state_dict()
if mean_std_file.exists():
mean_std = torch.load(mean_std_file, map_location="cpu", weights_only=True)
else:
mean_std = (None, None)
config = VAE_CONFIGS[vae_type]["diffusers_config"]
config.update(
{
"latents_mean": mean_std[0].detach().cpu().numpy().tolist(),
"latents_std": mean_std[1].detach().cpu().numpy().tolist(),
}
)
vae = AutoencoderKLCosmos(**config)
for key in list(original_state_dict.keys()):
new_key = key[:]
for replace_key, rename_key in VAE_KEYS_RENAME_DICT.items():
new_key = new_key.replace(replace_key, rename_key)
update_state_dict_(original_state_dict, key, new_key)
for key in list(original_state_dict.keys()):
for special_key, handler_fn_inplace in VAE_SPECIAL_KEYS_REMAP.items():
if special_key not in key:
continue
handler_fn_inplace(key, original_state_dict)
vae.load_state_dict(original_state_dict, strict=True, assign=True)
return vae
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--transformer_type", type=str, default=None, choices=list(TRANSFORMER_CONFIGS.keys()))
parser.add_argument(
"--transformer_ckpt_path", type=str, default=None, help="Path to original transformer checkpoint"
)
parser.add_argument("--vae_type", type=str, default=None, choices=list(VAE_CONFIGS.keys()), help="Type of VAE")
parser.add_argument("--text_encoder_path", type=str, default="google-t5/t5-11b")
parser.add_argument("--tokenizer_path", type=str, default="google-t5/t5-11b")
parser.add_argument("--save_pipeline", action="store_true")
parser.add_argument("--output_path", type=str, required=True, help="Path where converted model should be saved")
parser.add_argument("--dtype", default="bf16", help="Torch dtype to save the transformer in.")
return parser.parse_args()
DTYPE_MAPPING = {
"fp32": torch.float32,
"fp16": torch.float16,
"bf16": torch.bfloat16,
}
if __name__ == "__main__":
args = get_args()
transformer = None
dtype = DTYPE_MAPPING[args.dtype]
if args.save_pipeline:
assert args.transformer_ckpt_path is not None
assert args.vae_type is not None
assert args.text_encoder_path is not None
assert args.tokenizer_path is not None
if args.transformer_ckpt_path is not None:
transformer = convert_transformer(args.transformer_type, args.transformer_ckpt_path)
transformer = transformer.to(dtype=dtype)
if not args.save_pipeline:
transformer.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
if args.vae_type is not None:
vae = convert_vae(args.vae_type)
if not args.save_pipeline:
vae.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
if args.save_pipeline:
text_encoder = T5EncoderModel.from_pretrained(args.text_encoder_path, torch_dtype=dtype)
tokenizer = T5TokenizerFast.from_pretrained(args.tokenizer_path)
# The original code initializes EDM config with sigma_min=0.0002, but does not make use of it anywhere directly.
# So, the sigma_min values that is used is the default value of 0.002.
scheduler = EDMEulerScheduler(
sigma_min=0.002,
sigma_max=80,
sigma_data=0.5,
sigma_schedule="karras",
num_train_timesteps=1000,
prediction_type="epsilon",
rho=7.0,
final_sigmas_type="sigma_min",
)
pipe = CosmosTextToWorldPipeline(
text_encoder=text_encoder,
tokenizer=tokenizer,
transformer=transformer,
vae=vae,
scheduler=scheduler,
)
pipe.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
...@@ -148,6 +148,7 @@ else: ...@@ -148,6 +148,7 @@ else:
"AutoencoderKL", "AutoencoderKL",
"AutoencoderKLAllegro", "AutoencoderKLAllegro",
"AutoencoderKLCogVideoX", "AutoencoderKLCogVideoX",
"AutoencoderKLCosmos",
"AutoencoderKLHunyuanVideo", "AutoencoderKLHunyuanVideo",
"AutoencoderKLLTXVideo", "AutoencoderKLLTXVideo",
"AutoencoderKLMagvit", "AutoencoderKLMagvit",
...@@ -166,6 +167,7 @@ else: ...@@ -166,6 +167,7 @@ else:
"ControlNetModel", "ControlNetModel",
"ControlNetUnionModel", "ControlNetUnionModel",
"ControlNetXSAdapter", "ControlNetXSAdapter",
"CosmosTransformer3DModel",
"DiTTransformer2DModel", "DiTTransformer2DModel",
"EasyAnimateTransformer3DModel", "EasyAnimateTransformer3DModel",
"FluxControlNetModel", "FluxControlNetModel",
...@@ -357,6 +359,9 @@ else: ...@@ -357,6 +359,9 @@ else:
"CogView3PlusPipeline", "CogView3PlusPipeline",
"CogView4ControlPipeline", "CogView4ControlPipeline",
"CogView4Pipeline", "CogView4Pipeline",
"ConsisIDPipeline",
"CosmosTextToWorldPipeline",
"CosmosVideoToWorldPipeline",
"CycleDiffusionPipeline", "CycleDiffusionPipeline",
"EasyAnimateControlPipeline", "EasyAnimateControlPipeline",
"EasyAnimateInpaintPipeline", "EasyAnimateInpaintPipeline",
...@@ -745,6 +750,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -745,6 +750,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AutoencoderKL, AutoencoderKL,
AutoencoderKLAllegro, AutoencoderKLAllegro,
AutoencoderKLCogVideoX, AutoencoderKLCogVideoX,
AutoencoderKLCosmos,
AutoencoderKLHunyuanVideo, AutoencoderKLHunyuanVideo,
AutoencoderKLLTXVideo, AutoencoderKLLTXVideo,
AutoencoderKLMagvit, AutoencoderKLMagvit,
...@@ -763,6 +769,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -763,6 +769,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
ControlNetModel, ControlNetModel,
ControlNetUnionModel, ControlNetUnionModel,
ControlNetXSAdapter, ControlNetXSAdapter,
CosmosTransformer3DModel,
DiTTransformer2DModel, DiTTransformer2DModel,
EasyAnimateTransformer3DModel, EasyAnimateTransformer3DModel,
FluxControlNetModel, FluxControlNetModel,
...@@ -933,6 +940,9 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -933,6 +940,9 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
CogView3PlusPipeline, CogView3PlusPipeline,
CogView4ControlPipeline, CogView4ControlPipeline,
CogView4Pipeline, CogView4Pipeline,
ConsisIDPipeline,
CosmosTextToWorldPipeline,
CosmosVideoToWorldPipeline,
CycleDiffusionPipeline, CycleDiffusionPipeline,
EasyAnimateControlPipeline, EasyAnimateControlPipeline,
EasyAnimateInpaintPipeline, EasyAnimateInpaintPipeline,
......
...@@ -32,6 +32,7 @@ if is_torch_available(): ...@@ -32,6 +32,7 @@ if is_torch_available():
_import_structure["autoencoders.autoencoder_kl"] = ["AutoencoderKL"] _import_structure["autoencoders.autoencoder_kl"] = ["AutoencoderKL"]
_import_structure["autoencoders.autoencoder_kl_allegro"] = ["AutoencoderKLAllegro"] _import_structure["autoencoders.autoencoder_kl_allegro"] = ["AutoencoderKLAllegro"]
_import_structure["autoencoders.autoencoder_kl_cogvideox"] = ["AutoencoderKLCogVideoX"] _import_structure["autoencoders.autoencoder_kl_cogvideox"] = ["AutoencoderKLCogVideoX"]
_import_structure["autoencoders.autoencoder_kl_cosmos"] = ["AutoencoderKLCosmos"]
_import_structure["autoencoders.autoencoder_kl_hunyuan_video"] = ["AutoencoderKLHunyuanVideo"] _import_structure["autoencoders.autoencoder_kl_hunyuan_video"] = ["AutoencoderKLHunyuanVideo"]
_import_structure["autoencoders.autoencoder_kl_ltx"] = ["AutoencoderKLLTXVideo"] _import_structure["autoencoders.autoencoder_kl_ltx"] = ["AutoencoderKLLTXVideo"]
_import_structure["autoencoders.autoencoder_kl_magvit"] = ["AutoencoderKLMagvit"] _import_structure["autoencoders.autoencoder_kl_magvit"] = ["AutoencoderKLMagvit"]
...@@ -75,6 +76,7 @@ if is_torch_available(): ...@@ -75,6 +76,7 @@ if is_torch_available():
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"] _import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
_import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"] _import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"]
_import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"] _import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"]
_import_structure["transformers.transformer_cosmos"] = ["CosmosTransformer3DModel"]
_import_structure["transformers.transformer_easyanimate"] = ["EasyAnimateTransformer3DModel"] _import_structure["transformers.transformer_easyanimate"] = ["EasyAnimateTransformer3DModel"]
_import_structure["transformers.transformer_flux"] = ["FluxTransformer2DModel"] _import_structure["transformers.transformer_flux"] = ["FluxTransformer2DModel"]
_import_structure["transformers.transformer_hidream_image"] = ["HiDreamImageTransformer2DModel"] _import_structure["transformers.transformer_hidream_image"] = ["HiDreamImageTransformer2DModel"]
...@@ -114,6 +116,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -114,6 +116,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AutoencoderKL, AutoencoderKL,
AutoencoderKLAllegro, AutoencoderKLAllegro,
AutoencoderKLCogVideoX, AutoencoderKLCogVideoX,
AutoencoderKLCosmos,
AutoencoderKLHunyuanVideo, AutoencoderKLHunyuanVideo,
AutoencoderKLLTXVideo, AutoencoderKLLTXVideo,
AutoencoderKLMagvit, AutoencoderKLMagvit,
...@@ -151,6 +154,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -151,6 +154,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
CogView3PlusTransformer2DModel, CogView3PlusTransformer2DModel,
CogView4Transformer2DModel, CogView4Transformer2DModel,
ConsisIDTransformer3DModel, ConsisIDTransformer3DModel,
CosmosTransformer3DModel,
DiTTransformer2DModel, DiTTransformer2DModel,
DualTransformer2DModel, DualTransformer2DModel,
EasyAnimateTransformer3DModel, EasyAnimateTransformer3DModel,
......
...@@ -203,8 +203,8 @@ class Attention(nn.Module): ...@@ -203,8 +203,8 @@ class Attention(nn.Module):
self.norm_q = nn.LayerNorm(dim_head * heads, eps=eps) self.norm_q = nn.LayerNorm(dim_head * heads, eps=eps)
self.norm_k = nn.LayerNorm(dim_head * kv_heads, eps=eps) self.norm_k = nn.LayerNorm(dim_head * kv_heads, eps=eps)
elif qk_norm == "rms_norm": elif qk_norm == "rms_norm":
self.norm_q = RMSNorm(dim_head, eps=eps) self.norm_q = RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
self.norm_k = RMSNorm(dim_head, eps=eps) self.norm_k = RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
elif qk_norm == "rms_norm_across_heads": elif qk_norm == "rms_norm_across_heads":
# LTX applies qk norm across all heads # LTX applies qk norm across all heads
self.norm_q = RMSNorm(dim_head * heads, eps=eps) self.norm_q = RMSNorm(dim_head * heads, eps=eps)
......
...@@ -3,6 +3,7 @@ from .autoencoder_dc import AutoencoderDC ...@@ -3,6 +3,7 @@ from .autoencoder_dc import AutoencoderDC
from .autoencoder_kl import AutoencoderKL from .autoencoder_kl import AutoencoderKL
from .autoencoder_kl_allegro import AutoencoderKLAllegro from .autoencoder_kl_allegro import AutoencoderKLAllegro
from .autoencoder_kl_cogvideox import AutoencoderKLCogVideoX from .autoencoder_kl_cogvideox import AutoencoderKLCogVideoX
from .autoencoder_kl_cosmos import AutoencoderKLCosmos
from .autoencoder_kl_hunyuan_video import AutoencoderKLHunyuanVideo from .autoencoder_kl_hunyuan_video import AutoencoderKLHunyuanVideo
from .autoencoder_kl_ltx import AutoencoderKLLTXVideo from .autoencoder_kl_ltx import AutoencoderKLLTXVideo
from .autoencoder_kl_magvit import AutoencoderKLMagvit from .autoencoder_kl_magvit import AutoencoderKLMagvit
......
This diff is collapsed.
...@@ -744,6 +744,17 @@ class DiagonalGaussianDistribution(object): ...@@ -744,6 +744,17 @@ class DiagonalGaussianDistribution(object):
return self.mean return self.mean
class IdentityDistribution(object):
def __init__(self, parameters: torch.Tensor):
self.parameters = parameters
def sample(self, generator: Optional[torch.Generator] = None) -> torch.Tensor:
return self.parameters
def mode(self) -> torch.Tensor:
return self.parameters
class EncoderTiny(nn.Module): class EncoderTiny(nn.Module):
r""" r"""
The `EncoderTiny` layer is a simpler version of the `Encoder` layer. The `EncoderTiny` layer is a simpler version of the `Encoder` layer.
......
...@@ -1204,7 +1204,7 @@ def apply_rotary_emb( ...@@ -1204,7 +1204,7 @@ def apply_rotary_emb(
x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1) # [B, S, H, D//2] x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1) # [B, S, H, D//2]
x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3) x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
elif use_real_unbind_dim == -2: elif use_real_unbind_dim == -2:
# Used for Stable Audio, OmniGen and CogView4 # Used for Stable Audio, OmniGen, CogView4 and Cosmos
x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2) # [B, S, H, D//2] x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2) # [B, S, H, D//2]
x_rotated = torch.cat([-x_imag, x_real], dim=-1) x_rotated = torch.cat([-x_imag, x_real], dim=-1)
else: else:
......
...@@ -19,6 +19,7 @@ if is_torch_available(): ...@@ -19,6 +19,7 @@ if is_torch_available():
from .transformer_allegro import AllegroTransformer3DModel from .transformer_allegro import AllegroTransformer3DModel
from .transformer_cogview3plus import CogView3PlusTransformer2DModel from .transformer_cogview3plus import CogView3PlusTransformer2DModel
from .transformer_cogview4 import CogView4Transformer2DModel from .transformer_cogview4 import CogView4Transformer2DModel
from .transformer_cosmos import CosmosTransformer3DModel
from .transformer_easyanimate import EasyAnimateTransformer3DModel from .transformer_easyanimate import EasyAnimateTransformer3DModel
from .transformer_flux import FluxTransformer2DModel from .transformer_flux import FluxTransformer2DModel
from .transformer_hidream_image import HiDreamImageTransformer2DModel from .transformer_hidream_image import HiDreamImageTransformer2DModel
......
This diff is collapsed.
...@@ -156,6 +156,8 @@ else: ...@@ -156,6 +156,8 @@ else:
] ]
_import_structure["cogview3"] = ["CogView3PlusPipeline"] _import_structure["cogview3"] = ["CogView3PlusPipeline"]
_import_structure["cogview4"] = ["CogView4Pipeline", "CogView4ControlPipeline"] _import_structure["cogview4"] = ["CogView4Pipeline", "CogView4ControlPipeline"]
_import_structure["consisid"] = ["ConsisIDPipeline"]
_import_structure["cosmos"] = ["CosmosTextToWorldPipeline", "CosmosVideoToWorldPipeline"]
_import_structure["controlnet"].extend( _import_structure["controlnet"].extend(
[ [
"BlipDiffusionControlNetPipeline", "BlipDiffusionControlNetPipeline",
...@@ -546,6 +548,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -546,6 +548,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
StableDiffusionControlNetXSPipeline, StableDiffusionControlNetXSPipeline,
StableDiffusionXLControlNetXSPipeline, StableDiffusionXLControlNetXSPipeline,
) )
from .cosmos import CosmosTextToWorldPipeline, CosmosVideoToWorldPipeline
from .deepfloyd_if import ( from .deepfloyd_if import (
IFImg2ImgPipeline, IFImg2ImgPipeline,
IFImg2ImgSuperResolutionPipeline, IFImg2ImgSuperResolutionPipeline,
......
from typing import TYPE_CHECKING
from ...utils import (
DIFFUSERS_SLOW_IMPORT,
OptionalDependencyNotAvailable,
_LazyModule,
get_objects_from_module,
is_torch_available,
is_transformers_available,
)
_dummy_objects = {}
_import_structure = {}
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils import dummy_torch_and_transformers_objects # noqa F403
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else:
_import_structure["pipeline_cosmos_text2world"] = ["CosmosTextToWorldPipeline"]
_import_structure["pipeline_cosmos_video2world"] = ["CosmosVideoToWorldPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import *
else:
from .pipeline_cosmos_text2world import CosmosTextToWorldPipeline
from .pipeline_cosmos_video2world import CosmosVideoToWorldPipeline
else:
import sys
sys.modules[__name__] = _LazyModule(
__name__,
globals()["__file__"],
_import_structure,
module_spec=__spec__,
)
for name, value in _dummy_objects.items():
setattr(sys.modules[__name__], name, value)
This diff is collapsed.
This diff is collapsed.
from dataclasses import dataclass
import torch
from diffusers.utils import BaseOutput
@dataclass
class CosmosPipelineOutput(BaseOutput):
r"""
Output class for Cosmos pipelines.
Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`.
"""
frames: torch.Tensor
...@@ -144,7 +144,7 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin): ...@@ -144,7 +144,7 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
# Copied from diffusers.schedulers.scheduling_edm_euler.EDMEulerScheduler.precondition_inputs # Copied from diffusers.schedulers.scheduling_edm_euler.EDMEulerScheduler.precondition_inputs
def precondition_inputs(self, sample, sigma): def precondition_inputs(self, sample, sigma):
c_in = 1 / ((sigma**2 + self.config.sigma_data**2) ** 0.5) c_in = self._get_conditioning_c_in(sigma)
scaled_sample = sample * c_in scaled_sample = sample * c_in
return scaled_sample return scaled_sample
...@@ -568,5 +568,10 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin): ...@@ -568,5 +568,10 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
noisy_samples = original_samples + noise * sigma noisy_samples = original_samples + noise * sigma
return noisy_samples return noisy_samples
# Copied from diffusers.schedulers.scheduling_edm_euler.EDMEulerScheduler._get_conditioning_c_in
def _get_conditioning_c_in(self, sigma):
c_in = 1 / ((sigma**2 + self.config.sigma_data**2) ** 0.5)
return c_in
def __len__(self): def __len__(self):
return self.config.num_train_timesteps return self.config.num_train_timesteps
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment