Cosmos (#10660)

* begin transformer conversion * refactor * refactor * refactor * refactor * refactor * refactor * update * add conversion script * add pipeline * make fix-copies * remove einops * update docs * gradient checkpointing * add transformer test * update * debug * remove prints * match sigmas * add vae pt. 1 * finish CV* vae * update * update * update * update * update * update * make fix-copies * update * make fix-copies * fix * update * update * make fix-copies * update * update tests * handle device and dtype for safety checker; required in latest diffusers * remove enable_gqa and use repeat_interleave instead * enforce safety checker; use dummy checker in fast tests * add review suggestion for ONNX export Co-Authored-By: Asfiya Baig <asfiyab@nvidia.com> * fix safety_checker issues when not passed explicitly We could either do what's done in this commit, or update the Cosmos examples to explicitly pass the safety checker * use cosmos guardrail package * auto format docs * update conversion script to support 14B models * update name CosmosPipeline -> CosmosTextToWorldPipeline * update docs * fix docs * fix group offload test failing for vae --------- Co-authored-by: Asfiya Baig <asfiyab@nvidia.com>

Cosmos (#10660)
* begin transformer conversion * refactor * refactor * refactor * refactor * refactor * refactor * update * add conversion script * add pipeline * make fix-copies * remove einops * update docs * gradient checkpointing * add transformer test * update * debug * remove prints * match sigmas * add vae pt. 1 * finish CV* vae * update * update * update * update * update * update * make fix-copies * update * make fix-copies * fix * update * update * make fix-copies * update * update tests * handle device and dtype for safety checker; required in latest diffusers * remove enable_gqa and use repeat_interleave instead * enforce safety checker; use dummy checker in fast tests * add review suggestion for ONNX export Co-Authored-By: Asfiya Baig <asfiyab@nvidia.com> * fix safety_checker issues when not passed explicitly We could either do what's done in this commit, or update the Cosmos examples to explicitly pass the safety checker * use cosmos guardrail package * auto format docs * update conversion script to support 14B models * update name CosmosPipeline -> CosmosTextToWorldPipeline * update docs * fix docs * fix group offload test failing for vae --------- Co-authored-by: Asfiya Baig <asfiyab@nvidia.com>
7b904941 · Aryan · GitHub · fb29132b · 7b904941 · 7b904941
Unverified Commit 7b904941 authored May 07, 2025 by Aryan Committed by GitHub May 07, 2025
20 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -295,6 +295,8 @@
        title: CogView4Transformer2DModel
      - local: api/models/consisid_transformer3d
        title: ConsisIDTransformer3DModel
+      - local: api/models/cosmos_transformer3d
+        title: CosmosTransformer3DModel
      - local: api/models/dit_transformer2d
        title: DiTTransformer2DModel
      - local: api/models/easyanimate_transformer3d
@@ -363,6 +365,8 @@
        title: AutoencoderKLAllegro
      - local: api/models/autoencoderkl_cogvideox
        title: AutoencoderKLCogVideoX
+      - local: api/models/autoencoderkl_cosmos
+        title: AutoencoderKLCosmos
      - local: api/models/autoencoder_kl_hunyuan_video
        title: AutoencoderKLHunyuanVideo
      - local: api/models/autoencoderkl_ltx_video
@@ -433,6 +437,8 @@
      title: ControlNet-XS with Stable Diffusion XL
    - local: api/pipelines/controlnet_union
      title: ControlNetUnion
+    - local: api/pipelines/cosmos
+      title: Cosmos
    - local: api/pipelines/dance_diffusion
      title: Dance Diffusion
    - local: api/pipelines/ddim

--- a/docs/source/en/api/models/autoencoderkl_cosmos.md
+++ b/docs/source/en/api/models/autoencoderkl_cosmos.md
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# AutoencoderKLCosmos
+[Cosmos Tokenizers](https://github.com/NVIDIA/Cosmos-Tokenizer).
+Supported models:
+- [nvidia/Cosmos-1.0-Tokenizer-CV8x8x8](https://huggingface.co/nvidia/Cosmos-1.0-Tokenizer-CV8x8x8)
+The model can be loaded with the following code snippet.
+```python
+from diffusers import AutoencoderKLCosmos
+vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae")
+```
+## AutoencoderKLCosmos
+[[autodoc]] AutoencoderKLCosmos
+    - decode
+    - encode
+    - all
+## AutoencoderKLOutput
+[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
+## DecoderOutput
+[[autodoc]] models.autoencoders.vae.DecoderOutput
--- a/docs/source/en/api/models/cosmos_transformer3d.md
+++ b/docs/source/en/api/models/cosmos_transformer3d.md
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# CosmosTransformer3DModel
+A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
+The model can be loaded with the following code snippet.
+```python
+from diffusers import CosmosTransformer3DModel
+transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16)
+```
+## CosmosTransformer3DModel
+[[autodoc]] CosmosTransformer3DModel
+## Transformer2DModelOutput
+[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
--- a/docs/source/en/api/pipelines/cosmos.md
+++ b/docs/source/en/api/pipelines/cosmos.md
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+# Cosmos
+[Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
+*Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.*
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
+## CosmosTextToWorldPipeline
+[[autodoc]] CosmosTextToWorldPipeline
+  - all
+  - __call__
+## CosmosVideoToWorldPipeline
+[[autodoc]] CosmosVideoToWorldPipeline
+  - all
+  - __call__
+## CosmosPipelineOutput
+[[autodoc]] pipelines.cosmos.pipeline_output.CosmosPipelineOutput
--- a/scripts/convert_cosmos_to_diffusers.py
+++ b/scripts/convert_cosmos_to_diffusers.py
+import argparse
+import pathlib
+from typing import Any, Dict
+import torch
+from accelerate import init_empty_weights
+from huggingface_hub import snapshot_download
+from transformers import T5EncoderModel, T5TokenizerFast
+from diffusers import AutoencoderKLCosmos, CosmosTextToWorldPipeline, CosmosTransformer3DModel, EDMEulerScheduler
+def remove_keys_(key: str, state_dict: Dict[str, Any]):
+    state_dict.pop(key)
+def update_state_dict_(state_dict: Dict[str, Any], old_key: str, new_key: str) -> Dict[str, Any]:
+    state_dict[new_key] = state_dict.pop(old_key)
+def rename_transformer_blocks_(key: str, state_dict: Dict[str, Any]):
+    block_index = int(key.split(".")[1].removeprefix("block"))
+    new_key = key
+    old_prefix = f"blocks.block{block_index}"
+    new_prefix = f"transformer_blocks.{block_index}"
+    new_key = new_prefix + new_key.removeprefix(old_prefix)
+    state_dict[new_key] = state_dict.pop(key)
+TRANSFORMER_KEYS_RENAME_DICT = {
+    "t_embedder.1": "time_embed.t_embedder",
+    "affline_norm": "time_embed.norm",
+    ".blocks.0.block.attn": ".attn1",
+    ".blocks.1.block.attn": ".attn2",
+    ".blocks.2.block": ".ff",
+    ".blocks.0.adaLN_modulation.1": ".norm1.linear_1",
+    ".blocks.0.adaLN_modulation.2": ".norm1.linear_2",
+    ".blocks.1.adaLN_modulation.1": ".norm2.linear_1",
+    ".blocks.1.adaLN_modulation.2": ".norm2.linear_2",
+    ".blocks.2.adaLN_modulation.1": ".norm3.linear_1",
+    ".blocks.2.adaLN_modulation.2": ".norm3.linear_2",
+    "to_q.0": "to_q",
+    "to_q.1": "norm_q",
+    "to_k.0": "to_k",
+    "to_k.1": "norm_k",
+    "to_v.0": "to_v",
+    "layer1": "net.0.proj",
+    "layer2": "net.2",
+    "proj.1": "proj",
+    "x_embedder": "patch_embed",
+    "extra_pos_embedder": "learnable_pos_embed",
+    "final_layer.adaLN_modulation.1": "norm_out.linear_1",
+    "final_layer.adaLN_modulation.2": "norm_out.linear_2",
+    "final_layer.linear": "proj_out",
+}
+TRANSFORMER_SPECIAL_KEYS_REMAP = {
+    "blocks.block": rename_transformer_blocks_,
+    "logvar.0.freqs": remove_keys_,
+    "logvar.0.phases": remove_keys_,
+    "logvar.1.weight": remove_keys_,
+    "pos_embedder.seq": remove_keys_,
+}
+TRANSFORMER_CONFIGS = {
+    "Cosmos-1.0-Diffusion-7B-Text2World": {
+        "in_channels": 16,
+        "out_channels": 16,
+        "num_attention_heads": 32,
+        "attention_head_dim": 128,
+        "num_layers": 28,
+        "mlp_ratio": 4.0,
+        "text_embed_dim": 1024,
+        "adaln_lora_dim": 256,
+        "max_size": (128, 240, 240),
+        "patch_size": (1, 2, 2),
+        "rope_scale": (2.0, 1.0, 1.0),
+        "concat_padding_mask": True,
+        "extra_pos_embed_type": "learnable",
+    },
+    "Cosmos-1.0-Diffusion-7B-Video2World": {
+        "in_channels": 16 + 1,
+        "out_channels": 16,
+        "num_attention_heads": 32,
+        "attention_head_dim": 128,
+        "num_layers": 28,
+        "mlp_ratio": 4.0,
+        "text_embed_dim": 1024,
+        "adaln_lora_dim": 256,
+        "max_size": (128, 240, 240),
+        "patch_size": (1, 2, 2),
+        "rope_scale": (2.0, 1.0, 1.0),
+        "concat_padding_mask": True,
+        "extra_pos_embed_type": "learnable",
+    },
+    "Cosmos-1.0-Diffusion-14B-Text2World": {
+        "in_channels": 16,
+        "out_channels": 16,
+        "num_attention_heads": 40,
+        "attention_head_dim": 128,
+        "num_layers": 36,
+        "mlp_ratio": 4.0,
+        "text_embed_dim": 1024,
+        "adaln_lora_dim": 256,
+        "max_size": (128, 240, 240),
+        "patch_size": (1, 2, 2),
+        "rope_scale": (2.0, 2.0, 2.0),
+        "concat_padding_mask": True,
+        "extra_pos_embed_type": "learnable",
+    },
+    "Cosmos-1.0-Diffusion-14B-Video2World": {
+        "in_channels": 16 + 1,
+        "out_channels": 16,
+        "num_attention_heads": 40,
+        "attention_head_dim": 128,
+        "num_layers": 36,
+        "mlp_ratio": 4.0,
+        "text_embed_dim": 1024,
+        "adaln_lora_dim": 256,
+        "max_size": (128, 240, 240),
+        "patch_size": (1, 2, 2),
+        "rope_scale": (2.0, 2.0, 2.0),
+        "concat_padding_mask": True,
+        "extra_pos_embed_type": "learnable",
+    },
+}
+VAE_KEYS_RENAME_DICT = {
+    "down.0": "down_blocks.0",
+    "down.1": "down_blocks.1",
+    "down.2": "down_blocks.2",
+    "up.0": "up_blocks.2",
+    "up.1": "up_blocks.1",
+    "up.2": "up_blocks.0",
+    ".block.": ".resnets.",
+    "downsample": "downsamplers.0",
+    "upsample": "upsamplers.0",
+    "mid.block_1": "mid_block.resnets.0",
+    "mid.attn_1.0": "mid_block.attentions.0",
+    "mid.attn_1.1": "mid_block.temp_attentions.0",
+    "mid.block_2": "mid_block.resnets.1",
+    ".q.conv3d": ".to_q",
+    ".k.conv3d": ".to_k",
+    ".v.conv3d": ".to_v",
+    ".proj_out.conv3d": ".to_out.0",
+    ".0.conv3d": ".conv_s",
+    ".1.conv3d": ".conv_t",
+    "conv1.conv3d": "conv1",
+    "conv2.conv3d": "conv2",
+    "conv3.conv3d": "conv3",
+    "nin_shortcut.conv3d": "conv_shortcut",
+    "quant_conv.conv3d": "quant_conv",
+    "post_quant_conv.conv3d": "post_quant_conv",
+}
+VAE_SPECIAL_KEYS_REMAP = {
+    "wavelets": remove_keys_,
+    "_arange": remove_keys_,
+    "patch_size_buffer": remove_keys_,
+}
+VAE_CONFIGS = {
+    "CV8x8x8-0.1": {
+        "name": "nvidia/Cosmos-0.1-Tokenizer-CV8x8x8",
+        "diffusers_config": {
+            "in_channels": 3,
+            "out_channels": 3,
+            "latent_channels": 16,
+            "encoder_block_out_channels": (128, 256, 512, 512),
+            "decode_block_out_channels": (256, 512, 512, 512),
+            "attention_resolutions": (32,),
+            "resolution": 1024,
+            "num_layers": 2,
+            "patch_size": 4,
+            "patch_type": "haar",
+            "scaling_factor": 1.0,
+            "spatial_compression_ratio": 8,
+            "temporal_compression_ratio": 8,
+            "latents_mean": None,
+            "latents_std": None,
+        },
+    },
+    "CV8x8x8-1.0": {
+        "name": "nvidia/Cosmos-1.0-Tokenizer-CV8x8x8",
+        "diffusers_config": {
+            "in_channels": 3,
+            "out_channels": 3,
+            "latent_channels": 16,
+            "encoder_block_out_channels": (128, 256, 512, 512),
+            "decode_block_out_channels": (256, 512, 512, 512),
+            "attention_resolutions": (32,),
+            "resolution": 1024,
+            "num_layers": 2,
+            "patch_size": 4,
+            "patch_type": "haar",
+            "scaling_factor": 1.0,
+            "spatial_compression_ratio": 8,
+            "temporal_compression_ratio": 8,
+            "latents_mean": None,
+            "latents_std": None,
+        },
+    },
+}
+def get_state_dict(saved_dict: Dict[str, Any]) -> Dict[str, Any]:
+    state_dict = saved_dict
+    if "model" in saved_dict.keys():
+        state_dict = state_dict["model"]
+    if "module" in saved_dict.keys():
+        state_dict = state_dict["module"]
+    if "state_dict" in saved_dict.keys():
+        state_dict = state_dict["state_dict"]
+    return state_dict
+def convert_transformer(transformer_type: str, ckpt_path: str):
+    PREFIX_KEY = "net."
+    original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", weights_only=True))
+    with init_empty_weights():
+        config = TRANSFORMER_CONFIGS[transformer_type]
+        transformer = CosmosTransformer3DModel(**config)
+    for key in list(original_state_dict.keys()):
+        new_key = key[:]
+        if new_key.startswith(PREFIX_KEY):
+            new_key = new_key.removeprefix(PREFIX_KEY)
+        for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
+            new_key = new_key.replace(replace_key, rename_key)
+        update_state_dict_(original_state_dict, key, new_key)
+    for key in list(original_state_dict.keys()):
+        for special_key, handler_fn_inplace in TRANSFORMER_SPECIAL_KEYS_REMAP.items():
+            if special_key not in key:
+                continue
+            handler_fn_inplace(key, original_state_dict)
+    transformer.load_state_dict(original_state_dict, strict=True, assign=True)
+    return transformer
+def convert_vae(vae_type: str):
+    model_name = VAE_CONFIGS[vae_type]["name"]
+    snapshot_directory = snapshot_download(model_name, repo_type="model")
+    directory = pathlib.Path(snapshot_directory)
+    autoencoder_file = directory / "autoencoder.jit"
+    mean_std_file = directory / "mean_std.pt"
+    original_state_dict = torch.jit.load(autoencoder_file.as_posix()).state_dict()
+    if mean_std_file.exists():
+        mean_std = torch.load(mean_std_file, map_location="cpu", weights_only=True)
+    else:
+        mean_std = (None, None)
+    config = VAE_CONFIGS[vae_type]["diffusers_config"]
+    config.update(
+        {
+            "latents_mean": mean_std[0].detach().cpu().numpy().tolist(),
+            "latents_std": mean_std[1].detach().cpu().numpy().tolist(),
+        }
+    )
+    vae = AutoencoderKLCosmos(**config)
+    for key in list(original_state_dict.keys()):
+        new_key = key[:]
+        for replace_key, rename_key in VAE_KEYS_RENAME_DICT.items():
+            new_key = new_key.replace(replace_key, rename_key)
+        update_state_dict_(original_state_dict, key, new_key)
+    for key in list(original_state_dict.keys()):
+        for special_key, handler_fn_inplace in VAE_SPECIAL_KEYS_REMAP.items():
+            if special_key not in key:
+                continue
+            handler_fn_inplace(key, original_state_dict)
+    vae.load_state_dict(original_state_dict, strict=True, assign=True)
+    return vae
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--transformer_type", type=str, default=None, choices=list(TRANSFORMER_CONFIGS.keys()))
+    parser.add_argument(
+        "--transformer_ckpt_path", type=str, default=None, help="Path to original transformer checkpoint"
+    )
+    parser.add_argument("--vae_type", type=str, default=None, choices=list(VAE_CONFIGS.keys()), help="Type of VAE")
+    parser.add_argument("--text_encoder_path", type=str, default="google-t5/t5-11b")
+    parser.add_argument("--tokenizer_path", type=str, default="google-t5/t5-11b")
+    parser.add_argument("--save_pipeline", action="store_true")
+    parser.add_argument("--output_path", type=str, required=True, help="Path where converted model should be saved")
+    parser.add_argument("--dtype", default="bf16", help="Torch dtype to save the transformer in.")
+    return parser.parse_args()
+DTYPE_MAPPING = {
+    "fp32": torch.float32,
+    "fp16": torch.float16,
+    "bf16": torch.bfloat16,
+}
+if __name__ == "__main__":
+    args = get_args()
+    transformer = None
+    dtype = DTYPE_MAPPING[args.dtype]
+    if args.save_pipeline:
+        assert args.transformer_ckpt_path is not None
+        assert args.vae_type is not None
+        assert args.text_encoder_path is not None
+        assert args.tokenizer_path is not None
+    if args.transformer_ckpt_path is not None:
+        transformer = convert_transformer(args.transformer_type, args.transformer_ckpt_path)
+        transformer = transformer.to(dtype=dtype)
+        if not args.save_pipeline:
+            transformer.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
+    if args.vae_type is not None:
+        vae = convert_vae(args.vae_type)
+        if not args.save_pipeline:
+            vae.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
+    if args.save_pipeline:
+        text_encoder = T5EncoderModel.from_pretrained(args.text_encoder_path, torch_dtype=dtype)
+        tokenizer = T5TokenizerFast.from_pretrained(args.tokenizer_path)
+        # The original code initializes EDM config with sigma_min=0.0002, but does not make use of it anywhere directly.
+        # So, the sigma_min values that is used is the default value of 0.002.
+        scheduler = EDMEulerScheduler(
+            sigma_min=0.002,
+            sigma_max=80,
+            sigma_data=0.5,
+            sigma_schedule="karras",
+            num_train_timesteps=1000,
+            prediction_type="epsilon",
+            rho=7.0,
+            final_sigmas_type="sigma_min",
+        )
+        pipe = CosmosTextToWorldPipeline(
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            transformer=transformer,
+            vae=vae,
+            scheduler=scheduler,
+        )
+        pipe.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB")
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -148,6 +148,7 @@ else:
            "AutoencoderKL",
            "AutoencoderKLAllegro",
            "AutoencoderKLCogVideoX",
+            "AutoencoderKLCosmos",
            "AutoencoderKLHunyuanVideo",
            "AutoencoderKLLTXVideo",
            "AutoencoderKLMagvit",
@@ -166,6 +167,7 @@ else:
            "ControlNetModel",
            "ControlNetUnionModel",
            "ControlNetXSAdapter",
+            "CosmosTransformer3DModel",
            "DiTTransformer2DModel",
            "EasyAnimateTransformer3DModel",
            "FluxControlNetModel",
@@ -357,6 +359,9 @@ else:
            "CogView3PlusPipeline",
            "CogView4ControlPipeline",
            "CogView4Pipeline",
+            "ConsisIDPipeline",
+            "CosmosTextToWorldPipeline",
+            "CosmosVideoToWorldPipeline",
            "CycleDiffusionPipeline",
            "EasyAnimateControlPipeline",
            "EasyAnimateInpaintPipeline",
@@ -745,6 +750,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKL,
            AutoencoderKLAllegro,
            AutoencoderKLCogVideoX,
+            AutoencoderKLCosmos,
            AutoencoderKLHunyuanVideo,
            AutoencoderKLLTXVideo,
            AutoencoderKLMagvit,
@@ -763,6 +769,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            ControlNetModel,
            ControlNetUnionModel,
            ControlNetXSAdapter,
+            CosmosTransformer3DModel,
            DiTTransformer2DModel,
            EasyAnimateTransformer3DModel,
            FluxControlNetModel,
@@ -933,6 +940,9 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            CogView3PlusPipeline,
            CogView4ControlPipeline,
            CogView4Pipeline,
+            ConsisIDPipeline,
+            CosmosTextToWorldPipeline,
+            CosmosVideoToWorldPipeline,
            CycleDiffusionPipeline,
            EasyAnimateControlPipeline,
            EasyAnimateInpaintPipeline,

--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -32,6 +32,7 @@ if is_torch_available():
    _import_structure["autoencoders.autoencoder_kl"] = ["AutoencoderKL"]
    _import_structure["autoencoders.autoencoder_kl_allegro"] = ["AutoencoderKLAllegro"]
    _import_structure["autoencoders.autoencoder_kl_cogvideox"] = ["AutoencoderKLCogVideoX"]
+    _import_structure["autoencoders.autoencoder_kl_cosmos"] = ["AutoencoderKLCosmos"]
    _import_structure["autoencoders.autoencoder_kl_hunyuan_video"] = ["AutoencoderKLHunyuanVideo"]
    _import_structure["autoencoders.autoencoder_kl_ltx"] = ["AutoencoderKLLTXVideo"]
    _import_structure["autoencoders.autoencoder_kl_magvit"] = ["AutoencoderKLMagvit"]
@@ -75,6 +76,7 @@ if is_torch_available():
    _import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
    _import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"]
    _import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"]
+    _import_structure["transformers.transformer_cosmos"] = ["CosmosTransformer3DModel"]
    _import_structure["transformers.transformer_easyanimate"] = ["EasyAnimateTransformer3DModel"]
    _import_structure["transformers.transformer_flux"] = ["FluxTransformer2DModel"]
    _import_structure["transformers.transformer_hidream_image"] = ["HiDreamImageTransformer2DModel"]
@@ -114,6 +116,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKL,
            AutoencoderKLAllegro,
            AutoencoderKLCogVideoX,
+            AutoencoderKLCosmos,
            AutoencoderKLHunyuanVideo,
            AutoencoderKLLTXVideo,
            AutoencoderKLMagvit,
@@ -151,6 +154,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            CogView3PlusTransformer2DModel,
            CogView4Transformer2DModel,
            ConsisIDTransformer3DModel,
+            CosmosTransformer3DModel,
            DiTTransformer2DModel,
            DualTransformer2DModel,
            EasyAnimateTransformer3DModel,

--- a/src/diffusers/models/attention_processor.py
+++ b/src/diffusers/models/attention_processor.py
@@ -203,8 +203,8 @@ class Attention(nn.Module):
            self.norm_q = nn.LayerNorm(dim_head * heads, eps=eps)
            self.norm_k = nn.LayerNorm(dim_head * kv_heads, eps=eps)
        elif qk_norm == "rms_norm":
-            self.norm_q = RMSNorm(dim_head, eps=eps)
+            self.norm_q = RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
-            self.norm_k = RMSNorm(dim_head, eps=eps)
+            self.norm_k = RMSNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
        elif qk_norm == "rms_norm_across_heads":
            # LTX applies qk norm across all heads
            self.norm_q = RMSNorm(dim_head * heads, eps=eps)

--- a/src/diffusers/models/autoencoders/__init__.py
+++ b/src/diffusers/models/autoencoders/__init__.py
@@ -3,6 +3,7 @@ from .autoencoder_dc import AutoencoderDC
 from .autoencoder_kl import AutoencoderKL
 from .autoencoder_kl_allegro import AutoencoderKLAllegro
 from .autoencoder_kl_cogvideox import AutoencoderKLCogVideoX
+from .autoencoder_kl_cosmos import AutoencoderKLCosmos
 from .autoencoder_kl_hunyuan_video import AutoencoderKLHunyuanVideo
 from .autoencoder_kl_ltx import AutoencoderKLLTXVideo
 from .autoencoder_kl_magvit import AutoencoderKLMagvit

--- a/src/diffusers/models/autoencoders/autoencoder_kl_cosmos.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_cosmos.py
--- a/src/diffusers/models/autoencoders/vae.py
+++ b/src/diffusers/models/autoencoders/vae.py
@@ -744,6 +744,17 @@ class DiagonalGaussianDistribution(object):
        return self.mean
+class IdentityDistribution(object):
+    def __init__(self, parameters: torch.Tensor):
+        self.parameters = parameters
+    def sample(self, generator: Optional[torch.Generator] = None) -> torch.Tensor:
+        return self.parameters
+    def mode(self) -> torch.Tensor:
+        return self.parameters
 class EncoderTiny(nn.Module):
    r"""
    The `EncoderTiny` layer is a simpler version of the `Encoder` layer.

--- a/src/diffusers/models/embeddings.py
+++ b/src/diffusers/models/embeddings.py
@@ -1204,7 +1204,7 @@ def apply_rotary_emb(
            x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1)  # [B, S, H, D//2]
            x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
        elif use_real_unbind_dim == -2:
-            # Used for Stable Audio, OmniGen and CogView4
+            # Used for Stable Audio, OmniGen, CogView4 and Cosmos
            x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2)  # [B, S, H, D//2]
            x_rotated = torch.cat([-x_imag, x_real], dim=-1)
        else:

--- a/src/diffusers/models/transformers/__init__.py
+++ b/src/diffusers/models/transformers/__init__.py
@@ -19,6 +19,7 @@ if is_torch_available():
    from .transformer_allegro import AllegroTransformer3DModel
    from .transformer_cogview3plus import CogView3PlusTransformer2DModel
    from .transformer_cogview4 import CogView4Transformer2DModel
+    from .transformer_cosmos import CosmosTransformer3DModel
    from .transformer_easyanimate import EasyAnimateTransformer3DModel
    from .transformer_flux import FluxTransformer2DModel
    from .transformer_hidream_image import HiDreamImageTransformer2DModel

--- a/src/diffusers/models/transformers/transformer_cosmos.py
+++ b/src/diffusers/models/transformers/transformer_cosmos.py
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -156,6 +156,8 @@ else:
    ]
    _import_structure["cogview3"] = ["CogView3PlusPipeline"]
    _import_structure["cogview4"] = ["CogView4Pipeline", "CogView4ControlPipeline"]
+    _import_structure["consisid"] = ["ConsisIDPipeline"]
+    _import_structure["cosmos"] = ["CosmosTextToWorldPipeline", "CosmosVideoToWorldPipeline"]
    _import_structure["controlnet"].extend(
        [
            "BlipDiffusionControlNetPipeline",
@@ -546,6 +548,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            StableDiffusionControlNetXSPipeline,
            StableDiffusionXLControlNetXSPipeline,
        )
+        from .cosmos import CosmosTextToWorldPipeline, CosmosVideoToWorldPipeline
        from .deepfloyd_if import (
            IFImg2ImgPipeline,
            IFImg2ImgSuperResolutionPipeline,

--- a/src/diffusers/pipelines/cosmos/__init__.py
+++ b/src/diffusers/pipelines/cosmos/__init__.py
+from typing import TYPE_CHECKING
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+_dummy_objects = {}
+_import_structure = {}
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_cosmos_text2world"] = ["CosmosTextToWorldPipeline"]
+    _import_structure["pipeline_cosmos_video2world"] = ["CosmosVideoToWorldPipeline"]
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *
+    else:
+        from .pipeline_cosmos_text2world import CosmosTextToWorldPipeline
+        from .pipeline_cosmos_video2world import CosmosVideoToWorldPipeline
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/cosmos/pipeline_cosmos_text2world.py
+++ b/src/diffusers/pipelines/cosmos/pipeline_cosmos_text2world.py
--- a/src/diffusers/pipelines/cosmos/pipeline_cosmos_video2world.py
+++ b/src/diffusers/pipelines/cosmos/pipeline_cosmos_video2world.py
--- a/src/diffusers/pipelines/cosmos/pipeline_output.py
+++ b/src/diffusers/pipelines/cosmos/pipeline_output.py
+from dataclasses import dataclass
+import torch
+from diffusers.utils import BaseOutput
+@dataclass
+class CosmosPipelineOutput(BaseOutput):
+    r"""
+    Output class for Cosmos pipelines.
+    Args:
+        frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
+            List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
+            denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
+            `(batch_size, num_frames, channels, height, width)`.
+    """
+    frames: torch.Tensor
--- a/src/diffusers/schedulers/scheduling_cosine_dpmsolver_multistep.py
+++ b/src/diffusers/schedulers/scheduling_cosine_dpmsolver_multistep.py
@@ -144,7 +144,7 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
    # Copied from diffusers.schedulers.scheduling_edm_euler.EDMEulerScheduler.precondition_inputs
    def precondition_inputs(self, sample, sigma):
-        c_in = 1 / ((sigma**2 + self.config.sigma_data**2) ** 0.5)
+        c_in = self._get_conditioning_c_in(sigma)
        scaled_sample = sample * c_in
        return scaled_sample
@@ -568,5 +568,10 @@ class CosineDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):
        noisy_samples = original_samples + noise * sigma
        return noisy_samples
+    # Copied from diffusers.schedulers.scheduling_edm_euler.EDMEulerScheduler._get_conditioning_c_in
+    def _get_conditioning_c_in(self, sigma):
+        c_in = 1 / ((sigma**2 + self.config.sigma_data**2) ** 0.5)
+        return c_in
    def __len__(self):
        return self.config.num_train_timesteps