首次提交

a44e71bd · weishb · df12fdb3 · a44e71bd · a44e71bd · a44e71bd
Commit a44e71bd authored May 29, 2026 by weishb
8 changed files
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/PixArt.py
+++ b/PixArt.py
+from abc import abstractmethod
+from functools import partial
+import math
+from typing import Iterable
+import numpy as np
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.models.vision_transformer import Attention, Mlp
+from positional_encodings.torch_encodings import PositionalEncoding1D
+from timm.models.layers import DropPath
+from .utils import auto_grad_checkpoint, to_2tuple
+from .PixArt_blocks import (
+    t2i_modulate,
+    WindowAttention,
+    MultiHeadCrossAttention,
+    T2IFinalLayer,
+    TimestepEmbedder,
+    FinalLayer,
+)
+import math
+class PatchEmbed(nn.Module):
+    """2D Image to Patch Embedding"""
+    def __init__(
+        self,
+        img_size=(256, 16),
+        patch_size=(16, 4),
+        overlap=(0, 0),
+        in_chans=128,
+        embed_dim=768,
+        norm_layer=None,
+        flatten=True,
+        bias=True,
+    ):
+        super().__init__()
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.ol = overlap
+        self.grid_size = (
+            math.ceil((img_size[0] - patch_size[0]) / (patch_size[0] - overlap[0])) + 1,
+            math.ceil((img_size[1] - patch_size[1]) / (patch_size[1] - overlap[1])) + 1,
+        )
+        self.pad_size = (
+            (self.grid_size[0] - 1) * (self.patch_size[0] - overlap[0])
+            + self.patch_size[0]
+            - self.img_size[0],
+            +(self.grid_size[1] - 1) * (self.patch_size[1] - overlap[1])
+            + self.patch_size[1]
+            - self.img_size[1],
+        )
+        self.pad_size = (self.pad_size[0] // 2, self.pad_size[1] // 2)
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.flatten = flatten
+        self.proj = nn.Conv2d(
+            in_chans,
+            embed_dim,
+            kernel_size=patch_size,
+            stride=(patch_size[0] - overlap[0], patch_size[1] - overlap[1]),
+            bias=bias,
+        )
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+    def forward(self, x):
+        x = F.pad(
+            x,
+            (
+                self.pad_size[-1],
+                self.pad_size[-1],
+                self.pad_size[-2],
+                self.pad_size[-2],
+            ),
+            "constant",
+            0,
+        )
+        x = self.proj(x)
+        if self.flatten:
+            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
+        x = self.norm(x)
+        return x
+class PatchEmbed_1D(nn.Module):
+    def __init__(
+        self,
+        img_size=(256, 16),
+        in_chans=8,
+        embed_dim=1152,
+        norm_layer=None,
+        bias=True,
+    ):
+        super().__init__()
+        self.proj = nn.Linear(in_chans * img_size[1], embed_dim, bias=bias)
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+    def forward(self, x):
+        x = th.einsum("bctf->btfc", x)
+        x = x.flatten(2)  # BTFC -> BTD
+        x = self.proj(x)
+        x = self.norm(x)
+        return x
+def modulate(x, shift, scale):
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+def t2i_modulate(x, shift, scale):
+    return x * (1 + scale) + shift
+class PixArtBlock(nn.Module):
+    """
+    A PixArt block with adaptive layer norm (adaLN-single) conditioning.
+    """
+    def __init__(
+        self,
+        hidden_size,
+        num_heads,
+        mlp_ratio=4.0,
+        drop_path=0.0,
+        window_size=0,
+        input_size=None,
+        use_rel_pos=False,
+        **block_kwargs
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = WindowAttention(
+            hidden_size,
+            num_heads=num_heads,
+            qkv_bias=True,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+            use_rel_pos=use_rel_pos,
+            **block_kwargs
+        )
+        self.cross_attn = MultiHeadCrossAttention(
+            hidden_size, num_heads, **block_kwargs
+        )
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        # to be compatible with lower version pytorch
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size,
+            hidden_features=int(hidden_size * mlp_ratio),
+            act_layer=approx_gelu,
+            drop=0,
+        )
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.window_size = window_size
+        self.scale_shift_table = nn.Parameter(
+            th.randn(6, hidden_size) / hidden_size**0.5
+        )
+    def forward(self, x, y, t, mask=None, **kwargs):
+        B, N, C = x.shape
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.scale_shift_table[None] + t.reshape(B, 6, -1)
+        ).chunk(6, dim=1)
+        x = x + self.drop_path(
+            gate_msa
+            * self.attn(t2i_modulate(self.norm1(x), shift_msa, scale_msa)).reshape(
+                B, N, C
+            )
+        )
+        x = x + self.cross_attn(x, y, mask)
+        x = x + self.drop_path(
+            gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp))
+        )
+        return x
+from ldm.modules.diffusionmodules.attention import CrossAttention_1D
+class PixArtBlock_Slow(nn.Module):
+    """
+    A PixArt block with adaptive layer norm (adaLN-single) conditioning.
+    """
+    def __init__(
+        self,
+        hidden_size,
+        num_heads,
+        mlp_ratio=4.0,
+        drop_path=0.0,
+        window_size=0,
+        input_size=None,
+        use_rel_pos=False,
+        **block_kwargs
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = CrossAttention_1D(
+            query_dim=hidden_size,
+            context_dim=hidden_size,
+            heads=num_heads,
+            dim_head=int(hidden_size / num_heads),
+        )
+        self.cross_attn = CrossAttention_1D(
+            query_dim=hidden_size,
+            context_dim=hidden_size,
+            heads=num_heads,
+            dim_head=int(hidden_size / num_heads),
+        )
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        # to be compatible with lower version pytorch
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size,
+            hidden_features=int(hidden_size * mlp_ratio),
+            act_layer=approx_gelu,
+            drop=0,
+        )
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.window_size = window_size
+        self.scale_shift_table = nn.Parameter(
+            th.randn(6, hidden_size) / hidden_size**0.5
+        )
+    def forward(self, x, y, t, mask=None, **kwargs):
+        B, N, C = x.shape
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.scale_shift_table[None] + t.reshape(B, 6, -1)
+        ).chunk(6, dim=1)
+        x = x + self.drop_path(
+            gate_msa
+            * self.attn(t2i_modulate(self.norm1(x), shift_msa, scale_msa)).reshape(
+                B, N, C
+            )
+        )
+        x = x + self.cross_attn(x, y, mask)
+        x = x + self.drop_path(
+            gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp))
+        )
+        return x
+class PixArt(nn.Module):
+    """
+    Diffusion model with a Transformer backbone.
+    """
+    def __init__(
+        self,
+        input_size=(256, 16),
+        patch_size=(16, 4),
+        overlap=(0, 0),
+        in_channels=8,
+        hidden_size=1152,
+        depth=28,
+        num_heads=16,
+        mlp_ratio=4.0,
+        class_dropout_prob=0.1,
+        pred_sigma=True,
+        drop_path: float = 0.0,
+        window_size=0,
+        window_block_indexes=None,
+        use_rel_pos=False,
+        cond_dim=1024,
+        lewei_scale=1.0,
+        use_cfg=True,
+        cfg_scale=4.0,
+        config=None,
+        model_max_length=120,
+        **kwargs
+    ):
+        if window_block_indexes is None:
+            window_block_indexes = []
+        super().__init__()
+        self.use_cfg = use_cfg
+        self.cfg_scale = cfg_scale
+        self.input_size = input_size
+        self.pred_sigma = pred_sigma
+        self.in_channels = in_channels
+        self.out_channels = in_channels * 2 if pred_sigma else in_channels
+        self.patch_size = patch_size
+        self.num_heads = num_heads
+        self.lewei_scale = (lewei_scale,)
+        self.x_embedder = PatchEmbed(
+            input_size, patch_size, overlap, in_channels, hidden_size, bias=True
+        )
+        self.t_embedder = TimestepEmbedder(hidden_size)
+        num_patches = self.x_embedder.num_patches
+        self.base_size = input_size[0] // self.patch_size[0] * 2
+        # Will use fixed sin-cos embedding:
+        self.register_buffer("pos_embed", th.zeros(1, num_patches, hidden_size))
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.t_block = nn.Sequential(
+            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
+        )
+        self.y_embedder = nn.Linear(cond_dim, hidden_size)
+        drop_path = [
+            x.item() for x in th.linspace(0, drop_path, depth)
+        ]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList(
+            [
+                PixArtBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    drop_path=drop_path[i],
+                    input_size=(
+                        self.x_embedder.grid_size[0],
+                        self.x_embedder.grid_size[1],
+                    ),
+                    window_size=0,
+                    use_rel_pos=False,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.final_layer = T2IFinalLayer(hidden_size, patch_size, self.out_channels)
+        self.initialize_weights()
+    def forward(self, x, timestep, context_list, context_mask_list=None, **kwargs):
+        """
+        Forward pass of PixArt.
+        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
+        t: (N,) tensor of diffusion timesteps
+        y: (N, 1, 120, C) tensor of class labels
+        """
+        x = x.to(self.dtype)
+        timestep = timestep.to(self.dtype)
+        y = context_list[0].to(self.dtype)
+        pos_embed = self.pos_embed.to(self.dtype)
+        self.h, self.w = self.x_embedder.grid_size[0], self.x_embedder.grid_size[1]
+        x = self.x_embedder(x) + pos_embed
+        t = self.t_embedder(timestep.to(x.dtype))
+        t0 = self.t_block(t)
+        y = self.y_embedder(y)
+        mask = context_mask_list[0]
+        assert mask is not None
+        # if mask is not None:
+        y = y.masked_select(mask.unsqueeze(-1) != 0).view(1, -1, x.shape[-1])
+        y_lens = mask.sum(dim=1).tolist()
+        y_lens = [int(_) for _ in y_lens]
+        for block in self.blocks:
+            x = auto_grad_checkpoint(block, x, y, t0, y_lens)
+        x = self.final_layer(x, t)
+        x = self.unpatchify(x)
+        return x
+    def forward_with_dpmsolver(self, x, timestep, y, mask=None, **kwargs):
+        """
+        dpm solver donnot need variance prediction
+        """
+        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
+        model_out = self.forward(x, timestep, y, mask)
+        return model_out.chunk(2, dim=1)[0]
+    def forward_with_cfg(self, x, timestep, y, cfg_scale, mask=None, **kwargs):
+        """
+        Forward pass of PixArt, but also batches the unconditional forward pass for classifier-free guidance.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
+        half = x[: len(x) // 2]
+        combined = th.cat([half, half], dim=0)
+        model_out = self.forward(combined, timestep, y, mask)
+        model_out = model_out["x"] if isinstance(model_out, dict) else model_out
+        eps, rest = model_out[:, :8], model_out[:, 8:]
+        cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
+        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)
+        eps = th.cat([half_eps, half_eps], dim=0)
+        return eps
+    def unpatchify(self, x):
+        """
+        x: (N, T, patch_size 0 * patch_size 1 * C)
+        imgs: (Bs. 256. 16. 8)
+        """
+        c = self.out_channels
+        p0 = self.x_embedder.patch_size[0]
+        p1 = self.x_embedder.patch_size[1]
+        h, w = self.x_embedder.grid_size[0], self.x_embedder.grid_size[1]
+        x = x.reshape(shape=(x.shape[0], h, w, p0, p1, c))
+        x = th.einsum("nhwpqc->nchpwq", x)
+        imgs = x.reshape(shape=(x.shape[0], c, h * p0, w * p1))
+        return imgs
+    def initialize_weights(self):
+        # Initialize transformer layers:
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                th.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize (and freeze) pos_embed by sin-cos embedding:
+        pos_embed = get_2d_sincos_pos_embed(
+            self.pos_embed.shape[-1],
+            self.x_embedder.grid_size,
+            lewei_scale=self.lewei_scale,
+            base_size=self.base_size,
+        )
+        self.pos_embed.data.copy_(th.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
+        w = self.x_embedder.proj.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        # Initialize timestep embedding MLP:
+        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
+        nn.init.normal_(self.t_block[1].weight, std=0.02)
+        # Initialize caption embedding MLP:
+        nn.init.normal_(self.y_embedder.weight, std=0.02)
+        # Zero-out adaLN modulation layers in PixArt blocks:
+        for block in self.blocks:
+            nn.init.constant_(block.cross_attn.proj.weight, 0)
+            nn.init.constant_(block.cross_attn.proj.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+        nn.init.constant_(self.final_layer.linear.bias, 0)
+    @property
+    def dtype(self):
+        return next(self.parameters()).dtype
+class SwiGLU(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        hidden_dim: int,
+        multiple_of: int,
+    ):
+        super().__init__()
+        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
+        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
+        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
+        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
+    def forward(self, x):
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))
+class MDTBlock(nn.Module):
+    """
+    A PixArt block with adaptive layer norm (adaLN-single) conditioning.
+    """
+    def __init__(
+        self,
+        hidden_size,
+        num_heads,
+        mlp_ratio=4.0,
+        FFN_type="SwiGLU",
+        drop_path=0.0,
+        window_size=0,
+        input_size=None,
+        use_rel_pos=False,
+        skip=False,
+        **block_kwargs
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = WindowAttention(
+            hidden_size,
+            num_heads=num_heads,
+            qkv_bias=True,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+            use_rel_pos=use_rel_pos,
+            **block_kwargs
+        )
+        self.cross_attn = MultiHeadCrossAttention(
+            hidden_size, num_heads, **block_kwargs
+        )
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        # to be compatible with lower version pytorch
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        if FFN_type == "mlp":
+            self.mlp = Mlp(
+                in_features=hidden_size,
+                hidden_features=int(hidden_size * mlp_ratio),
+                act_layer=approx_gelu,
+                drop=0,
+            )
+        elif FFN_type == "SwiGLU":
+            self.mlp = SwiGLU(hidden_size, int(hidden_size * mlp_ratio), 1)
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.window_size = window_size
+        self.scale_shift_table = nn.Parameter(
+            th.randn(6, hidden_size) / hidden_size**0.5
+        )
+        self.skip_linear = nn.Linear(2 * hidden_size, hidden_size) if skip else None
+    def forward(self, x, y, t, mask=None, skip=None, ids_keep=None, **kwargs):
+        B, N, C = x.shape
+        if self.skip_linear is not None:
+            x = self.skip_linear(th.cat([x, skip], dim=-1))
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.scale_shift_table[None] + t.reshape(B, 6, -1)
+        ).chunk(6, dim=1)
+        x = x + self.drop_path(
+            gate_msa
+            * self.attn(t2i_modulate(self.norm1(x), shift_msa, scale_msa)).reshape(
+                B, N, C
+            )
+        )
+        x = x + self.cross_attn(x, y, mask)
+        x = x + self.drop_path(
+            gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp))
+        )
+        return x
+class DEBlock(nn.Module):
+    """
+    Decoder block with added SpecTNT transformer
+    """
+    def __init__(
+        self,
+        hidden_size,
+        num_heads,
+        mlp_ratio=4.0,
+        FFN_type="SwiGLU",
+        drop_path=0.0,
+        window_size=0,
+        input_size=None,
+        use_rel_pos=False,
+        skip=False,
+        num_f=None,
+        num_t=None,
+        **block_kwargs
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = WindowAttention(
+            hidden_size,
+            num_heads=num_heads,
+            qkv_bias=True,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+            use_rel_pos=use_rel_pos,
+            **block_kwargs
+        )
+        self.cross_attn = MultiHeadCrossAttention(
+            hidden_size, num_heads, **block_kwargs
+        )
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.norm3 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.norm4 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.norm5 = nn.LayerNorm(
+            hidden_size * num_f, elementwise_affine=False, eps=1e-6
+        )
+        self.norm6 = nn.LayerNorm(
+            hidden_size * num_f, elementwise_affine=False, eps=1e-6
+        )
+        # to be compatible with lower version pytorch
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        if FFN_type == "mlp":
+            self.mlp = Mlp(
+                in_features=hidden_size,
+                hidden_features=int(hidden_size * mlp_ratio),
+                act_layer=approx_gelu,
+                drop=0,
+            )
+        elif FFN_type == "SwiGLU":
+            self.mlp = SwiGLU(hidden_size, int(hidden_size * mlp_ratio), 1)
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+        self.window_size = window_size
+        self.scale_shift_table = nn.Parameter(
+            th.randn(6, hidden_size) / hidden_size**0.5
+        )
+        self.skip_linear = nn.Linear(2 * hidden_size, hidden_size) if skip else None
+        self.F_transformer = WindowAttention(
+            hidden_size,
+            num_heads=4,
+            qkv_bias=True,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+            use_rel_pos=use_rel_pos,
+            **block_kwargs
+        )
+        self.T_transformer = WindowAttention(
+            hidden_size * num_f,
+            num_heads=16,
+            qkv_bias=True,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+            use_rel_pos=use_rel_pos,
+            **block_kwargs
+        )
+        self.f_pos = nn.Embedding(num_f, hidden_size)
+        self.t_pos = nn.Embedding(num_t, hidden_size * num_f)
+        self.num_f = num_f
+        self.num_t = num_t
+    def forward(self, x, end, y, t, mask=None, skip=None, ids_keep=None, **kwargs):
+        B, D, C = x.shape
+        T = self.num_t
+        F_add_1 = self.num_f
+        x_normal = x
+        if self.skip_linear is not None:
+            x_normal = self.skip_linear(th.cat([x_normal, skip], dim=-1))
+        D = T * (F_add_1 - 1)
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
+            self.scale_shift_table[None] + t.reshape(B, 6, -1)
+        ).chunk(6, dim=1)
+        x_normal = x_normal + self.drop_path(
+            gate_msa
+            * self.attn(
+                t2i_modulate(self.norm1(x_normal), shift_msa, scale_msa)
+            ).reshape(B, D, C)
+        )
+        x_normal = x_normal.reshape(B, T, F_add_1 - 1, C)
+        x_normal = th.cat((x_normal, end), 2)
+        x_normal = x_normal.reshape(B * T, F_add_1, C)
+        pos_f = th.arange(self.num_f, device=x.device).unsqueeze(0).expand(B * T, -1)
+        x_normal = x_normal + self.f_pos(pos_f)
+        x_normal = x_normal + self.F_transformer(self.norm3(x_normal))
+        x_normal = x_normal.reshape(B, T, F_add_1 * C)
+        pos_t = th.arange(self.num_t, device=x.device).unsqueeze(0).expand(B, -1)
+        x_normal = x_normal + self.t_pos(pos_t)
+        x_normal = x_normal + self.T_transformer(self.norm5(x_normal))
+        x_normal = x_normal.reshape(B, T, F_add_1, C)
+        end = x_normal[:, :, -1, :].unsqueeze(2)
+        x_normal = x_normal[:, :, :-1, :]
+        x_normal = x_normal.reshape(B, T * (F_add_1 - 1), C)
+        x_normal = x_normal + self.cross_attn(x_normal, y, mask)
+        x_normal = x_normal + self.drop_path(
+            gate_mlp
+            * self.mlp(t2i_modulate(self.norm2(x_normal), shift_mlp, scale_mlp))
+        )
+        return x_normal, end
+class PixArt_MDT(nn.Module):
+    """
+    Diffusion model with a Transformer backbone.
+    """
+    def __init__(
+        self,
+        input_size=(256, 16),
+        patch_size=(16, 4),
+        overlap=(0, 0),
+        in_channels=8,
+        hidden_size=1152,
+        depth=28,
+        num_heads=16,
+        mlp_ratio=4.0,
+        class_dropout_prob=0.1,
+        pred_sigma=False,
+        drop_path: float = 0.0,
+        window_size=0,
+        window_block_indexes=None,
+        use_rel_pos=False,
+        cond_dim=1024,
+        lewei_scale=1.0,
+        use_cfg=True,
+        cfg_scale=4.0,
+        config=None,
+        model_max_length=120,
+        mask_ratio=None,
+        decode_layer=4,
+        **kwargs
+    ):
+        if window_block_indexes is None:
+            window_block_indexes = []
+        super().__init__()
+        self.use_cfg = use_cfg
+        self.cfg_scale = cfg_scale
+        self.input_size = input_size
+        self.pred_sigma = pred_sigma
+        self.in_channels = in_channels
+        self.out_channels = in_channels
+        self.patch_size = patch_size
+        self.num_heads = num_heads
+        self.lewei_scale = (lewei_scale,)
+        decode_layer = int(decode_layer)
+        self.x_embedder = PatchEmbed(
+            input_size, patch_size, overlap, in_channels, hidden_size, bias=True
+        )
+        self.t_embedder = TimestepEmbedder(hidden_size)
+        num_patches = self.x_embedder.num_patches
+        self.base_size = input_size[0] // self.patch_size[0] * 2
+        # Will use fixed sin-cos embedding:
+        self.register_buffer("pos_embed", th.zeros(1, num_patches, hidden_size))
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.t_block = nn.Sequential(
+            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
+        )
+        self.y_embedder = nn.Linear(cond_dim, hidden_size)
+        half_depth = (depth - decode_layer) // 2
+        self.half_depth = half_depth
+        drop_path_half = [
+            x.item() for x in th.linspace(0, drop_path, half_depth)
+        ]  # stochastic depth decay rule
+        drop_path_decode = [x.item() for x in th.linspace(0, drop_path, decode_layer)]
+        self.en_inblocks = nn.ModuleList(
+            [
+                MDTBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    drop_path=drop_path_half[i],
+                    input_size=(
+                        self.x_embedder.grid_size[0],
+                        self.x_embedder.grid_size[1],
+                    ),
+                    window_size=0,
+                    use_rel_pos=False,
+                    FFN_type="mlp",
+                )
+                for i in range(half_depth)
+            ]
+        )
+        self.en_outblocks = nn.ModuleList(
+            [
+                MDTBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    drop_path=drop_path_half[i],
+                    input_size=(
+                        self.x_embedder.grid_size[0],
+                        self.x_embedder.grid_size[1],
+                    ),
+                    window_size=0,
+                    use_rel_pos=False,
+                    skip=True,
+                    FFN_type="mlp",
+                )
+                for i in range(half_depth)
+            ]
+        )
+        self.de_blocks = nn.ModuleList(
+            [
+                MDTBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    drop_path=drop_path_decode[i],
+                    input_size=(
+                        self.x_embedder.grid_size[0],
+                        self.x_embedder.grid_size[1],
+                    ),
+                    window_size=0,
+                    use_rel_pos=False,
+                    skip=True,
+                    FFN_type="mlp",
+                )
+                for i in range(decode_layer)
+            ]
+        )
+        self.sideblocks = nn.ModuleList(
+            [
+                MDTBlock(
+                    hidden_size,
+                    num_heads,
+                    mlp_ratio=mlp_ratio,
+                    input_size=(
+                        self.x_embedder.grid_size[0],
+                        self.x_embedder.grid_size[1],
+                    ),
+                    window_size=0,
+                    use_rel_pos=False,
+                    FFN_type="mlp",
+                )
+                for _ in range(1)
+            ]
+        )
+        self.final_layer = T2IFinalLayer(hidden_size, patch_size, self.out_channels)
+        self.decoder_pos_embed = nn.Parameter(
+            th.zeros(1, num_patches, hidden_size), requires_grad=True
+        )
+        if mask_ratio is not None:
+            self.mask_token = nn.Parameter(th.zeros(1, 1, hidden_size))
+            self.mask_ratio = float(mask_ratio)
+            self.decode_layer = int(decode_layer)
+        else:
+            self.mask_token = nn.Parameter(
+                th.zeros(1, 1, hidden_size), requires_grad=False
+            )
+            self.mask_ratio = None
+            self.decode_layer = int(decode_layer)
+        self.initialize_weights()
+    def forward(self, x, t, context, mask=None, enable_mask=False, **kwargs):
+        """
+        Forward pass of PixArt.
+        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
+        t: (N,) tensor of diffusion timesteps
+        y: (N, 1, 120, C) tensor of class labels
+        """
+        x = x.to(self.dtype)
+        t = t.to(self.dtype)
+        y = context.to(self.dtype)
+        pos_embed = self.pos_embed.to(self.dtype)
+        self.h, self.w = self.x_embedder.grid_size[0], self.x_embedder.grid_size[1]
+        x = self.x_embedder(x) + pos_embed
+        t = self.t_embedder(t.to(x.dtype))
+        t0 = self.t_block(t)
+        y = self.y_embedder(y)
+        try:
+            mask = mask
+        except:
+            mask = th.ones(x.shape[0], 1).to(x.device)
+            print("MASK !!!!!!!!!!!!!!!!!!!!!!!!!")
+        assert mask is not None
+        y = y.masked_select(mask.unsqueeze(-1) != 0).view(1, -1, x.shape[-1])
+        y_lens = mask.sum(dim=1).tolist()
+        y_lens = [int(_) for _ in y_lens]
+        input_skip = x
+        masked_stage = False
+        skips = []
+        # TODO : masking op for training
+        if self.mask_ratio is not None and self.training:
+            rand_mask_ratio = th.rand(1, device=x.device)
+            rand_mask_ratio = rand_mask_ratio * 0.2 + self.mask_ratio
+            x, mask, ids_restore, ids_keep = self.random_masking(x, rand_mask_ratio)
+            masked_stage = True
+        for block in self.en_inblocks:
+            if masked_stage:
+                x = auto_grad_checkpoint(block, x, y, t0, y_lens, ids_keep=ids_keep)
+            else:
+                x = auto_grad_checkpoint(block, x, y, t0, y_lens, ids_keep=None)
+            skips.append(x)
+        for block in self.en_outblocks:
+            if masked_stage:
+                x = auto_grad_checkpoint(
+                    block, x, y, t0, y_lens, skip=skips.pop(), ids_keep=ids_keep
+                )
+            else:
+                x = auto_grad_checkpoint(
+                    block, x, y, t0, y_lens, skip=skips.pop(), ids_keep=None
+                )
+        if self.mask_ratio is not None and self.training:
+            x = self.forward_side_interpolater(x, y, t0, y_lens, mask, ids_restore)
+            masked_stage = False
+        else:
+            # add pos embed
+            x = x + self.decoder_pos_embed
+        for i in range(len(self.de_blocks)):
+            block = self.de_blocks[i]
+            this_skip = input_skip
+            x = auto_grad_checkpoint(
+                block, x, y, t0, y_lens, skip=this_skip, ids_keep=None
+            )
+        x = self.final_layer(x, t)
+        x = self.unpatchify(x)
+        return x
+    def forward_with_dpmsolver(self, x, timestep, y, mask=None, **kwargs):
+        """
+        dpm solver donnot need variance prediction
+        """
+        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
+        model_out = self.forward(x, timestep, y, mask)
+        return model_out.chunk(2, dim=1)[0]
+    def forward_with_cfg(
+        self, x, timestep, context_list, context_mask_list=None, cfg_scale=4.0, **kwargs
+    ):
+        """
+        Forward pass of PixArt, but also batches the unconditional forward pass for classifier-free guidance.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
+        half = x[: len(x) // 2]
+        combined = th.cat([half, half], dim=0)
+        model_out = self.forward(
+            combined, timestep, context_list, context_mask_list=None
+        )
+        model_out = model_out["x"] if isinstance(model_out, dict) else model_out
+        eps, rest = model_out[:, :8], model_out[:, 8:]
+        cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
+        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)
+        eps = th.cat([half_eps, half_eps], dim=0)
+        return eps
+    def unpatchify(self, x):
+        """
+        x: (N, T, patch_size 0 * patch_size 1 * C)
+        imgs: (Bs. 256. 16. 8)
+        """
+        if self.x_embedder.ol == (0, 0) or self.x_embedder.ol == [0, 0]:
+            c = self.out_channels
+            p0 = self.x_embedder.patch_size[0]
+            p1 = self.x_embedder.patch_size[1]
+            h, w = self.x_embedder.grid_size[0], self.x_embedder.grid_size[1]
+            x = x.reshape(shape=(x.shape[0], h, w, p0, p1, c))
+            x = th.einsum("nhwpqc->nchpwq", x)
+            imgs = x.reshape(shape=(x.shape[0], c, h * p0, w * p1))
+            return imgs
+        lf = self.x_embedder.grid_size[0]
+        rf = self.x_embedder.grid_size[1]
+        lp = self.x_embedder.patch_size[0]
+        rp = self.x_embedder.patch_size[1]
+        lo = self.x_embedder.ol[0]
+        ro = self.x_embedder.ol[1]
+        lm = self.x_embedder.img_size[0]
+        rm = self.x_embedder.img_size[1]
+        lpad = self.x_embedder.pad_size[0]
+        rpad = self.x_embedder.pad_size[1]
+        bs = x.shape[0]
+        torch_map = self.torch_map
+        c = self.out_channels
+        x = x.reshape(shape=(bs, lf, rf, lp, rp, c))
+        x = th.einsum("nhwpqc->nchwpq", x)
+        added_map = th.zeros(bs, c, lm + 2 * lpad, rm + 2 * rpad).to(x.device)
+        for i in range(lf):
+            for j in range(rf):
+                xx = (i) * (lp - lo)
+                yy = (j) * (rp - ro)
+                added_map[:, :, xx : (xx + lp), yy : (yy + rp)] += x[:, :, i, j, :, :]
+        added_map = added_map[:][:][lpad : lm + lpad, rpad : rm + rpad]
+        return th.mul(added_map.to(x.device), torch_map.to(x.device))
+    def random_masking(self, x, mask_ratio):
+        """
+        Perform per-sample random masking by per-sample shuffling.
+        Per-sample shuffling is done by argsort random noise.
+        x: [N, L, D], sequence
+        """
+        N, L, D = x.shape  # batch, length, dim
+        len_keep = int(L * (1 - mask_ratio))
+        noise = th.rand(N, L, device=x.device)
+        # sort noise for each sample
+        # ascend: small is keep, large is remove
+        ids_shuffle = th.argsort(noise, dim=1)
+        ids_restore = th.argsort(ids_shuffle, dim=1)
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        x_masked = th.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
+        # generate the binary mask: 0 is keep, 1 is remove
+        mask = th.ones([N, L], device=x.device)
+        mask[:, :len_keep] = 0
+        # unshuffle to get the binary mask
+        mask = th.gather(mask, dim=1, index=ids_restore)
+        return x_masked, mask, ids_restore, ids_keep
+    def forward_side_interpolater(self, x, y, t0, y_lens, mask, ids_restore):
+        # append mask tokens to sequence
+        mask_tokens = self.mask_token.repeat(
+            x.shape[0], ids_restore.shape[1] - x.shape[1], 1
+        )
+        x_ = th.cat([x, mask_tokens], dim=1)
+        x = th.gather(
+            x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2])
+        )  # unshuffle
+        # add pos embed
+        x = x + self.decoder_pos_embed
+        # pass to the basic block
+        x_before = x
+        for sideblock in self.sideblocks:
+            x = sideblock(x, y, t0, y_lens, ids_keep=None)
+        # masked shortcut
+        mask = mask.unsqueeze(dim=-1)
+        x = x * mask + (1 - mask) * x_before
+        return x
+    def initialize_weights(self):
+        # Initialize transformer layers:
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                th.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize (and freeze) pos_embed by sin-cos embedding:
+        pos_embed = get_2d_sincos_pos_embed(
+            self.pos_embed.shape[-1],
+            self.x_embedder.grid_size,
+            lewei_scale=self.lewei_scale,
+            base_size=self.base_size,
+        )
+        self.pos_embed.data.copy_(th.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
+        w = self.x_embedder.proj.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        # Initialize timestep embedding MLP:
+        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
+        nn.init.normal_(self.t_block[1].weight, std=0.02)
+        # Initialize caption embedding MLP:
+        nn.init.normal_(self.y_embedder.weight, std=0.02)
+        # Zero-out adaLN modulation layers in PixArt blocks:
+        for block in self.en_inblocks:
+            nn.init.constant_(block.cross_attn.proj.weight, 0)
+            nn.init.constant_(block.cross_attn.proj.bias, 0)
+        for block in self.en_outblocks:
+            nn.init.constant_(block.cross_attn.proj.weight, 0)
+            nn.init.constant_(block.cross_attn.proj.bias, 0)
+        for block in self.de_blocks:
+            nn.init.constant_(block.cross_attn.proj.weight, 0)
+            nn.init.constant_(block.cross_attn.proj.bias, 0)
+        for block in self.sideblocks:
+            nn.init.constant_(block.cross_attn.proj.weight, 0)
+            nn.init.constant_(block.cross_attn.proj.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+        nn.init.constant_(self.final_layer.linear.bias, 0)
+        if self.x_embedder.ol == [0, 0] or self.x_embedder.ol == (0, 0):
+            return
+        lf = self.x_embedder.grid_size[0]
+        rf = self.x_embedder.grid_size[1]
+        lp = self.x_embedder.patch_size[0]
+        rp = self.x_embedder.patch_size[1]
+        lo = self.x_embedder.ol[0]
+        ro = self.x_embedder.ol[1]
+        lm = self.x_embedder.img_size[0]
+        rm = self.x_embedder.img_size[1]
+        lpad = self.x_embedder.pad_size[0]
+        rpad = self.x_embedder.pad_size[1]
+        torch_map = th.zeros(lm + 2 * lpad, rm + 2 * rpad).to("cuda")
+        for i in range(lf):
+            for j in range(rf):
+                xx = (i) * (lp - lo)
+                yy = (j) * (rp - ro)
+                torch_map[xx : (xx + lp), yy : (yy + rp)] += 1
+        torch_map = torch_map[lpad : lm + lpad, rpad : rm + rpad]
+        self.torch_map = th.reciprocal(torch_map)
+    @property
+    def dtype(self):
+        return next(self.parameters()).dtype
+def get_2d_sincos_pos_embed(
+    embed_dim,
+    grid_size,
+    cls_token=False,
+    extra_tokens=0,
+    lewei_scale=1.0,
+    base_size_x=256 // 4,
+    base_size_y=16 // 4,
+    base_size=128,
+):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    if isinstance(grid_size, int):
+        grid_size = to_2tuple(grid_size)
+    grid_h = (
+        np.arange(grid_size[0], dtype=np.float32)
+        / (grid_size[0] / base_size_x)
+        / lewei_scale
+    )
+    grid_w = (
+        np.arange(grid_size[1], dtype=np.float32)
+        / (grid_size[1] / base_size_y)
+        / lewei_scale
+    )
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size[1], grid_size[0]])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token and extra_tokens > 0:
+        pos_embed = np.concatenate(
+            [np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0
+        )
+    return pos_embed
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])
+    return np.concatenate([emb_h, emb_w], axis=1)
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.0
+    omega = 1.0 / 10000**omega
+    pos = pos.reshape(-1)
+    out = np.einsum("m,d->md", pos, omega)
+    emb_sin = np.sin(out)
+    emb_cos = np.cos(out)
+    return np.concatenate([emb_sin, emb_cos], axis=1)
--- a/PixArt_blocks.py
+++ b/PixArt_blocks.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# --------------------------------------------------------
+# References:
+# GLIDE: https://github.com/openai/glide-text2im
+# MAE: https://github.com/facebookresearch/mae/blob/main/models_mae.py
+# --------------------------------------------------------
+import math
+import torch
+import torch.nn as nn
+from timm.models.vision_transformer import Mlp, Attention as Attention_
+from einops import rearrange, repeat
+import torch.nn.functional as F
+from .utils import add_decomposed_rel_pos
+def modulate(x, shift, scale):
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+def t2i_modulate(x, shift, scale):
+    return x * (1 + scale) + shift
+class MultiHeadCrossAttention_org(nn.Module):
+    def __init__(self, n_feat, n_head, attn_drop=0.0, proj_drop=0):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadCrossAttention, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.proj = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=attn_drop)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward_qkv(self, query, key, value):
+        """Transform query, key and value.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        return q, k, v
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)
+            min_value = torch.finfo(scores.dtype).min
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(mask, 0.0)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)
+        x = x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        return self.proj_drop(self.proj(x))
+    def forward(self, x, cond, mask):
+        """Compute scaled dot product attention.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(x, cond, cond)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+        return self.forward_attention(v, scores, mask)
+class MultiHeadCrossAttention(nn.Module):
+    def __init__(
+        self, d_model, num_heads, attn_drop=0.0, proj_drop=0.0, **block_kwargs
+    ):
+        super(MultiHeadCrossAttention, self).__init__()
+        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+        self.d_model = d_model
+        self.num_heads = num_heads
+        self.head_dim = d_model // num_heads
+        self.q_linear = nn.Linear(d_model, d_model)
+        self.kv_linear = nn.Linear(d_model, d_model * 2)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(d_model, d_model)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x, cond, mask=None):
+        # query/value: img tokens; key: condition; mask: if padding tokens
+        B, N, C = x.shape
+        assert mask is not None
+        q = self.q_linear(x).view(B, N, self.num_heads, self.head_dim)
+        if isinstance(mask, list):
+            # mask = y_lens: list of kv sequence lengths per batch item.
+            # cond is (1, total_valid_tokens, C) — batch flattened, padding removed.
+            kv = self.kv_linear(cond).view(1, -1, 2, self.num_heads, self.head_dim)
+            k, v = kv.unbind(2)
+            total_kv = k.shape[1]
+            # (B, N, heads, dim) → (B, heads, N, dim)
+            q = q.permute(0, 2, 1, 3)
+            # (1, total_kv, heads, dim) → (B, heads, total_kv, dim)
+            k = k.permute(0, 2, 1, 3).expand(B, -1, -1, -1)
+            v = v.permute(0, 2, 1, 3).expand(B, -1, -1, -1)
+            # Build block-diagonal attention mask from sequence lengths
+            attn_mask = torch.full(
+                (B, 1, N, total_kv), float("-inf"), dtype=q.dtype, device=q.device
+            )
+            offset = 0
+            for i, kv_len in enumerate(mask):
+                attn_mask[i, :, :, offset:offset + kv_len] = 0
+                offset += kv_len
+        else:
+            # mask is a padding mask tensor (B, 1, N_kv)
+            kv = self.kv_linear(cond).view(B, -1, 2, self.num_heads, self.head_dim)
+            k, v = kv.unbind(2)
+            # (B, N, heads, dim) → (B, heads, N, dim)
+            q = q.permute(0, 2, 1, 3)
+            k = k.permute(0, 2, 1, 3)
+            v = v.permute(0, 2, 1, 3)
+            attn_mask = torch.zeros_like(mask, dtype=q.dtype)
+            attn_mask.masked_fill_(mask == 0, float("-inf"))
+            attn_mask = attn_mask.unsqueeze(1)
+        x = F.scaled_dot_product_attention(
+            q, k, v, dropout_p=self.attn_drop.p, attn_mask=attn_mask
+        )
+        # (B, heads, N, dim) → (B, N, C)
+        x = x.permute(0, 2, 1, 3).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class WindowAttention(Attention_):
+    """Multi-head Attention block with relative position embeddings."""
+    def __init__(
+        self,
+        dim,
+        num_heads=8,
+        qkv_bias=True,
+        use_rel_pos=False,
+        rel_pos_zero_init=True,
+        input_size=None,
+        **block_kwargs,
+    ):
+        """
+        Args:
+            dim (int): Number of input channels.
+            num_heads (int): Number of attention heads.
+            qkv_bias (bool:  If True, add a learnable bias to query, key, value.
+            rel_pos (bool): If True, add relative positional embeddings to the attention map.
+            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
+            input_size (int or None): Input resolution for calculating the relative positional
+                parameter size.
+        """
+        super().__init__(dim, num_heads=num_heads, qkv_bias=qkv_bias, **block_kwargs)
+        self.use_rel_pos = use_rel_pos
+        if self.use_rel_pos:
+            # initialize relative positional embeddings
+            self.rel_pos_h = nn.Parameter(
+                torch.zeros(2 * input_size[0] - 1, self.head_dim)
+            )
+            self.rel_pos_w = nn.Parameter(
+                torch.zeros(2 * input_size[1] - 1, self.head_dim)
+            )
+            if not rel_pos_zero_init:
+                nn.init.trunc_normal_(self.rel_pos_h, std=0.02)
+                nn.init.trunc_normal_(self.rel_pos_w, std=0.02)
+    def forward(self, x, mask=None):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads)
+        q, k, v = qkv.unbind(2)
+        if use_fp32_attention := getattr(self, "fp32_attention", False):
+            q, k, v = q.float(), k.float(), v.float()
+        # (B, N, heads, dim) → (B, heads, N, dim) for SDPA
+        q = q.permute(0, 2, 1, 3)
+        k = k.permute(0, 2, 1, 3)
+        v = v.permute(0, 2, 1, 3)
+        attn_mask = None
+        if mask is not None:
+            attn_mask = torch.zeros(
+                B * self.num_heads, q.shape[2], k.shape[2],
+                dtype=q.dtype, device=q.device,
+            )
+            attn_mask.masked_fill_(
+                mask.squeeze(1).repeat(self.num_heads, 1, 1) == 0, float("-inf")
+            )
+            attn_mask = attn_mask.view(B, self.num_heads, q.shape[2], k.shape[2])
+        x = F.scaled_dot_product_attention(
+            q, k, v, dropout_p=self.attn_drop.p, attn_mask=attn_mask
+        )
+        # (B, heads, N, dim) → (B, N, C)
+        x = x.permute(0, 2, 1, 3).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+#################################################################################
+#   AMP attention with fp32 softmax to fix loss NaN problem during training     #
+#################################################################################
+class Attention(Attention_):
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = (
+            self.qkv(x)
+            .reshape(B, N, 3, self.num_heads, C // self.num_heads)
+            .permute(2, 0, 3, 1, 4)
+        )
+        q, k, v = qkv.unbind(0)  # make torchscript happy (cannot use tensor as tuple)
+        use_fp32_attention = getattr(self, "fp32_attention", False)
+        if use_fp32_attention:
+            q, k = q.float(), k.float()
+        with torch.cuda.amp.autocast(enabled=not use_fp32_attention):
+            attn = (q @ k.transpose(-2, -1)) * self.scale
+            attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class FinalLayer(nn.Module):
+    """
+    The final layer of PixArt.
+    """
+    def __init__(self, hidden_size, patch_size, out_channels):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(
+            hidden_size, patch_size * patch_size * out_channels, bias=True
+        )
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(), nn.Linear(hidden_size, 2 * hidden_size, bias=True)
+        )
+    def forward(self, x, c):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
+        x = modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+class T2IFinalLayer(nn.Module):
+    """
+    The final layer of PixArt.
+    """
+    def __init__(self, hidden_size, patch_size, out_channels):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(
+            hidden_size, patch_size[0] * patch_size[1] * out_channels, bias=True
+        )
+        self.scale_shift_table = nn.Parameter(
+            torch.randn(2, hidden_size) / hidden_size**0.5
+        )
+        self.out_channels = out_channels
+        self.initialize_weights()
+    def initialize_weights(self):
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+    def forward(self, x, t):
+        shift, scale = (self.scale_shift_table[None] + t[:, None]).chunk(2, dim=1)
+        x = t2i_modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+class MaskFinalLayer(nn.Module):
+    """
+    The final layer of PixArt.
+    """
+    def __init__(self, final_hidden_size, c_emb_size, patch_size, out_channels):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(
+            final_hidden_size, elementwise_affine=False, eps=1e-6
+        )
+        self.linear = nn.Linear(
+            final_hidden_size, patch_size * patch_size * out_channels, bias=True
+        )
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(), nn.Linear(c_emb_size, 2 * final_hidden_size, bias=True)
+        )
+    def forward(self, x, t):
+        shift, scale = self.adaLN_modulation(t).chunk(2, dim=1)
+        x = modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+class DecoderLayer(nn.Module):
+    """
+    The final layer of PixArt.
+    """
+    def __init__(self, hidden_size, decoder_hidden_size):
+        super().__init__()
+        self.norm_decoder = nn.LayerNorm(
+            hidden_size, elementwise_affine=False, eps=1e-6
+        )
+        self.linear = nn.Linear(hidden_size, decoder_hidden_size, bias=True)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(), nn.Linear(hidden_size, 2 * hidden_size, bias=True)
+        )
+    def forward(self, x, t):
+        shift, scale = self.adaLN_modulation(t).chunk(2, dim=1)
+        x = modulate(self.norm_decoder(x), shift, scale)
+        x = self.linear(x)
+        return x
+#################################################################################
+#               Embedding Layers for Timesteps and Class Labels                 #
+#################################################################################
+class TimestepEmbedder(nn.Module):
+    """
+    Embeds scalar timesteps into vector representations.
+    """
+    def __init__(self, hidden_size, frequency_embedding_size=256):
+        super().__init__()
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+        self.frequency_embedding_size = frequency_embedding_size
+    @staticmethod
+    def timestep_embedding(t, dim, max_period=10000):
+        """
+        Create sinusoidal timestep embeddings.
+        :param t: a 1-D Tensor of N indices, one per batch element.
+                          These may be fractional.
+        :param dim: the dimension of the output.
+        :param max_period: controls the minimum frequency of the embeddings.
+        :return: an (N, D) Tensor of positional embeddings.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
+        half = dim // 2
+        freqs = torch.exp(
+            -math.log(max_period)
+            * torch.arange(start=0, end=half, dtype=torch.float32, device=t.device)
+            / half
+        )
+        args = t[:, None].float() * freqs[None]
+        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+        if dim % 2:
+            embedding = torch.cat(
+                [embedding, torch.zeros_like(embedding[:, :1])], dim=-1
+            )
+        return embedding
+    def forward(self, t):
+        t_freq = self.timestep_embedding(t, self.frequency_embedding_size).to(
+            self.dtype
+        )
+        return self.mlp(t_freq)
+    @property
+    def dtype(self):
+        return next(self.parameters()).dtype
+class SizeEmbedder(TimestepEmbedder):
+    """
+    Embeds scalar timesteps into vector representations.
+    """
+    def __init__(self, hidden_size, frequency_embedding_size=256):
+        super().__init__(
+            hidden_size=hidden_size, frequency_embedding_size=frequency_embedding_size
+        )
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+        self.frequency_embedding_size = frequency_embedding_size
+        self.outdim = hidden_size
+    def forward(self, s, bs):
+        if s.ndim == 1:
+            s = s[:, None]
+        assert s.ndim == 2
+        if s.shape[0] != bs:
+            s = s.repeat(bs // s.shape[0], 1)
+            assert s.shape[0] == bs
+        b, dims = s.shape[0], s.shape[1]
+        s = rearrange(s, "b d -> (b d)")
+        s_freq = self.timestep_embedding(s, self.frequency_embedding_size).to(
+            self.dtype
+        )
+        s_emb = self.mlp(s_freq)
+        s_emb = rearrange(s_emb, "(b d) d2 -> b (d d2)", b=b, d=dims, d2=self.outdim)
+        return s_emb
+    @property
+    def dtype(self):
+        return next(self.parameters()).dtype
+class LabelEmbedder(nn.Module):
+    """
+    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
+    """
+    def __init__(self, num_classes, hidden_size, dropout_prob):
+        super().__init__()
+        use_cfg_embedding = dropout_prob > 0
+        self.embedding_table = nn.Embedding(
+            num_classes + use_cfg_embedding, hidden_size
+        )
+        self.num_classes = num_classes
+        self.dropout_prob = dropout_prob
+    def token_drop(self, labels, force_drop_ids=None):
+        """
+        Drops labels to enable classifier-free guidance.
+        """
+        if force_drop_ids is None:
+            drop_ids = torch.rand(labels.shape[0]).cuda() < self.dropout_prob
+        else:
+            drop_ids = force_drop_ids == 1
+        labels = torch.where(drop_ids, self.num_classes, labels)
+        return labels
+    def forward(self, labels, train, force_drop_ids=None):
+        use_dropout = self.dropout_prob > 0
+        if (train and use_dropout) or (force_drop_ids is not None):
+            labels = self.token_drop(labels, force_drop_ids)
+        return self.embedding_table(labels)
+class CaptionEmbedder(nn.Module):
+    """
+    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
+    """
+    def __init__(
+        self,
+        in_channels,
+        hidden_size,
+        uncond_prob,
+        act_layer=nn.GELU(approximate="tanh"),
+        token_num=120,
+    ):
+        super().__init__()
+        self.y_proj = Mlp(
+            in_features=in_channels,
+            hidden_features=hidden_size,
+            out_features=hidden_size,
+            act_layer=act_layer,
+            drop=0,
+        )
+        self.register_buffer(
+            "y_embedding",
+            nn.Parameter(torch.randn(token_num, in_channels) / in_channels**0.5),
+        )
+        self.uncond_prob = uncond_prob
+    def token_drop(self, caption, force_drop_ids=None):
+        """
+        Drops labels to enable classifier-free guidance.
+        """
+        if force_drop_ids is None:
+            drop_ids = torch.rand(caption.shape[0]).cuda() < self.uncond_prob
+        else:
+            drop_ids = force_drop_ids == 1
+        caption = torch.where(drop_ids[:, None, None, None], self.y_embedding, caption)
+        return caption
+    def forward(self, caption, train, force_drop_ids=None):
+        if train:
+            assert caption.shape[2:] == self.y_embedding.shape
+        use_dropout = self.uncond_prob > 0
+        if (train and use_dropout) or (force_drop_ids is not None):
+            caption = self.token_drop(caption, force_drop_ids)
+        caption = self.y_proj(caption)
+        return caption
+class CaptionEmbedderDoubleBr(nn.Module):
+    """
+    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
+    """
+    def __init__(
+        self,
+        in_channels,
+        hidden_size,
+        uncond_prob,
+        act_layer=nn.GELU(approximate="tanh"),
+        token_num=120,
+    ):
+        super().__init__()
+        self.proj = Mlp(
+            in_features=in_channels,
+            hidden_features=hidden_size,
+            out_features=hidden_size,
+            act_layer=act_layer,
+            drop=0,
+        )
+        self.embedding = nn.Parameter(torch.randn(1, in_channels) / 10**0.5)
+        self.y_embedding = nn.Parameter(torch.randn(token_num, in_channels) / 10**0.5)
+        self.uncond_prob = uncond_prob
+    def token_drop(self, global_caption, caption, force_drop_ids=None):
+        """
+        Drops labels to enable classifier-free guidance.
+        """
+        if force_drop_ids is None:
+            drop_ids = torch.rand(global_caption.shape[0]).cuda() < self.uncond_prob
+        else:
+            drop_ids = force_drop_ids == 1
+        global_caption = torch.where(drop_ids[:, None], self.embedding, global_caption)
+        caption = torch.where(drop_ids[:, None, None, None], self.y_embedding, caption)
+        return global_caption, caption
+    def forward(self, caption, train, force_drop_ids=None):
+        assert caption.shape[2:] == self.y_embedding.shape
+        global_caption = caption.mean(dim=2).squeeze()
+        use_dropout = self.uncond_prob > 0
+        if (train and use_dropout) or (force_drop_ids is not None):
+            global_caption, caption = self.token_drop(
+                global_caption, caption, force_drop_ids
+            )
+        y_embed = self.proj(global_caption)
+        return y_embed, caption
--- a/README.md
+++ b/README.md
 # AudioFly
+## 论文
+暂无
+## 模型简介
 AudioFly 是一个音频生成模型。它根据文本描述合成音效。该模型可以以 44.1 kHz 的采样率生成高质量音频。生成的音频与提示文本有很强的一致性。
-AudioFly 采用了潜在扩散模型架构。该模型拥有 10 亿个参数，并在大量多样化的语料库上进行了训练。训练数据包括开源数据集，如 AudioSet、AudioCaps 和 TUT，以及专有的内部数据。该模型在单一事件和多事件场景中表现良好。在这两种情况下，生成的音频都能准确反映所描述的内容。在 AudioCaps 数据集上，AudioF
\ No newline at end of file
+AudioFly 采用了潜在扩散模型架构。该模型拥有 10 亿个参数，并在大量多样化的语料库上进行了训练。训练数据包括开源数据集，如 AudioSet、AudioCaps 和 TUT，以及专有的内部数据。该模型在单一事件和多事件场景中表现良好。在这两种情况下，生成的音频都能准确反映所描述的内容。在 AudioCaps 数据集上，AudioFly 的性能优于之前的音频生成模型。
+## 环境依赖
+| 软件 | 版本 |
+| :------: | :------: |
+| DTK | 26.04 |
+| Python | 3.10 |
+| Transformers | 4.56.1 |
+| vLLM | 0.18.1+das.dtk2604 |
+推荐使用镜像: 
+harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.18.1-ubuntu22.04-dtk26.04-py3.10-20260529-iflytek
+- 挂载地址`-v` 根据实际模型情况修改
+```bash
+docker run -it \
+    --shm-size 60g \
+    --network=host \
+    --name Spark-X1 \
+    --privileged \
+    --device=/dev/kfd \
+    --device=/dev/dri \
+    --device=/dev/mkfd \
+    --group-add video \
+    --cap-add=SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    -u root \
+    -v /opt/hyhal/:/opt/hyhal/:ro \
+    -v /path/your_code_data/:/path/your_code_data/ \
+    harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm0.18.1-ubuntu22.04-dtk26.04-py3.10-20260529-iflytek bash
+```
+更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
+其它环境配置说明：
+```bash
+#下载模型以后，替换下载路径内的文件
+cp PixArt_blocks.py AudioFly/ldm/modules/diffusionmodules/PixArt_blocks.py
+cp PixArt.py AudioFly/ldm/modules/diffusionmodules/PixArt.py
+```
+## 预训练权重
+**请根据`支持的DCU型号`选择对应模型下载，FP8模型仅在BW1100/BW1101上支持，其他型号请勿使用！**
+| 模型名称 | 权重大小 | 数据类型 | 支持的DCU型号 | 最低卡数需求 | 下载地址 |
+| :------: | :------: | :------: | :------------: | :----------: | :------: |
+| AudioFly | 1B | BF16 | BW1000 | 1 | [ModelScope](https://modelscope.cn/models/iflytek/AudioFly) |
+## 数据集
+暂无
+## 训练
+暂无
+## 推理
+### Pytorch
+#### 单机推理
+```bash
+cd AudioFly
+python run.py
+```
+## 效果展示
+输入：
+'Fierce winds howl through the valley' 
+输出：
+<audio controls src="./doc/result.wav"></audio>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch
+## 源码仓库及问题反馈
+- https://developer.sourcefind.cn/codes/modelzoo/audiofly
+## 参考资料
+- https://modelscope.csdn.net/68da3b11a6dc56200e8ae2ae.html
+- https://modelscope.cn/models/iflytek/AudioFly
\ No newline at end of file
--- a/doc/result.wav
+++ b/doc/result.wav
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=15319
+# 模型名称
+modelName=AudioFly
+# 模型描述
+modelDescription=AudioFly 是一个音频生成模型。它根据文本描述合成音效。该模型可以以 44.1 kHz 的采样率生成高质量音频。
+# 运行过程
+processType=推理
+# 算法类别
+appCategory=语音合成
+# 框架类型
+frameType=pytorch
+# 加速卡类型
+accelerateType=BW1000
--- a/run.py
+++ b/run.py
+import yaml
+import torch
+from ldm.utils.util import instantiate_from_config
+configs = yaml.load(open('/public/home/weishb/iflytek/AudioFly/config/config.yaml', "r"), Loader=yaml.FullLoader)
+model = instantiate_from_config(configs["model"])
+checkpoint = torch.load('/public/home/weishb/iflytek/AudioFly/models/ldm/model.ckpt')
+model.load_state_dict(checkpoint, strict=False)
+model.eval()
+model = model.cuda()
+text = 'Fierce winds howl through the valley' 
+name = 'result'
+savedir = './result'
+model.generate_sample(
+        textlist=[text],
+        name=name,
+        cfg=3.5,# Guidance scale (controls how strongly generation follows the text prompt）; not recommended to change
+        ddim_steps=200,  # Number of denoising steps in the diffusion process; not recommended to change
+        outputdir=f"{savedir}")