dit & video

c07946d8 · hepj · c07946d8 · c07946d8 · c07946d8 · c07946d8
Commit c07946d8 authored Apr 09, 2026 by hepj
20 changed files
--- a/FastVideo-main/csrc/sliding_tile_attention/MANIFEST.in
+++ b/FastVideo-main/csrc/sliding_tile_attention/MANIFEST.in
+recursive-include tk *
+include config.py
--- a/FastVideo-main/csrc/sliding_tile_attention/README.md
+++ b/FastVideo-main/csrc/sliding_tile_attention/README.md
+
+
+# Sliding Tile Atteniton Kernel
+
+
+## Installation
+We test our code on Pytorch 2.5.0 and CUDA>=12.4. Currently we only have implementation on H100.
+First, install C++20 for ThunderKittens:
+
+```bash
+sudo apt update
+sudo apt install gcc-11 g++-11
+
+sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 100 --slave /usr/bin/g++ g++ /usr/bin/g++-11
+
+sudo apt update
+sudo apt install clang-11
+```
+Install STA:
+```bash
+export CUDA_HOME=/usr/local/cuda-12.4
+export PATH=${CUDA_HOME}/bin:${PATH} 
+export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
+git submodule update --init --recursive
+python setup.py install
+```
+
+## Usage
+
+```python
+from st_attn import sliding_tile_attention
+# assuming video size (T, H, W) = (30, 48, 80), text tokens = 256 with padding. 
+# q, k, v: [batch_size, num_heads, seq_length, head_dim], seq_length = T*H*W + 256
+# a tile is a cube of size (6, 8, 8)
+# window_size in tiles: [(window_t, window_h, window_w), (..)...]. For example, window size (3, 3, 3) means a query can attend to (3x6, 3x8, 3x8) = (18, 24, 24) tokens out of the total 30x48x80 video.
+# text_length: int ranging from 0 to 256
+# If your attention contains text token (Hunyuan)
+out = sliding_tile_attention(q, k, v, window_size, text_length)
+# If your attention does not contain text token (StepVideo)
+out = sliding_tile_attention(q, k, v, window_size, 0, False)
+
+```
+
+
+## Test
+```bash
+python test/test_sta.py
+```
+
+## How Does STA Work?
+We give a demo for 2D STA with window size (6,6) operating on a (10, 10) image. 
+
+
+https://github.com/user-attachments/assets/f3b6dd79-7b43-4b60-a0fa-3d6495ec5747
+
+## Why is STA Fast?
+2D/3D Sliding Window Attention (SWA) creates many mixed blocks in the attention map. Even though mixed blocks have less output value,a mixed block is significantly slower than a dense block due to the GPU-unfriendly masking operation. 
+
+STA removes mixed blocks.
+
+
+<div align="center">
+<img src=../../assets/sliding_tile_attn_map.png width="80%"/>
+</div>
+
+## Acknowledgement
+
+We learned or reuse code from FlexAtteniton, NATEN, and ThunderKittens.
--- a/FastVideo-main/csrc/sliding_tile_attention/config.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/config.py
+### ADD TO THIS TO REGISTER NEW KERNELS
+sources = {
+    'attn': {
+        'source_files': {
+            'h100': 'st_attn/st_attn_h100.cu'  # define these source files for each GPU target desired.
+        }
+    }
+}
+
+### WHICH KERNELS DO WE WANT TO BUILD?
+# (oftentimes during development work you don't need to redefine them all.)
+kernels = ['attn']
+
+### WHICH GPU TARGET DO WE WANT TO BUILD FOR?
+target = 'h100'
--- a/FastVideo-main/csrc/sliding_tile_attention/setup.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/setup.py
+import os
+import subprocess
+
+from config import kernels, sources, target
+from setuptools import find_packages, setup
+from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+
+target = target.lower()
+
+# Package metadata
+PACKAGE_NAME = "st_attn"
+VERSION = "0.0.4"
+AUTHOR = "Hao AI Lab"
+DESCRIPTION = "Sliding Tile Atteniton Kernel Used in FastVideo"
+URL = "https://github.com/hao-ai-lab/FastVideo/tree/main/csrc/sliding_tile_attention"
+
+# Set environment variables
+tk_root = os.getenv('THUNDERKITTENS_ROOT', os.path.abspath(os.path.join(os.getcwd(), 'tk/')))
+python_include = subprocess.check_output(['python', '-c',
+                                          "import sysconfig; print(sysconfig.get_path('include'))"]).decode().strip()
+torch_include = subprocess.check_output([
+    'python', '-c',
+    "import torch; from torch.utils.cpp_extension import include_paths; print(' '.join(['-I' + p for p in include_paths()]))"
+]).decode().strip()
+print('st_attn root:', tk_root)
+print('Python include:', python_include)
+print('Torch include directories:', torch_include)
+
+# CUDA flags
+cuda_flags = [
+    '-DNDEBUG', '-Xcompiler=-Wno-psabi', '-Xcompiler=-fno-strict-aliasing', '--expt-extended-lambda',
+    '--expt-relaxed-constexpr', '-forward-unknown-to-host-compiler', '--use_fast_math', '-std=c++20', '-O3',
+    '-Xnvlink=--verbose', '-Xptxas=--verbose', '-Xptxas=--warn-on-spills', f'-I{tk_root}/include',
+    f'-I{tk_root}/prototype', f'-I{python_include}', '-DTORCH_COMPILE'
+] + torch_include.split()
+cpp_flags = ['-std=c++20', '-O3']
+
+if target == 'h100':
+    cuda_flags.append('-DKITTENS_HOPPER')
+    cuda_flags.append('-arch=sm_90a')
+else:
+    raise ValueError(f'Target {target} not supported')
+
+source_files = ['st_attn.cpp']
+for k in kernels:
+    if target not in sources[k]['source_files']:
+        raise KeyError(f'Target {target} not found in source files for kernel {k}')
+    if isinstance(sources[k]['source_files'][target], list):
+        source_files.extend(sources[k]['source_files'][target])
+    else:
+        source_files.append(sources[k]['source_files'][target])
+    cpp_flags.append(f'-DTK_COMPILE_{k.replace(" ", "_").upper()}')
+
+setup(name=PACKAGE_NAME,
+      version=VERSION,
+      author=AUTHOR,
+      description=DESCRIPTION,
+      url=URL,
+      packages=find_packages(),
+      ext_modules=[
+          CUDAExtension('st_attn_cuda',
+                        sources=source_files,
+                        extra_compile_args={
+                            'cxx': cpp_flags,
+                            'nvcc': cuda_flags
+                        },
+                        libraries=['cuda'])
+      ],
+      cmdclass={'build_ext': BuildExtension},
+      classifiers=[
+          "Programming Language :: Python :: 3",
+          "Environment :: GPU :: NVIDIA CUDA :: 12",
+          "License :: OSI Approved :: Apache Software License",
+      ],
+      python_requires='>=3.10',
+      install_requires=["torch>=2.5.0"])
--- a/FastVideo-main/csrc/sliding_tile_attention/st_attn.cpp
+++ b/FastVideo-main/csrc/sliding_tile_attention/st_attn.cpp
+#include <torch/extension.h>
+#include <ATen/ATen.h>
+
+#include <vector>
+#include <cuda_fp16.h>
+#include <cuda_bf16.h>
+
+#include <cuda_runtime.h>
+
+
+#ifdef TK_COMPILE_ATTN
+extern torch::Tensor sta_forward(
+    torch::Tensor q, torch::Tensor k, torch::Tensor v, torch::Tensor o, int kernel_t_size, int kernel_w_size, int kernel_h_size, int text_length, bool process_text, bool has_text, int kernel_aspect_ratio_flag
+); 
+#endif
+
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.doc() = "Sliding Block Attention Kernels"; // optional module docstring
+
+#ifdef TK_COMPILE_ATTN
+    m.def("sta_fwd",  torch::wrap_pybind_function(sta_forward), "sliding tile attention, assuming tile size is (6,8,8)");
+#endif
+}
\ No newline at end of file
--- a/FastVideo-main/csrc/sliding_tile_attention/st_attn/__init__.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/st_attn/__init__.py
+import math
+
+import torch
+from st_attn_cuda import sta_fwd
+
+
+def sliding_tile_attention(q_all, k_all, v_all, window_size, text_length, has_text=True, img_latent_shape='30*48*80'):
+    seq_length = q_all.shape[2]
+    img_latent_shape_mapping = {
+        '30x48x80':1,
+        '36x48x48':2,
+        '18x48x80':3, 
+    }
+    if has_text:
+        assert q_all.shape[
+            2] >= 115200, "STA currently only supports video with latent size (30, 48, 80), which is 117 frames x 768 x 1280 pixels"
+        assert q_all.shape[1] == len(window_size), "Number of heads must match the number of window sizes"
+        target_size = math.ceil(seq_length / 384) * 384
+        pad_size = target_size - seq_length
+        if pad_size > 0:
+            q_all = torch.cat([q_all, q_all[:, :, -pad_size:]], dim=2)
+            k_all = torch.cat([k_all, k_all[:, :, -pad_size:]], dim=2)
+            v_all = torch.cat([v_all, v_all[:, :, -pad_size:]], dim=2)
+    else:
+        if img_latent_shape == '36x48x48': # Stepvideo 204x768x68
+            assert q_all.shape[2] == 82944
+        elif img_latent_shape == '18x48x80': # Wan 69x768x1280
+            assert q_all.shape[2] == 69120
+        else:
+            raise ValueError(f"Unsupported {img_latent_shape}, current shape is {q_all.shape}, only support '36x48x48' for Stepvideo and '18x48x80' for Wan")
+
+    kernel_aspect_ratio_flag = img_latent_shape_mapping[img_latent_shape]
+    hidden_states = torch.empty_like(q_all)
+    # This for loop is ugly. but it is actually quite efficient. The sequence dimension alone can already oversubscribe SMs
+    for head_index, (t_kernel, h_kernel, w_kernel) in enumerate(window_size):
+        for batch in range(q_all.shape[0]):
+            q_head, k_head, v_head, o_head = (q_all[batch:batch + 1, head_index:head_index + 1],
+                                              k_all[batch:batch + 1,
+                                                    head_index:head_index + 1], v_all[batch:batch + 1,
+                                                                                      head_index:head_index + 1],
+                                              hidden_states[batch:batch + 1, head_index:head_index + 1])
+
+            _ = sta_fwd(q_head, k_head, v_head, o_head, t_kernel, h_kernel, w_kernel, text_length, False, has_text, kernel_aspect_ratio_flag)
+    if has_text:
+        _ = sta_fwd(q_all, k_all, v_all, hidden_states, 3, 3, 3, text_length, True, True, kernel_aspect_ratio_flag)
+    return hidden_states[:, :, :seq_length]
--- a/FastVideo-main/csrc/sliding_tile_attention/st_attn/st_attn_h100.cu
+++ b/FastVideo-main/csrc/sliding_tile_attention/st_attn/st_attn_h100.cu
--- a/FastVideo-main/csrc/sliding_tile_attention/test/bench.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/test/bench.py
+import os
+from collections import defaultdict
+
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+from st_attn import sliding_tile_attention
+
+
+def flops(batch, seqlen, nheads, headdim, causal, mode="fwd"):
+    assert mode in ["fwd", "bwd", "fwd_bwd"]
+    f = 4 * batch * seqlen**2 * nheads * headdim // (2 if causal else 1)
+    return f if mode == "fwd" else (2.5 * f if mode == "bwd" else 3.5 * f)
+
+
+def efficiency(flop, time):
+    flop = flop / 1e12
+    time = time / 1e6
+    return flop / time
+
+
+def benchmark_attention(configurations):
+    results = {'fwd': defaultdict(list), 'bwd': defaultdict(list)}
+
+    for B, H, N, D, causal in configurations:
+        print("=" * 60)
+        print(f"Timing forward and backward pass for B={B}, H={H}, N={N}, D={D}, causal={causal}")
+
+        q = torch.randn(B, H, N, D, dtype=torch.bfloat16, device='cuda', requires_grad=False).contiguous()
+        k = torch.randn(B, H, N, D, dtype=torch.bfloat16, device='cuda', requires_grad=False).contiguous()
+        v = torch.randn(B, H, N, D, dtype=torch.bfloat16, device='cuda', requires_grad=False).contiguous()
+
+        grad_output = torch.randn_like(q, requires_grad=False).contiguous()
+
+        qg = torch.zeros_like(q, requires_grad=False, dtype=torch.float).contiguous()
+        kg = torch.zeros_like(k, requires_grad=False, dtype=torch.float).contiguous()
+        vg = torch.zeros_like(v, requires_grad=False, dtype=torch.float).contiguous()
+
+        # Prepare for timing forward pass
+        start_events_fwd = [torch.cuda.Event(enable_timing=True) for _ in range(10)]
+        end_events_fwd = [torch.cuda.Event(enable_timing=True) for _ in range(10)]
+
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+
+        # Warmup for forward pass
+        for _ in range(10):
+            o = sliding_tile_attention(q, k, v, [[6, 6, 6]] * 24, 0, False)
+
+        # Time the forward pass
+        for i in range(10):
+            start_events_fwd[i].record()
+            o = sliding_tile_attention(q, k, v, [[6, 6, 6]] * 24, 0, False)
+            end_events_fwd[i].record()
+
+        torch.cuda.synchronize()
+        times_fwd = [s.elapsed_time(e) for s, e in zip(start_events_fwd, end_events_fwd)]
+        time_us_fwd = np.mean(times_fwd) * 1000
+
+        tflops_fwd = efficiency(flops(B, N, H, D, causal, 'fwd'), time_us_fwd)
+        results['fwd'][(D, causal)].append((N, tflops_fwd))
+
+        print(f"Average time for forward pass in us: {time_us_fwd:.2f}")
+        print(f"Average efficiency for forward pass in TFLOPS: {tflops_fwd}")
+        print("-" * 60)
+
+        # torch.cuda.empty_cache()
+        # torch.cuda.synchronize()
+
+        # # Prepare for timing backward pass
+        # start_events_bwd = [torch.cuda.Event(enable_timing=True) for _ in range(10)]
+        # end_events_bwd = [torch.cuda.Event(enable_timing=True) for _ in range(10)]
+
+        # # Warmup for backward pass
+        # for _ in range(10):
+        #     qg, kg, vg = tk.mha_backward(q, k, v, o, l_vec, grad_output, causal)
+
+        # # Time the backward pass
+        # for i in range(10):
+        #     start_events_bwd[i].record()
+        #     qg, kg, vg = tk.mha_backward(q, k, v, o, l_vec, grad_output, causal)
+        #     end_events_bwd[i].record()
+
+        # torch.cuda.synchronize()
+        # times_bwd = [s.elapsed_time(e) for s, e in zip(start_events_bwd, end_events_bwd)]
+        # time_us_bwd = np.mean(times_bwd) * 1000
+
+        # tflops_bwd = efficiency(flops(B, N, H, D, causal, 'bwd'), time_us_bwd)
+        # results['bwd'][(D, causal)].append((N, tflops_bwd))
+
+        # print(f"Average time for backward pass in us: {time_us_bwd:.2f}")
+        # print(f"Average efficiency for backward pass in TFLOPS: {tflops_bwd}")
+        print("=" * 60)
+
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+
+    return results
+
+
+def plot_results(results):
+    os.makedirs('benchmark_results', exist_ok=True)
+    for mode in ['fwd', 'bwd']:
+        for (D, causal), values in results[mode].items():
+            seq_lens = [x[0] for x in values]
+            tflops = [x[1] for x in values]
+
+            plt.figure(figsize=(10, 6))
+            bars = plt.bar(range(len(seq_lens)), tflops, tick_label=seq_lens)
+            plt.xlabel('Sequence Length')
+            plt.ylabel('TFLOPS')
+            plt.title(f'{mode.upper()} Pass - Head Dim: {D}, Causal: {causal}')
+            plt.grid(True)
+
+            # Adding the numerical y value on top of each bar
+            for bar in bars:
+                yval = bar.get_height()
+                plt.text(bar.get_x() + bar.get_width() / 2, yval, round(yval, 2), ha='center', va='bottom')
+
+            filename = f'benchmark_results/{mode}_D{D}_causal{causal}.png'
+            plt.savefig(filename)
+            plt.close()
+
+
+# Example list of configurations to test
+configurations = [
+    (2, 24, 82944, 128, False),
+    # (16, 16, 768*16,  128, False),
+    # (16, 16, 768*2,  128, False),
+    # (16, 16, 768*4,  128, False),
+    # (16, 16, 768*8,  128, False),
+    # (16, 16, 768*16, 128, False),
+    # (16, 16, 768,    128, True),
+    # (16, 16, 768*2,  128, True),
+    # (16, 16, 768*4,  128, True),
+    # (16, 16, 768*8,  128, True),
+    # (16, 16, 768*16, 128, True),
+    # (16, 32, 768,    64,  False),
+    # (16, 32, 768*2,  64,  False),
+    # (16, 32, 768*4,  64,  False),
+    # (16, 32, 768*8,  64,  False),
+    # (16, 32, 768*16, 64,  False),
+    # (16, 32, 768,    64,  True),
+    # (16, 32, 768*2,  64,  True),
+    # (16, 32, 768*4,  64,  True),
+    # (16, 32, 768*8,  64,  True),
+    # (16, 32, 768*16, 64,  True),
+]
+
+results = benchmark_attention(configurations)
+# plot_results(results)
--- a/FastVideo-main/csrc/sliding_tile_attention/test/flex_sta_ref.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/test/flex_sta_ref.py
+from typing import Tuple
+
+import torch
+from torch import BoolTensor, IntTensor
+from torch.nn.attention.flex_attention import create_block_mask
+
+# Peiyuan: This is neccesay. Dont know why. see https://github.com/pytorch/pytorch/issues/135028
+torch._inductor.config.realize_opcount_threshold = 100
+
+
+def generate_sta_mask(canvas_twh, kernel_twh, tile_twh, text_length):
+    """Generates a 3D NATTEN attention mask with a given kernel size.
+    
+    Args:
+        canvas_t: The time dimension of the canvas.
+        canvas_h: The height of the canvas.
+        canvas_w: The width of the canvas.
+        kernel_t: The time dimension of the kernel.
+        kernel_h: The height of the kernel.
+        kernel_w: The width of the kernel.
+    """
+    canvas_t, canvas_h, canvas_w = canvas_twh
+    kernel_t, kernel_h, kernel_w = kernel_twh
+    tile_t_size, tile_h_size, tile_w_size = tile_twh
+    total_tile_size = tile_t_size * tile_h_size * tile_w_size
+    canvas_tile_t, canvas_tile_h, canvas_tile_w = canvas_t // tile_t_size, canvas_h // tile_h_size, canvas_w // tile_w_size
+    img_seq_len = canvas_t * canvas_h * canvas_w
+
+    def get_tile_t_x_y(idx: IntTensor) -> Tuple[IntTensor, IntTensor, IntTensor]:
+        tile_id = idx // total_tile_size
+        tile_t = tile_id // (canvas_tile_h * canvas_tile_w)
+        tile_h = (tile_id % (canvas_tile_h * canvas_tile_w)) // canvas_tile_w
+        tile_w = tile_id % canvas_tile_w
+        return tile_t, tile_h, tile_w
+
+    def sta_mask_3d(
+        b: IntTensor,
+        h: IntTensor,
+        q_idx: IntTensor,
+        kv_idx: IntTensor,
+    ) -> BoolTensor:
+        q_t_tile, q_x_tile, q_y_tile = get_tile_t_x_y(q_idx)
+        kv_t_tile, kv_x_tile, kv_y_tile = get_tile_t_x_y(kv_idx)
+        # kernel nominally attempts to center itself on the query, but kernel center
+        # is clamped to a fixed distance (kernel half-length) from the canvas edge
+        kernel_center_t = q_t_tile.clamp(kernel_t // 2, (canvas_tile_t - 1) - kernel_t // 2)
+        kernel_center_x = q_x_tile.clamp(kernel_h // 2, (canvas_tile_h - 1) - kernel_h // 2)
+        kernel_center_y = q_y_tile.clamp(kernel_w // 2, (canvas_tile_w - 1) - kernel_w // 2)
+        time_mask = (kernel_center_t - kv_t_tile).abs() <= kernel_t // 2
+        hori_mask = (kernel_center_x - kv_x_tile).abs() <= kernel_h // 2
+        vert_mask = (kernel_center_y - kv_y_tile).abs() <= kernel_w // 2
+        image_mask = (q_idx < img_seq_len) & (kv_idx < img_seq_len)
+        image_to_text_mask = (q_idx < img_seq_len) & (kv_idx >= img_seq_len) & (kv_idx < img_seq_len + text_length)
+        text_to_all_mask = (q_idx >= img_seq_len) & (kv_idx < img_seq_len + text_length)
+        return (image_mask & time_mask & hori_mask & vert_mask) | image_to_text_mask | text_to_all_mask
+
+    sta_mask_3d.__name__ = f"natten_3d_c{canvas_t}x{canvas_w}x{canvas_h}_k{kernel_t}x{kernel_w}x{kernel_h}"
+    return sta_mask_3d
+
+
+def get_sliding_tile_attention_mask(kernel_size, tile_size, img_size, text_length, device, text_max_len=256):
+    img_seq_len = img_size[0] * img_size[1] * img_size[2]
+    image_mask = generate_sta_mask(img_size, kernel_size, tile_size, text_length)
+    mask = create_block_mask(image_mask,
+                             B=None,
+                             H=None,
+                             Q_LEN=img_seq_len + text_max_len,
+                             KV_LEN=img_seq_len + text_max_len,
+                             device=device,
+                             _compile=True)
+    return mask
--- a/FastVideo-main/csrc/sliding_tile_attention/test/test_sta.py
+++ b/FastVideo-main/csrc/sliding_tile_attention/test/test_sta.py
+import torch
+from flex_sta_ref import get_sliding_tile_attention_mask
+from st_attn import sliding_tile_attention
+from torch.nn.attention.flex_attention import flex_attention
+# from flash_attn_interface import flash_attn_func
+from tqdm import tqdm
+
+flex_attention = torch.compile(flex_attention, dynamic=False)
+
+
+def flex_test(Q, K, V, kernel_size):
+    mask = get_sliding_tile_attention_mask(kernel_size, (6, 8, 8), (36, 48, 48), 39, 'cuda', 0)
+    output = flex_attention(Q, K, V, block_mask=mask)
+
+    return output
+
+
+def h100_fwd_kernel_test(Q, K, V, kernel_size):
+    o = sliding_tile_attention(Q, K, V, [kernel_size] * 24, 39, False)
+    return o
+
+
+def generate_tensor(shape, mean, std, dtype, device):
+    tensor = torch.randn(shape, dtype=dtype, device=device)
+
+    magnitude = torch.norm(tensor, dim=-1, keepdim=True)
+    scaled_tensor = tensor * (torch.randn(magnitude.shape, dtype=dtype, device=device) * std + mean) / magnitude
+
+    return scaled_tensor.contiguous()
+
+
+def check_correctness(b, h, n, d, causal, mean, std, num_iterations=50, error_mode='all'):
+    results = {
+        'TK vs FLEX': {
+            'sum_diff': 0,
+            'sum_abs': 0,
+            'max_diff': 0
+        },
+    }
+    kernel_size_ls = [(6, 1, 6), (6, 6, 1)]
+    from tqdm import tqdm
+    for kernel_size in tqdm(kernel_size_ls):
+        for _ in range(num_iterations):
+            torch.manual_seed(0)
+
+            Q = generate_tensor((b, h, n, d), mean, std, torch.bfloat16, 'cuda')
+            K = generate_tensor((b, h, n, d), mean, std, torch.bfloat16, 'cuda')
+            V = generate_tensor((b, h, n, d), mean, std, torch.bfloat16, 'cuda')
+            tk_o = h100_fwd_kernel_test(Q, K, V, kernel_size)
+            pt_o = flex_test(Q, K, V, kernel_size)
+
+            diff = pt_o - tk_o
+            abs_diff = torch.abs(diff)
+            results['TK vs FLEX']['sum_diff'] += torch.sum(abs_diff).item()
+            results['TK vs FLEX']['max_diff'] = max(results['TK vs FLEX']['max_diff'], torch.max(abs_diff).item())
+
+            torch.cuda.empty_cache()
+        print("kernel_size", kernel_size)
+        print("max_diff", torch.max(abs_diff).item())
+        print(
+            "avg_diff",
+            torch.sum(abs_diff).item() / (b * h * n * d *
+                                          (1 if error_mode == 'output' else 3 if error_mode == 'backward' else 4)))
+
+    total_elements = b * h * n * d * num_iterations * (1 if error_mode == 'output' else
+                                                       3 if error_mode == 'backward' else 4) * len(kernel_size_ls)
+    for name, data in results.items():
+        avg_diff = data['sum_diff'] / total_elements
+        max_diff = data['max_diff']
+        results[name] = {'avg_diff': avg_diff, 'max_diff': max_diff}
+
+    return results
+
+
+def generate_error_graphs(b, h, d, causal, mean, std, error_mode='all'):
+    seq_lengths = [82944]
+
+    tk_avg_errors, tk_max_errors = [], []
+
+    for n in tqdm(seq_lengths, desc="Generating error data"):
+        results = check_correctness(b, h, n, d, causal, mean, std, error_mode=error_mode)
+
+        tk_avg_errors.append(results['TK vs FLEX']['avg_diff'])
+        tk_max_errors.append(results['TK vs FLEX']['max_diff'])
+
+
+# Example usage
+b, h, d = 2, 24, 128
+causal = False
+mean = 1e-1
+std = 10
+
+for mode in ['output']:
+    generate_error_graphs(b, h, d, causal, mean, std, error_mode=mode)
+
+print("Error graphs generated and saved for all modes.")
--- a/FastVideo-main/demo/gradio_web_demo.py
+++ b/FastVideo-main/demo/gradio_web_demo.py
+import argparse
+import os
+import tempfile
+
+import gradio as gr
+import torch
+from diffusers import FlowMatchEulerDiscreteScheduler
+from diffusers.utils import export_to_video
+
+from fastvideo.distill.solver import PCMFMScheduler
+from fastvideo.models.mochi_hf.modeling_mochi import MochiTransformer3DModel
+from fastvideo.models.mochi_hf.pipeline_mochi import MochiPipeline
+
+
+def init_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--prompts", nargs="+", default=[])
+    parser.add_argument("--num_frames", type=int, default=25)
+    parser.add_argument("--height", type=int, default=480)
+    parser.add_argument("--width", type=int, default=848)
+    parser.add_argument("--num_inference_steps", type=int, default=8)
+    parser.add_argument("--guidance_scale", type=float, default=4.5)
+    parser.add_argument("--model_path", type=str, default="data/mochi")
+    parser.add_argument("--seed", type=int, default=12345)
+    parser.add_argument("--transformer_path", type=str, default=None)
+    parser.add_argument("--scheduler_type", type=str, default="pcm_linear_quadratic")
+    parser.add_argument("--lora_checkpoint_dir", type=str, default=None)
+    parser.add_argument("--shift", type=float, default=8.0)
+    parser.add_argument("--num_euler_timesteps", type=int, default=50)
+    parser.add_argument("--linear_threshold", type=float, default=0.1)
+    parser.add_argument("--linear_range", type=float, default=0.75)
+    parser.add_argument("--cpu_offload", action="store_true")
+    return parser.parse_args()
+
+
+def load_model(args):
+    if args.scheduler_type == "euler":
+        scheduler = FlowMatchEulerDiscreteScheduler()
+    else:
+        linear_quadratic = True if "linear_quadratic" in args.scheduler_type else False
+        scheduler = PCMFMScheduler(
+            1000,
+            args.shift,
+            args.num_euler_timesteps,
+            linear_quadratic,
+            args.linear_threshold,
+            args.linear_range,
+        )
+
+    if args.transformer_path:
+        transformer = MochiTransformer3DModel.from_pretrained(args.transformer_path)
+    else:
+        transformer = MochiTransformer3DModel.from_pretrained(args.model_path, subfolder="transformer/")
+
+    pipe = MochiPipeline.from_pretrained(args.model_path, transformer=transformer, scheduler=scheduler)
+    pipe.enable_vae_tiling()
+    # pipe.to(device)
+    # if args.cpu_offload:
+    pipe.enable_sequential_cpu_offload()
+    return pipe
+
+
+def generate_video(
+    prompt,
+    negative_prompt,
+    use_negative_prompt,
+    seed,
+    guidance_scale,
+    num_frames,
+    height,
+    width,
+    num_inference_steps,
+    randomize_seed=False,
+):
+    if randomize_seed:
+        seed = torch.randint(0, 1000000, (1, )).item()
+
+    generator = torch.Generator(device="cuda").manual_seed(seed)
+
+    if not use_negative_prompt:
+        negative_prompt = None
+
+    with torch.autocast("cuda", dtype=torch.bfloat16):
+        output = pipe(
+            prompt=[prompt],
+            negative_prompt=negative_prompt,
+            height=height,
+            width=width,
+            num_frames=num_frames,
+            num_inference_steps=num_inference_steps,
+            guidance_scale=guidance_scale,
+            generator=generator,
+        ).frames[0]
+
+    output_path = os.path.join(tempfile.mkdtemp(), "output.mp4")
+    export_to_video(output, output_path, fps=30)
+    return output_path, seed
+
+
+examples = [
+    "A hand enters the frame, pulling a sheet of plastic wrap over three balls of dough placed on a wooden surface. The plastic wrap is stretched to cover the dough more securely. The hand adjusts the wrap, ensuring that it is tight and smooth over the dough. The scene focuses on the hand’s movements as it secures the edges of the plastic wrap. No new objects appear, and the camera remains stationary, focusing on the action of covering the dough.",
+    "A vintage train snakes through the mountains, its plume of white steam rising dramatically against the jagged peaks. The cars glint in the late afternoon sun, their deep crimson and gold accents lending a touch of elegance. The tracks carve a precarious path along the cliffside, revealing glimpses of a roaring river far below. Inside, passengers peer out the large windows, their faces lit with awe as the landscape unfolds.",
+    "A crowded rooftop bar buzzes with energy, the city skyline twinkling like a field of stars in the background. Strings of fairy lights hang above, casting a warm, golden glow over the scene. Groups of people gather around high tables, their laughter blending with the soft rhythm of live jazz. The aroma of freshly mixed cocktails and charred appetizers wafts through the air, mingling with the cool night breeze.",
+]
+
+args = init_args()
+pipe = load_model(args)
+print("load model successfully")
+with gr.Blocks() as demo:
+    gr.Markdown("# Fastvideo Mochi Video Generation Demo")
+
+    with gr.Group():
+        with gr.Row():
+            prompt = gr.Text(
+                label="Prompt",
+                show_label=False,
+                max_lines=1,
+                placeholder="Enter your prompt",
+                container=False,
+            )
+            run_button = gr.Button("Run", scale=0)
+        result = gr.Video(label="Result", show_label=False)
+
+    with gr.Accordion("Advanced options", open=False):
+        with gr.Group():
+            with gr.Row():
+                height = gr.Slider(
+                    label="Height",
+                    minimum=256,
+                    maximum=1024,
+                    step=32,
+                    value=args.height,
+                )
+                width = gr.Slider(label="Width", minimum=256, maximum=1024, step=32, value=args.width)
+
+            with gr.Row():
+                num_frames = gr.Slider(
+                    label="Number of Frames",
+                    minimum=21,
+                    maximum=163,
+                    value=args.num_frames,
+                )
+                guidance_scale = gr.Slider(
+                    label="Guidance Scale",
+                    minimum=1,
+                    maximum=12,
+                    value=args.guidance_scale,
+                )
+                num_inference_steps = gr.Slider(
+                    label="Inference Steps",
+                    minimum=4,
+                    maximum=100,
+                    value=args.num_inference_steps,
+                )
+
+            with gr.Row():
+                use_negative_prompt = gr.Checkbox(label="Use negative prompt", value=False)
+            negative_prompt = gr.Text(
+                label="Negative prompt",
+                max_lines=1,
+                placeholder="Enter a negative prompt",
+                visible=False,
+            )
+
+            seed = gr.Slider(label="Seed", minimum=0, maximum=1000000, step=1, value=args.seed)
+            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+            seed_output = gr.Number(label="Used Seed")
+
+    gr.Examples(examples=examples, inputs=prompt)
+
+    use_negative_prompt.change(
+        fn=lambda x: gr.update(visible=x),
+        inputs=use_negative_prompt,
+        outputs=negative_prompt,
+    )
+
+    run_button.click(
+        fn=generate_video,
+        inputs=[
+            prompt,
+            negative_prompt,
+            use_negative_prompt,
+            seed,
+            guidance_scale,
+            num_frames,
+            height,
+            width,
+            num_inference_steps,
+            randomize_seed,
+        ],
+        outputs=[result, seed_output],
+    )
+
+if __name__ == "__main__":
+    demo.queue(max_size=20).launch(server_name="0.0.0.0", server_port=7860)
--- a/FastVideo-main/demo/qualitative_results.md
+++ b/FastVideo-main/demo/qualitative_results.md
+Fast-Hunyuan comparison with original Hunyuan, achieving an 8X diffusion speed boost with the FastVideo framework.
+
+https://github.com/user-attachments/assets/064ac1d2-11ed-4a0c-955b-4d412a96ef30
+
+Fast-Mochi comparison with original Mochi, achieving an 8X diffusion speed boost with the FastVideo framework.
+
+https://github.com/user-attachments/assets/5fbc4596-56d6-43aa-98e0-da472cf8e26c
+
+Comparison between OpenAI Sora, original Hunyuan and FastHunyuan
+
+https://github.com/user-attachments/assets/d323b712-3f68-42b2-952b-94f6a49c4836
+
+Comparison between original FastHunyuan, LLM-INT8 quantized FastHunyuan and NF4 quantized FastHunyuan
+
+https://github.com/user-attachments/assets/cf89efb5-5f68-4949-a085-f41c1ef26c94
\ No newline at end of file
--- a/FastVideo-main/docker/Dockerfile.python3.10
+++ b/FastVideo-main/docker/Dockerfile.python3.10
+FROM nvidia/cuda:12.4.1-devel-ubuntu20.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+WORKDIR /FastVideo
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    git \
+    ca-certificates \
+    openssh-server \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
+    rm Miniconda3-latest-Linux-x86_64.sh
+
+ENV PATH=/opt/conda/bin:$PATH
+
+RUN conda create --name fastvideo-dev python=3.10.0 -y
+
+SHELL ["/bin/bash", "-c"]
+
+# Copy just the pyproject.toml first to leverage Docker cache
+COPY pyproject.toml ./
+
+# Create a dummy README to satisfy the installation
+RUN echo "# Placeholder" > README.md
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir --upgrade pip && \
+    conda run -n fastvideo-dev pip install --no-cache-dir .[dev] && \
+    conda run -n fastvideo-dev pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation && \
+    conda clean -afy
+
+COPY . .
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir -e .[dev]
+
+# Remove authentication headers
+RUN git config --unset-all http.https://github.com/.extraheader || true
+
+# Set up automatic conda environment activation for all shells
+RUN echo 'source /opt/conda/etc/profile.d/conda.sh' >> /root/.bashrc && \
+    echo 'conda activate fastvideo-dev' >> /root/.bashrc && \
+    # Ensure .bashrc is sourced for SSH login shells
+    echo 'if [ -f ~/.bashrc ]; then . ~/.bashrc; fi' > /root/.profile
+
+EXPOSE 22
\ No newline at end of file
--- a/FastVideo-main/docker/Dockerfile.python3.11
+++ b/FastVideo-main/docker/Dockerfile.python3.11
+FROM nvidia/cuda:12.4.1-devel-ubuntu20.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+WORKDIR /FastVideo
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    git \
+    ca-certificates \
+    openssh-server \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
+    rm Miniconda3-latest-Linux-x86_64.sh
+
+ENV PATH=/opt/conda/bin:$PATH
+
+RUN conda create --name fastvideo-dev python=3.11.11 -y
+
+SHELL ["/bin/bash", "-c"]
+
+# Copy just the pyproject.toml first to leverage Docker cache
+COPY pyproject.toml ./
+
+# Create a dummy README to satisfy the installation
+RUN echo "# Placeholder" > README.md
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir --upgrade pip && \
+    conda run -n fastvideo-dev pip install --no-cache-dir .[dev] && \
+    conda run -n fastvideo-dev pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation && \
+    conda clean -afy
+
+COPY . .
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir -e .[dev]
+
+# Remove authentication headers
+RUN git config --unset-all http.https://github.com/.extraheader || true
+
+# Set up automatic conda environment activation for all shells
+RUN echo 'source /opt/conda/etc/profile.d/conda.sh' >> /root/.bashrc && \
+    echo 'conda activate fastvideo-dev' >> /root/.bashrc && \
+    # Ensure .bashrc is sourced for SSH login shells
+    echo 'if [ -f ~/.bashrc ]; then . ~/.bashrc; fi' > /root/.profile
+
+EXPOSE 22
\ No newline at end of file
--- a/FastVideo-main/docker/Dockerfile.python3.12
+++ b/FastVideo-main/docker/Dockerfile.python3.12
+FROM nvidia/cuda:12.4.1-devel-ubuntu20.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+WORKDIR /FastVideo
+
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    git \
+    ca-certificates \
+    openssh-server \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
+    rm Miniconda3-latest-Linux-x86_64.sh
+
+ENV PATH=/opt/conda/bin:$PATH
+
+RUN conda create --name fastvideo-dev python=3.12.9 -y
+
+SHELL ["/bin/bash", "-c"]
+
+# Copy just the pyproject.toml first to leverage Docker cache
+COPY pyproject.toml ./
+
+# Create a dummy README to satisfy the installation
+RUN echo "# Placeholder" > README.md
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir --upgrade pip && \
+    conda run -n fastvideo-dev pip install --no-cache-dir .[dev] && \
+    conda run -n fastvideo-dev pip install --no-cache-dir flash-attn==2.7.4.post1 --no-build-isolation && \
+    conda clean -afy
+
+COPY . .
+
+RUN conda run -n fastvideo-dev pip install --no-cache-dir -e .[dev]
+
+# Remove authentication headers
+RUN git config --unset-all http.https://github.com/.extraheader || true
+
+# Set up automatic conda environment activation for all shells
+RUN echo 'source /opt/conda/etc/profile.d/conda.sh' >> /root/.bashrc && \
+    echo 'conda activate fastvideo-dev' >> /root/.bashrc && \
+    # Ensure .bashrc is sourced for SSH login shells
+    echo 'if [ -f ~/.bashrc ]; then . ~/.bashrc; fi' > /root/.profile
+
+EXPOSE 22
\ No newline at end of file
--- a/FastVideo-main/docs/Makefile
+++ b/FastVideo-main/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+clean:
+	@$(SPHINXBUILD) -M clean "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+	rm -rf "$(SOURCEDIR)/getting_started/examples"
+	rm -rf "$(SOURCEDIR)/inference/examples"
--- a/FastVideo-main/docs/README.md
+++ b/FastVideo-main/docs/README.md
+# FastVideo documents
+
+## Build the docs
+
+```bash
+# Install dependencies.
+pip install -r requirements-docs.txt
+
+# Build the docs.
+make clean
+make html
+```
+
+## Open the docs with your browser
+
+```bash
+python -m http.server -d build/html/
+```
+
+Launch your browser and open localhost:8000.
--- a/FastVideo-main/docs/make.bat
+++ b/FastVideo-main/docs/make.bat
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
--- a/FastVideo-main/docs/seed_parameter_behavior.md
+++ b/FastVideo-main/docs/seed_parameter_behavior.md
+# Seed Parameter Behavior in vLLM
+
+## Overview
+
+The `seed` parameter in vLLM is used to control the random states for various random number generators. This parameter can affect the behavior of random operations in user code, especially when working with models in vLLM.
+
+## Default Behavior
+
+By default, the `seed` parameter is set to `None`. When the `seed` parameter is `None`, the global random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that the random operations will behave as expected, without any fixed random states.
+
+## Specifying a Seed
+
+If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly. This can be useful for reproducibility, as it ensures that the random operations produce the same results across multiple runs.
+
+## Example Usage
+
+### Without Specifying a Seed
+
+```python
+import random
+from vllm import LLM
+
+# Initialize a vLLM model without specifying a seed
+model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
+
+# Try generating random numbers
+print(random.randint(0, 100))  # Outputs different numbers across runs
+```
+
+### Specifying a Seed
+
+```python
+import random
+from vllm import LLM
+
+# Initialize a vLLM model with a specific seed
+model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", seed=42)
+
+# Try generating random numbers
+print(random.randint(0, 100))  # Outputs the same number across runs
+```
+
+## Important Notes
+
+- If the `seed` parameter is not specified, the behavior of global random states remains unaffected.
+- If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set to that value.
+- This behavior can be useful for reproducibility but may lead to non-intuitive behavior if the user is not explicitly aware of it.
+
+## Conclusion
+
+Understanding the behavior of the `seed` parameter in vLLM is crucial for ensuring the expected behavior of random operations in your code. By default, the `seed` parameter is set to `None`, which means that the global random states are not affected. However, specifying a seed value can help achieve reproducibility in your experiments.
--- a/FastVideo-main/docs/source/_static/custom.css
+++ b/FastVideo-main/docs/source/_static/custom.css
+.vertical-table-header th.head:not(.stub) {
+    writing-mode: sideways-lr;
+    white-space: nowrap;
+    max-width: 0;
+    p {
+       margin: 0;
+    }
+}