start exec [2025-06-04 10:26:06,148] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,181] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,182] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,257] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,328] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,466] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,491] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 10:26:06,591] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. INFO 06-04 10:26:07 __init__.py:193] Automatically detected platform rocm. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< --> loading model from /public/home/wuxk/code/data <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Total training parameters = 12821.012544 M --> Initializing FSDP with sharding strategy: full >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> model loaded --> applying fdsp activation checkpointing... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FullyShardedDataParallel( (_fsdp_wrapped_module): HYVideoDiffusionTransformer( (img_in): PatchEmbed( (proj): Conv3d(16, 3072, kernel_size=(1, 2, 2), stride=(1, 2, 2)) (norm): Identity() ) (txt_in): SingleTokenRefiner( (input_embedder): Linear(in_features=4096, out_features=3072, bias=True) (t_embedder): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (c_embedder): TextProjection( (linear_1): Linear(in_features=4096, out_features=3072, bias=True) (act_1): SiLU() (linear_2): Linear(in_features=3072, out_features=3072, bias=True) ) (individual_token_refiner): IndividualTokenRefiner( (blocks): ModuleList( (0-1): 2 x IndividualTokenRefinerBlock( (norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=True) (self_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (self_attn_q_norm): Identity() (self_attn_k_norm): Identity() (self_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=True) (mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): SiLU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (adaLN_modulation): Sequential( (0): SiLU() (1): Linear(in_features=3072, out_features=6144, bias=True) ) ) ) ) ) (time_in): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (vector_in): MLPEmbedder( (in_layer): Linear(in_features=768, out_features=3072, bias=True) (silu): SiLU() (out_layer): Linear(in_features=3072, out_features=3072, bias=True) ) (guidance_in): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (double_blocks): ModuleList( (0-19): 20 x FullyShardedDataParallel( (_fsdp_wrapped_module): CheckpointWrapper( (_checkpoint_wrapped_module): MMDoubleStreamBlock( (img_mod): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=18432, bias=True) ) (img_norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (img_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (img_attn_q_norm): RMSNorm() (img_attn_k_norm): RMSNorm() (img_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (img_norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (img_mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): GELU(approximate='tanh') (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (txt_mod): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=18432, bias=True) ) (txt_norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (txt_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (txt_attn_q_norm): RMSNorm() (txt_attn_k_norm): RMSNorm() (txt_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (txt_norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (txt_mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): GELU(approximate='tanh') (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) ) ) ) ) (single_blocks): ModuleList( (0-39): 40 x FullyShardedDataParallel( (_fsdp_wrapped_module): CheckpointWrapper( (_checkpoint_wrapped_module): MMSingleStreamBlock( (linear1): Linear(in_features=3072, out_features=21504, bias=True) (linear2): Linear(in_features=15360, out_features=3072, bias=True) (q_norm): RMSNorm() (k_norm): RMSNorm() (pre_norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (mlp_act): GELU(approximate='tanh') (modulation): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=9216, bias=True) ) ) ) ) ) (final_layer): FinalLayer( (norm_final): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (linear): Linear(in_features=3072, out_features=64, bias=True) (adaLN_modulation): Sequential( (0): SiLU() (1): Linear(in_features=3072, out_features=6144, bias=True) ) ) ) ) optimizer: AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 1e-05 maximize: False weight_decay: 0.01 ) ***** Running training ***** Num examples = 101 Dataloader size = 13 Num Epochs = 1 Resume training from step 0 Instantaneous batch size per device = 1 Total train batch size (w. data & sequence parallel, accumulation) = 2.0 Gradient Accumulation steps = 1 Total optimization steps = 8 Total training parameters per FSDP shard = 1.602626568 B Master weight dtype: torch.float32 --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... zll step_time: 284.15s avg_step_time: 284.1516556739807 zll step_time: 149.47s avg_step_time: 216.8092384338379 zll step_time: 148.99s avg_step_time: 194.20422458648682 zll step_time: 148.98s avg_step_time: 182.89871686697006