start exec [2025-06-04 11:05:01,631] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:01,676] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:01,690] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:01,706] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:01,817] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:01,909] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:02,063] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-06-04 11:05:02,070] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:02 __init__.py:193] Automatically detected platform rocm. INFO 06-04 11:05:03 __init__.py:193] Automatically detected platform rocm. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. Could not load Sliding Tile Attention. --> loading model from /public/home/wuxk/code/data <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Total training parameters = 12821.012544 M --> Initializing FSDP with sharding strategy: full >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... --> model loaded --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... FullyShardedDataParallel( (_fsdp_wrapped_module): HYVideoDiffusionTransformer( (img_in): PatchEmbed( (proj): Conv3d(16, 3072, kernel_size=(1, 2, 2), stride=(1, 2, 2)) (norm): Identity() ) (txt_in): SingleTokenRefiner( (input_embedder): Linear(in_features=4096, out_features=3072, bias=True) (t_embedder): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (c_embedder): TextProjection( (linear_1): Linear(in_features=4096, out_features=3072, bias=True) (act_1): SiLU() (linear_2): Linear(in_features=3072, out_features=3072, bias=True) ) (individual_token_refiner): IndividualTokenRefiner( (blocks): ModuleList( (0-1): 2 x IndividualTokenRefinerBlock( (norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=True) (self_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (self_attn_q_norm): Identity() (self_attn_k_norm): Identity() (self_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=True) (mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): SiLU() (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (adaLN_modulation): Sequential( (0): SiLU() (1): Linear(in_features=3072, out_features=6144, bias=True) ) ) ) ) ) (time_in): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (vector_in): MLPEmbedder( (in_layer): Linear(in_features=768, out_features=3072, bias=True) (silu): SiLU() (out_layer): Linear(in_features=3072, out_features=3072, bias=True) ) (guidance_in): TimestepEmbedder( (mlp): Sequential( (0): Linear(in_features=256, out_features=3072, bias=True) (1): SiLU() (2): Linear(in_features=3072, out_features=3072, bias=True) ) ) (double_blocks): ModuleList( (0-19): 20 x FullyShardedDataParallel( (_fsdp_wrapped_module): CheckpointWrapper( (_checkpoint_wrapped_module): MMDoubleStreamBlock( (img_mod): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=18432, bias=True) ) (img_norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (img_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (img_attn_q_norm): RMSNorm() (img_attn_k_norm): RMSNorm() (img_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (img_norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (img_mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): GELU(approximate='tanh') (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) (txt_mod): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=18432, bias=True) ) (txt_norm1): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (txt_attn_qkv): Linear(in_features=3072, out_features=9216, bias=True) (txt_attn_q_norm): RMSNorm() (txt_attn_k_norm): RMSNorm() (txt_attn_proj): Linear(in_features=3072, out_features=3072, bias=True) (txt_norm2): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (txt_mlp): MLP( (fc1): Linear(in_features=3072, out_features=12288, bias=True) (act): GELU(approximate='tanh') (drop1): Dropout(p=0.0, inplace=False) (norm): Identity() (fc2): Linear(in_features=12288, out_features=3072, bias=True) (drop2): Dropout(p=0.0, inplace=False) ) ) ) ) ) (single_blocks): ModuleList( (0-39): 40 x FullyShardedDataParallel( (_fsdp_wrapped_module): CheckpointWrapper( (_checkpoint_wrapped_module): MMSingleStreamBlock( (linear1): Linear(in_features=3072, out_features=21504, bias=True) (linear2): Linear(in_features=15360, out_features=3072, bias=True) (q_norm): RMSNorm() (k_norm): RMSNorm() (pre_norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (mlp_act): GELU(approximate='tanh') (modulation): ModulateDiT( (act): SiLU() (linear): Linear(in_features=3072, out_features=9216, bias=True) ) ) ) ) ) (final_layer): FinalLayer( (norm_final): LayerNorm((3072,), eps=1e-06, elementwise_affine=False) (linear): Linear(in_features=3072, out_features=64, bias=True) (adaLN_modulation): Sequential( (0): SiLU() (1): Linear(in_features=3072, out_features=6144, bias=True) ) ) ) ) optimizer: AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 1e-05 maximize: False weight_decay: 0.01 ) ***** Running training ***** Num examples = 101 Dataloader size = 13 Num Epochs = 1 Resume training from step 0 Instantaneous batch size per device = 1 Total train batch size (w. data & sequence parallel, accumulation) = 2.0 Gradient Accumulation steps = 1 Total optimization steps = 8 Total training parameters per FSDP shard = 1.602626568 B Master weight dtype: torch.float32 --> applying fdsp activation checkpointing... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... zll step_time: 281.03s avg_step_time: 281.0341317653656 zll step_time: 149.39s avg_step_time: 215.21051609516144 zll step_time: 148.66s avg_step_time: 193.02822120984396 zll step_time: 148.75s avg_step_time: 181.9594955444336 zll step_time: 148.81s avg_step_time: 175.3291331768036