W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] ***************************************** W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] ***************************************** INFO 05-23 13:05:31 __init__.py:193] Automatically detected platform rocm. INFO 05-23 13:05:31 __init__.py:193] Automatically detected platform rocm. INFO 05-23 13:05:32 __init__.py:193] Automatically detected platform rocm. INFO 05-23 13:05:32 __init__.py:193] Automatically detected platform rocm. Could not load Sliding Tile Attention. WARNING: Logging before InitGoogleLogging() is written to STDERR I0523 13:05:32.669274 7493 ProcessGroupNCCL.cpp:869] [PG 0 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0 I0523 13:05:32.669310 7493 ProcessGroupNCCL.cpp:878] [PG 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 I0523 13:05:32.669786 7493 ProcessGroupNCCL.cpp:869] [PG 1 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55ba768dd970, SPLIT_COLOR: 1008299991543067201, PG Name: 1 I0523 13:05:32.669796 7493 ProcessGroupNCCL.cpp:878] [PG 1 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. WARNING: Logging before InitGoogleLogging() is written to STDERR I0523 13:05:32.726428 7492 ProcessGroupNCCL.cpp:869] [PG 0 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0 I0523 13:05:32.726469 7492 ProcessGroupNCCL.cpp:878] [PG 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 I0523 13:05:32.726986 7492 ProcessGroupNCCL.cpp:869] [PG 1 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5570e55b5ca0, SPLIT_COLOR: 1008299991543067201, PG Name: 1 I0523 13:05:32.726996 7492 ProcessGroupNCCL.cpp:878] [PG 1 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. WARNING: Logging before InitGoogleLogging() is written to STDERR I0523 13:05:33.251652 7494 ProcessGroupNCCL.cpp:869] [PG 0 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0 I0523 13:05:33.251686 7494 ProcessGroupNCCL.cpp:878] [PG 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 I0523 13:05:33.252153 7494 ProcessGroupNCCL.cpp:869] [PG 1 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56442d4470f0, SPLIT_COLOR: 1008299991543067201, PG Name: 1 I0523 13:05:33.252162 7494 ProcessGroupNCCL.cpp:878] [PG 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. WARNING: Logging before InitGoogleLogging() is written to STDERR I0523 13:05:33.368182 7491 ProcessGroupNCCL.cpp:869] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0 I0523 13:05:33.368213 7491 ProcessGroupNCCL.cpp:878] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 I0523 13:05:33.368700 7491 ProcessGroupNCCL.cpp:869] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563b447bf970, SPLIT_COLOR: 1008299991543067201, PG Name: 1 I0523 13:05:33.368708 7491 ProcessGroupNCCL.cpp:878] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0 --> loading model from /public/model/HunyuanVideo/hunyuan-video-t2v-720p <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Total training parameters = 12821.012544 M --> Initializing FSDP with sharding strategy: full >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... --> model loaded --> applying fdsp activation checkpointing... optimizer: AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 1e-05 maximize: False weight_decay: 0.01 ) ***** Running training ***** Num examples = 101 Dataloader size = 26 Num Epochs = 20 Resume training from step 0 Instantaneous batch size per device = 1 Total train batch size (w. data & sequence parallel, accumulation) = 1.0 Gradient Accumulation steps = 1 Total optimization steps = 2000 Total training parameters per FSDP shard = 3.205253136 B Master weight dtype: torch.float32 Steps: 0%| | 0/2000 [00:00 applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... I0523 13:07:02.079018 7491 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.197377 ms I0523 13:07:02.079244 7492 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 149.006 ms I0523 13:07:02.108016 7494 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.324357 ms I0523 13:07:02.198725 7493 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.290246 ms I0523 13:07:03.039152 7492 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 1] ProcessGroupNCCL created ncclComm_ 0x5570ea3e7d20 on CUDA device:  I0523 13:07:03.039230 7492 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 1] NCCL_DEBUG: N/A I0523 13:07:03.039255 7493 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 2] ProcessGroupNCCL created ncclComm_ 0x55ba7b570c20 on CUDA device:  I0523 13:07:03.039288 7491 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 0] ProcessGroupNCCL created ncclComm_ 0x563b495d0350 on CUDA device: I0523 13:07:03.039363 7493 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 2] NCCL_DEBUG: N/A I0523 13:07:03.039372 7491 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 0] NCCL_DEBUG: N/A I0523 13:07:03.039367 7494 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 3] ProcessGroupNCCL created ncclComm_ 0x56443211fe60 on CUDA device:  I0523 13:07:03.039454 7494 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 3] NCCL_DEBUG: N/A I0523 13:07:03.487334 7491 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.035869 ms I0523 13:07:03.487444 7492 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL broadcast unique ID through store took 17.0139 ms I0523 13:07:03.487512 7493 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL broadcast unique ID through store took 3.73795 ms I0523 13:07:03.488173 7494 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.211737 ms I0523 13:07:03.889570 7492 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL created ncclComm_ 0x5570ea192570 on CUDA device:  I0523 13:07:03.889585 7493 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL created ncclComm_ 0x55ba7b5455a0 on CUDA device:  I0523 13:07:03.889576 7491 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL created ncclComm_ 0x563b48338b20 on CUDA device: I0523 13:07:03.889642 7493 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 2] NCCL_DEBUG: N/A I0523 13:07:03.889645 7492 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 1] NCCL_DEBUG: N/A I0523 13:07:03.889657 7491 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 0] NCCL_DEBUG: N/A I0523 13:07:03.889678 7494 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL created ncclComm_ 0x564432000160 on CUDA device:  I0523 13:07:03.889797 7494 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 3] NCCL_DEBUG: N/A /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] [rank0]: Traceback (most recent call last): [rank0]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in [rank0]: main(args) [rank0]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main [rank0]: loss, grad_norm = train_one_step( [rank0]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step [rank0]: optimizer.step() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper [rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step [rank0]: adamw( [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw [rank0]: func( [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw [rank0]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.14 GiB is allocated by PyTorch, and 960.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank3]: Traceback (most recent call last): [rank3]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in [rank3]: main(args) [rank3]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main [rank3]: loss, grad_norm = train_one_step( [rank3]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step [rank3]: optimizer.step() [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper [rank3]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper [rank3]: out = func(*args, **kwargs) [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad [rank3]: ret = func(self, *args, **kwargs) [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step [rank3]: adamw( [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback [rank3]: return func(*args, **kwargs) [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw [rank3]: func( [rank3]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw [rank3]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank3]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 3 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 910.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank1]: Traceback (most recent call last): [rank1]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in [rank1]: main(args) [rank1]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main [rank1]: loss, grad_norm = train_one_step( [rank1]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step [rank1]: optimizer.step() [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper [rank1]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper [rank1]: out = func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad [rank1]: ret = func(self, *args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step [rank1]: adamw( [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback [rank1]: return func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw [rank1]: func( [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw [rank1]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank1]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 1 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 952.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank2]: Traceback (most recent call last): [rank2]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in [rank2]: main(args) [rank2]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main [rank2]: loss, grad_norm = train_one_step( [rank2]: File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step [rank2]: optimizer.step() [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper [rank2]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper [rank2]: out = func(*args, **kwargs) [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad [rank2]: ret = func(self, *args, **kwargs) [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step [rank2]: adamw( [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback [rank2]: return func(*args, **kwargs) [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw [rank2]: func( [rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw [rank2]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) [rank2]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 2 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 900.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Steps: 0%| | 0/2000 [02:46 sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /public/hy-code/FastVideo-main/fastvideo/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-23_13:09:51 host : BW1000 rank : 0 (local_rank: 0) exitcode : 1 (pid: 7491) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================