W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] 
W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] *****************************************
W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0523 13:05:27.356000 23309818001216 torch/distributed/run.py:779] *****************************************
INFO 05-23 13:05:31 __init__.py:193] Automatically detected platform rocm.
INFO 05-23 13:05:31 __init__.py:193] Automatically detected platform rocm.
INFO 05-23 13:05:32 __init__.py:193] Automatically detected platform rocm.
INFO 05-23 13:05:32 __init__.py:193] Automatically detected platform rocm.
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0523 13:05:32.669274  7493 ProcessGroupNCCL.cpp:869] [PG 0 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0523 13:05:32.669310  7493 ProcessGroupNCCL.cpp:878] [PG 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0523 13:05:32.669786  7493 ProcessGroupNCCL.cpp:869] [PG 1 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x55ba768dd970, SPLIT_COLOR: 1008299991543067201, PG Name: 1
I0523 13:05:32.669796  7493 ProcessGroupNCCL.cpp:878] [PG 1 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0523 13:05:32.726428  7492 ProcessGroupNCCL.cpp:869] [PG 0 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0523 13:05:32.726469  7492 ProcessGroupNCCL.cpp:878] [PG 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0523 13:05:32.726986  7492 ProcessGroupNCCL.cpp:869] [PG 1 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x5570e55b5ca0, SPLIT_COLOR: 1008299991543067201, PG Name: 1
I0523 13:05:32.726996  7492 ProcessGroupNCCL.cpp:878] [PG 1 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0523 13:05:33.251652  7494 ProcessGroupNCCL.cpp:869] [PG 0 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0523 13:05:33.251686  7494 ProcessGroupNCCL.cpp:878] [PG 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0523 13:05:33.252153  7494 ProcessGroupNCCL.cpp:869] [PG 1 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x56442d4470f0, SPLIT_COLOR: 1008299991543067201, PG Name: 1
I0523 13:05:33.252162  7494 ProcessGroupNCCL.cpp:878] [PG 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0523 13:05:33.368182  7491 ProcessGroupNCCL.cpp:869] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0523 13:05:33.368213  7491 ProcessGroupNCCL.cpp:878] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0523 13:05:33.368700  7491 ProcessGroupNCCL.cpp:869] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0x563b447bf970, SPLIT_COLOR: 1008299991543067201, PG Name: 1
I0523 13:05:33.368708  7491 ProcessGroupNCCL.cpp:878] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
--> loading model from /public/model/HunyuanVideo/hunyuan-video-t2v-720p
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
  Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 1e-05
    maximize: False
    weight_decay: 0.01
)
***** Running training *****
  Num examples = 101
  Dataloader size = 26
  Num Epochs = 20
  Resume training from step 0
  Instantaneous batch size per device = 1
  Total train batch size (w. data & sequence parallel, accumulation) = 1.0
  Gradient Accumulation steps = 1
  Total optimization steps = 2000
  Total training parameters per FSDP shard = 3.205253136 B
  Master weight dtype: torch.float32
Steps:   0%|          | 0/2000 [00:00<?, ?it/s]--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
I0523 13:07:02.079018  7491 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.197377 ms
I0523 13:07:02.079244  7492 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 149.006 ms
I0523 13:07:02.108016  7494 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.324357 ms
I0523 13:07:02.198725  7493 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.290246 ms
I0523 13:07:03.039152  7492 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 1] ProcessGroupNCCL created ncclComm_ 0x5570ea3e7d20 on CUDA device: 
I0523 13:07:03.039230  7492 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 1] NCCL_DEBUG: N/A
I0523 13:07:03.039255  7493 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 2] ProcessGroupNCCL created ncclComm_ 0x55ba7b570c20 on CUDA device: 
I0523 13:07:03.039288  7491 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 0] ProcessGroupNCCL created ncclComm_ 0x563b495d0350 on CUDA device:  
I0523 13:07:03.039363  7493 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 2] NCCL_DEBUG: N/A
I0523 13:07:03.039372  7491 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 0] NCCL_DEBUG: N/A
I0523 13:07:03.039367  7494 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 3] ProcessGroupNCCL created ncclComm_ 0x56443211fe60 on CUDA device: 
I0523 13:07:03.039454  7494 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 3] NCCL_DEBUG: N/A
I0523 13:07:03.487334  7491 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.035869 ms
I0523 13:07:03.487444  7492 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL broadcast unique ID through store took 17.0139 ms
I0523 13:07:03.487512  7493 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL broadcast unique ID through store took 3.73795 ms
I0523 13:07:03.488173  7494 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.211737 ms
I0523 13:07:03.889570  7492 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL created ncclComm_ 0x5570ea192570 on CUDA device: 
I0523 13:07:03.889585  7493 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL created ncclComm_ 0x55ba7b5455a0 on CUDA device: 
I0523 13:07:03.889576  7491 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL created ncclComm_ 0x563b48338b20 on CUDA device:  
I0523 13:07:03.889642  7493 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 2] NCCL_DEBUG: N/A
I0523 13:07:03.889645  7492 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 1] NCCL_DEBUG: N/A
I0523 13:07:03.889657  7491 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 0] NCCL_DEBUG: N/A
I0523 13:07:03.889678  7494 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL created ncclComm_ 0x564432000160 on CUDA device: 
I0523 13:07:03.889797  7494 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 3] NCCL_DEBUG: N/A
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in <module>
[rank0]:     main(args)
[rank0]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main
[rank0]:     loss, grad_norm = train_one_step(
[rank0]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step
[rank0]:     optimizer.step()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank0]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step
[rank0]:     adamw(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw
[rank0]:     func(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
[rank0]:     exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.14 GiB is allocated by PyTorch, and 960.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in <module>
[rank3]:     main(args)
[rank3]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main
[rank3]:     loss, grad_norm = train_one_step(
[rank3]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step
[rank3]:     optimizer.step()
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank3]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank3]:     out = func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank3]:     ret = func(self, *args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step
[rank3]:     adamw(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw
[rank3]:     func(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
[rank3]:     exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank3]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 3 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 910.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in <module>
[rank1]:     main(args)
[rank1]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main
[rank1]:     loss, grad_norm = train_one_step(
[rank1]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step
[rank1]:     optimizer.step()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank1]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank1]:     ret = func(self, *args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step
[rank1]:     adamw(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw
[rank1]:     func(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
[rank1]:     exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank1]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 1 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 952.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 664, in <module>
[rank2]:     main(args)
[rank2]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 369, in main
[rank2]:     loss, grad_norm = train_one_step(
[rank2]:   File "/public/hy-code/FastVideo-main/fastvideo/train.py", line 164, in train_one_step
[rank2]:     optimizer.step()
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank2]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank2]:     out = func(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 89, in _use_grad
[rank2]:     ret = func(self, *args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 227, in step
[rank2]:     adamw(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 767, in adamw
[rank2]:     func(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/adamw.py", line 600, in _multi_tensor_adamw
[rank2]:     exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank2]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 136.00 MiB. GPU 2 has a total capacity of 63.98 GiB of which 0 bytes is free. Of the allocated memory 55.28 GiB is allocated by PyTorch, and 900.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps:   0%|          | 0/2000 [02:46<?, ?it/s]
W0523 13:09:51.895000 23309818001216 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 7492 closing signal SIGTERM
W0523 13:09:51.903000 23309818001216 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 7493 closing signal SIGTERM
W0523 13:09:51.904000 23309818001216 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 7494 closing signal SIGTERM
E0523 13:09:52.485000 23309818001216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 7491) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/public/hy-code/FastVideo-main/fastvideo/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-23_13:09:51
  host      : BW1000
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7491)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================