nohup: ignoring input
start exec
W0606 20:00:51.984000 139812779824960 torch/distributed/run.py:779] 
W0606 20:00:51.984000 139812779824960 torch/distributed/run.py:779] *****************************************
W0606 20:00:51.984000 139812779824960 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0606 20:00:51.984000 139812779824960 torch/distributed/run.py:779] *****************************************
[2025-06-06 20:00:56,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,282] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,289] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,346] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,354] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,446] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,453] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-06-06 20:00:56,511] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
INFO 06-06 20:00:58 __init__.py:193] Automatically detected platform rocm.
Could not load Sliding Tile Attention.Could not load Sliding Tile Attention.

Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.Could not load Sliding Tile Attention.

Could not load Sliding Tile Attention.Could not load Sliding Tile Attention.

Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.392499 238721 ProcessGroupNCCL.cpp:881] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 8, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.392678 238721 ProcessGroupNCCL.cpp:890] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.393101 238724 ProcessGroupNCCL.cpp:881] [PG 0 Rank 3] ProcessGroupNCCL initialization options: size: 8, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.393143 238724 ProcessGroupNCCL.cpp:890] [PG 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.393261 238721 ProcessGroupNCCL.cpp:881] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0606 20:01:01.393301 238721 ProcessGroupNCCL.cpp:890] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.393829 238724 ProcessGroupNCCL.cpp:881] [PG 1 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0606 20:01:01.393853 238724 ProcessGroupNCCL.cpp:890] [PG 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
--> loading model from /public/home/wuxk/code/data
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.542949 238727 ProcessGroupNCCL.cpp:881] [PG 0 Rank 6] ProcessGroupNCCL initialization options: size: 8, global rank: 6, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.543148 238727 ProcessGroupNCCL.cpp:890] [PG 0 Rank 6] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.543787 238727 ProcessGroupNCCL.cpp:881] [PG 2 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 6, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0606 20:01:01.543813 238727 ProcessGroupNCCL.cpp:890] [PG 2 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.593428 238722 ProcessGroupNCCL.cpp:881] [PG 0 Rank 1] ProcessGroupNCCL initialization options: size: 8, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.593632 238722 ProcessGroupNCCL.cpp:890] [PG 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.594197 238722 ProcessGroupNCCL.cpp:881] [PG 1 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0606 20:01:01.594237 238722 ProcessGroupNCCL.cpp:890] [PG 1 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.594931 238728 ProcessGroupNCCL.cpp:881] [PG 0 Rank 7] ProcessGroupNCCL initialization options: size: 8, global rank: 7, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.594971 238728 ProcessGroupNCCL.cpp:890] [PG 0 Rank 7] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.595727 238728 ProcessGroupNCCL.cpp:881] [PG 2 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 7, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0606 20:01:01.595752 238728 ProcessGroupNCCL.cpp:890] [PG 2 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.599256 238726 ProcessGroupNCCL.cpp:881] [PG 0 Rank 5] ProcessGroupNCCL initialization options: size: 8, global rank: 5, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.599431 238726 ProcessGroupNCCL.cpp:890] [PG 0 Rank 5] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.600155 238726 ProcessGroupNCCL.cpp:881] [PG 2 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 5, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0606 20:01:01.600180 238726 ProcessGroupNCCL.cpp:890] [PG 2 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.926810 238725 ProcessGroupNCCL.cpp:881] [PG 0 Rank 4] ProcessGroupNCCL initialization options: size: 8, global rank: 4, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.926949 238725 ProcessGroupNCCL.cpp:890] [PG 0 Rank 4] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.927536 238725 ProcessGroupNCCL.cpp:881] [PG 2 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 4, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0606 20:01:01.927565 238725 ProcessGroupNCCL.cpp:890] [PG 2 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0606 20:01:01.977881 238723 ProcessGroupNCCL.cpp:881] [PG 0 Rank 2] ProcessGroupNCCL initialization options: size: 8, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0606 20:01:01.978206 238723 ProcessGroupNCCL.cpp:890] [PG 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0606 20:01:01.979126 238723 ProcessGroupNCCL.cpp:881] [PG 1 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0606 20:01:01.979176 238723 ProcessGroupNCCL.cpp:890] [PG 1 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
  Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 1e-05
    maximize: False
    weight_decay: 0.01
)
***** Running training *****
  Num examples = 101
  Dataloader size = 13
  Num Epochs = 1
  Resume training from step 0
  Instantaneous batch size per device = 1
  Total train batch size (w. data & sequence parallel, accumulation) = 2.0
  Gradient Accumulation steps = 1
  Total optimization steps = 8
  Total training parameters per FSDP shard = 1.602626568 B
  Master weight dtype: torch.float32
Steps:   0%|          | 0/8 [00:00<?, ?it/s]I0606 20:02:28.484512 238721 ProcessGroupNCCL.cpp:2086] [PG 1 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.128021 ms
--> applying fdsp activation checkpointing...
I0606 20:02:29.324895 238722 ProcessGroupNCCL.cpp:2086] [PG 1 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.316103 ms
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
I0606 20:02:30.042835 238724 ProcessGroupNCCL.cpp:2086] [PG 1 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.339922 ms
I0606 20:02:30.064776 238725 ProcessGroupNCCL.cpp:2086] [PG 2 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.110931 ms
I0606 20:02:30.065168 238728 ProcessGroupNCCL.cpp:2086] [PG 2 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 333.727 ms
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
I0606 20:02:30.565199 238726 ProcessGroupNCCL.cpp:2086] [PG 2 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.190112 ms
I0606 20:02:30.797624 238723 ProcessGroupNCCL.cpp:2086] [PG 1 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.255362 ms
I0606 20:02:30.920964 238727 ProcessGroupNCCL.cpp:2086] [PG 2 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.307832 ms
I0606 20:02:31.499979 238722 ProcessGroupNCCL.cpp:2195] [PG 1 Rank 1] ProcessGroupNCCL created ncclComm_ 0x5608261e7c10 on CUDA device: 
I0606 20:02:31.500263 238722 ProcessGroupNCCL.cpp:2200] [PG 1 Rank 1] NCCL_DEBUG: N/A
I0606 20:02:31.500800 238723 ProcessGroupNCCL.cpp:2195] [PG 1 Rank 2] ProcessGroupNCCL created ncclComm_ 0x555e2f5d4d50 on CUDA device: 
I0606 20:02:31.500877 238723 ProcessGroupNCCL.cpp:2200] [PG 1 Rank 2] NCCL_DEBUG: N/A
I0606 20:02:31.500881 238721 ProcessGroupNCCL.cpp:2195] [PG 1 Rank 0] ProcessGroupNCCL created ncclComm_ 0x55f1fe660140 on CUDA device:  
I0606 20:02:31.500880 238724 ProcessGroupNCCL.cpp:2195] [PG 1 Rank 3] ProcessGroupNCCL created ncclComm_ 0x559b9af10940 on CUDA device: 
I0606 20:02:31.501081 238721 ProcessGroupNCCL.cpp:2200] [PG 1 Rank 0] NCCL_DEBUG: N/A
I0606 20:02:31.501327 238724 ProcessGroupNCCL.cpp:2200] [PG 1 Rank 3] NCCL_DEBUG: N/A
I0606 20:02:31.700700 238726 ProcessGroupNCCL.cpp:2195] [PG 2 Rank 1] ProcessGroupNCCL created ncclComm_ 0x55af3188cdf0 on CUDA device: 
I0606 20:02:31.700909 238725 ProcessGroupNCCL.cpp:2195] [PG 2 Rank 0] ProcessGroupNCCL created ncclComm_ 0x5623e91ba5e0 on CUDA device: 
I0606 20:02:31.700928 238726 ProcessGroupNCCL.cpp:2200] [PG 2 Rank 1] NCCL_DEBUG: N/A
I0606 20:02:31.700951 238728 ProcessGroupNCCL.cpp:2195] [PG 2 Rank 3] ProcessGroupNCCL created ncclComm_ 0x5629877da120 on CUDA device: 
I0606 20:02:31.700960 238727 ProcessGroupNCCL.cpp:2195] [PG 2 Rank 2] ProcessGroupNCCL created ncclComm_ 0x5629c1d4e410 on CUDA device: 
I0606 20:02:31.701046 238725 ProcessGroupNCCL.cpp:2200] [PG 2 Rank 0] NCCL_DEBUG: N/A
I0606 20:02:31.701257 238728 ProcessGroupNCCL.cpp:2200] [PG 2 Rank 3] NCCL_DEBUG: N/A
I0606 20:02:31.701365 238727 ProcessGroupNCCL.cpp:2200] [PG 2 Rank 2] NCCL_DEBUG: N/A
I0606 20:02:31.727560 238721 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.123151 ms
I0606 20:02:31.746049 238724 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.233492 ms
I0606 20:02:31.753268 238723 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.0686 ms
I0606 20:02:31.753826 238722 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.193082 ms
I0606 20:02:32.034540 238725 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL broadcast unique ID through store took 0.335813 ms
I0606 20:02:32.043879 238728 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL broadcast unique ID through store took 0.206662 ms
I0606 20:02:32.044067 238726 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL broadcast unique ID through store took 0.07409 ms
I0606 20:02:32.049070 238727 ProcessGroupNCCL.cpp:2086] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL broadcast unique ID through store took 0.219522 ms
I0606 20:02:32.272241 238721 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL created ncclComm_ 0x55f1fe27b800 on CUDA device:  
I0606 20:02:32.272315 238721 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 0] NCCL_DEBUG: N/A
I0606 20:02:32.272483 238727 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL created ncclComm_ 0x5629c25446b0 on CUDA device: 
I0606 20:02:32.272498 238722 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL created ncclComm_ 0x560826c97b00 on CUDA device: 
I0606 20:02:32.272568 238725 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL created ncclComm_ 0x5623e9cd5740 on CUDA device: 
I0606 20:02:32.272683 238727 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 6] NCCL_DEBUG: N/A
I0606 20:02:32.272775 238722 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 1] NCCL_DEBUG: N/A
I0606 20:02:32.272796 238726 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL created ncclComm_ 0x55af31090e40 on CUDA device: 
I0606 20:02:32.272856 238725 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 4] NCCL_DEBUG: N/A
I0606 20:02:32.272922 238726 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 5] NCCL_DEBUG: N/A
I0606 20:02:32.273185 238724 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL created ncclComm_ 0x559b9a5b43f0 on CUDA device: 
I0606 20:02:32.273252 238723 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL created ncclComm_ 0x555e2ff1d4f0 on CUDA device: 
I0606 20:02:32.273422 238724 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 3] NCCL_DEBUG: N/A
I0606 20:02:32.273631 238723 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 2] NCCL_DEBUG: N/A
I0606 20:02:32.274735 238728 ProcessGroupNCCL.cpp:2195] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL created ncclComm_ 0x56298704f230 on CUDA device: 
I0606 20:02:32.274950 238728 ProcessGroupNCCL.cpp:2200] [PG 0 (default_pg) Rank 7] NCCL_DEBUG: N/A
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:663: UserWarning: Graph break due to unsupported builtin flash_attn_2_cuda.PyCapsule.varlen_fwd. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
  torch._dynamo.utils.warn_once(msg)
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|          | 0/8 [04:17<?, ?it/s, loss=0.0478, step_time=257.85s, grad_norm=0.113]Steps:  12%|█▎        | 1/8 [04:17<30:04, 257.85s/it, loss=0.0478, step_time=257.85s, grad_norm=0.113]zll step_time: 257.85s avg_step_time: 257.85235619544983
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:  12%|█▎        | 1/8 [06:10<30:04, 257.85s/it, loss=0.0477, step_time=112.95s, grad_norm=0.121]Steps:  25%|██▌       | 2/8 [06:10<17:15, 172.62s/it, loss=0.0477, step_time=112.95s, grad_norm=0.121]zll step_time: 112.95s avg_step_time: 185.40267634391785
Steps:  25%|██▌       | 2/8 [08:02<17:15, 172.62s/it, loss=0.0409, step_time=111.65s, grad_norm=0.0943]Steps:  38%|███▊      | 3/8 [08:02<12:03, 144.78s/it, loss=0.0409, step_time=111.65s, grad_norm=0.0943]zll step_time: 111.65s avg_step_time: 160.8174086411794
Steps:  38%|███▊      | 3/8 [09:53<12:03, 144.78s/it, loss=0.0751, step_time=111.54s, grad_norm=0.0935]Steps:  50%|█████     | 4/8 [09:53<08:46, 131.66s/it, loss=0.0751, step_time=111.54s, grad_norm=0.0935]zll step_time: 111.54s avg_step_time: 148.49758940935135
Steps:  50%|█████     | 4/8 [11:45<08:46, 131.66s/it, loss=0.3197, step_time=111.51s, grad_norm=0.698] Steps:  62%|██████▎   | 5/8 [11:45<06:13, 124.39s/it, loss=0.3197, step_time=111.51s, grad_norm=0.698]zll step_time: 111.51s avg_step_time: 141.09909749031067
Steps:  62%|██████▎   | 5/8 [13:37<06:13, 124.39s/it, loss=0.0901, step_time=111.52s, grad_norm=0.251]Steps:  75%|███████▌  | 6/8 [13:37<04:00, 120.01s/it, loss=0.0901, step_time=111.52s, grad_norm=0.251]zll step_time: 111.52s avg_step_time: 136.16851162910461