test-new.log 30.3 KB
Newer Older
hepj's avatar
hepj committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
W0526 15:40:34.779000 23412721211200 torch/distributed/run.py:779] 
W0526 15:40:34.779000 23412721211200 torch/distributed/run.py:779] *****************************************
W0526 15:40:34.779000 23412721211200 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0526 15:40:34.779000 23412721211200 torch/distributed/run.py:779] *****************************************
INFO 05-26 15:40:38 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:38 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:38 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:38 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:39 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:39 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:39 __init__.py:193] Automatically detected platform rocm.
INFO 05-26 15:40:39 __init__.py:193] Automatically detected platform rocm.
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:40.336937 89349 ProcessGroupNCCL.cpp:869] [PG 0 Rank 3] ProcessGroupNCCL initialization options: size: 8, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:40.336980 89349 ProcessGroupNCCL.cpp:878] [PG 0 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:40.337304 89349 ProcessGroupNCCL.cpp:869] [PG 1 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 3, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0526 15:40:40.337319 89349 ProcessGroupNCCL.cpp:878] [PG 1 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:40.390194 89348 ProcessGroupNCCL.cpp:869] [PG 0 Rank 2] ProcessGroupNCCL initialization options: size: 8, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:40.390235 89348 ProcessGroupNCCL.cpp:878] [PG 0 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:40.390537 89348 ProcessGroupNCCL.cpp:869] [PG 1 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 2, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0526 15:40:40.390552 89348 ProcessGroupNCCL.cpp:878] [PG 1 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:40.468447 89351 ProcessGroupNCCL.cpp:869] [PG 0 Rank 5] ProcessGroupNCCL initialization options: size: 8, global rank: 5, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:40.468519 89351 ProcessGroupNCCL.cpp:878] [PG 0 Rank 5] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:40.469101 89351 ProcessGroupNCCL.cpp:869] [PG 2 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 5, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0526 15:40:40.469116 89351 ProcessGroupNCCL.cpp:878] [PG 2 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:41.540874 89352 ProcessGroupNCCL.cpp:869] [PG 0 Rank 6] ProcessGroupNCCL initialization options: size: 8, global rank: 6, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:41.540922 89352 ProcessGroupNCCL.cpp:878] [PG 0 Rank 6] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:41.541432 89352 ProcessGroupNCCL.cpp:869] [PG 2 Rank 2] ProcessGroupNCCL initialization options: size: 4, global rank: 6, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0526 15:40:41.541446 89352 ProcessGroupNCCL.cpp:878] [PG 2 Rank 2] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:41.548353 89346 ProcessGroupNCCL.cpp:869] [PG 0 Rank 0] ProcessGroupNCCL initialization options: size: 8, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:41.548408 89346 ProcessGroupNCCL.cpp:878] [PG 0 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:41.548835 89346 ProcessGroupNCCL.cpp:869] [PG 1 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0526 15:40:41.548847 89346 ProcessGroupNCCL.cpp:878] [PG 1 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
--> loading model from /public/model/HunyuanVideo/hunyuan-video-t2v-720p
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:41.837021 89353 ProcessGroupNCCL.cpp:869] [PG 0 Rank 7] ProcessGroupNCCL initialization options: size: 8, global rank: 7, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:41.837060 89353 ProcessGroupNCCL.cpp:878] [PG 0 Rank 7] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:41.837443 89353 ProcessGroupNCCL.cpp:869] [PG 2 Rank 3] ProcessGroupNCCL initialization options: size: 4, global rank: 7, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0526 15:40:41.837458 89353 ProcessGroupNCCL.cpp:878] [PG 2 Rank 3] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:41.848428 89347 ProcessGroupNCCL.cpp:869] [PG 0 Rank 1] ProcessGroupNCCL initialization options: size: 8, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:41.848471 89347 ProcessGroupNCCL.cpp:878] [PG 0 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:41.848771 89347 ProcessGroupNCCL.cpp:869] [PG 1 Rank 1] ProcessGroupNCCL initialization options: size: 4, global rank: 1, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 1
I0526 15:40:41.848786 89347 ProcessGroupNCCL.cpp:878] [PG 1 Rank 1] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Could not load Sliding Tile Attention.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 15:40:42.108510 89350 ProcessGroupNCCL.cpp:869] [PG 0 Rank 4] ProcessGroupNCCL initialization options: size: 8, global rank: 4, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 0
I0526 15:40:42.108565 89350 ProcessGroupNCCL.cpp:878] [PG 0 Rank 4] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
I0526 15:40:42.109011 89350 ProcessGroupNCCL.cpp:869] [PG 2 Rank 0] ProcessGroupNCCL initialization options: size: 4, global rank: 4, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, PG Name: 2
I0526 15:40:42.109025 89350 ProcessGroupNCCL.cpp:878] [PG 2 Rank 0] ProcessGroupNCCL environments: NCCL version: 2.18.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
  Total training parameters = 12821.012544 M
--> Initializing FSDP with sharding strategy: full
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 1e-05
    maximize: False
    weight_decay: 0.01
)
***** Running training *****
  Num examples = 101
  Dataloader size = 13
  Num Epochs = 39
  Resume training from step 0
  Instantaneous batch size per device = 1
  Total train batch size (w. data & sequence parallel, accumulation) = 2.0
  Gradient Accumulation steps = 1
  Total optimization steps = 2000
  Total training parameters per FSDP shard = 1.602626568 B
  Master weight dtype: torch.float32

Steps:   0%|          | 0/2000 [00:00<?, ?it/s]--> applying fdsp activation checkpointing...
I0526 15:42:10.140130 89346 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.106799 ms
I0526 15:42:10.140599 89349 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 4345.63 ms
I0526 15:42:10.140630 89348 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 4589.22 ms
--> applying fdsp activation checkpointing...
I0526 15:42:10.436164 89350 ProcessGroupNCCL.cpp:2074] [PG 2 Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.197698 ms
I0526 15:42:10.436353 89353 ProcessGroupNCCL.cpp:2074] [PG 2 Rank 3] ProcessGroupNCCL broadcast unique ID through store took 1017.89 ms
--> applying fdsp activation checkpointing...
I0526 15:42:10.834331 89351 ProcessGroupNCCL.cpp:2074] [PG 2 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.286167 ms
I0526 15:42:10.932520 89352 ProcessGroupNCCL.cpp:2074] [PG 2 Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.325397 ms
--> applying fdsp activation checkpointing...
I0526 15:42:11.800187 89350 ProcessGroupNCCL.cpp:2183] [PG 2 Rank 0] ProcessGroupNCCL created ncclComm_ 0x5606a55d29d0 on CUDA device: 
I0526 15:42:11.800220 89353 ProcessGroupNCCL.cpp:2183] [PG 2 Rank 3] ProcessGroupNCCL created ncclComm_ 0x563e3849eaf0 on CUDA device: 
I0526 15:42:11.800222 89352 ProcessGroupNCCL.cpp:2183] [PG 2 Rank 2] ProcessGroupNCCL created ncclComm_ 0x5571303a5800 on CUDA device: 
I0526 15:42:11.800279 89350 ProcessGroupNCCL.cpp:2188] [PG 2 Rank 0] NCCL_DEBUG: N/A
I0526 15:42:11.800294 89353 ProcessGroupNCCL.cpp:2188] [PG 2 Rank 3] NCCL_DEBUG: N/A
I0526 15:42:11.800345 89352 ProcessGroupNCCL.cpp:2188] [PG 2 Rank 2] NCCL_DEBUG: N/A
I0526 15:42:11.801318 89351 ProcessGroupNCCL.cpp:2183] [PG 2 Rank 1] ProcessGroupNCCL created ncclComm_ 0x555ee2337e90 on CUDA device: 
I0526 15:42:11.801399 89351 ProcessGroupNCCL.cpp:2188] [PG 2 Rank 1] NCCL_DEBUG: N/A
I0526 15:42:11.974504 89347 ProcessGroupNCCL.cpp:2074] [PG 1 Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.300847 ms
I0526 15:42:12.899058 89348 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 2] ProcessGroupNCCL created ncclComm_ 0x557f562a2620 on CUDA device: 
I0526 15:42:12.899091 89346 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 0] ProcessGroupNCCL created ncclComm_ 0x55b6f5928900 on CUDA device: 
I0526 15:42:12.899174 89348 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 2] NCCL_DEBUG: N/A
I0526 15:42:12.899205 89346 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 0] NCCL_DEBUG: N/A
I0526 15:42:12.899204 89349 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 3] ProcessGroupNCCL created ncclComm_ 0x55de54e073d0 on CUDA device: 
I0526 15:42:12.899271 89349 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 3] NCCL_DEBUG: N/A
I0526 15:42:12.899235 89347 ProcessGroupNCCL.cpp:2183] [PG 1 Rank 1] ProcessGroupNCCL created ncclComm_ 0x55c981004470 on CUDA device: 
I0526 15:42:12.899341 89347 ProcessGroupNCCL.cpp:2188] [PG 1 Rank 1] NCCL_DEBUG: N/A
I0526 15:42:13.361822 89346 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL broadcast unique ID through store took 0.086089 ms
I0526 15:42:13.362102 89351 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL broadcast unique ID through store took 1030.98 ms
I0526 15:42:13.362121 89350 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL broadcast unique ID through store took 1027.7 ms
I0526 15:42:13.362154 89352 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL broadcast unique ID through store took 1020.04 ms
I0526 15:42:13.362164 89353 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL broadcast unique ID through store took 1020.44 ms
I0526 15:42:13.362969 89347 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL broadcast unique ID through store took 0.251227 ms
I0526 15:42:13.365820 89349 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL broadcast unique ID through store took 0.212888 ms
I0526 15:42:13.370061 89348 ProcessGroupNCCL.cpp:2074] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL broadcast unique ID through store took 0.133879 ms
I0526 15:42:13.855226 89346 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL created ncclComm_ 0x55b6f608fa50 on CUDA device: 
I0526 15:42:13.855289 89346 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 0] NCCL_DEBUG: N/A
I0526 15:42:13.855336 89349 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL created ncclComm_ 0x55de54db6c00 on CUDA device: 
I0526 15:42:13.855326 89350 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL created ncclComm_ 0x5606a5797450 on CUDA device: 
I0526 15:42:13.855337 89352 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL created ncclComm_ 0x55713060ba10 on CUDA device: 
I0526 15:42:13.855357 89353 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL created ncclComm_ 0x563e38753a30 on CUDA device: 
I0526 15:42:13.855398 89349 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 3] NCCL_DEBUG: N/A
I0526 15:42:13.855408 89351 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL created ncclComm_ 0x555ee276b710 on CUDA device: 
I0526 15:42:13.855451 89348 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL created ncclComm_ 0x557f565420d0 on CUDA device: 
I0526 15:42:13.855479 89347 ProcessGroupNCCL.cpp:2183] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL created ncclComm_ 0x55c980ebde00 on CUDA device: 
I0526 15:42:13.856510 89350 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 4] NCCL_DEBUG: N/A
I0526 15:42:13.856552 89352 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 6] NCCL_DEBUG: N/A
I0526 15:42:13.856582 89353 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 7] NCCL_DEBUG: N/A
I0526 15:42:13.856609 89351 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 5] NCCL_DEBUG: N/A
I0526 15:42:13.856634 89348 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 2] NCCL_DEBUG: N/A
I0526 15:42:13.856678 89347 ProcessGroupNCCL.cpp:2188] [PG 0 (default_pg) Rank 1] NCCL_DEBUG: N/A
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]

Steps:   0%|          | 0/2000 [02:49<?, ?it/s, loss=0.0478, step_time=169.26s, grad_norm=0.111]
Steps:   0%|          | 1/2000 [02:49<93:59:13, 169.26s/it, loss=0.0478, step_time=169.26s, grad_norm=0.111]W0526 15:45:34.268000 23412721211200 torch/distributed/elastic/agent/server/api.py:688] Received Signals.SIGINT death signal, shutting down workers
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89346 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89347 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89348 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89349 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89350 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89351 closing signal SIGINT
W0526 15:45:34.271000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89352 closing signal SIGINT
W0526 15:45:34.272000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89353 closing signal SIGINT
W0526 15:45:34.440000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89346 closing signal SIGTERM
W0526 15:45:34.440000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89347 closing signal SIGTERM
W0526 15:45:34.441000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89348 closing signal SIGTERM
W0526 15:45:34.442000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89349 closing signal SIGTERM
W0526 15:45:34.443000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89350 closing signal SIGTERM
W0526 15:45:34.444000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89351 closing signal SIGTERM
W0526 15:45:34.445000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89352 closing signal SIGTERM
W0526 15:45:34.446000 23412721211200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 89353 closing signal SIGTERM
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
    time.sleep(monitor_interval)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 89279 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 689, in run
    self._shutdown(e.sigval)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 347, in _shutdown
    self._pcontext.close(death_sig)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 544, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 868, in _close
    handler.proc.wait(time_to_wait)
  File "/usr/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 89279 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 694, in run
    self._shutdown()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 347, in _shutdown
    self._pcontext.close(death_sig)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 544, in close
    self._close(death_sig=death_sig, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 868, in _close
    handler.proc.wait(time_to_wait)
  File "/usr/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait
    time.sleep(delay)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 89279 got signal: 2