INFO 05-26 18:09:07 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:07 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. INFO 05-26 18:09:08 __init__.py:193] Automatically detected platform rocm. Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. --> loading model from /public/model/HunyuanVideo/hunyuan-video-t2v-720p <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Could not load Sliding Tile Attention. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Total training parameters = 12821.012544 M --> Initializing FSDP with sharding strategy: full >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> applying fdsp activation checkpointing... --> model loaded --> applying fdsp activation checkpointing... optimizer: AdamW ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None lr: 1e-05 maximize: False weight_decay: 0.01 ) ***** Running training ***** Num examples = 101 Dataloader size = 13 Num Epochs = 1 Resume training from step 0 Instantaneous batch size per device = 1 Total train batch size (w. data & sequence parallel, accumulation) = 2.0 Gradient Accumulation steps = 1 Total optimization steps = 10 Total training parameters per FSDP shard = 1.602626568 B Master weight dtype: torch.float32 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --> applying fdsp activation checkpointing... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.58% 118.974s 76.58% 118.974s 165.013ms 0.000us 0.00% 40.200ms 55.757us 721 hipMemcpyWithStream 21.21% 32.945s 21.21% 32.946s 65.892ms 0.000us 0.00% 1.050s 2.101ms 500 hipLaunchKernel 0.21% 321.844ms 0.21% 321.860ms 14.085us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.15% 235.485ms 20.84% 32.371s 3.291ms 6.360s 2.61% 7.409s 753.153us 9837 MulBackward0 0.15% 226.434ms 17.95% 27.884s 34.638ms 0.000us 0.00% 42.292s 52.536ms 805 SeqAllToAll4D 0.13% 202.225ms 76.93% 119.512s 165.988ms 0.000us 0.00% 3.318s 4.608ms 720 FullyShardedDataParallel.forward 0.11% 174.704ms 52.72% 81.911s 1.343s 0.000us 0.00% 85.131s 1.396s 61 record_param_comms 0.11% 172.360ms 0.17% 261.078ms 126.860us 4.522s 1.85% 4.524s 2.198ms 2058 aten::mul 0.10% 155.101ms 0.12% 188.380ms 51.781us 3.230s 1.32% 3.230s 887.938us 3638 aten::empty_strided 0.08% 129.408ms 0.08% 129.772ms 17.449us 6.560us 0.00% 6.560us 0.001us 7437 aten::cat 0.08% 120.499ms 0.10% 152.213ms 101.476us 2.208s 0.90% 2.208s 1.472ms 1500 aten::addmm 0.05% 83.071ms 0.07% 105.724ms 159.464us 40.124s 16.44% 40.124s 60.519ms 663 hipStreamWaitEvent 0.05% 81.237ms 0.05% 81.237ms 26.197us 9.851ms 0.00% 9.851ms 3.177us 3101 aten::empty 0.05% 70.876ms 0.05% 70.876ms 13.935us 0.000us 0.00% 0.000us 0.000us 5086 aten::sum 0.03% 53.225ms 0.05% 76.205ms 59.815us 517.290ms 0.21% 517.969ms 406.569us 1274 FullyShardedDataParallel._pre_forward 0.03% 50.109ms 0.05% 81.190ms 1.331ms 0.000us 0.00% 717.366ms 11.760ms 61 FullyShardedDataParallel._post_backward_hook 0.03% 46.753ms 0.06% 88.199ms 1.446ms 0.000us 0.00% 878.337ms 14.399ms 61 aten::_to_copy 0.03% 45.962ms 20.82% 32.349s 5.311ms 0.000us 0.00% 4.710s 773.292us 6091 aten::add 0.03% 42.503ms 0.03% 52.484ms 42.566us 777.310ms 0.32% 777.310ms 630.421us 1233 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 40.298ms 0.10% 148.345ms 83.387us 0.000us 0.00% 1.028s 578.057us 1779 aten::view 0.03% 39.961ms 0.03% 39.961ms 4.237us 0.000us 0.00% 0.000us 0.000us 9432 FullyShardedDataParallel._pre_backward_prefetch 0.02% 36.299ms 0.04% 56.513ms 926.441us 0.000us 0.00% 452.591ms 7.420ms 61 aten::mm 0.02% 35.884ms 0.03% 45.933ms 67.548us 2.973s 1.22% 2.973s 4.372ms 680 aten::cos 0.02% 33.720ms 0.02% 33.739ms 5.623ms 25.600us 0.00% 25.600us 4.267us 6 _AllGather 0.02% 33.579ms 0.07% 110.589ms 921.576us 0.000us 0.00% 17.039ms 141.989us 120 aten::slice 0.02% 31.942ms 0.02% 38.737ms 5.265us 0.000us 0.00% 0.000us 0.000us 7357 FullyShardedDataParallel._post_forward 0.02% 29.671ms 0.02% 30.331ms 497.225us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 28.523ms 0.02% 36.356ms 403.956us 0.000us 0.00% 921.109us 10.235us 90 aten::pow 0.02% 28.372ms 0.02% 37.925ms 58.346us 146.517ms 0.06% 301.740ms 464.215us 650 c10d::alltoall_base_ 0.02% 27.667ms 0.11% 168.313ms 233.768us 0.000us 0.00% 2.595s 3.604ms 720 FlashAttnVarlenQKVPackedFunc 0.02% 25.339ms 0.02% 36.710ms 300.899us 30.111s 12.34% 30.111s 246.814ms 122 aten::as_strided 0.02% 24.897ms 0.02% 24.897ms 1.272us 0.000us 0.00% 0.000us 0.000us 19575 aten::fill_ 0.01% 22.343ms 0.02% 36.246ms 21.967us 436.593ms 0.18% 436.593ms 264.602us 1650 aten::native_layer_norm 0.01% 21.351ms 0.03% 46.493ms 189.767us 174.228ms 0.07% 1.031s 4.206ms 245 hipExtModuleLaunchKernel 0.01% 20.861ms 0.01% 20.861ms 10.676us 0.000us 0.00% 0.000us 0.000us 1954 autograd::engine::evaluate_function: MulBackward0 0.01% 20.553ms 17.99% 27.950s 34.721ms 0.000us 0.00% 42.754s 53.111ms 805 aten::sin 0.01% 20.283ms 0.01% 20.302ms 3.384ms 23.520us 0.00% 23.520us 3.920us 6 hipMemcpyAsync 0.01% 20.256ms 0.01% 20.256ms 9.627us 0.000us 0.00% 0.000us 0.000us 2104 aten::rsqrt 0.01% 19.167ms 0.01% 22.526ms 70.394us 1.698ms 0.00% 1.698ms 5.305us 320 aten::transpose 0.01% 18.387ms 0.02% 27.205ms 6.986us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 17.916ms 0.01% 17.916ms 17.411us 0.000us 0.00% 0.000us 0.000us 1029 aten::reshape 0.01% 17.545ms 0.04% 55.181ms 8.237us 0.000us 0.00% 119.774ms 17.879us 6699 aten::mean 0.01% 16.671ms 0.01% 19.357ms 60.303us 433.547ms 0.18% 433.547ms 1.351ms 321 IndexFirstAxis 0.01% 15.717ms 0.02% 27.411ms 224.684us 0.000us 0.00% 265.642ms 2.177ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.556ms 0.01% 23.249ms 374.986us 61.918s 25.37% 61.918s 998.675ms 62 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 15.433ms 0.02% 29.433ms 182.813us 0.000us 0.00% 324.383ms 2.015ms 161 SeqAllToAll4DBackward 0.01% 15.349ms 54.84% 85.194s 354.976ms 0.000us 0.00% 1.537s 6.404ms 240 autograd::engine::evaluate_function: ViewBackward0 0.01% 14.811ms 0.02% 30.491ms 14.709us 0.000us 0.00% 40.168ms 19.377us 2073 aten::linear 0.01% 14.367ms 0.22% 336.162ms 253.516us 0.000us 0.00% 80.677s 60.842ms 1326 aten::silu 0.01% 14.162ms 0.01% 17.589ms 103.464us 132.557ms 0.05% 132.557ms 779.747us 170 aten::select 0.01% 13.645ms 0.01% 16.777ms 6.904us 0.000us 0.00% 0.000us 0.000us 2430 aten::narrow 0.01% 13.135ms 0.02% 30.961ms 9.256us 0.000us 0.00% 0.000us 0.000us 3345 detach 0.01% 13.019ms 0.01% 13.019ms 1.908us 0.000us 0.00% 0.000us 0.000us 6823 aten::to 0.01% 12.577ms 20.83% 32.361s 4.365ms 0.000us 0.00% 4.710s 635.387us 7413 aten::nonzero 0.01% 12.201ms 0.48% 745.675ms 6.062ms 11.748ms 0.00% 13.550ms 110.166us 123 AddmmBackward0 0.01% 11.831ms 0.04% 63.984ms 186.543us 0.000us 0.00% 2.973s 8.668ms 343 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.648ms 0.06% 92.241ms 75.484us 0.000us 0.00% 841.479ms 688.608us 1222 FullyShardedDataParallel._pre_backward_hook 0.01% 10.582ms 0.04% 69.342ms 1.137ms 0.000us 0.00% 452.591ms 7.420ms 61 aten::empty_like 0.01% 10.506ms 0.04% 57.369ms 19.688us 0.000us 0.00% 6.560us 0.002us 2914 aten::clone 0.01% 10.503ms 0.12% 180.750ms 110.415us 0.000us 0.00% 2.091s 1.277ms 1637 aten::add_ 0.01% 10.417ms 0.01% 14.849ms 21.677us 288.245ms 0.12% 288.245ms 420.796us 685 aten::gelu 0.01% 9.956ms 0.01% 12.336ms 77.099us 680.536ms 0.28% 680.536ms 4.253ms 160 autograd::engine::evaluate_function: AddBackward0 0.01% 9.768ms 0.05% 80.794ms 124.489us 0.000us 0.00% 486.679ms 749.890us 649 aten::neg 0.01% 9.701ms 0.01% 12.563ms 34.898us 273.027ms 0.11% 273.027ms 758.407us 360 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.562ms 0.06% 98.799ms 1.620ms 0.000us 0.00% 878.337ms 14.399ms 61 aten::unsqueeze 0.01% 8.920ms 0.01% 10.585ms 6.197us 0.000us 0.00% 0.000us 0.000us 1708 hipMemsetAsync 0.01% 8.619ms 0.01% 8.619ms 11.507us 0.000us 0.00% 0.000us 0.000us 749 IndexFirstAxisBackward 0.01% 8.025ms 0.01% 16.930ms 273.072us 0.000us 0.00% 171.731ms 2.770ms 62 aten::detach 0.01% 7.961ms 0.01% 20.980ms 3.075us 0.000us 0.00% 0.000us 0.000us 6823 c10d::allgather_ 0.00% 7.665ms 0.04% 57.482ms 479.017us 0.000us 0.00% 14.589ms 121.573us 120 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.601ms 0.06% 90.965ms 265.203us 0.000us 0.00% 3.090s 9.010ms 343 aten::t 0.00% 7.215ms 0.01% 16.056ms 7.461us 0.000us 0.00% 0.000us 0.000us 2152 IndexPutFirstAxis 0.00% 7.208ms 0.01% 20.969ms 171.876us 0.000us 0.00% 153.096ms 1.255ms 122 aten::gather 0.00% 7.143ms 0.01% 8.642ms 70.833us 265.642ms 0.11% 265.642ms 2.177ms 122 FullyShardedDataParallel._post_backward_prefetch 0.00% 6.908ms 0.00% 6.908ms 113.241us 0.000us 0.00% 0.000us 0.000us 61 aten::slice_backward 0.00% 6.787ms 0.05% 75.534ms 61.811us 0.000us 0.00% 770.503ms 630.526us 1222 aten::stack 0.00% 6.772ms 0.03% 52.568ms 92.876us 0.000us 0.00% 1.039s 1.836ms 566 aten::zero_ 0.00% 6.345ms 0.02% 35.437ms 25.204us 0.000us 0.00% 435.626ms 309.833us 1406 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 6.276ms 54.84% 85.201s 355.002ms 0.000us 0.00% 1.537s 6.404ms 240 aten::expand 0.00% 6.104ms 0.00% 7.545ms 6.954us 0.000us 0.00% 0.000us 0.000us 1085 aten::split_with_sizes 0.00% 5.990ms 0.00% 6.991ms 21.779us 0.000us 0.00% 0.000us 0.000us 321 hipEventDestroy 0.00% 5.685ms 0.00% 5.686ms 1.764us 1.127s 0.46% 1.127s 349.816us 3223 aten::cumsum 0.00% 5.620ms 0.00% 6.884ms 56.423us 779.680us 0.00% 779.680us 6.391us 122 ToCopyBackward0 0.00% 5.059ms 0.05% 80.814ms 45.427us 0.000us 0.00% 927.226ms 521.206us 1779 aten::zeros 0.00% 5.050ms 0.04% 55.326ms 39.350us 0.000us 0.00% 435.626ms 309.833us 1406 FullyShardedDataParallel.rate_limiter 0.00% 5.011ms 0.00% 5.599ms 46.271us 0.000us 0.00% 7.676ms 63.437us 121 FullyShardedDataParallel._pre_forward_prefetch 0.00% 4.980ms 0.00% 4.980ms 81.637us 0.000us 0.00% 0.000us 0.000us 61 IndexPutFirstAxisBackward 0.00% 4.931ms 0.01% 10.321ms 166.472us 0.000us 0.00% 55.949ms 902.398us 62 aten::div 0.00% 4.881ms 0.00% 6.414ms 37.730us 160.692ms 0.07% 160.692ms 945.244us 170 ViewBackward0 0.00% 4.842ms 0.01% 15.680ms 7.564us 0.000us 0.00% 40.168ms 19.377us 2073 aten::unbind 0.00% 4.776ms 0.01% 10.900ms 27.048us 0.000us 0.00% 0.000us 0.000us 403 NativeLayerNormBackward0 0.00% 4.604ms 0.01% 13.835ms 110.676us 0.000us 0.00% 571.869ms 4.575ms 125 aten::max 0.00% 4.598ms 0.00% 6.990ms 57.299us 1.417ms 0.00% 1.417ms 11.613us 122 aten::index 0.00% 4.278ms 0.00% 5.475ms 85.548us 55.955ms 0.02% 55.955ms 874.301us 64 aten::split 0.00% 4.019ms 0.01% 13.820ms 61.697us 0.000us 0.00% 0.000us 0.000us 224 aten::_index_put_impl_ 0.00% 3.907ms 0.00% 5.549ms 45.485us 107.198ms 0.04% 107.198ms 878.668us 122 _AllGatherBackward 0.00% 3.834ms 0.00% 4.827ms 80.452us 0.000us 0.00% 0.000us 0.000us 60 c10d::_allgather_base_ 0.00% 3.668ms 0.02% 26.596ms 219.798us 0.000us 0.00% 1.139s 9.413ms 121 PowBackward0 0.00% 3.621ms 0.01% 19.338ms 120.114us 0.000us 0.00% 346.505ms 2.152ms 161 aten::layer_norm 0.00% 3.460ms 0.07% 112.644ms 229.886us 0.000us 0.00% 2.369s 4.834ms 490 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.358s Self CUDA time total: 244.106s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.70% 119.172s 76.70% 119.172s 165.288ms 0.000us 0.00% 309.571ms 429.363us 721 hipMemcpyWithStream 21.18% 32.909s 21.18% 32.910s 65.820ms 0.000us 0.00% 16.350ms 32.701us 500 hipLaunchKernel 0.20% 314.820ms 0.20% 314.833ms 13.777us 0.000us 0.00% 844.479us 0.037us 22852 aten::copy_ 0.14% 224.705ms 20.76% 32.258s 3.279ms 6.255s 2.55% 6.270s 637.383us 9837 MulBackward0 0.14% 213.461ms 17.95% 27.893s 34.650ms 0.000us 0.00% 42.421s 52.697ms 805 SeqAllToAll4D 0.13% 198.270ms 77.04% 119.695s 166.242ms 0.000us 0.00% 3.587s 4.982ms 720 FullyShardedDataParallel.forward 0.11% 172.803ms 52.71% 81.894s 1.343s 0.000us 0.00% 83.661s 1.371s 61 record_param_comms 0.10% 161.984ms 0.16% 247.984ms 120.497us 4.907s 2.00% 4.910s 2.386ms 2058 aten::mul 0.09% 140.184ms 0.11% 172.023ms 47.285us 3.193s 1.30% 3.193s 877.686us 3638 aten::empty_strided 0.08% 121.223ms 0.08% 121.228ms 16.301us 0.000us 0.00% 0.000us 0.000us 7437 aten::cat 0.07% 115.521ms 0.09% 142.711ms 95.141us 2.206s 0.90% 2.206s 1.471ms 1500 aten::addmm 0.05% 80.698ms 0.07% 102.780ms 155.023us 40.313s 16.44% 40.313s 60.803ms 663 hipStreamWaitEvent 0.05% 79.904ms 0.05% 79.904ms 25.767us 10.963ms 0.00% 10.963ms 3.535us 3101 aten::empty 0.04% 67.260ms 0.04% 67.266ms 13.226us 0.000us 0.00% 0.000us 0.000us 5086 aten::sum 0.03% 50.065ms 0.05% 72.061ms 56.563us 515.037ms 0.21% 515.694ms 404.783us 1274 FullyShardedDataParallel._pre_forward 0.03% 49.801ms 0.05% 80.564ms 1.321ms 0.000us 0.00% 800.176ms 13.118ms 61 aten::_to_copy 0.03% 45.410ms 20.75% 32.233s 5.292ms 0.000us 0.00% 3.568s 585.729us 6091 FullyShardedDataParallel._post_backward_hook 0.03% 44.106ms 0.05% 78.690ms 1.290ms 0.000us 0.00% 1.170s 19.180ms 61 aten::add 0.03% 39.851ms 0.03% 49.409ms 40.072us 786.897ms 0.32% 786.897ms 638.197us 1233 aten::view 0.02% 38.811ms 0.02% 38.811ms 4.115us 0.000us 0.00% 0.000us 0.000us 9432 autograd::engine::evaluate_function: ToCopyBackward0... 0.02% 38.638ms 0.09% 140.876ms 79.188us 0.000us 0.00% 1.056s 593.779us 1779 aten::mm 0.02% 35.017ms 0.03% 45.248ms 66.541us 2.957s 1.21% 2.957s 4.348ms 680 _AllGather 0.02% 31.668ms 0.07% 104.553ms 871.271us 0.000us 0.00% 42.081ms 350.671us 120 aten::slice 0.02% 31.545ms 0.02% 38.357ms 5.214us 0.000us 0.00% 0.000us 0.000us 7357 aten::cos 0.02% 30.821ms 0.02% 30.857ms 5.143ms 23.200us 0.00% 31.840us 5.307us 6 FullyShardedDataParallel._post_forward 0.02% 29.169ms 0.02% 29.767ms 487.990us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 28.242ms 0.02% 35.657ms 396.194us 0.000us 0.00% 915.834us 10.176us 90 aten::pow 0.02% 27.303ms 0.02% 36.593ms 56.297us 146.989ms 0.06% 302.194ms 464.913us 650 c10d::alltoall_base_ 0.02% 27.262ms 0.10% 162.251ms 225.349us 0.000us 0.00% 2.594s 3.603ms 720 FullyShardedDataParallel._pre_backward_prefetch 0.02% 26.069ms 0.03% 43.403ms 711.528us 0.000us 0.00% 508.834ms 8.342ms 61 FlashAttnVarlenQKVPackedFunc 0.02% 24.705ms 0.02% 35.672ms 292.393us 30.109s 12.28% 30.109s 246.792ms 122 aten::as_strided 0.02% 24.254ms 0.02% 24.254ms 1.239us 0.000us 0.00% 0.000us 0.000us 19575 hipExtModuleLaunchKernel 0.01% 20.792ms 0.01% 20.792ms 10.641us 0.000us 0.00% 0.000us 0.000us 1954 aten::native_layer_norm 0.01% 20.762ms 0.03% 44.724ms 182.545us 156.144ms 0.06% 1.013s 4.136ms 245 aten::fill_ 0.01% 20.191ms 0.02% 33.384ms 20.233us 456.517ms 0.19% 456.517ms 276.677us 1650 autograd::engine::evaluate_function: MulBackward0 0.01% 19.448ms 17.99% 27.957s 34.729ms 0.000us 0.00% 42.883s 53.270ms 805 hipMemcpyAsync 0.01% 19.329ms 0.01% 19.329ms 9.187us 0.000us 0.00% 0.000us 0.000us 2104 aten::reshape 0.01% 18.999ms 0.04% 55.287ms 8.253us 0.000us 0.00% 119.868ms 17.893us 6699 aten::sin 0.01% 18.083ms 0.01% 18.095ms 3.016ms 24.320us 0.00% 24.320us 4.053us 6 aten::transpose 0.01% 17.572ms 0.02% 25.773ms 6.619us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 16.631ms 0.01% 16.631ms 16.162us 0.000us 0.00% 0.000us 0.000us 1029 aten::rsqrt 0.01% 15.668ms 0.01% 18.863ms 58.947us 1.688ms 0.00% 1.688ms 5.273us 320 aten::mean 0.01% 15.532ms 0.01% 18.168ms 56.599us 434.602ms 0.18% 434.602ms 1.354ms 321 IndexFirstAxis 0.01% 15.278ms 0.02% 24.804ms 203.312us 0.000us 0.00% 264.364ms 2.167ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.218ms 0.01% 22.508ms 363.027us 61.970s 25.28% 61.970s 999.517ms 62 SeqAllToAll4DBackward 0.01% 15.042ms 54.84% 85.209s 355.039ms 0.000us 0.00% 1.356s 5.649ms 240 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 14.380ms 0.02% 27.789ms 172.604us 0.000us 0.00% 327.882ms 2.037ms 161 aten::linear 0.01% 14.348ms 0.21% 324.923ms 245.040us 0.000us 0.00% 81.042s 61.118ms 1326 aten::silu 0.01% 13.808ms 0.01% 17.183ms 101.077us 122.315ms 0.05% 122.315ms 719.503us 170 autograd::engine::evaluate_function: ViewBackward0 0.01% 13.761ms 0.02% 28.556ms 13.775us 0.000us 0.00% 40.205ms 19.394us 2073 aten::select 0.01% 13.297ms 0.01% 16.379ms 6.740us 0.000us 0.00% 0.000us 0.000us 2430 detach 0.01% 12.998ms 0.01% 12.998ms 1.905us 0.000us 0.00% 0.000us 0.000us 6823 aten::to 0.01% 12.562ms 20.75% 32.245s 4.350ms 0.000us 0.00% 3.568s 481.273us 7413 AddmmBackward0 0.01% 11.516ms 0.04% 63.102ms 183.970us 0.000us 0.00% 2.957s 8.620ms 343 aten::nonzero 0.01% 11.066ms 0.48% 746.528ms 6.069ms 10.979ms 0.00% 13.541ms 110.091us 123 autograd::engine::evaluate_function: torch::autograd... 0.01% 10.889ms 0.06% 90.621ms 1.486ms 0.000us 0.00% 1.170s 19.180ms 61 aten::empty_like 0.01% 10.392ms 0.04% 54.767ms 18.794us 0.000us 0.00% 0.000us 0.000us 2914 aten::clone 0.01% 10.277ms 0.11% 178.003ms 108.737us 0.000us 0.00% 2.093s 1.278ms 1637 autograd::engine::evaluate_function: SliceBackward0 0.01% 9.919ms 0.06% 86.005ms 70.381us 0.000us 0.00% 863.182ms 706.369us 1222 aten::add_ 0.01% 9.824ms 0.01% 14.127ms 20.623us 288.996ms 0.12% 288.996ms 421.891us 685 aten::narrow 0.01% 9.794ms 0.02% 27.277ms 8.155us 0.000us 0.00% 0.000us 0.000us 3345 aten::gelu 0.01% 9.587ms 0.01% 11.954ms 74.710us 627.759ms 0.26% 627.759ms 3.923ms 160 FullyShardedDataParallel._pre_backward_hook 0.01% 9.516ms 0.04% 54.948ms 900.788us 0.000us 0.00% 508.834ms 8.342ms 61 aten::neg 0.01% 8.940ms 0.01% 11.564ms 32.121us 271.859ms 0.11% 271.859ms 755.163us 360 aten::unsqueeze 0.01% 8.800ms 0.01% 10.462ms 6.125us 0.000us 0.00% 0.000us 0.000us 1708 autograd::engine::evaluate_function: AddBackward0 0.01% 8.750ms 0.05% 73.625ms 113.443us 0.000us 0.00% 514.463ms 792.701us 649 hipMemsetAsync 0.01% 8.540ms 0.01% 8.540ms 11.402us 0.000us 0.00% 0.000us 0.000us 749 IndexFirstAxisBackward 0.01% 7.905ms 0.01% 16.661ms 268.723us 0.000us 0.00% 171.474ms 2.766ms 62 aten::detach 0.00% 7.767ms 0.01% 20.765ms 3.043us 0.000us 0.00% 0.000us 0.000us 6823 c10d::allgather_ 0.00% 7.493ms 0.04% 54.586ms 454.886us 0.000us 0.00% 40.046ms 333.716us 120 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.439ms 0.06% 89.024ms 259.544us 0.000us 0.00% 3.073s 8.958ms 343 IndexPutFirstAxis 0.00% 7.313ms 0.01% 20.670ms 169.425us 0.000us 0.00% 153.079ms 1.255ms 122 aten::t 0.00% 7.114ms 0.01% 16.012ms 7.441us 0.000us 0.00% 0.000us 0.000us 2152 aten::slice_backward 0.00% 6.930ms 0.05% 70.322ms 57.547us 0.000us 0.00% 791.816ms 647.967us 1222 aten::stack 0.00% 6.474ms 0.03% 46.959ms 82.966us 0.000us 0.00% 1.039s 1.835ms 566 FullyShardedDataParallel._post_backward_prefetch 0.00% 6.142ms 0.00% 6.142ms 100.683us 0.000us 0.00% 0.000us 0.000us 61 hipEventDestroy 0.00% 6.116ms 0.00% 6.116ms 1.897us 431.367ms 0.18% 431.367ms 133.840us 3223 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 6.009ms 54.85% 85.215s 355.064ms 0.000us 0.00% 1.356s 5.649ms 240 aten::split_with_sizes 0.00% 5.982ms 0.00% 6.961ms 21.684us 0.000us 0.00% 0.000us 0.000us 321 aten::expand 0.00% 5.915ms 0.00% 7.420ms 6.839us 0.000us 0.00% 0.000us 0.000us 1085 aten::zero_ 0.00% 5.716ms 0.02% 32.661ms 23.230us 0.000us 0.00% 455.545ms 324.001us 1406 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.162ms 0.00% 5.162ms 84.616us 0.000us 0.00% 0.000us 0.000us 61 aten::gather 0.00% 5.080ms 0.00% 6.612ms 54.198us 264.364ms 0.11% 264.364ms 2.167ms 122 aten::zeros 0.00% 5.053ms 0.03% 51.466ms 36.604us 0.000us 0.00% 455.545ms 324.001us 1406 aten::cumsum 0.00% 5.044ms 0.00% 6.237ms 51.121us 818.719us 0.00% 818.719us 6.711us 122 FullyShardedDataParallel.rate_limiter 0.00% 5.019ms 0.00% 5.559ms 45.941us 0.000us 0.00% 5.932ms 49.022us 121 ToCopyBackward0 0.00% 4.742ms 0.05% 75.766ms 42.589us 0.000us 0.00% 954.959ms 536.796us 1779 IndexPutFirstAxisBackward 0.00% 4.620ms 0.01% 9.852ms 158.896us 0.000us 0.00% 55.893ms 901.498us 62 aten::unbind 0.00% 4.620ms 0.01% 10.671ms 26.478us 0.000us 0.00% 0.000us 0.000us 403 aten::div 0.00% 4.587ms 0.00% 6.134ms 36.085us 161.054ms 0.07% 161.054ms 947.379us 170 NativeLayerNormBackward0 0.00% 4.415ms 0.01% 13.358ms 106.866us 0.000us 0.00% 572.927ms 4.583ms 125 aten::max 0.00% 4.242ms 0.00% 6.553ms 53.714us 1.414ms 0.00% 1.414ms 11.587us 122 ViewBackward0 0.00% 4.186ms 0.01% 14.795ms 7.137us 0.000us 0.00% 40.205ms 19.394us 2073 aten::index 0.00% 4.142ms 0.00% 5.312ms 83.000us 55.901ms 0.02% 55.901ms 873.458us 64 aten::split 0.00% 3.794ms 0.01% 10.444ms 46.623us 0.000us 0.00% 0.000us 0.000us 224 aten::_index_put_impl_ 0.00% 3.711ms 0.00% 5.315ms 43.567us 107.148ms 0.04% 107.148ms 878.258us 122 _AllGatherBackward 0.00% 3.589ms 0.00% 4.569ms 76.152us 0.000us 0.00% 0.000us 0.000us 60 c10d::_allgather_base_ 0.00% 3.584ms 0.01% 23.244ms 192.095us 0.000us 0.00% 1.210s 9.997ms 121 PowBackward0 0.00% 3.563ms 0.01% 18.418ms 114.400us 0.000us 0.00% 346.491ms 2.152ms 161 aten::layer_norm 0.00% 3.409ms 0.07% 108.189ms 220.793us 0.000us 0.00% 2.272s 4.637ms 490 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.367s Self CUDA time total: 245.173s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 77.26% 120.032s 77.26% 120.032s 166.481ms 0.000us 0.00% 277.449ms 384.812us 721 hipMemcpyWithStream 20.54% 31.908s 20.54% 31.909s 63.817ms 0.000us 0.00% 18.294ms 36.588us 500 hipLaunchKernel 0.20% 312.036ms 0.20% 312.050ms 13.655us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.16% 243.746ms 20.17% 31.334s 3.185ms 6.268s 2.55% 6.283s 638.737us 9837 MulBackward0 0.14% 220.756ms 17.97% 27.914s 34.676ms 0.000us 0.00% 41.782s 51.904ms 805 SeqAllToAll4D 0.14% 211.633ms 77.61% 120.578s 167.470ms 0.000us 0.00% 4.940s 6.861ms 720 FullyShardedDataParallel.forward 0.11% 177.922ms 52.71% 81.892s 1.342s 0.000us 0.00% 87.059s 1.427s 61 record_param_comms 0.11% 169.793ms 0.17% 258.079ms 125.403us 5.927s 2.41% 5.930s 2.881ms 2058 aten::mul 0.10% 155.712ms 0.12% 187.113ms 51.433us 3.179s 1.30% 3.179s 873.853us 3638 aten::empty_strided 0.08% 127.256ms 0.08% 127.657ms 17.165us 6.240us 0.00% 6.240us 0.001us 7437 aten::cat 0.08% 120.555ms 0.10% 154.201ms 102.801us 2.213s 0.90% 2.213s 1.475ms 1500 aten::addmm 0.05% 82.498ms 0.07% 104.302ms 157.318us 38.755s 15.79% 38.755s 58.454ms 663 hipStreamWaitEvent 0.05% 81.556ms 0.05% 81.556ms 26.300us 20.061ms 0.01% 20.061ms 6.469us 3101 aten::empty 0.05% 70.519ms 0.05% 70.519ms 13.865us 0.000us 0.00% 0.000us 0.000us 5086 FullyShardedDataParallel._pre_forward 0.03% 52.741ms 0.05% 84.362ms 1.383ms 0.000us 0.00% 1.520s 24.914ms 61 aten::sum 0.03% 52.215ms 0.05% 73.933ms 58.032us 518.225ms 0.21% 518.880ms 407.284us 1274 FullyShardedDataParallel._post_backward_hook 0.03% 46.645ms 0.05% 81.970ms 1.344ms 0.000us 0.00% 881.213ms 14.446ms 61 aten::_to_copy 0.03% 43.264ms 20.15% 31.297s 5.138ms 0.000us 0.00% 3.571s 586.323us 6091 aten::add 0.03% 42.032ms 0.03% 51.657ms 41.895us 773.122ms 0.31% 773.122ms 627.025us 1233 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 39.925ms 0.09% 143.225ms 80.509us 0.000us 0.00% 1.010s 567.573us 1779 aten::view 0.02% 38.772ms 0.02% 38.772ms 4.111us 0.000us 0.00% 0.000us 0.000us 9432 aten::cos 0.02% 38.761ms 0.02% 38.806ms 6.468ms 23.200us 0.00% 174.080us 29.013us 6 aten::mm 0.02% 35.803ms 0.03% 45.691ms 67.193us 2.960s 1.21% 2.960s 4.353ms 680 _AllGather 0.02% 32.882ms 0.07% 109.872ms 915.601us 0.000us 0.00% 17.043ms 142.029us 120 aten::slice 0.02% 31.805ms 0.02% 38.361ms 5.214us 0.000us 0.00% 0.000us 0.000us 7357 FullyShardedDataParallel._post_forward 0.02% 29.904ms 0.02% 30.573ms 501.203us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 28.583ms 0.02% 36.156ms 401.733us 0.000us 0.00% 907.995us 10.089us 90 aten::pow 0.02% 27.911ms 0.02% 37.061ms 57.017us 146.845ms 0.06% 302.666ms 465.640us 650 c10d::alltoall_base_ 0.02% 27.806ms 0.11% 168.822ms 234.475us 0.000us 0.00% 3.965s 5.507ms 720 FullyShardedDataParallel._pre_backward_prefetch 0.02% 26.727ms 0.03% 44.377ms 727.491us 0.000us 0.00% 478.129ms 7.838ms 61 FlashAttnVarlenQKVPackedFunc 0.02% 25.460ms 0.02% 37.043ms 303.630us 30.110s 12.27% 30.110s 246.801ms 122 aten::as_strided 0.02% 24.538ms 0.02% 24.538ms 1.254us 0.000us 0.00% 0.000us 0.000us 19575 aten::sin 0.02% 24.016ms 0.02% 24.032ms 4.005ms 23.200us 0.00% 23.200us 3.867us 6 aten::fill_ 0.01% 22.046ms 0.02% 35.415ms 21.463us 453.811ms 0.18% 453.811ms 275.037us 1650 aten::native_layer_norm 0.01% 21.007ms 0.03% 45.523ms 185.808us 155.988ms 0.06% 1.008s 4.115ms 245 hipExtModuleLaunchKernel 0.01% 20.702ms 0.01% 20.702ms 10.595us 0.000us 0.00% 0.000us 0.000us 1954 hipMemcpyAsync 0.01% 20.126ms 0.01% 20.126ms 9.566us 0.000us 0.00% 0.000us 0.000us 2104 autograd::engine::evaluate_function: MulBackward0 0.01% 19.997ms 18.01% 27.979s 34.757ms 0.000us 0.00% 42.246s 52.480ms 805 aten::transpose 0.01% 17.803ms 0.02% 26.345ms 6.766us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 17.467ms 0.01% 17.467ms 16.975us 0.000us 0.00% 0.000us 0.000us 1029 aten::reshape 0.01% 16.832ms 0.03% 53.434ms 7.976us 0.000us 0.00% 119.819ms 17.886us 6699 aten::mean 0.01% 16.650ms 0.01% 19.335ms 60.234us 434.494ms 0.18% 434.494ms 1.354ms 321 aten::rsqrt 0.01% 16.547ms 0.01% 19.705ms 61.579us 1.693ms 0.00% 1.693ms 5.292us 320 IndexFirstAxis 0.01% 16.369ms 0.02% 26.291ms 215.500us 0.000us 0.00% 264.976ms 2.172ms 122 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 15.563ms 0.02% 29.129ms 180.925us 0.000us 0.00% 326.434ms 2.028ms 161 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.456ms 0.01% 23.017ms 371.247us 62.102s 25.30% 62.102s 1.002s 62 SeqAllToAll4DBackward 0.01% 15.359ms 54.85% 85.205s 355.021ms 0.000us 0.00% 1.881s 7.836ms 240 autograd::engine::evaluate_function: ViewBackward0 0.01% 14.414ms 0.02% 29.106ms 14.040us 0.000us 0.00% 40.190ms 19.388us 2073 aten::silu 0.01% 14.062ms 0.01% 17.456ms 102.685us 122.752ms 0.05% 122.752ms 722.069us 170 aten::linear 0.01% 13.962ms 0.21% 329.189ms 248.257us 0.000us 0.00% 77.929s 58.770ms 1326 aten::select 0.01% 13.368ms 0.01% 16.607ms 6.834us 0.000us 0.00% 0.000us 0.000us 2430 aten::nonzero 0.01% 12.275ms 0.48% 750.175ms 6.099ms 11.532ms 0.00% 14.183ms 115.313us 123 aten::to 0.01% 12.117ms 20.15% 31.309s 4.224ms 0.000us 0.00% 3.571s 481.760us 7413 detach 0.01% 11.635ms 0.01% 11.635ms 1.705us 0.000us 0.00% 0.000us 0.000us 6823 AddmmBackward0 0.01% 11.491ms 0.04% 63.071ms 183.880us 0.000us 0.00% 2.960s 8.630ms 343 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.445ms 0.06% 88.743ms 72.621us 0.000us 0.00% 860.894ms 704.496us 1222 aten::add_ 0.01% 10.404ms 0.01% 14.757ms 21.543us 288.442ms 0.12% 288.442ms 421.083us 685 FullyShardedDataParallel._pre_backward_hook 0.01% 10.308ms 0.04% 56.757ms 930.436us 0.000us 0.00% 478.129ms 7.838ms 61 aten::empty_like 0.01% 10.167ms 0.04% 57.117ms 19.601us 0.000us 0.00% 6.240us 0.002us 2914 aten::clone 0.01% 9.922ms 0.12% 189.738ms 115.906us 0.000us 0.00% 2.102s 1.284ms 1637 aten::narrow 0.01% 9.894ms 0.02% 27.369ms 8.182us 0.000us 0.00% 0.000us 0.000us 3345 aten::gelu 0.01% 9.862ms 0.01% 12.240ms 76.499us 630.667ms 0.26% 630.667ms 3.942ms 160 autograd::engine::evaluate_function: AddBackward0 0.01% 9.495ms 0.05% 76.282ms 117.538us 0.000us 0.00% 504.029ms 776.624us 649 aten::neg 0.01% 9.463ms 0.01% 12.305ms 34.181us 273.752ms 0.11% 273.752ms 760.423us 360 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.406ms 0.06% 92.376ms 1.514ms 0.000us 0.00% 881.213ms 14.446ms 61 aten::unsqueeze 0.01% 8.528ms 0.01% 10.288ms 6.023us 0.000us 0.00% 0.000us 0.000us 1708 hipMemsetAsync 0.01% 8.313ms 0.01% 8.313ms 11.099us 0.000us 0.00% 0.000us 0.000us 749 c10d::allgather_ 0.01% 7.794ms 0.04% 57.306ms 477.554us 0.000us 0.00% 14.543ms 121.188us 120 IndexFirstAxisBackward 0.01% 7.790ms 0.01% 16.560ms 267.097us 0.000us 0.00% 171.445ms 2.765ms 62 IndexPutFirstAxis 0.00% 7.435ms 0.01% 21.439ms 175.730us 0.000us 0.00% 153.132ms 1.255ms 122 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.301ms 0.06% 88.852ms 259.044us 0.000us 0.00% 3.076s 8.969ms 343 aten::t 0.00% 7.025ms 0.01% 15.371ms 7.143us 0.000us 0.00% 0.000us 0.000us 2152 aten::detach 0.00% 6.987ms 0.01% 18.622ms 2.729us 0.000us 0.00% 0.000us 0.000us 6823 hipMalloc 0.00% 6.704ms 0.00% 6.704ms 3.352ms 0.000us 0.00% 0.000us 0.000us 2 aten::stack 0.00% 6.413ms 0.03% 48.916ms 86.424us 0.000us 0.00% 1.041s 1.839ms 566 FullyShardedDataParallel._post_backward_prefetch 0.00% 6.361ms 0.00% 6.361ms 104.277us 0.000us 0.00% 0.000us 0.000us 61 aten::split_with_sizes 0.00% 6.208ms 0.00% 7.250ms 22.585us 0.000us 0.00% 0.000us 0.000us 321 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 6.195ms 54.85% 85.211s 355.047ms 0.000us 0.00% 1.881s 7.836ms 240 aten::zero_ 0.00% 6.172ms 0.02% 34.582ms 24.596us 0.000us 0.00% 452.842ms 322.078us 1406 aten::slice_backward 0.00% 6.124ms 0.05% 72.516ms 59.342us 0.000us 0.00% 789.661ms 646.204us 1222 aten::expand 0.00% 5.886ms 0.00% 7.255ms 6.687us 0.000us 0.00% 0.000us 0.000us 1085 aten::cumsum 0.00% 5.554ms 0.00% 6.813ms 55.843us 795.195us 0.00% 795.195us 6.518us 122 aten::gather 0.00% 5.375ms 0.00% 6.875ms 56.351us 264.976ms 0.11% 264.976ms 2.172ms 122 hipEventDestroy 0.00% 5.312ms 0.00% 5.312ms 1.648us 383.037ms 0.16% 383.037ms 118.845us 3223 FullyShardedDataParallel.rate_limiter 0.00% 5.126ms 0.00% 5.767ms 47.664us 0.000us 0.00% 743.395ms 6.144ms 121 ToCopyBackward0 0.00% 4.910ms 0.05% 76.463ms 42.981us 0.000us 0.00% 908.894ms 510.902us 1779 FullyShardedDataParallel._pre_forward_prefetch 0.00% 4.897ms 0.00% 4.897ms 80.280us 0.000us 0.00% 0.000us 0.000us 61 aten::zeros 0.00% 4.759ms 0.03% 53.852ms 38.302us 0.000us 0.00% 452.842ms 322.078us 1406 aten::div 0.00% 4.718ms 0.00% 6.136ms 36.092us 160.977ms 0.07% 160.977ms 946.926us 170 ViewBackward0 0.00% 4.715ms 0.01% 14.692ms 7.087us 0.000us 0.00% 40.190ms 19.388us 2073 IndexPutFirstAxisBackward 0.00% 4.654ms 0.01% 9.975ms 160.881us 0.000us 0.00% 55.992ms 903.090us 62 aten::max 0.00% 4.602ms 0.00% 6.999ms 57.370us 1.435ms 0.00% 1.435ms 11.763us 122 aten::unbind 0.00% 4.586ms 0.01% 10.641ms 26.405us 0.000us 0.00% 0.000us 0.000us 403 NativeLayerNormBackward0 0.00% 4.427ms 0.01% 13.355ms 106.838us 0.000us 0.00% 572.216ms 4.578ms 125 aten::index 0.00% 4.261ms 0.00% 5.418ms 84.651us 55.999ms 0.02% 55.999ms 874.988us 64 aten::_index_put_impl_ 0.00% 3.928ms 0.00% 5.670ms 46.479us 107.233ms 0.04% 107.233ms 878.962us 122 _AllGatherBackward 0.00% 3.746ms 0.00% 4.752ms 79.206us 0.000us 0.00% 0.000us 0.000us 60 aten::split 0.00% 3.685ms 0.01% 10.011ms 44.692us 0.000us 0.00% 0.000us 0.000us 224 c10d::_allgather_base_ 0.00% 3.566ms 0.02% 24.251ms 200.417us 0.000us 0.00% 1.173s 9.690ms 121 aten::layer_norm 0.00% 3.370ms 0.07% 110.114ms 224.723us 0.000us 0.00% 2.231s 4.553ms 490 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.356s Self CUDA time total: 245.466s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.36% 118.640s 76.36% 118.640s 164.549ms 0.000us 0.00% 301.590ms 418.294us 721 hipMemcpyWithStream 21.46% 33.343s 21.46% 33.344s 66.688ms 0.000us 0.00% 76.294ms 152.588us 500 hipLaunchKernel 0.20% 313.956ms 0.20% 313.986ms 13.740us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.16% 243.090ms 21.06% 32.714s 3.326ms 6.301s 2.58% 6.378s 648.372us 9837 MulBackward0 0.14% 214.699ms 17.95% 27.887s 34.642ms 0.000us 0.00% 42.201s 52.423ms 805 SeqAllToAll4D 0.13% 201.215ms 76.71% 119.178s 165.525ms 0.000us 0.00% 3.240s 4.500ms 720 FullyShardedDataParallel.forward 0.11% 175.795ms 52.69% 81.865s 1.342s 0.000us 0.00% 84.065s 1.378s 61 record_param_comms 0.11% 167.511ms 0.16% 254.553ms 123.689us 4.523s 1.85% 4.525s 2.199ms 2058 aten::mul 0.10% 154.192ms 0.12% 186.696ms 51.318us 3.232s 1.32% 3.232s 888.398us 3638 aten::empty_strided 0.08% 126.589ms 0.08% 126.597ms 17.023us 0.000us 0.00% 0.000us 0.000us 7437 aten::cat 0.08% 119.381ms 0.09% 147.257ms 98.171us 2.208s 0.90% 2.208s 1.472ms 1500 aten::addmm 0.05% 81.994ms 0.07% 104.040ms 156.923us 40.477s 16.54% 40.477s 61.052ms 663 hipStreamWaitEvent 0.05% 79.270ms 0.05% 79.270ms 25.563us 9.130ms 0.00% 9.130ms 2.944us 3101 aten::empty 0.05% 70.920ms 0.05% 70.934ms 13.947us 0.000us 0.00% 0.000us 0.000us 5086 aten::sum 0.03% 52.488ms 0.05% 74.980ms 58.854us 516.739ms 0.21% 517.391ms 406.115us 1274 FullyShardedDataParallel._pre_forward 0.03% 50.213ms 0.05% 81.015ms 1.328ms 0.000us 0.00% 706.968ms 11.590ms 61 FullyShardedDataParallel._post_backward_hook 0.03% 45.227ms 0.05% 80.119ms 1.313ms 0.000us 0.00% 1.185s 19.432ms 61 aten::_to_copy 0.03% 44.193ms 21.03% 32.679s 5.365ms 0.000us 0.00% 3.673s 602.983us 6091 aten::add 0.03% 42.202ms 0.03% 51.854ms 42.055us 784.169ms 0.32% 784.169ms 635.984us 1233 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 39.094ms 0.09% 143.123ms 80.451us 0.000us 0.00% 1.057s 594.053us 1779 aten::view 0.02% 38.689ms 0.02% 38.689ms 4.102us 0.000us 0.00% 0.000us 0.000us 9432 aten::mm 0.02% 36.267ms 0.03% 46.026ms 67.685us 2.945s 1.20% 2.945s 4.331ms 680 aten::cos 0.02% 35.736ms 0.02% 35.760ms 5.960ms 23.680us 0.00% 28.960us 4.827us 6 _AllGather 0.02% 33.094ms 0.07% 109.033ms 908.610us 0.000us 0.00% 37.476ms 312.297us 120 aten::slice 0.02% 32.155ms 0.03% 38.989ms 5.300us 0.000us 0.00% 0.000us 0.000us 7357 FullyShardedDataParallel._post_forward 0.02% 29.686ms 0.02% 30.336ms 497.314us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 28.703ms 0.02% 36.355ms 403.942us 0.000us 0.00% 912.159us 10.135us 90 aten::pow 0.02% 28.034ms 0.02% 37.303ms 57.389us 146.397ms 0.06% 301.701ms 464.156us 650 c10d::alltoall_base_ 0.02% 27.604ms 0.11% 166.794ms 231.659us 0.000us 0.00% 2.252s 3.128ms 720 FullyShardedDataParallel._pre_backward_prefetch 0.02% 26.101ms 0.03% 43.729ms 716.873us 0.000us 0.00% 504.593ms 8.272ms 61 FlashAttnVarlenQKVPackedFunc 0.02% 25.206ms 0.02% 36.608ms 300.066us 30.112s 12.31% 30.112s 246.820ms 122 aten::as_strided 0.02% 25.099ms 0.02% 25.099ms 1.282us 0.000us 0.00% 0.000us 0.000us 19575 aten::sin 0.02% 24.294ms 0.02% 24.313ms 4.052ms 23.360us 0.00% 23.360us 3.893us 6 aten::fill_ 0.01% 22.383ms 0.02% 35.686ms 21.628us 456.353ms 0.19% 456.353ms 276.578us 1650 aten::native_layer_norm 0.01% 21.161ms 0.03% 45.646ms 186.309us 156.580ms 0.06% 1.016s 4.147ms 245 aten::reshape 0.01% 20.253ms 0.04% 56.821ms 8.482us 0.000us 0.00% 119.919ms 17.901us 6699 hipExtModuleLaunchKernel 0.01% 20.133ms 0.01% 20.133ms 10.303us 0.000us 0.00% 0.000us 0.000us 1954 autograd::engine::evaluate_function: MulBackward0 0.01% 19.862ms 17.99% 27.952s 34.723ms 0.000us 0.00% 42.663s 52.997ms 805 hipMemcpyAsync 0.01% 19.704ms 0.01% 19.704ms 9.365us 0.000us 0.00% 0.000us 0.000us 2104 aten::transpose 0.01% 18.041ms 0.02% 27.043ms 6.945us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 17.264ms 0.01% 17.264ms 16.777us 0.000us 0.00% 0.000us 0.000us 1029 aten::mean 0.01% 16.543ms 0.01% 19.088ms 59.466us 433.836ms 0.18% 433.836ms 1.352ms 321 aten::rsqrt 0.01% 16.363ms 0.01% 19.554ms 61.106us 1.696ms 0.00% 1.696ms 5.300us 320 IndexFirstAxis 0.01% 15.719ms 0.02% 25.680ms 210.491us 0.000us 0.00% 265.014ms 2.172ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.382ms 0.01% 22.985ms 370.725us 62.000s 25.34% 62.000s 999.993ms 62 SeqAllToAll4DBackward 0.01% 15.176ms 54.84% 85.202s 355.008ms 0.000us 0.00% 1.538s 6.407ms 240 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 14.555ms 0.02% 28.469ms 176.823us 0.000us 0.00% 326.472ms 2.028ms 161 aten::silu 0.01% 14.123ms 0.01% 17.399ms 102.346us 133.716ms 0.05% 133.716ms 786.562us 170 aten::linear 0.01% 13.761ms 0.21% 329.338ms 248.370us 0.000us 0.00% 81.372s 61.367ms 1326 aten::select 0.01% 13.626ms 0.01% 16.865ms 6.941us 0.000us 0.00% 0.000us 0.000us 2430 autograd::engine::evaluate_function: ViewBackward0 0.01% 13.574ms 0.02% 27.785ms 13.403us 0.000us 0.00% 40.198ms 19.391us 2073 aten::to 0.01% 12.634ms 21.04% 32.692s 4.410ms 0.000us 0.00% 3.673s 495.449us 7413 aten::nonzero 0.01% 12.129ms 0.48% 746.985ms 6.073ms 11.436ms 0.00% 11.750ms 95.529us 123 detach 0.01% 11.881ms 0.01% 11.881ms 1.741us 0.000us 0.00% 0.000us 0.000us 6823 AddmmBackward0 0.01% 11.006ms 0.04% 63.314ms 184.589us 0.000us 0.00% 2.945s 8.585ms 343 aten::clone 0.01% 10.751ms 0.12% 190.914ms 116.624us 0.000us 0.00% 2.098s 1.281ms 1637 aten::empty_like 0.01% 10.586ms 0.04% 57.593ms 19.764us 0.000us 0.00% 0.000us 0.000us 2914 aten::add_ 0.01% 10.435ms 0.01% 14.837ms 21.659us 288.051ms 0.12% 288.051ms 420.512us 685 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.398ms 0.06% 90.945ms 74.423us 0.000us 0.00% 860.977ms 704.564us 1222 aten::narrow 0.01% 9.960ms 0.02% 27.697ms 8.280us 0.000us 0.00% 0.000us 0.000us 3345 aten::gelu 0.01% 9.856ms 0.01% 12.184ms 76.151us 626.246ms 0.26% 626.246ms 3.914ms 160 FullyShardedDataParallel._pre_backward_hook 0.01% 9.659ms 0.04% 55.448ms 908.981us 0.000us 0.00% 504.593ms 8.272ms 61 aten::neg 0.01% 9.654ms 0.01% 12.285ms 34.125us 272.218ms 0.11% 272.218ms 756.162us 360 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.202ms 0.06% 90.375ms 1.482ms 0.000us 0.00% 1.185s 19.432ms 61 autograd::engine::evaluate_function: AddBackward0 0.01% 8.856ms 0.05% 74.428ms 114.681us 0.000us 0.00% 510.324ms 786.324us 649 aten::unsqueeze 0.01% 8.782ms 0.01% 10.446ms 6.116us 0.000us 0.00% 0.000us 0.000us 1708 hipMemsetAsync 0.01% 8.447ms 0.01% 8.447ms 11.278us 0.000us 0.00% 0.000us 0.000us 749 aten::detach 0.01% 8.036ms 0.01% 19.916ms 2.919us 0.000us 0.00% 0.000us 0.000us 6823 IndexFirstAxisBackward 0.01% 8.035ms 0.01% 16.978ms 273.843us 0.000us 0.00% 171.498ms 2.766ms 62 c10d::allgather_ 0.00% 7.675ms 0.04% 56.829ms 473.572us 0.000us 0.00% 35.571ms 296.423us 120 IndexPutFirstAxis 0.00% 7.365ms 0.01% 21.281ms 174.431us 0.000us 0.00% 153.194ms 1.256ms 122 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.204ms 0.06% 89.770ms 261.719us 0.000us 0.00% 3.061s 8.925ms 343 aten::t 0.00% 7.191ms 0.01% 16.140ms 7.500us 0.000us 0.00% 0.000us 0.000us 2152 aten::slice_backward 0.00% 6.757ms 0.05% 74.673ms 61.107us 0.000us 0.00% 790.002ms 646.483us 1222 aten::stack 0.00% 6.659ms 0.03% 48.970ms 86.520us 0.000us 0.00% 1.040s 1.837ms 566 aten::zero_ 0.00% 6.292ms 0.02% 35.057ms 24.934us 0.000us 0.00% 455.389ms 323.890us 1406 aten::expand 0.00% 5.999ms 0.00% 7.386ms 6.807us 0.000us 0.00% 0.000us 0.000us 1085 aten::split_with_sizes 0.00% 5.961ms 0.00% 6.913ms 21.536us 0.000us 0.00% 0.000us 0.000us 321 hipEventDestroy 0.00% 5.889ms 0.00% 5.889ms 1.827us 613.652ms 0.25% 613.652ms 190.398us 3223 FullyShardedDataParallel._post_backward_prefetch 0.00% 5.871ms 0.00% 5.871ms 96.246us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 5.814ms 54.85% 85.208s 355.032ms 0.000us 0.00% 1.538s 6.407ms 240 aten::cumsum 0.00% 5.510ms 0.00% 6.686ms 54.801us 803.199us 0.00% 803.199us 6.584us 122 aten::gather 0.00% 5.352ms 0.00% 6.950ms 56.970us 265.014ms 0.11% 265.014ms 2.172ms 122 aten::zeros 0.00% 5.172ms 0.04% 54.897ms 39.045us 0.000us 0.00% 455.389ms 323.890us 1406 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.066ms 0.00% 5.066ms 83.043us 0.000us 0.00% 0.000us 0.000us 61 aten::div 0.00% 4.775ms 0.00% 6.418ms 37.754us 160.807ms 0.07% 160.807ms 945.926us 170 FullyShardedDataParallel.rate_limiter 0.00% 4.754ms 0.00% 5.289ms 43.708us 0.000us 0.00% 16.610ms 137.273us 121 aten::unbind 0.00% 4.672ms 0.01% 10.967ms 27.213us 0.000us 0.00% 0.000us 0.000us 403 IndexPutFirstAxisBackward 0.00% 4.670ms 0.01% 10.054ms 162.165us 0.000us 0.00% 55.947ms 902.370us 62 ToCopyBackward0 0.00% 4.569ms 0.05% 77.732ms 43.694us 0.000us 0.00% 956.049ms 537.408us 1779 aten::max 0.00% 4.531ms 0.00% 6.913ms 56.666us 1.411ms 0.00% 1.411ms 11.569us 122 aten::index 0.00% 4.345ms 0.00% 5.469ms 85.454us 55.956ms 0.02% 55.956ms 874.316us 64 NativeLayerNormBackward0 0.00% 4.180ms 0.01% 13.359ms 106.869us 0.000us 0.00% 571.646ms 4.573ms 125 aten::_index_put_impl_ 0.00% 3.943ms 0.00% 5.523ms 45.271us 107.297ms 0.04% 107.297ms 879.484us 122 aten::split 0.00% 3.834ms 0.01% 10.426ms 46.544us 0.000us 0.00% 0.000us 0.000us 224 _AllGatherBackward 0.00% 3.680ms 0.00% 4.656ms 77.600us 0.000us 0.00% 0.000us 0.000us 60 c10d::_allgather_base_ 0.00% 3.673ms 0.02% 24.012ms 198.450us 0.000us 0.00% 1.160s 9.585ms 121 ViewBackward0 0.00% 3.610ms 0.01% 14.212ms 6.856us 0.000us 0.00% 40.198ms 19.391us 2073 aten::layer_norm 0.00% 3.469ms 0.07% 110.715ms 225.950us 0.000us 0.00% 2.279s 4.650ms 490 aten::native_layer_norm_backward 0.00% 3.432ms 0.01% 9.179ms 73.430us 144.446ms 0.06% 571.646ms 4.573ms 125 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.361s Self CUDA time total: 244.672s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.38% 118.672s 76.38% 118.673s 164.595ms 0.000us 0.00% 43.811ms 60.764us 721 hipMemcpyWithStream 21.42% 33.276s 21.42% 33.276s 66.553ms 0.000us 0.00% 273.320ms 546.640us 500 hipLaunchKernel 0.21% 327.300ms 0.21% 327.307ms 14.323us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.15% 233.207ms 21.04% 32.691s 3.323ms 6.345s 2.60% 6.618s 672.745us 9837 MulBackward0 0.15% 232.134ms 17.97% 27.926s 34.691ms 0.000us 0.00% 42.423s 52.699ms 805 SeqAllToAll4D 0.13% 205.554ms 76.72% 119.194s 165.547ms 0.000us 0.00% 2.891s 4.016ms 720 FullyShardedDataParallel.forward 0.11% 177.785ms 52.71% 81.891s 1.342s 0.000us 0.00% 84.116s 1.379s 61 record_param_comms 0.10% 159.976ms 0.16% 242.417ms 117.793us 4.106s 1.68% 4.108s 1.996ms 2058 aten::mul 0.09% 146.993ms 0.12% 180.380ms 49.582us 3.319s 1.36% 3.319s 912.369us 3638 aten::empty_strided 0.08% 121.626ms 0.08% 122.078ms 16.415us 6.240us 0.00% 6.240us 0.001us 7437 aten::cat 0.07% 114.626ms 0.10% 148.135ms 98.757us 2.237s 0.92% 2.237s 1.491ms 1500 hipStreamWaitEvent 0.07% 102.780ms 0.07% 102.780ms 33.144us 19.398ms 0.01% 19.398ms 6.255us 3101 aten::addmm 0.05% 82.280ms 0.07% 105.289ms 158.807us 40.460s 16.60% 40.460s 61.025ms 663 aten::empty 0.04% 66.561ms 0.04% 66.577ms 13.090us 0.000us 0.00% 0.000us 0.000us 5086 FullyShardedDataParallel._post_backward_hook 0.04% 57.857ms 0.06% 93.216ms 1.528ms 0.000us 0.00% 896.751ms 14.701ms 61 FullyShardedDataParallel._pre_forward 0.03% 53.238ms 0.05% 84.918ms 1.392ms 0.000us 0.00% 1.040s 17.052ms 61 aten::sum 0.03% 50.342ms 0.05% 74.555ms 58.520us 514.982ms 0.21% 515.643ms 404.743us 1274 aten::_to_copy 0.03% 45.365ms 21.02% 32.659s 5.362ms 0.000us 0.00% 3.893s 639.160us 6091 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 41.308ms 0.11% 164.935ms 92.712us 0.000us 0.00% 1.002s 563.429us 1779 aten::add 0.03% 40.812ms 0.03% 51.054ms 41.407us 808.885ms 0.33% 808.885ms 656.030us 1233 aten::cos 0.02% 37.862ms 0.02% 37.893ms 6.316ms 23.680us 0.00% 712.159us 118.693us 6 aten::view 0.02% 37.587ms 0.02% 37.587ms 3.985us 0.000us 0.00% 0.000us 0.000us 9432 aten::mm 0.02% 35.020ms 0.03% 45.316ms 66.641us 2.972s 1.22% 2.972s 4.371ms 680 _AllGather 0.02% 33.007ms 0.07% 106.935ms 891.128us 0.000us 0.00% 14.764ms 123.033us 120 FullyShardedDataParallel._post_forward 0.02% 31.119ms 0.02% 31.755ms 520.576us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 31.088ms 0.02% 38.493ms 427.697us 0.000us 0.00% 910.077us 10.112us 90 aten::slice 0.02% 29.681ms 0.02% 36.218ms 4.923us 0.000us 0.00% 0.000us 0.000us 7357 aten::sin 0.02% 27.852ms 0.02% 27.884ms 4.647ms 23.680us 0.00% 33.600us 5.600us 6 aten::as_strided 0.02% 26.633ms 0.02% 26.633ms 1.361us 0.000us 0.00% 0.000us 0.000us 19575 aten::pow 0.02% 26.631ms 0.02% 36.029ms 55.429us 146.480ms 0.06% 302.013ms 464.635us 650 FlashAttnVarlenQKVPackedFunc 0.02% 26.274ms 0.02% 37.186ms 304.801us 30.107s 12.35% 30.107s 246.777ms 122 FullyShardedDataParallel._pre_backward_prefetch 0.02% 24.929ms 0.03% 41.962ms 687.909us 0.000us 0.00% 476.199ms 7.807ms 61 c10d::alltoall_base_ 0.02% 24.734ms 0.10% 154.743ms 214.921us 0.000us 0.00% 2.135s 2.965ms 720 hipExtModuleLaunchKernel 0.01% 21.772ms 0.01% 21.772ms 11.142us 0.000us 0.00% 0.000us 0.000us 1954 aten::native_layer_norm 0.01% 20.969ms 0.03% 45.769ms 186.814us 155.991ms 0.06% 1.009s 4.117ms 245 hipMemcpyAsync 0.01% 20.626ms 0.01% 20.643ms 9.811us 0.000us 0.00% 0.000us 0.000us 2104 autograd::engine::evaluate_function: MulBackward0 0.01% 19.310ms 18.02% 27.989s 34.769ms 0.000us 0.00% 42.883s 53.270ms 805 aten::fill_ 0.01% 18.783ms 0.02% 33.056ms 20.034us 452.786ms 0.19% 452.786ms 274.416us 1650 aten::transpose 0.01% 17.410ms 0.02% 25.941ms 6.662us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 17.315ms 0.01% 17.315ms 16.827us 0.000us 0.00% 0.000us 0.000us 1029 aten::reshape 0.01% 16.722ms 0.03% 52.193ms 7.791us 0.000us 0.00% 119.772ms 17.879us 6699 IndexFirstAxis 0.01% 16.542ms 0.02% 26.226ms 214.967us 0.000us 0.00% 264.991ms 2.172ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.957ms 0.01% 23.293ms 375.693us 61.947s 25.42% 61.947s 999.152ms 62 aten::mean 0.01% 15.736ms 0.01% 18.601ms 57.948us 433.953ms 0.18% 433.953ms 1.352ms 321 aten::rsqrt 0.01% 15.726ms 0.01% 19.073ms 59.603us 1.701ms 0.00% 1.701ms 5.317us 320 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 15.725ms 0.02% 29.040ms 180.372us 0.000us 0.00% 324.826ms 2.018ms 161 autograd::engine::evaluate_function: ViewBackward0 0.01% 15.167ms 0.02% 31.449ms 15.171us 0.000us 0.00% 40.171ms 19.378us 2073 SeqAllToAll4DBackward 0.01% 14.719ms 54.82% 85.175s 354.898ms 0.000us 0.00% 1.400s 5.833ms 240 aten::linear 0.01% 14.107ms 0.21% 332.399ms 250.678us 0.000us 0.00% 81.337s 61.340ms 1326 aten::silu 0.01% 13.985ms 0.01% 17.336ms 101.978us 144.260ms 0.06% 144.526ms 850.152us 170 aten::select 0.01% 12.672ms 0.01% 15.701ms 6.461us 0.000us 0.00% 0.000us 0.000us 2430 aten::to 0.01% 12.168ms 21.03% 32.671s 4.407ms 0.000us 0.00% 3.893s 525.175us 7413 detach 0.01% 11.982ms 0.01% 11.982ms 1.756us 0.000us 0.00% 0.000us 0.000us 6823 AddmmBackward0 0.01% 11.731ms 0.04% 62.985ms 183.631us 0.000us 0.00% 2.972s 8.665ms 343 aten::nonzero 0.01% 11.590ms 0.49% 757.706ms 6.160ms 11.107ms 0.00% 11.888ms 96.648us 123 autograd::engine::evaluate_function: torch::autograd... 0.01% 10.489ms 0.07% 104.633ms 1.715ms 0.000us 0.00% 896.751ms 14.701ms 61 autograd::engine::evaluate_function: AddBackward0 0.01% 10.393ms 0.05% 75.911ms 116.966us 0.000us 0.00% 501.844ms 773.258us 649 aten::clone 0.01% 10.314ms 0.12% 183.738ms 112.241us 0.000us 0.00% 2.117s 1.293ms 1637 FullyShardedDataParallel._pre_backward_hook 0.01% 10.231ms 0.03% 54.305ms 890.246us 0.000us 0.00% 476.199ms 7.807ms 61 aten::empty_like 0.01% 10.185ms 0.03% 53.828ms 18.472us 0.000us 0.00% 6.240us 0.002us 2914 aten::gelu 0.01% 9.903ms 0.01% 12.332ms 77.076us 629.473ms 0.26% 629.473ms 3.934ms 160 aten::narrow 0.01% 9.542ms 0.02% 26.172ms 7.824us 0.000us 0.00% 0.000us 0.000us 3345 autograd::engine::evaluate_function: SliceBackward0 0.01% 9.514ms 0.05% 80.392ms 65.787us 0.000us 0.00% 857.002ms 701.311us 1222 aten::add_ 0.01% 9.074ms 0.01% 13.451ms 19.636us 288.351ms 0.12% 288.351ms 420.950us 685 aten::neg 0.01% 8.855ms 0.01% 11.827ms 32.853us 274.064ms 0.11% 274.064ms 761.290us 360 hipMemsetAsync 0.01% 8.743ms 0.01% 8.746ms 11.676us 0.000us 0.00% 0.000us 0.000us 749 IndexFirstAxisBackward 0.01% 8.570ms 0.01% 17.166ms 276.863us 0.000us 0.00% 171.657ms 2.769ms 62 aten::unsqueeze 0.01% 8.568ms 0.01% 10.231ms 5.990us 0.000us 0.00% 0.000us 0.000us 1708 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.678ms 0.06% 89.785ms 261.763us 0.000us 0.00% 3.089s 9.007ms 343 FullyShardedDataParallel._post_backward_prefetch 0.00% 7.416ms 0.00% 7.416ms 121.567us 0.000us 0.00% 0.000us 0.000us 61 aten::detach 0.00% 7.359ms 0.01% 19.341ms 2.835us 0.000us 0.00% 0.000us 0.000us 6823 IndexPutFirstAxis 0.00% 7.240ms 0.01% 22.239ms 182.284us 0.000us 0.00% 153.063ms 1.255ms 122 aten::t 0.00% 7.054ms 0.01% 15.332ms 7.125us 0.000us 0.00% 0.000us 0.000us 2152 c10d::allgather_ 0.00% 6.926ms 0.03% 54.212ms 451.765us 0.000us 0.00% 12.394ms 103.287us 120 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 6.397ms 54.83% 85.182s 354.924ms 0.000us 0.00% 1.400s 5.833ms 240 aten::stack 0.00% 6.344ms 0.03% 51.256ms 90.558us 0.000us 0.00% 1.054s 1.862ms 566 hipEventDestroy 0.00% 6.311ms 0.00% 6.311ms 1.958us 400.959ms 0.16% 400.959ms 124.406us 3223 ViewBackward0 0.00% 5.900ms 0.01% 16.282ms 7.855us 0.000us 0.00% 40.171ms 19.378us 2073 aten::split_with_sizes 0.00% 5.893ms 0.00% 6.849ms 21.337us 0.000us 0.00% 0.000us 0.000us 321 aten::expand 0.00% 5.815ms 0.00% 7.259ms 6.690us 0.000us 0.00% 0.000us 0.000us 1085 aten::cumsum 0.00% 5.331ms 0.00% 6.764ms 55.444us 782.559us 0.00% 782.559us 6.414us 122 ToCopyBackward0 0.00% 5.311ms 0.05% 74.087ms 41.645us 0.000us 0.00% 901.420ms 506.700us 1779 aten::zero_ 0.00% 5.286ms 0.02% 31.182ms 22.178us 0.000us 0.00% 451.816ms 321.348us 1406 aten::gather 0.00% 5.158ms 0.00% 6.775ms 55.533us 264.991ms 0.11% 264.991ms 2.172ms 122 FullyShardedDataParallel.rate_limiter 0.00% 5.133ms 0.00% 5.686ms 46.988us 0.000us 0.00% 256.368ms 2.119ms 121 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.090ms 0.00% 5.090ms 83.439us 0.000us 0.00% 0.000us 0.000us 61 IndexPutFirstAxisBackward 0.00% 5.027ms 0.01% 10.424ms 168.136us 0.000us 0.00% 55.835ms 900.571us 62 NativeLayerNormBackward0 0.00% 4.930ms 0.01% 13.900ms 111.196us 0.000us 0.00% 572.513ms 4.580ms 125 aten::zeros 0.00% 4.862ms 0.03% 49.701ms 35.349us 0.000us 0.00% 451.816ms 321.348us 1406 aten::slice_backward 0.00% 4.540ms 0.04% 65.412ms 53.528us 0.000us 0.00% 785.819ms 643.060us 1222 aten::unbind 0.00% 4.527ms 0.01% 10.379ms 25.754us 0.000us 0.00% 0.000us 0.000us 403 aten::div 0.00% 4.438ms 0.00% 6.103ms 35.900us 161.006ms 0.07% 161.006ms 947.097us 170 aten::max 0.00% 4.347ms 0.00% 6.791ms 55.666us 1.427ms 0.00% 1.427ms 11.697us 122 aten::index 0.00% 4.146ms 0.00% 5.494ms 85.846us 55.843ms 0.02% 55.843ms 872.544us 64 _AllGatherBackward 0.00% 3.859ms 0.00% 4.744ms 79.060us 0.000us 0.00% 0.000us 0.000us 60 aten::_index_put_impl_ 0.00% 3.679ms 0.00% 5.547ms 45.470us 107.129ms 0.04% 107.129ms 878.105us 122 aten::split 0.00% 3.644ms 0.01% 9.829ms 43.878us 0.000us 0.00% 0.000us 0.000us 224 c10d::_allgather_base_ 0.00% 3.414ms 0.02% 23.342ms 192.906us 0.000us 0.00% 1.170s 9.670ms 121 aten::layer_norm 0.00% 3.407ms 0.07% 110.755ms 226.032us 0.000us 0.00% 2.257s 4.606ms 490 PowBackward0 0.00% 3.368ms 0.01% 17.799ms 110.555us 0.000us 0.00% 346.724ms 2.154ms 161 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.365s Self CUDA time total: 243.694s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.48% 118.822s 76.48% 118.823s 164.803ms 0.000us 0.00% 49.838ms 69.123us 721 hipMemcpyWithStream 21.32% 33.130s 21.32% 33.131s 66.261ms 0.000us 0.00% 261.368ms 522.736us 500 hipLaunchKernel 0.20% 315.559ms 0.20% 315.563ms 13.809us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.15% 239.573ms 20.91% 32.490s 3.303ms 6.182s 2.53% 6.445s 655.141us 9837 MulBackward0 0.14% 209.876ms 17.94% 27.869s 34.620ms 0.000us 0.00% 42.366s 52.628ms 805 SeqAllToAll4D 0.13% 200.011ms 76.82% 119.353s 165.768ms 0.000us 0.00% 3.060s 4.250ms 720 FullyShardedDataParallel.forward 0.12% 187.252ms 52.70% 81.879s 1.342s 0.000us 0.00% 84.488s 1.385s 61 record_param_comms 0.11% 166.663ms 0.16% 255.058ms 123.935us 4.266s 1.75% 4.519s 2.196ms 2058 aten::mul 0.10% 157.293ms 0.12% 189.892ms 52.197us 3.153s 1.29% 3.153s 866.663us 3638 aten::empty_strided 0.08% 129.765ms 0.08% 129.770ms 17.449us 0.000us 0.00% 0.000us 0.000us 7437 aten::cat 0.08% 121.674ms 0.10% 150.014ms 100.009us 2.210s 0.90% 2.210s 1.473ms 1500 aten::addmm 0.05% 85.313ms 0.07% 107.790ms 162.579us 40.551s 16.60% 40.551s 61.162ms 663 hipStreamWaitEvent 0.05% 79.950ms 0.05% 79.955ms 25.784us 3.340ms 0.00% 3.340ms 1.077us 3101 aten::empty 0.05% 71.997ms 0.05% 71.997ms 14.156us 0.000us 0.00% 0.000us 0.000us 5086 FullyShardedDataParallel._pre_forward 0.04% 56.500ms 0.06% 89.829ms 1.473ms 0.000us 0.00% 706.390ms 11.580ms 61 aten::sum 0.03% 53.011ms 0.05% 76.130ms 59.757us 517.591ms 0.21% 518.237ms 406.780us 1274 aten::_to_copy 0.03% 46.213ms 20.90% 32.467s 5.330ms 0.000us 0.00% 3.739s 613.915us 6091 FullyShardedDataParallel._post_backward_hook 0.03% 45.010ms 0.05% 80.120ms 1.313ms 0.000us 0.00% 1.104s 18.091ms 61 aten::add 0.03% 43.906ms 0.03% 53.850ms 43.674us 861.445ms 0.35% 861.445ms 698.658us 1233 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 39.260ms 0.09% 144.389ms 81.163us 0.000us 0.00% 1.042s 585.685us 1779 aten::view 0.02% 38.388ms 0.02% 38.388ms 4.070us 0.000us 0.00% 0.000us 0.000us 9432 aten::mm 0.02% 35.278ms 0.03% 45.228ms 66.512us 3.042s 1.25% 3.042s 4.474ms 680 aten::slice 0.02% 33.573ms 0.03% 40.874ms 5.556us 0.000us 0.00% 0.000us 0.000us 7357 aten::cos 0.02% 33.211ms 0.02% 33.255ms 5.543ms 28.160us 0.00% 111.680us 18.613us 6 _AllGather 0.02% 32.502ms 0.07% 107.938ms 899.484us 0.000us 0.00% 42.622ms 355.184us 120 FullyShardedDataParallel._post_forward 0.02% 30.979ms 0.02% 31.660ms 519.016us 0.000us 0.00% 0.000us 0.000us 61 aten::pow 0.02% 29.052ms 0.02% 38.443ms 59.144us 146.946ms 0.06% 302.035ms 464.669us 650 FlashAttnVarlenQKVPackedFunc 0.02% 27.285ms 0.03% 39.519ms 323.926us 30.134s 12.34% 30.134s 247.002ms 122 c10d::alltoall_base_ 0.02% 26.395ms 0.11% 164.790ms 228.874us 0.000us 0.00% 2.327s 3.232ms 720 FullyShardedDataParallel._pre_backward_prefetch 0.02% 26.134ms 0.03% 43.741ms 717.067us 0.000us 0.00% 496.063ms 8.132ms 61 aten::as_strided 0.02% 25.702ms 0.02% 25.702ms 1.313us 0.000us 0.00% 0.000us 0.000us 19575 autograd::engine::evaluate_function: SiluBackward0 0.02% 23.394ms 0.02% 31.028ms 344.752us 0.000us 0.00% 915.663us 10.174us 90 aten::fill_ 0.01% 22.562ms 0.02% 36.195ms 21.937us 456.456ms 0.19% 456.456ms 276.640us 1650 aten::native_layer_norm 0.01% 21.850ms 0.03% 47.201ms 192.657us 155.978ms 0.06% 1.015s 4.145ms 245 aten::sin 0.01% 20.948ms 0.01% 20.968ms 3.495ms 23.360us 0.00% 23.360us 3.893us 6 hipExtModuleLaunchKernel 0.01% 20.838ms 0.01% 20.838ms 10.664us 0.000us 0.00% 0.000us 0.000us 1954 hipMemcpyAsync 0.01% 19.769ms 0.01% 19.769ms 9.396us 0.000us 0.00% 0.000us 0.000us 2104 autograd::engine::evaluate_function: MulBackward0 0.01% 19.507ms 17.98% 27.935s 34.702ms 0.000us 0.00% 42.829s 53.204ms 805 aten::transpose 0.01% 17.967ms 0.02% 26.790ms 6.880us 0.000us 0.00% 0.000us 0.000us 3894 aten::reshape 0.01% 17.524ms 0.03% 53.733ms 8.021us 0.000us 0.00% 119.797ms 17.883us 6699 hipExtLaunchKernel 0.01% 17.466ms 0.01% 17.466ms 16.974us 0.000us 0.00% 0.000us 0.000us 1029 aten::mean 0.01% 17.327ms 0.01% 20.101ms 62.620us 434.370ms 0.18% 434.370ms 1.353ms 321 aten::rsqrt 0.01% 17.258ms 0.01% 20.608ms 64.401us 1.691ms 0.00% 1.691ms 5.283us 320 IndexFirstAxis 0.01% 16.705ms 0.02% 26.965ms 221.024us 0.000us 0.00% 264.975ms 2.172ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.893ms 0.02% 23.420ms 377.744us 62.122s 25.43% 62.122s 1.002s 62 aten::linear 0.01% 14.672ms 0.22% 343.629ms 259.147us 0.000us 0.00% 81.521s 61.479ms 1326 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 14.671ms 0.02% 27.818ms 172.781us 0.000us 0.00% 328.403ms 2.040ms 161 aten::silu 0.01% 14.417ms 0.01% 17.786ms 104.626us 89.927ms 0.04% 89.927ms 528.982us 170 SeqAllToAll4DBackward 0.01% 14.317ms 54.85% 85.209s 355.040ms 0.000us 0.00% 1.187s 4.946ms 240 hipEventDestroy 0.01% 13.617ms 0.01% 13.619ms 4.225us 601.642ms 0.25% 601.642ms 186.671us 3223 aten::select 0.01% 13.594ms 0.01% 16.806ms 6.916us 0.000us 0.00% 0.000us 0.000us 2430 autograd::engine::evaluate_function: ViewBackward0 0.01% 13.527ms 0.02% 27.739ms 13.381us 0.000us 0.00% 40.175ms 19.380us 2073 aten::to 0.01% 13.068ms 20.91% 32.480s 4.381ms 0.000us 0.00% 3.739s 504.432us 7413 aten::nonzero 0.01% 12.565ms 0.48% 746.536ms 6.069ms 11.480ms 0.00% 11.482ms 93.350us 123 detach 0.01% 11.751ms 0.01% 11.751ms 1.722us 0.000us 0.00% 0.000us 0.000us 6823 AddmmBackward0 0.01% 11.102ms 0.04% 62.659ms 182.679us 0.000us 0.00% 3.042s 8.870ms 343 aten::empty_like 0.01% 11.009ms 0.04% 57.833ms 19.847us 0.000us 0.00% 0.000us 0.000us 2914 aten::clone 0.01% 10.977ms 0.12% 182.491ms 111.479us 0.000us 0.00% 2.099s 1.282ms 1637 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.373ms 0.06% 91.803ms 75.125us 0.000us 0.00% 861.376ms 704.891us 1222 aten::add_ 0.01% 10.373ms 0.01% 14.693ms 21.449us 343.983ms 0.14% 343.983ms 502.164us 685 aten::gelu 0.01% 10.188ms 0.01% 12.648ms 79.052us 626.079ms 0.26% 626.079ms 3.913ms 160 aten::narrow 0.01% 10.108ms 0.02% 28.549ms 8.535us 0.000us 0.00% 0.000us 0.000us 3345 aten::neg 0.01% 10.013ms 0.01% 12.798ms 35.549us 270.616ms 0.11% 270.616ms 751.710us 360 FullyShardedDataParallel._pre_backward_hook 0.01% 9.739ms 0.04% 55.552ms 910.687us 0.000us 0.00% 496.063ms 8.132ms 61 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.108ms 0.06% 90.259ms 1.480ms 0.000us 0.00% 1.104s 18.091ms 61 aten::unsqueeze 0.01% 9.039ms 0.01% 10.831ms 6.341us 0.000us 0.00% 0.000us 0.000us 1708 autograd::engine::evaluate_function: AddBackward0 0.01% 8.567ms 0.05% 74.417ms 114.664us 0.000us 0.00% 507.104ms 781.363us 649 hipMemsetAsync 0.01% 8.443ms 0.01% 8.443ms 11.272us 0.000us 0.00% 0.000us 0.000us 749 IndexFirstAxisBackward 0.01% 7.908ms 0.01% 16.826ms 271.393us 0.000us 0.00% 171.543ms 2.767ms 62 aten::detach 0.01% 7.785ms 0.01% 19.536ms 2.863us 0.000us 0.00% 0.000us 0.000us 6823 IndexPutFirstAxis 0.00% 7.539ms 0.01% 21.919ms 179.667us 0.000us 0.00% 153.056ms 1.255ms 122 aten::t 0.00% 7.358ms 0.01% 16.404ms 7.623us 0.000us 0.00% 0.000us 0.000us 2152 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.290ms 0.06% 89.202ms 260.063us 0.000us 0.00% 3.158s 9.208ms 343 c10d::allgather_ 0.00% 7.021ms 0.04% 55.747ms 464.558us 0.000us 0.00% 40.191ms 334.924us 120 aten::slice_backward 0.00% 6.876ms 0.05% 75.354ms 61.665us 0.000us 0.00% 789.423ms 646.009us 1222 aten::stack 0.00% 6.787ms 0.03% 50.590ms 89.381us 0.000us 0.00% 1.039s 1.836ms 566 aten::split_with_sizes 0.00% 6.337ms 0.00% 7.397ms 23.043us 0.000us 0.00% 0.000us 0.000us 321 aten::zero_ 0.00% 6.320ms 0.02% 35.205ms 25.039us 0.000us 0.00% 455.487ms 323.959us 1406 FullyShardedDataParallel._post_backward_prefetch 0.00% 6.063ms 0.00% 6.063ms 99.393us 0.000us 0.00% 0.000us 0.000us 61 aten::expand 0.00% 6.032ms 0.00% 7.486ms 6.900us 0.000us 0.00% 0.000us 0.000us 1085 aten::cumsum 0.00% 5.903ms 0.00% 7.174ms 58.801us 784.160us 0.00% 784.160us 6.428us 122 aten::gather 0.00% 5.596ms 0.00% 7.149ms 58.597us 264.975ms 0.11% 264.975ms 2.172ms 122 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.273ms 0.00% 5.273ms 86.446us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 5.261ms 54.85% 85.215s 355.061ms 0.000us 0.00% 1.187s 4.946ms 240 aten::zeros 0.00% 5.254ms 0.04% 55.535ms 39.499us 0.000us 0.00% 455.487ms 323.959us 1406 FullyShardedDataParallel.rate_limiter 0.00% 5.170ms 0.00% 5.683ms 46.969us 0.000us 0.00% 18.279ms 151.064us 121 aten::div 0.00% 4.838ms 0.00% 6.343ms 37.311us 160.501ms 0.07% 160.501ms 944.121us 170 aten::unbind 0.00% 4.818ms 0.01% 11.030ms 27.370us 0.000us 0.00% 0.000us 0.000us 403 aten::max 0.00% 4.648ms 0.00% 7.158ms 58.674us 1.412ms 0.00% 2.271ms 18.612us 122 ToCopyBackward0 0.00% 4.533ms 0.05% 78.720ms 44.250us 0.000us 0.00% 886.101ms 498.089us 1779 IndexPutFirstAxisBackward 0.00% 4.494ms 0.01% 9.964ms 160.703us 0.000us 0.00% 55.806ms 900.102us 62 aten::index 0.00% 4.360ms 0.00% 5.558ms 86.836us 55.814ms 0.02% 55.814ms 872.096us 64 NativeLayerNormBackward0 0.00% 4.292ms 0.01% 13.316ms 106.528us 0.000us 0.00% 574.841ms 4.599ms 125 aten::_index_put_impl_ 0.00% 4.091ms 0.00% 5.831ms 47.798us 107.159ms 0.04% 107.159ms 878.356us 122 aten::split 0.00% 3.939ms 0.01% 10.880ms 48.571us 0.000us 0.00% 0.000us 0.000us 224 _AllGatherBackward 0.00% 3.734ms 0.00% 4.696ms 78.267us 0.000us 0.00% 0.000us 0.000us 60 c10d::_allgather_base_ 0.00% 3.705ms 0.02% 25.087ms 207.331us 0.000us 0.00% 1.155s 9.543ms 121 ViewBackward0 0.00% 3.696ms 0.01% 14.212ms 6.856us 0.000us 0.00% 40.175ms 19.380us 2073 aten::layer_norm 0.00% 3.576ms 0.07% 114.347ms 233.362us 0.000us 0.00% 2.272s 4.637ms 490 PowBackward0 0.00% 3.416ms 0.01% 18.977ms 117.872us 0.000us 0.00% 346.195ms 2.150ms 161 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.364s Self CUDA time total: 244.296s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.50% 118.848s 76.50% 118.849s 164.839ms 0.000us 0.00% 1.073s 1.488ms 721 hipMemcpyWithStream 21.28% 33.062s 21.30% 33.095s 66.189ms 0.000us 0.00% 31.258ms 62.516us 500 hipLaunchKernel 0.20% 314.450ms 0.20% 314.466ms 13.761us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.16% 246.641ms 20.90% 32.465s 3.300ms 6.308s 2.58% 6.339s 644.451us 9837 MulBackward0 0.14% 214.146ms 17.95% 27.887s 34.642ms 0.000us 0.00% 42.345s 52.602ms 805 SeqAllToAll4D 0.13% 204.942ms 76.85% 119.392s 165.823ms 0.000us 0.00% 4.030s 5.598ms 720 FullyShardedDataParallel.forward 0.12% 182.880ms 52.69% 81.856s 1.342s 0.000us 0.00% 85.036s 1.394s 61 record_param_comms 0.11% 170.957ms 0.17% 260.778ms 126.714us 4.451s 1.82% 4.454s 2.164ms 2058 aten::mul 0.10% 155.799ms 0.12% 187.878ms 51.643us 3.245s 1.33% 3.245s 892.060us 3638 aten::empty_strided 0.08% 127.658ms 0.08% 127.658ms 17.165us 0.000us 0.00% 0.000us 0.000us 7437 aten::cat 0.08% 120.120ms 0.10% 148.047ms 98.698us 2.212s 0.90% 2.212s 1.474ms 1500 aten::addmm 0.05% 83.299ms 0.07% 105.780ms 159.548us 40.263s 16.47% 40.263s 60.729ms 663 hipStreamWaitEvent 0.05% 81.623ms 0.05% 81.629ms 26.323us 17.887ms 0.01% 19.243ms 6.205us 3101 aten::empty 0.05% 71.329ms 0.05% 71.334ms 14.026us 0.000us 0.00% 403.519us 0.079us 5086 aten::sum 0.03% 52.849ms 0.05% 75.044ms 58.904us 514.024ms 0.21% 514.674ms 403.983us 1274 FullyShardedDataParallel._pre_forward 0.03% 51.467ms 0.05% 83.772ms 1.373ms 0.000us 0.00% 726.945ms 11.917ms 61 aten::add 0.03% 46.548ms 0.04% 56.582ms 45.889us 792.325ms 0.32% 792.325ms 642.599us 1233 aten::_to_copy 0.03% 45.945ms 20.87% 32.429s 5.324ms 0.000us 0.00% 3.610s 592.753us 6091 FullyShardedDataParallel._post_backward_hook 0.03% 45.005ms 0.05% 80.050ms 1.312ms 0.000us 0.00% 1.166s 19.109ms 61 aten::cos 0.03% 39.834ms 0.03% 39.857ms 6.643ms 26.240us 0.00% 42.080us 7.013us 6 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 39.500ms 0.09% 143.397ms 80.605us 0.000us 0.00% 1.059s 595.228us 1779 aten::view 0.03% 39.433ms 0.03% 39.433ms 4.181us 0.000us 0.00% 0.000us 0.000us 9432 hipEventDestroy 0.02% 38.217ms 0.02% 38.226ms 11.860us 1.289s 0.53% 1.289s 400.076us 3223 aten::mm 0.02% 35.310ms 0.03% 45.195ms 66.463us 2.968s 1.21% 2.968s 4.364ms 680 _AllGather 0.02% 32.927ms 0.07% 110.815ms 923.455us 0.000us 0.00% 10.141ms 84.510us 120 aten::slice 0.02% 32.657ms 0.03% 39.276ms 5.339us 0.000us 0.00% 0.000us 0.000us 7357 FullyShardedDataParallel._post_forward 0.02% 29.786ms 0.02% 30.446ms 499.122us 0.000us 0.00% 0.000us 0.000us 61 autograd::engine::evaluate_function: SiluBackward0 0.02% 28.722ms 0.02% 36.399ms 404.437us 0.000us 0.00% 914.074us 10.156us 90 c10d::alltoall_base_ 0.02% 28.508ms 0.11% 170.226ms 236.426us 0.000us 0.00% 2.246s 3.119ms 720 aten::pow 0.02% 28.159ms 0.02% 37.475ms 57.654us 146.556ms 0.06% 301.798ms 464.304us 650 FullyShardedDataParallel._pre_backward_prefetch 0.02% 26.697ms 0.03% 46.072ms 755.282us 0.000us 0.00% 503.407ms 8.253ms 61 FlashAttnVarlenQKVPackedFunc 0.02% 25.450ms 0.02% 37.120ms 304.261us 30.134s 12.32% 30.134s 246.999ms 122 aten::as_strided 0.02% 24.684ms 0.02% 24.684ms 1.261us 0.000us 0.00% 0.000us 0.000us 19575 aten::sin 0.02% 24.670ms 0.02% 24.687ms 4.114ms 22.240us 0.00% 22.240us 3.707us 6 aten::fill_ 0.01% 22.588ms 0.02% 35.625ms 21.591us 456.380ms 0.19% 456.384ms 276.596us 1650 aten::native_layer_norm 0.01% 21.454ms 0.03% 46.114ms 188.221us 170.462ms 0.07% 1.027s 4.192ms 245 hipExtModuleLaunchKernel 0.01% 20.712ms 0.01% 20.712ms 10.600us 0.000us 0.00% 0.000us 0.000us 1954 autograd::engine::evaluate_function: MulBackward0 0.01% 19.886ms 17.99% 27.952s 34.723ms 0.000us 0.00% 42.805s 53.174ms 805 hipMemcpyAsync 0.01% 19.417ms 0.01% 19.417ms 9.229us 0.000us 0.00% 0.000us 0.000us 2104 aten::transpose 0.01% 18.480ms 0.02% 27.080ms 6.954us 0.000us 0.00% 0.000us 0.000us 3894 aten::reshape 0.01% 17.769ms 0.04% 54.856ms 8.189us 0.000us 0.00% 119.864ms 17.893us 6699 hipExtLaunchKernel 0.01% 17.626ms 0.01% 17.626ms 17.129us 0.000us 0.00% 0.000us 0.000us 1029 aten::mean 0.01% 16.741ms 0.01% 19.222ms 59.883us 433.604ms 0.18% 433.604ms 1.351ms 321 aten::rsqrt 0.01% 16.470ms 0.01% 19.708ms 61.588us 1.692ms 0.00% 1.692ms 5.286us 320 IndexFirstAxis 0.01% 15.585ms 0.02% 25.597ms 209.811us 0.000us 0.00% 265.414ms 2.176ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.213ms 0.01% 22.743ms 366.818us 62.015s 25.36% 62.015s 1.000s 62 SeqAllToAll4DBackward 0.01% 15.114ms 54.84% 85.202s 355.008ms 0.000us 0.00% 1.328s 5.533ms 240 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 14.647ms 0.02% 28.493ms 176.973us 0.000us 0.00% 327.981ms 2.037ms 161 aten::linear 0.01% 14.379ms 0.22% 336.173ms 253.524us 0.000us 0.00% 80.945s 61.045ms 1326 aten::silu 0.01% 14.132ms 0.01% 17.498ms 102.932us 235.027ms 0.10% 235.027ms 1.383ms 170 aten::select 0.01% 13.900ms 0.01% 17.131ms 7.050us 0.000us 0.00% 0.000us 0.000us 2430 autograd::engine::evaluate_function: ViewBackward0 0.01% 13.503ms 0.02% 27.539ms 13.285us 0.000us 0.00% 40.192ms 19.388us 2073 aten::to 0.01% 12.903ms 20.88% 32.442s 4.376ms 0.000us 0.00% 3.610s 487.044us 7413 aten::nonzero 0.01% 12.354ms 0.48% 746.912ms 6.072ms 11.395ms 0.00% 11.395ms 92.641us 123 detach 0.01% 11.224ms 0.01% 11.224ms 1.645us 0.000us 0.00% 0.000us 0.000us 6823 AddmmBackward0 0.01% 11.075ms 0.04% 62.576ms 182.439us 0.000us 0.00% 2.968s 8.653ms 343 aten::clone 0.01% 10.441ms 0.12% 193.677ms 118.312us 0.000us 0.00% 2.120s 1.295ms 1637 aten::add_ 0.01% 10.397ms 0.01% 14.750ms 21.532us 289.724ms 0.12% 289.724ms 422.955us 685 aten::empty_like 0.01% 10.320ms 0.04% 57.630ms 19.777us 0.000us 0.00% 0.000us 0.000us 2914 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.087ms 0.06% 88.002ms 72.015us 0.000us 0.00% 862.369ms 705.703us 1222 aten::narrow 0.01% 10.057ms 0.02% 28.139ms 8.412us 0.000us 0.00% 0.000us 0.000us 3345 aten::gelu 0.01% 10.000ms 0.01% 12.458ms 77.859us 625.850ms 0.26% 625.850ms 3.912ms 160 aten::neg 0.01% 9.713ms 0.01% 12.538ms 34.829us 271.660ms 0.11% 271.660ms 754.610us 360 FullyShardedDataParallel._pre_backward_hook 0.01% 9.661ms 0.04% 57.758ms 946.855us 0.000us 0.00% 503.407ms 8.253ms 61 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.074ms 0.06% 90.152ms 1.478ms 0.000us 0.00% 1.166s 19.109ms 61 aten::unsqueeze 0.01% 8.962ms 0.01% 10.642ms 6.230us 0.000us 0.00% 0.000us 0.000us 1708 autograd::engine::evaluate_function: AddBackward0 0.01% 8.716ms 0.05% 76.839ms 118.396us 0.000us 0.00% 509.162ms 784.532us 649 hipMemsetAsync 0.01% 8.340ms 0.01% 8.340ms 11.134us 0.000us 0.00% 0.000us 0.000us 749 IndexFirstAxisBackward 0.01% 7.991ms 0.01% 16.998ms 274.169us 0.000us 0.00% 171.668ms 2.769ms 62 c10d::allgather_ 0.01% 7.956ms 0.04% 58.473ms 487.278us 0.000us 0.00% 8.205ms 68.377us 120 aten::detach 0.00% 7.631ms 0.01% 18.855ms 2.763us 0.000us 0.00% 0.000us 0.000us 6823 IndexPutFirstAxis 0.00% 7.496ms 0.01% 21.699ms 177.862us 0.000us 0.00% 153.063ms 1.255ms 122 aten::t 0.00% 7.279ms 0.01% 16.232ms 7.543us 0.000us 0.00% 0.000us 0.000us 2152 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.083ms 0.06% 88.584ms 258.261us 0.000us 0.00% 3.084s 8.990ms 343 aten::stack 0.00% 6.625ms 0.03% 49.467ms 87.398us 0.000us 0.00% 1.039s 1.836ms 566 aten::split_with_sizes 0.00% 6.571ms 0.00% 7.662ms 23.869us 0.000us 0.00% 0.000us 0.000us 321 aten::slice_backward 0.00% 6.329ms 0.05% 72.360ms 59.215us 0.000us 0.00% 791.258ms 647.511us 1222 aten::zero_ 0.00% 6.244ms 0.02% 34.689ms 24.672us 0.000us 0.00% 455.410ms 323.904us 1406 FullyShardedDataParallel._post_backward_prefetch 0.00% 6.065ms 0.00% 6.065ms 99.430us 0.000us 0.00% 0.000us 0.000us 61 aten::expand 0.00% 6.039ms 0.00% 7.488ms 6.902us 0.000us 0.00% 0.000us 0.000us 1085 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 5.789ms 54.85% 85.208s 355.032ms 0.000us 0.00% 1.328s 5.533ms 240 aten::cumsum 0.00% 5.669ms 0.00% 6.920ms 56.718us 788.477us 0.00% 788.477us 6.463us 122 aten::gather 0.00% 5.490ms 0.00% 6.986ms 57.265us 265.414ms 0.11% 265.414ms 2.176ms 122 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.039ms 0.00% 5.039ms 82.605us 0.000us 0.00% 0.000us 0.000us 61 aten::zeros 0.00% 4.918ms 0.03% 54.137ms 38.504us 0.000us 0.00% 455.410ms 323.904us 1406 aten::unbind 0.00% 4.906ms 0.01% 11.294ms 28.025us 0.000us 0.00% 0.000us 0.000us 403 FullyShardedDataParallel.rate_limiter 0.00% 4.858ms 0.00% 5.411ms 44.723us 0.000us 0.00% 17.284ms 142.841us 121 aten::div 0.00% 4.738ms 0.00% 6.164ms 36.261us 160.966ms 0.07% 160.966ms 946.859us 170 aten::max 0.00% 4.734ms 0.00% 7.154ms 58.639us 1.424ms 0.00% 1.424ms 11.673us 122 IndexPutFirstAxisBackward 0.00% 4.614ms 0.01% 10.127ms 163.346us 0.000us 0.00% 56.010ms 903.392us 62 ToCopyBackward0 0.00% 4.465ms 0.05% 77.160ms 43.373us 0.000us 0.00% 956.483ms 537.652us 1779 aten::index 0.00% 4.449ms 0.00% 5.609ms 87.645us 56.020ms 0.02% 56.020ms 875.308us 64 NativeLayerNormBackward0 0.00% 4.253ms 0.01% 13.252ms 106.019us 0.000us 0.00% 572.836ms 4.583ms 125 aten::_index_put_impl_ 0.00% 4.037ms 0.00% 5.830ms 47.786us 107.167ms 0.04% 107.167ms 878.421us 122 aten::split 0.00% 3.857ms 0.01% 10.606ms 47.350us 0.000us 0.00% 0.000us 0.000us 224 ViewBackward0 0.00% 3.783ms 0.01% 14.037ms 6.771us 0.000us 0.00% 40.192ms 19.388us 2073 _AllGatherBackward 0.00% 3.736ms 0.00% 4.721ms 78.679us 0.000us 0.00% 0.000us 0.000us 60 c10d::_allgather_base_ 0.00% 3.575ms 0.02% 26.284ms 217.221us 0.000us 0.00% 1.143s 9.446ms 121 aten::layer_norm 0.00% 3.524ms 0.07% 111.899ms 228.365us 0.000us 0.00% 2.304s 4.703ms 490 aten::native_layer_norm_backward 0.00% 3.440ms 0.01% 9.000ms 71.999us 144.507ms 0.06% 572.836ms 4.583ms 125 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.361s Self CUDA time total: 244.528s ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ hipDeviceSynchronize 76.99% 119.601s 76.99% 119.602s 165.883ms 0.000us 0.00% 45.993ms 63.791us 721 hipMemcpyWithStream 20.80% 32.310s 20.80% 32.310s 64.620ms 0.000us 0.00% 22.828ms 45.657us 500 hipLaunchKernel 0.21% 322.529ms 0.21% 322.537ms 14.114us 0.000us 0.00% 0.000us 0.000us 22852 aten::copy_ 0.15% 234.714ms 20.43% 31.735s 3.226ms 6.272s 2.57% 6.293s 639.744us 9837 MulBackward0 0.13% 209.248ms 17.94% 27.866s 34.616ms 0.000us 0.00% 42.404s 52.676ms 805 SeqAllToAll4D 0.13% 196.729ms 77.32% 120.125s 166.840ms 0.000us 0.00% 3.768s 5.233ms 720 FullyShardedDataParallel.forward 0.12% 180.441ms 52.73% 81.910s 1.343s 0.000us 0.00% 85.113s 1.395s 61 record_param_comms 0.11% 164.679ms 0.16% 252.011ms 122.454us 4.983s 2.04% 4.985s 2.422ms 2058 aten::mul 0.10% 154.399ms 0.12% 191.145ms 52.541us 3.261s 1.33% 3.261s 896.363us 3638 aten::empty_strided 0.08% 126.687ms 0.08% 127.081ms 17.088us 6.080us 0.00% 6.080us 0.001us 7437 aten::cat 0.08% 119.410ms 0.09% 147.054ms 98.036us 2.201s 0.90% 2.201s 1.467ms 1500 aten::addmm 0.05% 83.927ms 0.07% 106.180ms 160.151us 39.682s 16.23% 39.682s 59.852ms 663 hipStreamWaitEvent 0.05% 80.243ms 0.05% 80.243ms 25.876us 9.618ms 0.00% 9.618ms 3.102us 3101 aten::empty 0.05% 70.989ms 0.05% 70.989ms 13.958us 0.000us 0.00% 0.000us 0.000us 5086 aten::mm 0.05% 69.962ms 0.05% 79.747ms 117.275us 2.969s 1.21% 2.969s 4.366ms 680 FullyShardedDataParallel._pre_forward 0.03% 53.399ms 0.05% 85.114ms 1.395ms 0.000us 0.00% 764.321ms 12.530ms 61 aten::sum 0.03% 52.685ms 0.05% 75.065ms 58.921us 515.835ms 0.21% 516.501ms 405.417us 1274 FullyShardedDataParallel._post_backward_hook 0.03% 45.911ms 0.05% 80.190ms 1.315ms 0.000us 0.00% 855.099ms 14.018ms 61 aten::_to_copy 0.03% 45.749ms 20.41% 31.713s 5.206ms 0.000us 0.00% 3.597s 590.541us 6091 aten::add 0.03% 43.220ms 0.03% 52.793ms 42.817us 797.709ms 0.33% 799.046ms 648.050us 1233 detach 0.03% 42.811ms 0.03% 42.811ms 6.275us 0.000us 0.00% 0.000us 0.000us 6823 autograd::engine::evaluate_function: ToCopyBackward0... 0.03% 39.604ms 0.09% 143.515ms 80.672us 0.000us 0.00% 1.001s 562.618us 1779 aten::view 0.03% 38.905ms 0.03% 38.905ms 4.125us 0.000us 0.00% 0.000us 0.000us 9432 aten::cos 0.02% 34.171ms 0.02% 34.194ms 5.699ms 23.039us 0.00% 23.039us 3.840us 6 aten::slice 0.02% 32.839ms 0.03% 39.812ms 5.412us 0.000us 0.00% 0.000us 0.000us 7357 _AllGather 0.02% 32.213ms 0.07% 106.654ms 888.785us 0.000us 0.00% 19.747ms 164.562us 120 FullyShardedDataParallel._post_forward 0.02% 31.092ms 0.02% 31.769ms 520.807us 0.000us 0.00% 0.000us 0.000us 61 aten::pow 0.02% 28.368ms 0.02% 37.461ms 57.633us 146.225ms 0.06% 301.490ms 463.831us 650 FlashAttnVarlenQKVPackedFunc 0.02% 26.616ms 0.02% 38.252ms 313.539us 30.116s 12.32% 30.116s 246.848ms 122 FullyShardedDataParallel._pre_backward_prefetch 0.02% 25.842ms 0.03% 46.424ms 761.047us 0.000us 0.00% 473.024ms 7.754ms 61 c10d::alltoall_base_ 0.02% 25.824ms 0.11% 163.514ms 227.102us 0.000us 0.00% 3.039s 4.221ms 720 aten::as_strided 0.02% 24.851ms 0.02% 24.851ms 1.270us 0.000us 0.00% 0.000us 0.000us 19575 autograd::engine::evaluate_function: SiluBackward0 0.02% 23.437ms 0.02% 30.978ms 344.203us 0.000us 0.00% 898.389us 9.982us 90 aten::fill_ 0.01% 21.999ms 0.02% 35.609ms 21.581us 449.578ms 0.18% 449.578ms 272.471us 1650 aten::native_layer_norm 0.01% 21.203ms 0.03% 46.417ms 189.457us 166.347ms 0.07% 1.021s 4.166ms 245 hipExtModuleLaunchKernel 0.01% 20.408ms 0.01% 20.408ms 10.444us 0.000us 0.00% 0.000us 0.000us 1954 hipMemcpyAsync 0.01% 19.657ms 0.01% 19.657ms 9.343us 0.000us 0.00% 0.000us 0.000us 2104 autograd::engine::evaluate_function: MulBackward0 0.01% 19.651ms 17.98% 27.930s 34.696ms 0.000us 0.00% 42.866s 53.250ms 805 aten::sin 0.01% 19.576ms 0.01% 19.591ms 3.265ms 23.040us 0.00% 23.040us 3.840us 6 aten::transpose 0.01% 18.035ms 0.02% 26.337ms 6.764us 0.000us 0.00% 0.000us 0.000us 3894 hipExtLaunchKernel 0.01% 17.164ms 0.01% 17.176ms 16.692us 0.000us 0.00% 0.000us 0.000us 1029 aten::reshape 0.01% 17.164ms 0.03% 53.924ms 8.050us 0.000us 0.00% 119.799ms 17.883us 6699 aten::mean 0.01% 16.687ms 0.01% 19.398ms 60.429us 433.494ms 0.18% 433.494ms 1.350ms 321 aten::rsqrt 0.01% 16.675ms 0.03% 49.856ms 155.800us 1.679ms 0.00% 1.679ms 5.245us 320 IndexFirstAxis 0.01% 16.052ms 0.02% 25.964ms 212.823us 0.000us 0.00% 264.903ms 2.171ms 122 FlashAttnVarlenQKVPackedFuncBackward 0.01% 15.928ms 0.02% 23.359ms 376.756us 61.956s 25.34% 61.956s 999.296ms 62 aten::neg 0.01% 15.828ms 0.01% 18.647ms 51.796us 269.620ms 0.11% 269.620ms 748.944us 360 autograd::engine::evaluate_function: SplitWithSizesB... 0.01% 14.653ms 0.02% 27.536ms 171.029us 0.000us 0.00% 323.237ms 2.008ms 161 aten::linear 0.01% 14.339ms 0.22% 342.808ms 258.528us 0.000us 0.00% 79.792s 60.175ms 1326 aten::silu 0.01% 14.036ms 0.01% 17.313ms 101.844us 171.806ms 0.07% 171.806ms 1.011ms 170 SeqAllToAll4DBackward 0.01% 13.997ms 54.86% 85.221s 355.086ms 0.000us 0.00% 1.406s 5.860ms 240 autograd::engine::evaluate_function: ViewBackward0 0.01% 13.692ms 0.02% 28.350ms 13.676us 0.000us 0.00% 40.148ms 19.367us 2073 aten::select 0.01% 13.340ms 0.01% 16.506ms 6.793us 0.000us 0.00% 0.000us 0.000us 2430 aten::to 0.01% 12.772ms 20.42% 31.725s 4.280ms 0.000us 0.00% 3.597s 485.226us 7413 aten::nonzero 0.01% 12.368ms 0.48% 742.250ms 6.035ms 10.715ms 0.00% 12.674ms 103.040us 123 AddmmBackward0 0.01% 11.214ms 0.06% 97.312ms 283.708us 0.000us 0.00% 2.969s 8.656ms 343 aten::empty_like 0.01% 10.821ms 0.04% 56.965ms 19.549us 0.000us 0.00% 6.080us 0.002us 2914 aten::clone 0.01% 10.796ms 0.12% 179.375ms 109.575us 0.000us 0.00% 2.088s 1.275ms 1637 autograd::engine::evaluate_function: SliceBackward0 0.01% 10.373ms 0.06% 90.858ms 74.352us 0.000us 0.00% 855.256ms 699.882us 1222 aten::add_ 0.01% 10.245ms 0.01% 14.277ms 20.842us 289.058ms 0.12% 289.058ms 421.983us 685 aten::gelu 0.01% 9.944ms 0.01% 12.318ms 76.987us 631.958ms 0.26% 631.958ms 3.950ms 160 aten::narrow 0.01% 9.911ms 0.02% 28.113ms 8.404us 0.000us 0.00% 0.000us 0.000us 3345 FullyShardedDataParallel._pre_backward_hook 0.01% 9.668ms 0.04% 58.133ms 952.992us 0.000us 0.00% 473.024ms 7.754ms 61 autograd::engine::evaluate_function: torch::autograd... 0.01% 9.194ms 0.06% 90.344ms 1.481ms 0.000us 0.00% 855.099ms 14.018ms 61 aten::unsqueeze 0.01% 8.966ms 0.01% 10.721ms 6.277us 0.000us 0.00% 0.000us 0.000us 1708 autograd::engine::evaluate_function: AddBackward0 0.01% 8.866ms 0.05% 77.448ms 119.334us 0.000us 0.00% 504.374ms 777.155us 649 hipMemsetAsync 0.01% 8.218ms 0.01% 8.218ms 10.972us 0.000us 0.00% 0.000us 0.000us 749 aten::detach 0.00% 7.766ms 0.03% 50.577ms 7.413us 0.000us 0.00% 0.000us 0.000us 6823 IndexFirstAxisBackward 0.00% 7.747ms 0.01% 16.398ms 264.480us 0.000us 0.00% 171.752ms 2.770ms 62 autograd::engine::evaluate_function: AddmmBackward0 0.00% 7.476ms 0.08% 123.535ms 360.160us 0.000us 0.00% 3.085s 8.993ms 343 aten::t 0.00% 7.377ms 0.01% 16.425ms 7.633us 0.000us 0.00% 0.000us 0.000us 2152 IndexPutFirstAxis 0.00% 7.253ms 0.01% 21.288ms 174.489us 0.000us 0.00% 153.122ms 1.255ms 122 c10d::allgather_ 0.00% 7.100ms 0.04% 54.525ms 454.375us 0.000us 0.00% 17.828ms 148.566us 120 c10d::_allgather_base_ 0.00% 6.837ms 0.02% 27.482ms 227.122us 0.000us 0.00% 1.183s 9.777ms 121 hipEventDestroy 0.00% 6.823ms 0.00% 6.823ms 2.117us 128.440ms 0.05% 128.440ms 39.851us 3223 aten::slice_backward 0.00% 6.705ms 0.05% 74.674ms 61.108us 0.000us 0.00% 783.911ms 641.499us 1222 aten::stack 0.00% 6.696ms 0.03% 50.138ms 88.582us 0.000us 0.00% 1.037s 1.832ms 566 aten::zero_ 0.00% 6.263ms 0.02% 34.716ms 24.692us 0.000us 0.00% 448.596ms 319.058us 1406 aten::split_with_sizes 0.00% 6.119ms 0.00% 7.154ms 22.286us 0.000us 0.00% 0.000us 0.000us 321 aten::expand 0.00% 5.917ms 0.00% 7.436ms 6.854us 0.000us 0.00% 0.000us 0.000us 1085 FullyShardedDataParallel._post_backward_prefetch 0.00% 5.797ms 0.00% 5.797ms 95.038us 0.000us 0.00% 0.000us 0.000us 61 aten::cumsum 0.00% 5.713ms 0.00% 6.962ms 57.066us 785.599us 0.00% 785.599us 6.439us 122 aten::gather 0.00% 5.453ms 0.00% 6.909ms 56.628us 264.903ms 0.11% 264.903ms 2.171ms 122 autograd::engine::evaluate_function: SeqAllToAll4DBa... 0.00% 5.267ms 54.86% 85.226s 355.108ms 0.000us 0.00% 1.406s 5.860ms 240 aten::zeros 0.00% 5.099ms 0.04% 54.833ms 38.999us 0.000us 0.00% 448.596ms 319.058us 1406 FullyShardedDataParallel._pre_forward_prefetch 0.00% 5.082ms 0.00% 5.082ms 83.307us 0.000us 0.00% 0.000us 0.000us 61 aten::div 0.00% 4.885ms 0.00% 6.428ms 37.810us 160.786ms 0.07% 160.786ms 945.802us 170 FullyShardedDataParallel.rate_limiter 0.00% 4.864ms 0.00% 5.364ms 44.329us 0.000us 0.00% 10.995ms 90.872us 121 aten::unbind 0.00% 4.742ms 0.01% 10.801ms 26.801us 0.000us 0.00% 0.000us 0.000us 403 aten::max 0.00% 4.547ms 0.00% 6.932ms 56.821us 1.411ms 0.00% 1.411ms 11.563us 122 ToCopyBackward0 0.00% 4.475ms 0.05% 77.243ms 43.420us 0.000us 0.00% 899.115ms 505.405us 1779 IndexPutFirstAxisBackward 0.00% 4.229ms 0.01% 9.525ms 153.624us 0.000us 0.00% 55.801ms 900.008us 62 aten::index 0.00% 4.197ms 0.00% 5.391ms 84.239us 55.810ms 0.02% 55.810ms 872.031us 64 NativeLayerNormBackward0 0.00% 4.181ms 0.01% 13.063ms 104.502us 0.000us 0.00% 570.093ms 4.561ms 125 aten::_index_put_impl_ 0.00% 3.980ms 0.00% 5.711ms 46.807us 107.227ms 0.04% 107.227ms 878.911us 122 ViewBackward0 0.00% 3.964ms 0.01% 14.658ms 7.071us 0.000us 0.00% 40.148ms 19.367us 2073 aten::split 0.00% 3.893ms 0.01% 10.656ms 47.571us 0.000us 0.00% 0.000us 0.000us 224 _AllGatherBackward 0.00% 3.594ms 0.00% 4.558ms 75.962us 0.000us 0.00% 0.000us 0.000us 60 aten::layer_norm 0.00% 3.466ms 0.07% 112.509ms 229.611us 0.000us 0.00% 2.339s 4.774ms 490 PowBackward0 0.00% 3.315ms 0.01% 18.693ms 116.106us 0.000us 0.00% 346.297ms 2.151ms 161 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 155.352s Self CUDA time total: 244.528s --> saving checkpoint at step 10 --> checkpoint saved at step 10