[1,15]:[2024-09-28 17:10:34,208] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,25]:[2024-09-28 17:10:34,241] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,28]:[2024-09-28 17:10:34,257] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,31]:[2024-09-28 17:10:34,262] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,8]:[2024-09-28 17:10:34,302] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,11]:[2024-09-28 17:10:34,313] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,13]:[2024-09-28 17:10:34,320] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,30]:[2024-09-28 17:10:34,341] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,27]:[2024-09-28 17:10:34,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,1]:[2024-09-28 17:10:34,391] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,4]:[2024-09-28 17:10:34,397] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,7]:[2024-09-28 17:10:34,400] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,29]:[2024-09-28 17:10:34,369] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,5]:[2024-09-28 17:10:34,417] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,14]:[2024-09-28 17:10:34,407] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,2]:[2024-09-28 17:10:34,474] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,3]:[2024-09-28 17:10:34,476] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,6]:[2024-09-28 17:10:34,477] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,26]:[2024-09-28 17:10:34,480] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,9]:[2024-09-28 17:10:34,527] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,0]:[2024-09-28 17:10:34,643] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,10]:[2024-09-28 17:10:34,617] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,12]:[2024-09-28 17:10:34,659] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,24]:[2024-09-28 17:10:34,737] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,17]:[2024-09-28 17:10:34,713] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,18]:[2024-09-28 17:10:34,729] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,23]:[2024-09-28 17:10:34,757] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,21]:[2024-09-28 17:10:34,773] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,20]:[2024-09-28 17:10:34,815] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,19]:[2024-09-28 17:10:34,819] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,22]:[2024-09-28 17:10:34,857] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,16]:[2024-09-28 17:10:34,882] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [1,31]:> setting tensorboard ... [1,0]:using world size: 32, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 8 [1,0]:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:QwenTokenizer [1,0]:WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication [1,0]:accumulate and all-reduce gradients in fp32 for bfloat16 data type. [1,0]:using torch.bfloat16 for parameters ... [1,0]:------------------------ arguments ------------------------ [1,0]: accumulate_allreduce_grads_in_fp32 .............. True [1,0]: adam_beta1 ...................................... 0.9 [1,0]: adam_beta2 ...................................... 0.95 [1,0]: adam_eps ........................................ 1e-08 [1,0]: add_bias_linear ................................. False [1,0]: add_position_embedding .......................... False [1,0]: add_qkv_bias .................................... False [1,0]: adlr_autoresume ................................. False [1,0]: adlr_autoresume_interval ........................ 1000 [1,0]: apply_layernorm_1p .............................. False [1,0]: apply_query_key_layer_scaling ................... False [1,0]: apply_residual_connection_post_layernorm ........ False [1,0]: apply_rope_fusion ............................... True [1,0]: async_save ...................................... None [1,0]: async_tensor_model_parallel_allreduce ........... False [1,0]: attention_dropout ............................... 0.0 [1,0]: attention_softmax_in_fp32 ....................... False [1,0]: auto_detect_ckpt_format ......................... False [1,0]: barrier_with_L1_time ............................ True [1,0]: bert_binary_head ................................ True [1,0]: bert_embedder_type .............................. megatron [1,0]: bert_load ....................................... None [1,0]: bf16 ............................................ True [1,0]: bias_dropout_fusion ............................. True [1,0]: bias_gelu_fusion ................................ False [1,0]: bias_swiglu_fusion .............................. True [1,0]: biencoder_projection_dim ........................ 0 [1,0]: biencoder_shared_query_context_model ............ False [1,0]: block_data_path ................................. None [1,0]: calculate_per_token_loss ........................ False [1,0]: check_for_nan_in_loss_and_grad .................. True [1,0]: check_weight_hash_across_dp_replicas_interval ... None [1,0]: ckpt_assume_constant_structure .................. False [1,0]: ckpt_fully_parallel_load ........................ False [1,0]: ckpt_fully_parallel_save ........................ False [1,0]: ckpt_step ....................................... None [1,0]: classes_fraction ................................ 1.0 [1,0]: clip_grad ....................................... 1.0 [1,0]: clone_scatter_output_in_embedding ............... True [1,0]: consumed_train_samples .......................... 0 [1,0]: consumed_valid_samples .......................... 0 [1,0]: context_parallel_size ........................... 1 [1,0]: create_attention_mask_in_dataloader ............. True [1,0]: cross_entropy_loss_fusion ....................... False [1,0]: data_cache_path ................................. None [1,0]: data_parallel_random_init ....................... False [1,0]: data_parallel_size .............................. 1 [1,0]: data_path ....................................... ['./qwen_token/my-qwen_text_document'] [1,0]: data_per_class_fraction ......................... 1.0 [1,0]: data_sharding ................................... True [1,0]: dataloader_type ................................. single [1,0]: ddp_average_in_collective ....................... False [1,0]: ddp_bucket_size ................................. None [1,0]: decoder_num_layers .............................. None [1,0]: decoder_seq_length .............................. None [1,0]: decoupled_lr .................................... None [1,0]: decoupled_min_lr ................................ None [1,0]: delay_grad_reduce ............................... True [1,0]: delay_param_gather .............................. False [1,0]: deprecated_use_mcore_models ..................... False [1,0]: deterministic_mode .............................. False [1,0]: dino_bottleneck_size ............................ 256 [1,0]: dino_freeze_last_layer .......................... 1[1,0]: [1,0]: dino_head_hidden_size ........................... 2048 [1,0]: dino_local_crops_number ......................... 10 [1,0]: dino_local_img_size ............................. 96 [1,0]: dino_norm_last_layer ............................ False [1,0]: dino_teacher_temp ............................... 0.07 [1,0]: dino_warmup_teacher_temp ........................ 0.04[1,0]: [1,0]: dino_warmup_teacher_temp_epochs ................. 30 [1,0]: disable_straggler_on_startup .................... False [1,0]: dist_ckpt_format ................................ torch_dist [1,0]: dist_url ........................................ tcp://node116:34566 [1,0]: distribute_saved_activations .................... False [1,0]: distributed_backend ............................. nccl [1,0]: distributed_timeout_minutes ..................... 10 [1,0]: embedding_path .................................. None [1,0]: empty_unused_memory_level ....................... 0 [1,0]: enable_one_logger ............................... False [1,0]: encoder_num_layers .............................. 80 [1,0]: encoder_seq_length .............................. 2048 [1,0]: end_weight_decay ................................ 0.1 [1,0]: eod_mask_loss ................................... False [1,0]: eval_interval ................................... 1000 [1,0]: eval_iters ...................................... 10 [1,0]: evidence_data_path .............................. None [1,0]: exit_duration_in_mins ........................... None [1,0]: exit_interval ................................... None [1,0]: exit_on_missing_checkpoint ...................... False [1,0]: exit_signal_handler ............................. False [1,0]: expert_model_parallel_size ...................... 1 [1,0]: ffn_hidden_size ................................. 29568 [1,0]: finetune ........................................ False [1,0]: fp16 ............................................ False [1,0]: fp16_lm_cross_entropy ........................... False [1,0]: fp32_residual_connection ........................ False [1,0]: fp8 ............................................. None [1,0]: fp8_amax_compute_algo ........................... most_recent [1,0]: fp8_amax_history_len ............................ 1 [1,0]: fp8_interval .................................... 1 [1,0]: fp8_margin ...................................... 0 [1,0]: fp8_wgrad ....................................... True [1,0]: global_batch_size ............................... 64 [1,0]: gradient_accumulation_fusion .................... False [1,0]: group_query_attention ........................... True [1,0]: head_lr_mult .................................... 1.0 [1,0]: hidden_dropout .................................. 0.0 [1,0]: hidden_size ..................................... 8192 [1,0]: hybrid_attention_ratio .......................... 0.0 [1,0]: hybrid_mlp_ratio ................................ 0.0 [1,0]: hybrid_override_pattern ......................... None [1,0]: hysteresis ...................................... 2 [1,0]: ict_head_size ................................... None [1,0]: ict_load ........................................ None [1,0]: img_h ........................................... 224 [1,0]: img_w ........................................... 224 [1,0]: indexer_batch_size .............................. 128 [1,0]: indexer_log_interval ............................ 1000 [1,0]: inference_batch_times_seqlen_threshold .......... 512 [1,0]: init_method_std ................................. 0.006 [1,0]: init_method_xavier_uniform ...................... False [1,0]: initial_loss_scale .............................. 4294967296 [1,0]: iter_per_epoch .................................. 1250 [1,0]: kv_channels ..................................... 128 [1,0]: lazy_mpu_init ................................... None [1,0]: load ............................................ ./tmp/qwen1_5_72b/ckpt [1,0]: local_rank ...................................... None [1,0]: log_batch_size_to_tensorboard ................... False [1,0]: log_interval .................................... 1 [1,0]: log_learning_rate_to_tensorboard ................ True [1,0]: log_loss_scale_to_tensorboard ................... True [1,0]: log_memory_to_tensorboard ....................... False [1,0]: log_num_zeros_in_grad ........................... False [1,0]: log_params_norm ................................. False [1,0]: log_progress .................................... False [1,0]: log_straggler ................................... False [1,0]: log_throughput .................................. True [1,0]: log_timers_to_tensorboard ....................... False [1,0]: log_validation_ppl_to_tensorboard ............... False [1,0]: log_world_size_to_tensorboard ................... False [1,0]: logging_level ................................... None [1,0]: loss_scale ...................................... None [1,0]: loss_scale_window ............................... 1000 [1,0]: lr .............................................. 3e-05 [1,0]: lr_decay_iters .................................. None [1,0]: lr_decay_samples ................................ None [1,0]: lr_decay_style .................................. cosine [1,0]: lr_warmup_fraction .............................. None [1,0]: lr_warmup_init .................................. 0.0 [1,0]: lr_warmup_iters ................................. 1 [1,0]: lr_warmup_samples ............................... 0 [1,0]: lr_wsd_decay_iters .............................. None [1,0]: lr_wsd_decay_samples ............................ None [1,0]: lr_wsd_decay_style .............................. exponential [1,0]: make_vocab_size_divisible_by .................... 128 [1,0]: manual_gc ....................................... False [1,0]: manual_gc_eval .................................. True [1,0]: manual_gc_interval .............................. 0 [1,0]: mask_factor ..................................... 1.0 [1,0]: mask_prob ....................................... 0.15 [1,0]: mask_type ....................................... random [1,0]: masked_softmax_fusion ........................... True [1,0]: max_position_embeddings ......................... 32768 [1,0]: max_tokens_to_oom ............................... 12000 [1,0]: merge_file ...................................... ./qwen_token/merges.txt [1,0]: micro_batch_size ................................ 1 [1,0]: min_loss_scale .................................. 1.0 [1,0]: min_lr .......................................... 3e-06 [1,0]: mmap_bin_files .................................. True [1,0]: mock_data ....................................... False [1,0]: moe_aux_loss_coeff .............................. 0.0 [1,0]: moe_expert_capacity_factor ...................... None [1,0]: moe_extended_tp ................................. False [1,0]: moe_grouped_gemm ................................ False [1,0]: moe_input_jitter_eps ............................ None [1,0]: moe_layer_recompute ............................. False [1,0]: moe_p[1,0]:ad_expert_input_to_capacity ................ False [1,0]: moe_per_layer_logging ........................... False [1,0]: moe_router_load_balancing_type .................. aux_loss [1,0]: moe_router_topk ................................. 2 [1,0]: moe_token_dispatcher_type ....................... allgather [1,0]: moe_token_drop_policy ........................... probs [1,0]: moe_z_loss_coeff ................................ None [1,0]: nccl_communicator_config_path ................... None [1,0]: no_load_optim ................................... None [1,0]: no_load_rng ..................................... None [1,0]: no_persist_layer_norm ........................... False [1,0]: no_save_optim ................................... None [1,0]: no_save_rng ..................................... None [1,0]: norm_epsilon .................................... 1e-05 [1,0]: normalization ................................... RMSNorm [1,0]: num_attention_heads ............................. 64 [1,0]: num_channels .................................... 3 [1,0]: num_classes ..................................... 1000 [1,0]: num_dataset_builder_threads ..................... 1 [1,0]: num_experts ..................................... None [1,0]: num_layers ...................................... 80 [1,0]: num_layers_per_virtual_pipeline_stage ........... None [1,0]: num_query_groups ................................ 8 [1,0]: num_workers ..................................... 2 [1,0]: one_logger_entity ............................... hwinf_dcm [1,0]: one_logger_project .............................. e2e-tracking [1,0]: one_logger_run_name ............................. None [1,0]: onnx_safe ....................................... None [1,0]: openai_gelu ..................................... False [1,0]: optimizer ....................................... adam [1,0]: output_bert_embeddings .......................... False [1,0]: overlap_grad_reduce ............................. False [1,0]: overlap_p2p_comm ................................ False [1,0]: overlap_param_gather ............................ False [1,0]: override_opt_param_scheduler .................... False [1,0]: params_dtype .................................... torch.bfloat16 [1,0]: patch_dim ....................................... 16 [1,0]: perform_initialization .......................... True [1,0]: pipeline_model_parallel_size .................... 8 [1,0]: pipeline_model_parallel_split_rank .............. None [1,0]: position_embedding_type ......................... rope [1,0]: pretrained_checkpoint ........................... None [1,0]: profile ......................................... False [1,0]: profile_ranks ................................... [0] [1,0]: profile_step_end ................................ 12 [1,0]: profile_step_start .............................. 10 [1,0]: qk_layernorm .................................... False [1,0]: query_in_block_prob ............................. 0.1 [1,0]: rampup_batch_size ............................... None [1,0]: rank ............................................ 0 [1,0]: recompute_granularity ........................... None [1,0]: recompute_method ................................ None [1,0]: recompute_num_layers ............................ None [1,0]: reset_attention_mask ............................ False [1,0]: reset_position_ids .............................. False [1,0]: retriever_report_topk_accuracies ................ [] [1,0]: retriever_score_scaling ......................... False [1,0]: retriever_seq_length ............................ 256 [1,0]: retro_add_retriever ............................. False [1,0]: retro_attention_gate ............................ 1 [1,0]: retro_cyclic_train_iters ........................ None [1,0]: retro_encoder_attention_dropout ................. 0.1 [1,0]: retro_encoder_hidden_dropout .................... 0.1 [1,0]: retro_encoder_layers ............................ 2 [1,0]: retro_num_neighbors ............................. 2 [1,0]: retro_num_retrieved_chunks ...................... 2 [1,0]: retro_project_dir ............................... None [1,0]: retro_verify_neighbor_count ..................... True [1,0]: rotary_interleaved .............................. False [1,0]: rotary_percent .................................. 1.0 [1,0]: rotary_seq_len_interpolation_factor ............. None [1,0]: sample_rate ..................................... 1.0 [1,0]: save ............................................ ./tmp/qwen1_5_72b/ckpt [1,0]: save_interval ................................... 10000 [1,0]: scatter_gather_tensors_in_pipeline .............. True[1,0]: [1,0]: seed ............................................ 1234 [1,0]: seq_length ...................................... 2048 [1,0]: sequence_parallel ............................... True [1,0]: sgd_momentum .................................... 0.9 [1,0]: short_seq_prob .................................. 0.1 [1,0]: skip_train ...................................... False [1,0]: spec ............................................ None [1,0]: split ........................................... 949,50,1 [1,0]: squared_relu .................................... False [1,0]: standalone_embedding_stage ...................... False [1,0]: start_weight_decay .............................. 0.1 [1,0]: straggler_ctrlr_port ............................ 65535 [1,0]: straggler_minmax_count .......................... 1 [1,0]: swiglu .......................................... True [1,0]: swin_backbone_type .............................. tiny [1,0]: tensor_model_parallel_size ...................... 4 [1,0]: tensorboard_dir ................................. ./tmp/qwen1_5_72b/tblog [1,0]: tensorboard_log_interval ........................ 1 [1,0]: tensorboard_queue_size .......................... 1000 [1,0]: test_data_path .................................. None [1,0]: test_mode ....................................... False [1,0]: timing_log_level ................................ 0 [1,0]: timing_log_option ............................... minmax [1,0]: titles_data_path ................................ None [1,0]: tokenizer_model ................................. None [1,0]: tokenizer_type .................................. QwenTokenizer [1,0]: tp_comm_bulk_dgrad .............................. True [1,0]: tp_comm_bulk_wgrad .............................. True [1,0]: tp_comm_overlap ................................. False [1,0]: tp_comm_overlap_ag .............................. True [1,0]: tp_comm_overlap_cfg ............................. None [1,0]: tp_comm_overlap_rs .............................. True [1,0]: tp_comm_overlap_rs_dgrad ........................ False [1,0]: tp_comm_split_ag ................................ True [1,0]: tp_comm_split_rs ................................ True [1,0]: train_data_path ................................. None [1,0]: train_iters ..................................... 100 [1,0]: train_samples ................................... None [1,0]: transformer_impl ................................ local [1,0]: transformer_pipeline_model_parallel_size ........ 8 [1,0]: untie_embeddings_and_output_weights ............. True [1,0]: use_checkpoint_args ............................. False [1,0]: use_checkpoint_opt_param_scheduler .............. False [1,0]: use_cpu_initialization .......................... None [1,0]: use_dist_ckpt ................................... False [1,0]: use_distributed_optimizer ....................... True [1,0]: use_fast_cross_entropy_loss ..................... False [1,0]: use_fast_rms_layernorm .......................... False [1,0]: use_flash_attn .................................. True [1,0]: use_flash_attn_triton ........................... False [1,0]: use_flash_attn_v1 ............................... False [1,0]: use_flash_attn_v2 ............................... True [1,0]: use_legacy_models ............................... True [1,0]: use_one_sent_docs ............................... False [1,0]: use_ring_exchange_p2p ........................... False [1,0]: use_rotary_position_embeddings .................. True [1,0]: use_tp_pp_dp_mapping ............................ False [1,0]: valid_data_path ................................. None [1,0]: variable_seq_lengths ............................ False [1,0]: virtual_pipeline_model_parallel_size ............ None [1,0]: vision_backbone_type ............................ vit [1,0]: vision_pretraining .............................. False [1,0]: vision_pretraining_type ......................... classify [1,0]: vocab_extra_ids ................................. 0 [1,0]: vocab_file ...................................... ./qwen_token/vocab.json [1,0]: vocab_size ...................................... None [1,0]: wandb_exp_name .................................. [1,0]: wandb_project ................................... [1,0]: wandb_save_dir .................................. [1,0]: weight_decay .................................... 0.1 [1,0]: weight_decay_incr_style ......................... constant [1,0]: world_size ...................................... 32 [1,0]: yaml_cfg ........................................ None [1,0]:-------------------- end of arguments --------------------- [1,0]:setting number of micro-batches to constant 64 [1,0]:> building QwenTokenizer tokenizer ... [1,0]: > padded vocab (size: 151643) with 421 dummy tokens (new size: 152064) [1,0]:> initializing torch distributed ... [1,12]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,12]:I0928 17:10:36.284003 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843548017696 [1,12]:I0928 17:10:36.285332 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534134176 [1,12]:I0928 17:10:36.291662 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534124432 [1,12]:I0928 17:10:36.293839 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534741728 [1,12]:I0928 17:10:36.294546 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843533522800 [1,12]:I0928 17:10:36.294782 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843545869136 [1,12]:I0928 17:10:36.295115 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551010032 [1,12]:I0928 17:10:36.295454 24841 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551012256 [1,12]:I0928 17:10:36.296087 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551014480 [1,12]:I0928 17:10:36.296543 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843543298800 [1,12]:I0928 17:10:36.296970 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843543301024 [1,12]:I0928 17:10:36.297653 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843555296880 [1,12]:I0928 17:10:36.298990 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843555299104 [1,24]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,24]:I0928 17:10:36.375571 8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797634713232 [1,24]:I0928 17:10:36.377507 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622326816 [1,24]:I0928 17:10:36.383812 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622545344 [1,24]:I0928 17:10:36.385749 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797624659104 [1,24]:I0928 17:10:36.386175 8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622265424 [1,24]:I0928 17:10:36.386415 8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622275168 [1,24]:I0928 17:10:36.386817 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797621658848 [1,24]:I0928 17:10:36.387077 8434 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797632374736 [1,24]:I0928 17:10:36.387781 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797643245072 [1,24]:I0928 17:10:36.388206 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797632397408 [1,24]:I0928 17:10:36.388716 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633717248 [1,24]:I0928 17:10:36.389621 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633719472 [1,24]:I0928 17:10:36.391235 8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633721696 [1,17]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,17]:I0928 17:10:36.355304 6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586130816 [1,24]:I0928 17:10:36.393000 8434 ProcessGroupNCCL.cpp:2780] Rank 24 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,17]:I0928 17:10:36.356807 6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494576519840 [1,17]:I0928 17:10:36.363021 6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494577123520 [1,17]:I0928 17:10:36.365171 6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494577113776 [1,17]:I0928 17:10:36.365768 6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586057360 [1,17]:I0928 17:10:36.366010 6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586060704 [1,17]:I0928 17:10:36.366370 6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494594119296 [1,17]:I0928 17:10:36.366777 6901 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494594121520 [1,17]:I0928 17:10:36.367363 6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494590678000 [1,17]:I0928 17:10:36.367805 6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494590682576 [1,17]:I0928 17:10:36.368239 6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494588830336 [1,17]:I0928 17:10:36.369048 6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494588833360 [1,17]:I0928 17:10:36.370493 6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494595266384 [1,18]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,18]:I0928 17:10:36.379254 6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032103418048 [1,18]:I0928 17:10:36.380781 6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109756784 [1,18]:I0928 17:10:36.386423 6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032089922368 [1,18]:I0928 17:10:36.388432 6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032090504128 [1,18]:I0928 17:10:36.389006 6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032090494384 [1,18]:I0928 17:10:36.389246 6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032104369376 [1,18]:I0928 17:10:36.389598 6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032111636832 [1,18]:I0928 17:10:36.390074 6907 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109772272 [1,18]:I0928 17:10:36.390568 6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109774496 [1,18]:I0928 17:10:36.391021 6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032102357712 [1,18]:I0928 17:10:36.391466 6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099218592 [1,18]:I0928 17:10:36.392282 6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099220768 [1,18]:I0928 17:10:36.393748 6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099222944 [1,22]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,22]:I0928 17:10:36.398507 6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748338852304 [1,22]:I0928 17:10:36.400197 6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372540256 [1,22]:I0928 17:10:36.406311 6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748381780240 [1,22]:I0928 17:10:36.408246 6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372765792 [1,22]:I0928 17:10:36.408762 6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372835696 [1,22]:I0928 17:10:36.408998 6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748371169840 [1,22]:I0928 17:10:36.409375 6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748384710064 [1,22]:I0928 17:10:36.409832 6921 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748380275088 [1,22]:I0928 17:10:36.410344 6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748380277264 [1,22]:I0928 17:10:36.410764 6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748391377760 [1,22]:I0928 17:10:36.411202 6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748391379936 [1,22]:I0928 17:10:36.412055 6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748383594128 [1,22]:I0928 17:10:36.413607 6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748383596352 [1,21]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,21]:I0928 17:10:36.423324 6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389532071376 [1,21]:I0928 17:10:36.425096 6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389531242864 [1,21]:I0928 17:10:36.430836 6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539382400 [1,21]:I0928 17:10:36.432796 6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389534722736 [1,21]:I0928 17:10:36.433297 6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389536419424 [1,21]:I0928 17:10:36.433529 6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389518170400 [1,21]:I0928 17:10:36.433919 6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389531713216 [1,21]:I0928 17:10:36.434312 6919 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389535046480 [1,21]:I0928 17:10:36.434938 6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389535048704 [1,21]:I0928 17:10:36.435379 6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389530190496 [1,21]:I0928 17:10:36.435827 6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389530192720 [1,21]:I0928 17:10:36.436689 6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539349392 [1,21]:I0928 17:10:36.438230 6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539351616 [1,23]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,23]:I0928 17:10:36.463274 6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766919504 [1,23]:I0928 17:10:36.465096 6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236776792976 [1,23]:I0928 17:10:36.470860 6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236776312880 [1,23]:I0928 17:10:36.472807 6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236775420416 [1,23]:I0928 17:10:36.473253 6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766348256 [1,23]:I0928 17:10:36.473493 6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236771144880 [1,23]:I0928 17:10:36.473889 6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766991904 [1,23]:I0928 17:10:36.474437 6922 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766994080 [1,23]:I0928 17:10:36.474905 6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767015632 [1,23]:I0928 17:10:36.475334 6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767018896 [1,23]:I0928 17:10:36.475767 6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767021120 [1,23]:I0928 17:10:36.476676 6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767022272 [1,23]:I0928 17:10:36.478250 6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236771216624 [1,16]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,16]:I0928 17:10:36.488837 6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531083697728 [1,16]:I0928 17:10:36.490300 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531092418800 [1,16]:I0928 17:10:36.495915 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531085015024 [1,16]:I0928 17:10:36.497959 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094577728 [1,16]:I0928 17:10:36.498570 6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094579904 [1,16]:I0928 17:10:36.498822 6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094583200 [1,16]:I0928 17:10:36.499187 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096685760 [1,16]:I0928 17:10:36.499490 6894 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096687984 [1,16]:I0928 17:10:36.500156 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096689136 [1,16]:I0928 17:10:36.500597 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096946416 [1,16]:I0928 17:10:36.501024 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096731856 [1,16]:I0928 17:10:36.501770 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096734080 [1,16]:I0928 17:10:36.503214 6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096736304 [1,16]:I0928 17:10:36.505196 6894 ProcessGroupNCCL.cpp:2780] Rank 16 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,19]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,19]:I0928 17:10:36.518448 6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300337116736 [1,19]:I0928 17:10:36.520079 6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300325515072 [1,19]:I0928 17:10:36.525683 6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300326086992 [1,19]:I0928 17:10:36.527709 6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300334892960 [1,19]:I0928 17:10:36.528247 6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300334895088 [1,19]:I0928 17:10:36.528477 6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300338145840 [1,19]:I0928 17:10:36.528829 6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300338149088 [1,19]:I0928 17:10:36.529383 6914 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300339687936 [1,19]:I0928 17:10:36.529805 6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300339690160 [1,19]:I0928 17:10:36.530236 6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343115088 [1,19]:I0928 17:10:36.530668 6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343117312 [1,19]:I0928 17:10:36.531503 6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343119536 [1,19]:I0928 17:10:36.532995 6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300337097584 [1,20]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,20]:I0928 17:10:36.554998 6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072592032544 [1,20]:I0928 17:10:36.556634 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072592411888 [1,20]:I0928 17:10:36.562213 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578212336 [1,20]:I0928 17:10:36.564203 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578201872 [1,20]:I0928 17:10:36.564730 6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578587728 [1,20]:I0928 17:10:36.564967 6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578176608 [1,20]:I0928 17:10:36.565351 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072577999200 [1,20]:I0928 17:10:36.565639 6917 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072587434096 [1,20]:I0928 17:10:36.566325 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072589537184 [1,20]:I0928 17:10:36.566758 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599580624 [1,20]:I0928 17:10:36.567198 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599582800 [1,20]:I0928 17:10:36.568019 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599585024 [1,20]:I0928 17:10:36.569561 6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599819744 [1,15]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,15]:I0928 17:10:36.737020 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461299984 [1,15]:I0928 17:10:36.738432 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461389152 [1,15]:I0928 17:10:36.744571 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109463704672 [1,15]:I0928 17:10:36.746693 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461309728 [1,15]:I0928 17:10:36.747330 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461253648 [1,15]:I0928 17:10:36.747577 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109460698656 [1,15]:I0928 17:10:36.747915 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109473134976 [1,15]:I0928 17:10:36.748497 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109472801488 [1,15]:I0928 17:10:36.748901 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109472803712 [1,15]:I0928 17:10:36.749331 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482399584 [1,15]:I0928 17:10:36.749781 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482401808 [1,15]:I0928 17:10:36.750537 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482404032 [1,15]:I0928 17:10:36.751936 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109475019632 [1,25]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,25]:I0928 17:10:36.752588 8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234098688 [1,25]:I0928 17:10:36.754529 8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220635968 [1,25]:I0928 17:10:36.760869 8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220626224 [1,25]:I0928 17:10:36.762804 8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234528464 [1,25]:I0928 17:10:36.763206 8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220087904 [1,25]:I0928 17:10:36.763445 8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220118176 [1,25]:I0928 17:10:36.763859 8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423230722784 [1,25]:I0928 17:10:36.764223 8440 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423233605248 [1,25]:I0928 17:10:36.764853 8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423233607424 [1,25]:I0928 17:10:36.765290 8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231555728 [1,25]:I0928 17:10:36.765724 8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231557952 [1,25]:I0928 17:10:36.766669 8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231560176 [1,25]:I0928 17:10:36.768301 8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234156288 [1,8]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,8]:I0928 17:10:36.805460 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862987275536 [1,8]:I0928 17:10:36.806476 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862987200032 [1,8]:I0928 17:10:36.811786 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981200896 [1,8]:I0928 17:10:36.814039 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862631544896 [1,8]:I0928 17:10:36.814842 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862919336832 [1,8]:I0928 17:10:36.815078 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981465536 [1,8]:I0928 17:10:36.815388 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862967433392 [1,8]:I0928 17:10:36.815769 24819 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981012400 [1,8]:I0928 17:10:36.816380 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981957888 [1,8]:I0928 17:10:36.816820 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981961824 [1,31]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,31]:I0928 17:10:36.802973 8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866576640 [1,8]:I0928 17:10:36.817243 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978075328 [1,8]:I0928 17:10:36.817868 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978077552 [1,8]:I0928 17:10:36.819103 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978079776 [1,31]:I0928 17:10:36.805253 8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866566896 [1,8]:I0928 17:10:36.821259 24819 ProcessGroupNCCL.cpp:2780] Rank 8 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,31]:I0928 17:10:36.810376 8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866021808 [1,31]:I0928 17:10:36.812062 8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247865999184 [1,31]:I0928 17:10:36.812310 8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877783136 [1,31]:I0928 17:10:36.812574 8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877786480 [1,31]:I0928 17:10:36.813009 8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877789776 [1,31]:I0928 17:10:36.813500 8462 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877792000 [1,31]:I0928 17:10:36.813617 8462 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879112736 [1,31]:I0928 17:10:36.814082 8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879117760 [1,31]:I0928 17:10:36.814509 8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879119840 [1,31]:I0928 17:10:36.814936 8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879121920 [1,31]:I0928 17:10:36.815979 8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879124096 [1,31]:I0928 17:10:36.817761 8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879126272 [1,12]:I0928 17:10:36.874425 24841 ProcessGroupNCCL.cpp:2780] Rank 12 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,28]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,28]:I0928 17:10:36.866642 8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795432624 [1,28]:I0928 17:10:36.868739 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229792988208 [1,1]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,1]:I0928 17:10:36.914921 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197189493312 [1,1]:I0928 17:10:36.915673 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197200118480 [1,28]:I0928 17:10:36.873793 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229733066432 [1,28]:I0928 17:10:36.875526 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793672560 [1,28]:I0928 17:10:36.875842 8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229799201184 [1,28]:I0928 17:10:36.876076 8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229781891456 [1,28]:I0928 17:10:36.876530 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229781226064 [1,28]:I0928 17:10:36.876778 8456 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793551760 [1,28]:I0928 17:10:36.876883 8456 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793553984 [1,28]:I0928 17:10:36.877589 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793959648 [1,28]:I0928 17:10:36.878011 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793961872 [1,1]:I0928 17:10:36.922307 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180903056 [1,28]:I0928 17:10:36.878453 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795357264 [1,28]:I0928 17:10:36.879438 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795359488 [1,7]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,7]:I0928 17:10:36.923707 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861098353136 [1,7]:I0928 17:10:36.924728 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861096737312 [1,1]:I0928 17:10:36.924889 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180893312 [1,28]:I0928 17:10:36.881153 8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795361712 [1,1]:I0928 17:10:36.925887 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197189868976 [1,1]:I0928 17:10:36.926132 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180295600 [1,5]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,5]:I0928 17:10:36.926327 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819155932480 [1,1]:I0928 17:10:36.926362 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197193874336 [1,1]:I0928 17:10:36.926870 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197193876560 [1,1]:I0928 17:10:36.926990 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197201948928 [1,1]:I0928 17:10:36.927093 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197201953200 [1,5]:I0928 17:10:36.927234 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819177567008 [1,1]:I0928 17:10:36.927510 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194033664 [1,1]:I0928 17:10:36.927934 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194035888 [1,1]:I0928 17:10:36.928367 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194038064 [1,1]:I0928 17:10:36.928872 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197202020224 [1,7]:I0928 17:10:36.929852 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861104499648 [1,1]:I0928 17:10:36.929937 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197202022928 [1,5]:I0928 17:10:36.932377 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107612576 [1,7]:I0928 17:10:36.932574 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861103093088 [1,7]:I0928 17:10:36.933432 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861102548352 [1,7]:I0928 17:10:36.933667 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861102525392 [1,7]:I0928 17:10:36.933935 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861085333808 [1,7]:I0928 17:10:36.934589 21526 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861052349808 [1,5]:I0928 17:10:36.934896 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819157173744 [1,7]:I0928 17:10:36.934911 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094929008 [1,7]:I0928 17:10:36.935353 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094932864 [1,7]:I0928 17:10:36.935786 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094935088 [1,5]:I0928 17:10:36.935801 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819153183616 [1,5]:I0928 17:10:36.936040 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107844992 [1,5]:I0928 17:10:36.936304 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107567760 [1,7]:I0928 17:10:36.936416 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861099051504 [1,5]:I0928 17:10:36.936815 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819170248480 [1,5]:I0928 17:10:36.937319 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819170250704 [1,7]:I0928 17:10:36.937649 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861099054256 [1,5]:I0928 17:10:36.937741 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169545952 [1,5]:I0928 17:10:36.938180 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169548176 [1,5]:I0928 17:10:36.938771 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169670160 [1,5]:I0928 17:10:36.939939 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169672384 [1,29]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,29]:I0928 17:10:36.899094 8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681200400 [1,29]:I0928 17:10:36.901144 8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854668022080 [1,29]:I0928 17:10:36.906114 8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854668257792 [1,29]:I0928 17:10:36.907840 8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681371424 [1,29]:I0928 17:10:36.908150 8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667952112 [1,29]:I0928 17:10:36.908402 8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667959808 [1,29]:I0928 17:10:36.908844 8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667961856 [1,4]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,4]:I0928 17:10:36.952936 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748528960 [1,29]:I0928 17:10:36.909166 8459 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681445824 [1,29]:I0928 17:10:36.909273 8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854635908528 [1,4]:I0928 17:10:36.953810 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070755923376 [1,29]:I0928 17:10:36.909904 8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854676619312 [1,29]:I0928 17:10:36.910317 8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854689090560 [1,29]:I0928 17:10:36.910737 8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854689092784 [1,29]:I0928 17:10:36.911710 8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854685104048 [1,29]:I0928 17:10:36.913434 8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854685106800 [1,4]:I0928 17:10:36.958770 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070737500832 [1,4]:I0928 17:10:36.961220 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070757860832 [1,4]:I0928 17:10:36.962128 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070749840400 [1,4]:I0928 17:10:36.962368 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070687931392 [1,4]:I0928 17:10:36.962633 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070736072192 [1,4]:I0928 17:10:36.963016 21522 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070746462272 [1,4]:I0928 17:10:36.963591 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748671648 [1,4]:I0928 17:10:36.964022 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748675664 [1,4]:I0928 17:10:36.964448 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750620080 [1,3]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,3]:I0928 17:10:36.964892 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668003536 [1,4]:I0928 17:10:36.965009 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750622304 [1,3]:I0928 17:10:36.965688 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668046576 [1,4]:I0928 17:10:36.966142 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750624528 [1,3]:I0928 17:10:36.971949 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447674233936 [1,3]:I0928 17:10:36.974468 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447672383664 [1,3]:I0928 17:10:36.975409 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447606256144 [1,3]:I0928 17:10:36.975648 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447657472512 [1,3]:I0928 17:10:36.975901 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447622088544 [1,3]:I0928 17:10:36.976589 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447667383952 [1,3]:I0928 17:10:36.976708 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668543072 [1,3]:I0928 17:10:36.976804 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665615008 [1,3]:I0928 17:10:36.977036 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665617184 [1,3]:I0928 17:10:36.977481 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665619360 [1,3]:I0928 17:10:36.977908 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676008624 [1,3]:I0928 17:10:36.978473 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676010848 [1,3]:I0928 17:10:36.979580 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676013024 [1,27]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,27]:I0928 17:10:36.952451 8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521774252128 [1,27]:I0928 17:10:36.954414 8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521817222960 [1,27]:I0928 17:10:36.959306 8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806302864 [1,27]:I0928 17:10:36.961118 8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806442496 [1,27]:I0928 17:10:36.961479 8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521757550784 [1,27]:I0928 17:10:36.961714 8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806312608 [1,27]:I0928 17:10:36.962117 8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521805743536 [1,27]:I0928 17:10:36.962615 8451 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819356656 [1,27]:I0928 17:10:36.963070 8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819358832 [1,27]:I0928 17:10:36.963505 8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827551056 [1,27]:I0928 17:10:36.963940 8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827553232 [1,27]:I0928 17:10:36.964911 8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827555456 [1,27]:I0928 17:10:36.966615 8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819344144 [1,22]:I0928 17:10:36.962049 6921 ProcessGroupNCCL.cpp:2780] Rank 22 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,17]:I0928 17:10:36.971592 6901 ProcessGroupNCCL.cpp:2780] Rank 17 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,21]:I0928 17:10:36.977463 6919 ProcessGroupNCCL.cpp:2780] Rank 21 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,2]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,2]:I0928 17:10:37.073457 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012302754640 [1,2]:I0928 17:10:37.074244 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012336462736 [1,2]:I0928 17:10:37.080004 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348934864 [1,2]:I0928 17:10:37.082574 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012345126304 [1,2]:I0928 17:10:37.083531 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012335727184 [1,2]:I0928 17:10:37.083765 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012335106512 [1,2]:I0928 17:10:37.084004 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012352592832 [1,2]:I0928 17:10:37.084596 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012352595056 [1,2]:I0928 17:10:37.084703 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012356784400 [1,2]:I0928 17:10:37.084812 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012356788976 [1,2]:I0928 17:10:37.085153 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348880352 [1,2]:I0928 17:10:37.085594 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348882576 [1,2]:I0928 17:10:37.086021 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348884800 [1,2]:I0928 17:10:37.086558 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012354867344 [1,2]:I0928 17:10:37.087626 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012354869568 [1,18]:I0928 17:10:37.012248 6907 ProcessGroupNCCL.cpp:2780] Rank 18 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,30]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,30]:I0928 17:10:37.059296 8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159712551936 [1,30]:I0928 17:10:37.061789 8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159712542192 [1,6]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,6]:I0928 17:10:37.110015 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968938272 [1,6]:I0928 17:10:37.110972 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875972849008 [1,30]:I0928 17:10:37.067155 8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159663803232 [1,30]:I0928 17:10:37.068888 8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725652016 [1,30]:I0928 17:10:37.069159 8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725468368 [1,30]:I0928 17:10:37.069396 8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725471712 [1,30]:I0928 17:10:37.069839 8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733738016 [1,30]:I0928 17:10:37.070231 8461 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733740192 [1,30]:I0928 17:10:37.070331 8461 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733742368 [1,30]:I0928 17:10:37.070874 8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724029824 [1,30]:I0928 17:10:37.071305 8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724032000 [1,30]:I0928 17:10:37.071736 8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724034224 [1,6]:I0928 17:10:37.116189 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875973933408 [1,30]:I0928 17:10:37.072749 8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159731088448 [1,30]:I0928 17:10:37.074509 8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159731091072 [1,6]:I0928 17:10:37.118682 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875963737456 [1,6]:I0928 17:10:37.119536 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875967152384 [1,6]:I0928 17:10:37.119781 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875966863200 [1,6]:I0928 17:10:37.120056 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875966866544 [1,6]:I0928 17:10:37.120616 21525 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968701344 [1,6]:I0928 17:10:37.121022 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968703568 [1,6]:I0928 17:10:37.121486 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968705744 [1,6]:I0928 17:10:37.121914 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968871392 [1,6]:I0928 17:10:37.122524 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968873568 [1,6]:I0928 17:10:37.123687 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968875792 [1,23]:I0928 17:10:37.044494 6922 ProcessGroupNCCL.cpp:2780] Rank 23 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,11]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,11]:I0928 17:10:37.128873 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498980641440 [1,11]:I0928 17:10:37.130040 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498918979744 [1,11]:I0928 17:10:37.135072 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498968532784 [1,26]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,26]:I0928 17:10:37.122563 8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029190032 [1,11]:I0928 17:10:37.137250 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498980072064 [1,10]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,10]:I0928 17:10:37.137311 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594181953040 [1,11]:I0928 17:10:37.137989 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988908752 [1,11]:I0928 17:10:37.138221 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988912048 [1,26]:I0928 17:10:37.124487 8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94650980360432 [1,11]:I0928 17:10:37.138522 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978682928 [1,10]:I0928 17:10:37.138542 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169667872 [1,11]:I0928 17:10:37.139132 24836 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978685152 [1,11]:I0928 17:10:37.139508 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978687328 [1,11]:I0928 17:10:37.139928 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498989155104 [1,11]:I0928 17:10:37.140359 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498989157328 [1,11]:I0928 17:10:37.141057 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988754304 [1,11]:I0928 17:10:37.142352 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988756528 [1,26]:I0928 17:10:37.129940 8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651037985728 [1,10]:I0928 17:10:37.144203 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169665824 [1,26]:I0928 17:10:37.131722 8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029171808 [1,26]:I0928 17:10:37.132103 8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029108848 [1,26]:I0928 17:10:37.132349 8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651028696608 [1,10]:I0928 17:10:37.146450 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169658128 [1,26]:I0928 17:10:37.132762 8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651028533888 [1,10]:I0928 17:10:37.147200 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169059168 [1,26]:I0928 17:10:37.133189 8446 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651039194288 [1,10]:I0928 17:10:37.147445 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180588112 [1,26]:I0928 17:10:37.133728 8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651039196416 [1,10]:I0928 17:10:37.147754 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180591408 [1,26]:I0928 17:10:37.134155 8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651048342112 [1,10]:I0928 17:10:37.148296 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180593632 [1,26]:I0928 17:10:37.134593 8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050523248 [1,10]:I0928 17:10:37.148746 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594178361936 [1,10]:I0928 17:10:37.149168 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594178366176 [1,26]:I0928 17:10:37.135520 8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050525472 [1,10]:I0928 17:10:37.149605 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180691712 [1,10]:I0928 17:10:37.150280 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180693936 [1,26]:I0928 17:10:37.137228 8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050527696 [1,10]:I0928 17:10:37.151577 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180696240 [1,20]:I0928 17:10:37.120223 6917 ProcessGroupNCCL.cpp:2780] Rank 20 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,13]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,13]:I0928 17:10:37.194805 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544541280 [1,13]:I0928 17:10:37.196089 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531708128 [1,13]:I0928 17:10:37.201287 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969532443504 [1,19]:I0928 17:10:37.151767 6914 ProcessGroupNCCL.cpp:2780] Rank 19 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,13]:I0928 17:10:37.203401 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969540597472 [1,13]:I0928 17:10:37.204113 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531492160 [1,13]:I0928 17:10:37.204353 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531015632 [1,13]:I0928 17:10:37.204679 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969482635904 [1,13]:I0928 17:10:37.205111 24844 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544429168 [1,13]:I0928 17:10:37.205663 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544431392 [1,13]:I0928 17:10:37.206087 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545340448 [1,9]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,9]:I0928 17:10:37.206405 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262122367072 [1,13]:I0928 17:10:37.206526 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545342576 [1,13]:I0928 17:10:37.207253 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545344800 [1,9]:I0928 17:10:37.207465 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262102520816 [1,13]:I0928 17:10:37.208600 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969542317984 [1,9]:I0928 17:10:37.212440 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262103222480 [1,9]:I0928 17:10:37.214661 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262114321520 [1,9]:I0928 17:10:37.215456 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116224016 [1,9]:I0928 17:10:37.215713 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124306496 [1,9]:I0928 17:10:37.216012 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124309840 [1,9]:I0928 17:10:37.216480 24825 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124312064 [1,9]:I0928 17:10:37.217005 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116058896 [1,9]:I0928 17:10:37.217443 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116063472 [1,9]:I0928 17:10:37.217875 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124147728 [1,9]:I0928 17:10:37.218526 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124149952 [1,9]:I0928 17:10:37.219781 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124152176 [1,14]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,14]:I0928 17:10:37.228828 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603221840 [1,0]:WARNING: Logging before InitGoogleLogging() is written to STDERR [1,0]:I0928 17:10:37.259811 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877283227312 [1,14]:I0928 17:10:37.230141 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603296960 [1,0]:I0928 17:10:37.260424 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877283828432 [1,14]:I0928 17:10:37.235035 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616761952 [1,0]:I0928 17:10:37.265904 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877295878384 [1,14]:I0928 17:10:37.237162 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603040288 [1,14]:I0928 17:10:37.237843 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603050032 [1,14]:I0928 17:10:37.238082 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417602464272 [1,14]:I0928 17:10:37.238412 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417614119712 [1,0]:I0928 17:10:37.268484 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304147392 [1,14]:I0928 17:10:37.238921 24846 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417614121936 [1,14]:I0928 17:10:37.239393 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616167520 [1,0]:I0928 17:10:37.269519 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304149472 [1,0]:I0928 17:10:37.269762 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297562688 [1,14]:I0928 17:10:37.239832 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616171376 [1,0]:I0928 17:10:37.269999 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297566032 [1,14]:I0928 17:10:37.240267 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616019872 [1,0]:I0928 17:10:37.270411 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877303057776 [1,0]:I0928 17:10:37.270515 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877303059952 [1,0]:I0928 17:10:37.270612 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297344592 [1,14]:I0928 17:10:37.241012 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616022096 [1,0]:I0928 17:10:37.271090 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297346816 [1,0]:I0928 17:10:37.271520 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304888864 [1,0]:I0928 17:10:37.271955 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304891088 [1,14]:I0928 17:10:37.242390 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417624106544 [1,0]:I0928 17:10:37.272440 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304893312 [1,0]:I0928 17:10:37.273483 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304894464 [1,0]:> initialized tensor model parallel with size 4 [1,0]:> initialized pipeline model parallel with size 8 [1,0]:> setting random seeds to 1234 ... [1,0]:> compiling dataset index builder ... [1,0]:make: Entering directory '/data/project/Megatron-LM-Qwen/megatron/core/datasets' [1,0]:make: Nothing to be done for 'default'. [1,0]:make: Leaving directory '/data/project/Megatron-LM-Qwen/megatron/core/datasets' [1,0]:>>> done with dataset index builder. Compilation time: 0.030 seconds [1,0]:> compiling and loading fused kernels ... [1,0]:I0928 17:10:37.305773 21506 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,15]:I0928 17:10:37.293015 24847 ProcessGroupNCCL.cpp:2780] Rank 15 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,25]:I0928 17:10:37.331790 8440 ProcessGroupNCCL.cpp:2780] Rank 25 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,31]:I0928 17:10:37.398240 8462 ProcessGroupNCCL.cpp:2780] Rank 31 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,7]:I0928 17:10:37.479629 21526 ProcessGroupNCCL.cpp:2780] Rank 7 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,5]:I0928 17:10:37.484269 21524 ProcessGroupNCCL.cpp:2780] Rank 5 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,1]:I0928 17:10:37.485459 21509 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,28]:I0928 17:10:37.449275 8456 ProcessGroupNCCL.cpp:2780] Rank 28 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,4]:I0928 17:10:37.514962 21522 ProcessGroupNCCL.cpp:2780] Rank 4 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,29]:I0928 17:10:37.483800 8459 ProcessGroupNCCL.cpp:2780] Rank 29 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,3]:I0928 17:10:37.528204 21518 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,27]:I0928 17:10:37.532408 8451 ProcessGroupNCCL.cpp:2780] Rank 27 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,2]:I0928 17:10:37.663508 21514 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,30]:I0928 17:10:37.644487 8461 ProcessGroupNCCL.cpp:2780] Rank 30 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,6]:I0928 17:10:37.698328 21525 ProcessGroupNCCL.cpp:2780] Rank 6 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,11]:I0928 17:10:37.696750 24836 ProcessGroupNCCL.cpp:2780] Rank 11 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,10]:I0928 17:10:37.718168 24831 ProcessGroupNCCL.cpp:2780] Rank 10 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,26]:I0928 17:10:37.706135 8446 ProcessGroupNCCL.cpp:2780] Rank 26 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,13]:I0928 17:10:37.763438 24844 ProcessGroupNCCL.cpp:2780] Rank 13 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,9]:I0928 17:10:37.785578 24825 ProcessGroupNCCL.cpp:2780] Rank 9 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,14]:I0928 17:10:37.803975 24846 ProcessGroupNCCL.cpp:2780] Rank 14 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [1,0]:I0928 17:10:40.743487 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,0]:>>> done with compiling and loading fused kernels. Compilation time: 3.489 seconds [1,0]:time to initialize megatron (seconds): 16.969 [1,0]:[after megatron is initialized] datetime: 2024-09-28 17:10:40 [1,0]:building GPT model ... [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: [1,4]:torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: [1,4]:torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: [1,4]:torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: [1,4]:torch.Size([8192, 2048]) [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: [1,1]:torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,8]:torch.Size([8192, 2048]) [1,25]:++++++++ weight.size is in RowParallelLinear: [1,25]:torch.Size([8192, 2048]) [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,4]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: [1,1]:torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,1]:++++++++++++++++++++++++padding is done [1,8]:torch.Size([8192, 2048]) [1,21]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: [1,25]:torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++ weight.size is in RowParallelLinear: [1,13]:torch.Size([8192, 2048]) [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: [1,4]:torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: [1,25]:torch.Size([8192, 2048]) [1,10]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,8]:torch.Size([8192, 2048]) [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: [1,13]:++++++++++++++++++++++++padding is done [1,6]:torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: [1,11]:torch.Size([8192, 2048]) [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: [1,1]:torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: [1,7]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,25]:torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,8]:torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: [1,24]:torch.Size([8192, 2048]) [1,28]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: [1,25]:torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,4]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,4]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++ weight.size is in RowParallelLinear: [1,28]:torch.Size([8192, 2048]) [1,11]:++++++++ weight.size is in RowParallelLinear: [1,11]:torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,4]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,21]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,4]: > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 605716480 [1,17]:++++++++++++++++++++++++padding is done [1,21]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: [1,25]:torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,8]:torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: [1,24]:torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,13]:++++++++++++++++++++++++padding is done [1,28]:torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++ weight.size is in RowParallelLinear: [1,27]:++++++++++++++++++++++++padding is done [1,9]:torch.Size([8192, 2048]) [1,21]: > number of parameters on (tensor, pipeline) model parallel rank (1, 5): 605716480 [1,29]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,26]:++++++++++++++++++++++++padding is done[1,28]:torch.Size([8192, 2048]) [1,26]: [1,13]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,26]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,10]:torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,9]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: [1,18]:torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,8]:torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,7]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]:++++++++++++++++++++++++padding is done [1,13]:torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: [1,13]:++++++++++++++++++++++++padding is done [1,26]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: [1,29]:torch.Size([8192, 2048]) [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,17]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: [1,22]:torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,8]:++++++++ weight.size is in RowParallelLinear: [1,8]:torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: [1,24]:++++++++++++++++++++++++padding is done [1,11]:torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,26]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,13]:torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,10]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,10]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,25]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,25]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,25]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,1]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,22]:++++++++ weight.size is in RowParallelLinear: [1,22]:torch.Size([8192, 2048]) [1,8]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,8]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,1]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: [1,11]:torch.Size([8192, 2048]) [1,24]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,24]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,10]: > number of parameters on (tensor, pipeline) model parallel rank (2, 2): 605716480 [1,17]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,24]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,28]:torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,18]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,13]:torch.Size([8192, 2048]) [1,14]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,3]:++++++++++++++++++++++++padding is done [1,3]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:++++++++++++++++++++++++padding is done [1,29]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,29]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,17]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,17]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,9]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,1]: > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 917143552 [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,25]: > number of parameters on (tensor, pipeline) model parallel rank (1, 6): 605716480 [1,15]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,8]: > number of parameters on (tensor, pipeline) model parallel rank (0, 2): 605716480 [1,22]:++++++++++++++++++++++++padding is done [1,24]: > number of parameters on (tensor, pipeline) model parallel rank (0, 6): 605716480 [1,6]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,3]: > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 917143552 [1,5]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,29]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,11]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,7]:++++++++++++++++++++++++padding is done [1,7]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,7]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,28]:torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: [1,26]:torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: [1,17]: > number of parameters on (tensor, pipeline) model parallel rank (1, 4): 605716480 [1,12]:torch.Size([8192, 2048]) [1,26]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++ weight.size is in RowParallelLinear: [1,14]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,14]:torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,29]: > number of parameters on (tensor, pipeline) model parallel rank (1, 7): 605724672 [1,18]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,27]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,15]:torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,27]:++++++++++++++++++++++++padding is done [1,13]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,7]: > number of parameters on (tensor, pipeline) model parallel rank (3, 1): 605716480 [1,2]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,6]:++++++++++++++++++++++++padding is done [1,6]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,6]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,28]:torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,11]:++++++++ weight.size is in RowParallelLinear: [1,5]:++++++++++++++++++++++++padding is done [1,11]:torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,11]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: [1,27]:++++++++++++++++++++++++padding is done [1,26]:torch.Size([8192, 2048]) [1,12]:++++++++++++++++++++++++padding is done [1,12]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,4]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,4]:Params for bucket 1 (605716480 elements): [1,4]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.9.input_norm.weight [1,4]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.2.input_norm.weight [1,4]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.8.input_norm.weight [1,4]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.6.input_norm.weight [1,4]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.3.input_norm.weight [1,4]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.1.input_norm.weight [1,4]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.7.input_norm.weight [1,4]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,4]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.4.input_norm.weight [1,4]: module.language_model.encoder.layers.5.input_norm.weight [1,4]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,4]: module.language_model.encoder.layers.0.input_norm.weight [1,26]:++++++++++++++++++++++++padding is done [1,12]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,9]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: [1,15]:torch.Size([8192, 2048]) [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,6]: > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 605716480 [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,13]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,28]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,22]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: [1,19]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,11]: > number of parameters on (tensor, pipeline) model parallel rank (3, 2): 605716480 [1,16]:torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,20]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,12]: > number of parameters on (tensor, pipeline) model parallel rank (0, 3): 605716480 [1,28]:++++++++++++++++++++++++padding is done [1,28]:++++++++ weight.size is in RowParallelLinear: [1,28]:torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,5]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,5]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++ weight.size is in RowParallelLinear: [1,30]:torch.Size([8192, 2048]) [1,5]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,9]:++++++++ weight.size is in RowParallelLinear: [1,9]:torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,15]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,15]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,15]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,28]:++++++++++++++++++++++++padding is done [1,22]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,0]:++++++++++++++++++++++++padding is done [1,0]:++++++++ weight.size is in RowParallelLinear: [1,0]:torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,22]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,0]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,13]:++++++++ weight.size is in RowParallelLinear: [1,13]:torch.Size([8192, 2048]) [1,4]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,4]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,21]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,21]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,26]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,5]: > number of parameters on (tensor, pipeline) model parallel rank (1, 1): 605716480 [1,13]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,9]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: [1,26]:torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,15]: > number of parameters on (tensor, pipeline) model parallel rank (3, 3): 605716480 [1,30]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,28]: > number of parameters on (tensor, pipeline) model parallel rank (0, 7): 605724672 [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,9]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,9]:++++++++++++++++++++++++padding is done [1,0]: > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 917143552 [1,13]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,27]:++++++++++++++++++++++++padding is done [1,27]:++++++++ weight.size is in RowParallelLinear: [1,27]:torch.Size([8192, 2048]) [1,27]:++++++++++++++++++++++++padding is done [1,22]: > number of parameters on (tensor, pipeline) model parallel rank (2, 5): 605716480 [1,16]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,13]:++++++++ weight.size is in RowParallelLinear: [1,13]:torch.Size([8192, 2048]) [1,26]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,13]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,9]: > number of parameters on (tensor, pipeline) model parallel rank (1, 2): 605716480 [1,26]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,27]: > number of parameters on (tensor, pipeline) model parallel rank (3, 6): 605716480 [1,19]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,8]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,8]:Params for bucket 1 (605716480 elements): [1,8]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.7.input_norm.weight [1,8]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.5.input_norm.weight [1,8]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.8.input_norm.weight [1,8]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.6.input_norm.weight [1,8]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.2.input_norm.weight [1,8]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.9.input_norm.weight [1,8]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.3.input_norm.weight [1,8]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,8]: [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,8]:module.language_model.encoder.layers.4.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.4.input_norm.weight [1,8]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.1.input_norm.weight [1,8]: module.language_model.encoder.layers.0.input_norm.weight [1,8]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,8]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,8]: module.language_model.encoder.layers.5.mlp.dense_4h_to_[1,8]:h.weight [1,13]: > number of parameters on (tensor, pipeline) model parallel rank (1, 3): 605716480 [1,2]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,18]:++++++++++++++++++++++++padding is done [1,18]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,18]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,14]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,19]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,26]:++++++++ weight.size is in RowParallelLinear: [1,26]:torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,26]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,2]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,2]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,2]:++++++++++++++++++++++++padding is done [1,18]: > number of parameters on (tensor, pipeline) model parallel rank (2, 4): 605716480 [1,24]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,24]:Params for bucket 1 (605716480 elements): [1,24]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.7.input_norm.weight [1,24]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.1.input_norm.weight [1,24]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.4.input_norm.weight [1,24]: module.language_model.encoder.layers.0.input_norm.weight [1,24]: module.language_model.encoder.layers.2.input_norm.weight [1,24]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.8.input_norm.weight [1,24]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.9.input_norm.weight [1,24]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.5.input_norm.weight [1,24]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.6.input_norm.weight [1,24]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,24]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,24]: module.language_model.encoder.layers.3.input_norm.weight [1,24]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,20]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,20]:++++++++ weight.size is in RowParallelLinear: [1,20]:torch.Size([8192, 2048]) [1,8]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,8]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,26]: > number of parameters on (tensor, pipeline) model parallel rank (2, 6): 605716480 [1,19]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,20]:++++++++++++++++++++++++padding is done [1,14]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,14]:++++++++ weight.size is in RowParallelLinear: [1,14]:torch.Size([8192, 2048]) [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,14]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,10]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,10]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,2]: > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 917143552 [1,25]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,25]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,17]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,17]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,31]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,0]:INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=False, use_distributed_optimizer=True, check_for_nan_in_grad=True, bucket_size=None, average_in_collective=False) [1,30]:++++++++++++++++++++++++padding is done [1,14]: > number of parameters on (tensor, pipeline) model parallel rank (2, 3): 605716480 [1,19]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,30]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,30]:++++++++++++++++++++++++padding is done [1,20]: > number of parameters on (tensor, pipeline) model parallel rank (0, 5): 605716480 [1,24]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,24]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,7]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,7]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,3]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,3]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,31]:++++++++++++++++++++++++padding is done [1,30]:++++++++++++++++++++++++padding is done [1,16]:++++++++++++++++++++++++padding is done [1,16]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,16]:++++++++++++++++++++++++padding is done [1,12]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,12]:Params for bucket 1 (605716480 elements): [1,12]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.8.input_norm.weight [1,12]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.9.input_norm.weight [1,12]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.0.input_norm.weight [1,12]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.5.input_norm.weight [1,12]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.6.input_norm.weight [1,12]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,12]: [1,12]:module.language_model.encoder.layers.1.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.3.input_norm.weight [1,12]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,12]: module.language_model.encoder.layers.7.input_norm.weight [1,12]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,12]: module.language_model.encoder.layers.4.input_norm.weight [1,12]: module.language_model.encoder.layers.2.input_norm.weight [1,12]: module.language_model.encoder.layers.1.input_norm.weight [1,11]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,11]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:++++++++++++++++++++++++padding is done [1,31]:++++++++++++++++++++++++padding is done [1,30]: > number of parameters on (tensor, pipeline) model parallel rank (2, 7): 605724672 [1,1]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,1]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,29]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,29]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:++++++++++++++++++++++++padding is done [1,19]:++++++++ weight.size is in RowParallelLinear: [1,19]:torch.Size([8192, 2048]) [1,31]:++++++++++++++++++++++++padding is done [1,31]:++++++++ weight.size is in RowParallelLinear: [1,31]:torch.Size([8192, 2048]) [1,19]:++++++++++++++++++++++++padding is done [1,16]: > number of parameters on (tensor, pipeline) model parallel rank (0, 4): 605716480 [1,31]:++++++++++++++++++++++++padding is done [1,6]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,6]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]: > number of parameters on (tensor, pipeline) model parallel rank (3, 4): 605716480 [1,12]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,12]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,31]:++++++++++++++++++++++++padding is done [1,31]: > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 605724672 [1,0]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,0]:Params for bucket 1 (917143552 elements): [1,0]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,0]: module.language_model.embedding.word_embeddings.weight [1,0]: module.language_model.encoder.layers.7.input_norm.weight [1,0]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.4.input_norm.weight [1,0]: module.language_model.encoder.layers.1.input_norm.weight [1,0]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.0.input_norm.weight [1,0]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.6.input_norm.weight [1,0]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.5.input_norm.weight [1,0]: module.language_model.encoder.layers.3.input_norm.weight [1,0]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.8.input_norm.weight [1,0]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,0]: [1,0]:module.language_model.encoder.layers.0.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.9.input_norm.weight [1,0]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,0]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,0]: module.language_model.encoder.layers.2.input_norm.weight [1,28]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,28]:Params for bucket 1 (605724672 elements): [1,28]: module.language_model.encoder.layers.4.input_norm.weight [1,28]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.6.input_norm.weight [1,28]: module.language_model.encoder.final_norm.weight [1,28]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.7.input_norm.weight [1,28]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.9.input_norm.weight [1,28]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.5.input_norm.weight [1,28]: module.language_model.encoder.layers.3.input_norm.weight [1,28]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.8.input_norm.weight [1,28]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,28]: module.language_model.encoder.layers.1.input_norm.weight [1,28]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.2.input_norm.weight [1,28]: [1,9]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,9]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,28]:module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,28]: module.language_model.encoder.layers.0.input_norm.weight [1,15]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,15]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,5]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,5]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,0]:INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=3e-05, min_lr=3e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=) [1,22]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,22]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,27]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,27]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,28]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,28]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,13]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,13]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,18]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,18]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,0]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,0]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,26]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,26]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,20]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,20]:Params for bucket 1 (605716480 elements): [1,20]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.3.input_norm.weight [1,20]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.4.input_norm.weight [1,20]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.7.input_norm.weight [1,20]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.1.input_norm.weight [1,20]: module.language_model.encoder.layers.0.input_norm.weight [1,20]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,20]: [1,20]:module.language_model.encoder.layers.6.input_norm.weight [1,20]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.9.input_norm.weight [1,20]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.5.input_norm.weight [1,20]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,20]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,20]: module.language_model.encoder.layers.8.input_norm.weight [1,20]: module.language_model.encoder.layers.2.input_nor[1,20]:m.weight [1,20]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,20]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,14]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,14]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,19]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,19]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,16]:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 [1,16]:Params for bucket 1 (605716480 elements): [1,16]: module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.7.input_norm.weight [1,16]: module.language_model.encoder.layers.5.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.4.input_norm.weight [1,16]: module.language_model.encoder.layers.0.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.8.input_norm.weight [1,16]: module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.4.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.7.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.6.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.6.input_norm.weight [1,16]: module.language_model.encoder.layers.2.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.8.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.5.input_norm.weight [1,16]: module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.0.input_norm.weight [1,16]: module.language_model.encoder.layers.3.input_norm.weight [1,16]: module.language_model.encoder.layers.9.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.1.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.9.input_norm.weight [1,16]: module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.2.input_norm.weight [1,16]: module.language_model.encoder.layers.1.input_norm.weight [1,16]: module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight [1,16]: module.language_model.encoder.layers.3.post_attention_norm.weight [1,16]: module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight [1,30]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,30]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,2]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,2]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,16]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,16]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,31]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,31]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,0]:> learning rate decay style: cosine [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: [1,23]:torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++++++++++++++++++padding is done [1,23]:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048]) [1,23]:++++++++++++++++++++++++padding is done [1,23]: > number of parameters on (tensor, pipeline) model parallel rank (3, 5): 605716480 [1,23]:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.) [1,23]: self._dummy_overflow_buf = torch.cuda.IntTensor([0]) [1,0]:WARNING: could not find the metadata file ./tmp/qwen1_5_72b/ckpt/latest_checkpointed_iteration.txt [1,0]: will not load any checkpoints and will start from random [1,31]:(min, max) time across ranks (ms): [1,31]: load-checkpoint ................................: (0.58, 0.76) [1,0]:[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-28 17:10:41 [1,0]:> building train, validation, and test datasets ... [1,0]: > datasets target sizes (minimum size): [1,0]: train: 6400 [1,0]: validation: 640 [1,0]: test: 640 [1,0]:INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)] [1,0]:> building train, validation, and test datasets for GPT ... [1,0]:INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(6400, 640, 640), and config=GPTDatasetConfig(random_seed=1234, sequence_length=2048, blend=(['./qwen_token/my-qwen_text_document'], None), blend_per_split=[None, None, None], split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True) [1,0]:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ./qwen_token/my-qwen_text_document.idx [1,0]:INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths [1,0]:INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers [1,0]:INFO:megatron.core.datasets.indexed_dataset: Extract the document indices [1,0]:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 79000 [1,0]:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 79000 [1,0]:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset train indices [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 107536 [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1 [1,0]:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset valid indices [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 5734 [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1 [1,0]:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset test indices [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 743 [1,0]:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 4 [1,0]:> finished creating GPT datasets ... [1,0]:[after dataloaders are built] datetime: 2024-09-28 17:10:41 [1,0]:done with setup ... [1,0]:training ... [1,31]:(min, max) time across ranks (ms): [1,31]: model-and-optimizer-setup ......................: (524.77, 564.45) [1,31]: train/valid/test-data-iterators-setup ..........: (49.20, 134.69) [1,0]:[before the start of training step] datetime: 2024-09-28 17:10:41 [1,6]:W0928 17:10:41.720322 21525 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,13]:W0928 17:10:41.687623 24844 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,11]:W0928 17:10:41.687621 24836 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,10]:W0928 17:10:41.687646 24831 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,9]:W0928 17:10:41.687633 24825 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,14]:W0928 17:10:41.687645 24846 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,15]:W0928 17:10:41.687639 24847 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,8]:W0928 17:10:41.687639 24819 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,12]:W0928 17:10:41.687701 24841 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,4]:W0928 17:10:41.720338 21522 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,5]:W0928 17:10:41.720358 21524 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,7]:W0928 17:10:41.720341 21526 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,21]:W0928 17:10:41.638365 6919 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,27]:W0928 17:10:41.663214 8451 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,19]:W0928 17:10:41.638398 6914 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,30]:W0928 17:10:41.663210 8461 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,26]:W0928 17:10:41.663236 8446 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,31]:W0928 17:10:41.663370 8462 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,24]:W0928 17:10:41.663281 8434 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,25]:W0928 17:10:41.663278 8440 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,17]:W0928 17:10:41.638440 6901 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,18]:W0928 17:10:41.638557 6907 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,20]:W0928 17:10:41.638422 6917 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,16]:W0928 17:10:41.638450 6894 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,23]:W0928 17:10:41.638474 6922 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,28]:W0928 17:10:41.663606 8456 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,22]:W0928 17:10:41.647271 6921 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,29]:W0928 17:10:41.672372 8459 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,0]:I0928 17:10:42.460181 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,3]:AUTOTUNE mm(2048x8192, 8192x2560) [1,3]: mm 1.0498 ms 100.0% [1,3]: triton_mm_7 3.1983 ms 32.8% [1,3]: triton_mm_3 3.6087 ms 29.1% [1,3]: triton_mm_5 3.8699 ms 27.1% [1,3]: triton_mm_8 4.1841 ms 25.1% [1,3]: triton_mm_4 4.3498 ms 24.1% [1,3]: triton_mm_6 4.5394 ms 23.1% [1,3]: triton_mm_2 5.1234 ms 20.5% [1,3]: triton_mm_1 5.2411 ms 20.0% [1,3]: triton_mm_0 5.2778 ms 19.9% [1,3]:SingleProcess AUTOTUNE takes 10.2842 seconds [1,1]:AUTOTUNE mm(2048x8192, 8192x2560) [1,1]: mm 1.0295 ms 100.0% [1,1]: triton_mm_7 3.1563 ms 32.6% [1,1]: triton_mm_3 3.6237 ms 28.4% [1,1]: triton_mm_5 3.9042 ms 26.4% [1,1]: triton_mm_8 4.1048 ms 25.1% [1,1]: triton_mm_4 4.3397 ms 23.7% [1,1]: triton_mm_6 4.5323 ms 22.7% [1,1]: triton_mm_0 5.1038 ms 20.2% [1,1]: triton_mm_2 5.1453 ms 20.0% [1,1]: triton_mm_1 5.2684 ms 19.5% [1,1]:SingleProcess AUTOTUNE takes 10.3284 seconds [1,2]:AUTOTUNE mm(2048x8192, 8192x2560) [1,2]: mm 1.0287 ms 100.0% [1,2]: triton_mm_7 3.2200 ms 31.9% [1,2]: triton_mm_3 3.6764 ms 28.0% [1,2]: triton_mm_5 3.9014 ms 26.4% [1,2]: triton_mm_8 4.1172 ms 25.0% [1,2]: triton_mm_4 4.4075 ms 23.3% [1,2]: triton_mm_6 4.5329 ms 22.7% [1,2]: triton_mm_1 5.0831 ms 20.2% [1,2]: triton_mm_0 5.1285 ms 20.1% [1,2]: triton_mm_2 5.1481 ms 20.0% [1,2]:SingleProcess AUTOTUNE takes 10.4001 seconds [1,0]:AUTOTUNE mm(2048x8192, 8192x2560) [1,0]: mm 1.0242 ms 100.0% [1,0]: triton_mm_7 3.1608 ms 32.4% [1,0]: triton_mm_3 3.6354 ms 28.2% [1,0]: triton_mm_5 3.9250 ms 26.1% [1,0]: triton_mm_8 4.1619 ms 24.6% [1,0]: triton_mm_4 4.3849 ms 23.4% [1,0]: triton_mm_6 4.5466 ms 22.5% [1,0]: triton_mm_2 5.0089 ms 20.4% [1,0]: triton_mm_1 5.1330 ms 20.0% [1,0]: triton_mm_0 5.1458 ms 19.9% [1,0]:SingleProcess AUTOTUNE takes 10.4098 seconds [1,3]:AUTOTUNE mm(2048x2048, 2048x8192) [1,3]: mm 0.8634 ms 100.0% [1,3]: triton_mm_14 1.5196 ms 56.8% [1,3]: triton_mm_18 1.5497 ms 55.7% [1,3]: triton_mm_19 1.5765 ms 54.8% [1,3]: triton_mm_15 1.8716 ms 46.1% [1,3]: triton_mm_11 2.0062 ms 43.0% [1,3]: triton_mm_12 2.0176 ms 42.8% [1,3]: triton_mm_13 2.0432 ms 42.3% [1,3]: triton_mm_16 2.5107 ms 34.4% [1,3]: triton_mm_17 3.2397 ms 26.7% [1,3]:SingleProcess AUTOTUNE takes 1.7922 seconds [1,0]:AUTOTUNE mm(2048x2048, 2048x8192) [1,0]: mm 0.8564 ms 100.0% [1,0]: triton_mm_14 1.4993 ms 57.1% [1,0]: triton_mm_18 1.5417 ms 55.5% [1,0]: triton_mm_19 1.5793 ms 54.2% [1,0]: triton_mm_15 1.8664 ms 45.9% [1,0]: triton_mm_12 2.0045 ms 42.7% [1,0]: triton_mm_11 2.0251 ms 42.3% [1,0]: triton_mm_13 2.0588 ms 41.6% [1,0]: triton_mm_16 2.4971 ms 34.3% [1,0]: triton_mm_17 3.2688 ms 26.2% [1,0]:SingleProcess AUTOTUNE takes 1.7752 seconds [1,1]:AUTOTUNE mm(2048x2048, 2048x8192) [1,1]: mm 0.8574 ms 100.0% [1,1]: triton_mm_14 1.5078 ms 56.9% [1,1]: triton_mm_18 1.5438 ms 55.5% [1,1]: triton_mm_19 1.5790 ms 54.3% [1,1]: triton_mm_15 1.8748 ms 45.7% [1,1]: triton_mm_12 2.0099 ms 42.7% [1,1]: triton_mm_11 2.0266 ms 42.3% [1,1]: triton_mm_13 2.0479 ms 41.9% [1,1]: triton_mm_16 2.5000 ms 34.3% [1,1]: triton_mm_17 3.2497 ms 26.4% [1,1]:SingleProcess AUTOTUNE takes 1.7940 seconds [1,2]:AUTOTUNE mm(2048x2048, 2048x8192) [1,2]: mm 0.8571 ms 100.0% [1,2]: triton_mm_14 1.5051 ms 56.9% [1,2]: triton_mm_18 1.5540 ms 55.2% [1,2]: triton_mm_19 1.5806 ms 54.2% [1,2]: triton_mm_15 1.8697 ms 45.8% [1,2]: triton_mm_12 2.0208 ms 42.4% [1,2]: triton_mm_11 2.0212 ms 42.4% [1,2]: triton_mm_13 2.0525 ms 41.8% [1,2]: triton_mm_16 2.5044 ms 34.2% [1,2]: triton_mm_17 3.2073 ms 26.7% [1,2]:SingleProcess AUTOTUNE takes 1.7819 seconds [1,3]:[rank3]:[2024-09-28 17:11:01,910] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,0]:[rank0]:[2024-09-28 17:11:02,004] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,1]:[rank1]:[2024-09-28 17:11:02,069] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] due to: [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] [1,3]:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] [1,2]:[rank2]:[2024-09-28 17:11:02,123] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] due to: [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] [1,0]:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] due to: [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] [1,1]:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] due to: [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] [1,2]:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] [1,3]:AUTOTUNE mm(2048x8192, 8192x14784) [1,3]: mm 5.1855 ms 100.0% [1,3]: triton_mm_29 18.1666 ms 28.5% [1,3]: triton_mm_30 19.0999 ms 27.1% [1,3]: triton_mm_26 19.8113 ms 26.2% [1,3]: triton_mm_27 21.9293 ms 23.6% [1,3]: triton_mm_24 23.4876 ms 22.1% [1,3]: triton_mm_28 25.4800 ms 20.4% [1,3]: triton_mm_25 27.2083 ms 19.1% [1,3]: triton_mm_22 29.6523 ms 17.5% [1,3]: triton_mm_23 33.2988 ms 15.6% [1,3]:SingleProcess AUTOTUNE takes 9.1258 seconds [1,0]:AUTOTUNE mm(2048x8192, 8192x14784) [1,0]: mm 5.0763 ms 100.0% [1,0]: triton_mm_29 17.8295 ms 28.5% [1,0]: triton_mm_30 19.3175 ms 26.3% [1,0]: triton_mm_26 20.0796 ms 25.3% [1,0]: triton_mm_27 22.0445 ms 23.0% [1,0]: triton_mm_24 22.6142 ms 22.4% [1,0]: triton_mm_28 24.9178 ms 20.4% [1,0]: triton_mm_25 26.3039 ms 19.3% [1,0]: triton_mm_22 30.4394 ms 16.7% [1,0]: triton_mm_23 34.1525 ms 14.9% [1,0]:SingleProcess AUTOTUNE takes 9.0679 seconds [1,1]:AUTOTUNE mm(2048x8192, 8192x14784) [1,1]: mm 5.0566 ms 100.0% [1,1]: triton_mm_29 17.3261 ms 29.2% [1,1]: triton_mm_26 19.1989 ms 26.3% [1,1]: triton_mm_30 19.3449 ms 26.1% [1,1]: triton_mm_27 21.8966 ms 23.1% [1,1]: triton_mm_24 22.5473 ms 22.4% [1,1]: triton_mm_25 23.7000 ms 21.3% [1,1]: triton_mm_28 24.8784 ms 20.3% [1,1]: triton_mm_22 32.4732 ms 15.6% [1,1]: triton_mm_31 33.6202 ms 15.0% [1,1]:SingleProcess AUTOTUNE takes 9.2076 seconds [1,2]:AUTOTUNE mm(2048x8192, 8192x14784) [1,2]: mm 5.0838 ms 100.0% [1,2]: triton_mm_29 17.7675 ms 28.6% [1,2]: triton_mm_30 18.3269 ms 27.7% [1,2]: triton_mm_26 19.8405 ms 25.6% [1,2]: triton_mm_27 22.0307 ms 23.1% [1,2]: triton_mm_24 23.4502 ms 21.7% [1,2]: triton_mm_28 24.8083 ms 20.5% [1,2]: triton_mm_25 26.1370 ms 19.5% [1,2]: triton_mm_22 30.5937 ms 16.6% [1,2]: triton_mm_23 30.9062 ms 16.4% [1,2]:SingleProcess AUTOTUNE takes 9.1920 seconds [1,3]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,3]: torch.has_cuda, [1,3]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,3]: torch.has_cudnn, [1,3]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,3]: torch.has_mps, [1,3]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,3]: torch.has_mkldnn, [1,0]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,0]: torch.has_cuda, [1,0]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,0]: torch.has_cudnn, [1,0]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,0]: torch.has_mps, [1,0]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,0]: torch.has_mkldnn, [1,1]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,1]: torch.has_cuda, [1,1]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,1]: torch.has_cudnn, [1,1]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,1]: torch.has_mps, [1,1]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,1]: torch.has_mkldnn, [1,2]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,2]: torch.has_cuda, [1,2]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,2]: torch.has_cudnn, [1,2]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,2]: torch.has_mps, [1,2]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,2]: torch.has_mkldnn, [1,3]:AUTOTUNE mm(2048x7392, 7392x8192) [1,3]: mm 3.3452 ms 100.0% [1,3]: triton_mm_40 3.8633 ms 86.6% [1,3]: triton_mm_36 5.0951 ms 65.7% [1,3]: triton_mm_37 5.5629 ms 60.1% [1,3]: triton_mm_41 7.2336 ms 46.2% [1,3]: triton_mm_38 10.0755 ms 33.2% [1,3]: triton_mm_35 11.6711 ms 28.7% [1,3]: triton_mm_34 12.3092 ms 27.2% [1,3]: triton_mm_33 15.2904 ms 21.9% [1,3]: triton_mm_39 15.7015 ms 21.3% [1,3]:SingleProcess AUTOTUNE takes 2.4936 seconds [1,3]:[rank3]:[2024-09-28 17:11:17,263] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,3]:[rank3]:[2024-09-28 17:11:17,263] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,3]:[rank3]:[2024-09-28 17:11:17,283] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,3]:[rank3]:[2024-09-28 17:11:17,283] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,0]:AUTOTUNE mm(2048x7392, 7392x8192) [1,0]: mm 3.3581 ms 100.0% [1,0]: triton_mm_40 3.8333 ms 87.6% [1,0]: triton_mm_36 5.0018 ms 67.1% [1,0]: triton_mm_41 5.6386 ms 59.6% [1,0]: triton_mm_37 6.0385 ms 55.6% [1,0]: triton_mm_38 10.0579 ms 33.4% [1,0]: triton_mm_35 11.8332 ms 28.4% [1,0]: triton_mm_34 14.9400 ms 22.5% [1,0]: triton_mm_39 15.9981 ms 21.0% [1,0]: triton_mm_33 16.1442 ms 20.8% [1,0]:SingleProcess AUTOTUNE takes 2.5098 seconds [1,0]:[rank0]:[2024-09-28 17:11:17,451] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,0]:[rank0]:[2024-09-28 17:11:17,451] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,0]:[rank0]:[2024-09-28 17:11:17,471] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,0]:[rank0]:[2024-09-28 17:11:17,471] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,2]:AUTOTUNE mm(2048x7392, 7392x8192) [1,2]: mm 3.2400 ms 100.0% [1,2]: triton_mm_40 3.8364 ms 84.5% [1,2]: triton_mm_36 5.0255 ms 64.5% [1,2]: triton_mm_37 5.6424 ms 57.4% [1,2]: triton_mm_41 7.0218 ms 46.1% [1,2]: triton_mm_38 10.1663 ms 31.9% [1,2]: triton_mm_35 11.5364 ms 28.1% [1,2]: triton_mm_34 13.3459 ms 24.3% [1,2]: triton_mm_33 14.6459 ms 22.1% [1,2]: triton_mm_39 15.8142 ms 20.5% [1,2]:SingleProcess AUTOTUNE takes 2.5147 seconds [1,2]:[rank2]:[2024-09-28 17:11:17,614] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,2]:[rank2]:[2024-09-28 17:11:17,614] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,2]:[rank2]:[2024-09-28 17:11:17,634] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,2]:[rank2]:[2024-09-28 17:11:17,634] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,1]:AUTOTUNE mm(2048x7392, 7392x8192) [1,1]: mm 3.3703 ms 100.0% [1,1]: triton_mm_40 3.7507 ms 89.9% [1,1]: triton_mm_36 5.0206 ms 67.1% [1,1]: triton_mm_37 5.7857 ms 58.3% [1,1]: triton_mm_41 6.2546 ms 53.9% [1,1]: triton_mm_38 10.2652 ms 32.8% [1,1]: triton_mm_35 11.7461 ms 28.7% [1,1]: triton_mm_34 13.1324 ms 25.7% [1,1]: triton_mm_39 14.8818 ms 22.6% [1,1]: triton_mm_33 16.2534 ms 20.7% [1,1]:SingleProcess AUTOTUNE takes 2.5369 seconds [1,1]:[rank1]:[2024-09-28 17:11:17,704] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,1]:[rank1]:[2024-09-28 17:11:17,704] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,1]:[rank1]:[2024-09-28 17:11:17,724] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,1]:[rank1]:[2024-09-28 17:11:17,724] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,3]:W0928 17:11:24.389076 21518 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,1]:W0928 17:11:24.391553 21509 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,0]:W0928 17:11:24.393659 21506 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,2]:W0928 17:11:24.399269 21514 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,3]:I0928 17:11:25.779922 21518 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,0]:I0928 17:11:26.047698 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,2]:I0928 17:11:26.222893 21514 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,1]:I0928 17:11:26.331490 21509 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,4]:I0928 17:11:32.974344 21522 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,5]:[rank5]:[2024-09-28 17:11:36,690] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,4]:[rank4]:[2024-09-28 17:11:36,697] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,6]:[rank6]:[2024-09-28 17:11:36,704] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] due to: [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] [1,5]:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] due to: [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] [1,4]:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] due to: [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] [1,6]:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] [1,7]:[rank7]:[2024-09-28 17:11:36,930] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] due to: [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] [1,7]:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] [1,5]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,5]: torch.has_cuda, [1,5]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,5]: torch.has_cudnn, [1,5]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,5]: torch.has_mps, [1,5]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,5]: torch.has_mkldnn, [1,4]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,4]: torch.has_cuda, [1,4]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,4]: torch.has_cudnn, [1,4]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,4]: torch.has_mps, [1,4]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,4]: torch.has_mkldnn, [1,6]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,6]: torch.has_cuda, [1,6]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,6]: torch.has_cudnn, [1,6]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,6]: torch.has_mps, [1,6]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,6]: torch.has_mkldnn, [1,7]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,7]: torch.has_cuda, [1,7]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,7]: torch.has_cudnn, [1,7]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,7]: torch.has_mps, [1,7]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,7]: torch.has_mkldnn, [1,4]:[rank4]:[2024-09-28 17:11:40,537] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,4]:[rank4]:[2024-09-28 17:11:40,537] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,4]:[rank4]:[2024-09-28 17:11:40,558] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,4]:[rank4]:[2024-09-28 17:11:40,558] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,5]:[rank5]:[2024-09-28 17:11:40,588] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,5]:[rank5]:[2024-09-28 17:11:40,588] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,5]:[rank5]:[2024-09-28 17:11:40,609] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,5]:[rank5]:[2024-09-28 17:11:40,609] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,6]:[rank6]:[2024-09-28 17:11:40,639] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,6]:[rank6]:[2024-09-28 17:11:40,639] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,6]:[rank6]:[2024-09-28 17:11:40,659] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,6]:[rank6]:[2024-09-28 17:11:40,659] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,7]:[rank7]:[2024-09-28 17:11:40,690] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,7]:[rank7]:[2024-09-28 17:11:40,690] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,7]:[rank7]:[2024-09-28 17:11:40,711] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,7]:[rank7]:[2024-09-28 17:11:40,711] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,4]:W0928 17:11:47.335281 21522 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,7]:W0928 17:11:47.335413 21526 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,6]:W0928 17:11:47.341578 21525 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,5]:W0928 17:11:47.341688 21524 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,8]:I0928 17:11:54.878262 24819 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,10]:AUTOTUNE mm(2048x8192, 8192x2560) [1,10]: mm 1.0378 ms 100.0% [1,10]: triton_mm_7 3.2085 ms 32.3% [1,10]: triton_mm_3 3.7506 ms 27.7% [1,10]: triton_mm_5 3.8942 ms 26.6% [1,10]: triton_mm_8 4.3069 ms 24.1% [1,10]: triton_mm_4 4.3168 ms 24.0% [1,10]: triton_mm_6 4.5318 ms 22.9% [1,10]: triton_mm_0 5.1230 ms 20.3% [1,10]: triton_mm_1 5.2015 ms 20.0% [1,10]: triton_mm_2 5.2401 ms 19.8% [1,10]:SingleProcess AUTOTUNE takes 10.3330 seconds [1,8]:AUTOTUNE mm(2048x8192, 8192x2560) [1,8]: mm 1.0301 ms 100.0% [1,8]: triton_mm_7 3.2140 ms 32.0% [1,8]: triton_mm_3 3.7502 ms 27.5% [1,8]: triton_mm_5 3.8955 ms 26.4% [1,8]: triton_mm_8 4.1798 ms 24.6% [1,8]: triton_mm_4 4.3744 ms 23.5% [1,8]: triton_mm_6 4.5068 ms 22.9% [1,8]: triton_mm_0 5.1285 ms 20.1% [1,8]: triton_mm_2 5.2497 ms 19.6% [1,8]: triton_mm_1 5.3827 ms 19.1% [1,8]:SingleProcess AUTOTUNE takes 10.4021 seconds [1,11]:AUTOTUNE mm(2048x8192, 8192x2560) [1,11]: mm 1.0266 ms 100.0% [1,11]: triton_mm_7 3.2032 ms 32.0% [1,11]: triton_mm_3 3.6907 ms 27.8% [1,11]: triton_mm_5 3.8802 ms 26.5% [1,11]: triton_mm_8 4.0726 ms 25.2% [1,11]: triton_mm_4 4.3446 ms 23.6% [1,11]: triton_mm_6 4.5232 ms 22.7% [1,11]: triton_mm_0 4.9972 ms 20.5% [1,11]: triton_mm_2 5.1340 ms 20.0% [1,11]: triton_mm_1 5.1699 ms 19.9% [1,11]:SingleProcess AUTOTUNE takes 10.5661 seconds [1,9]:AUTOTUNE mm(2048x8192, 8192x2560) [1,9]: mm 1.0400 ms 100.0% [1,9]: triton_mm_7 3.1960 ms 32.5% [1,9]: triton_mm_3 3.6162 ms 28.8% [1,9]: triton_mm_5 3.8928 ms 26.7% [1,9]: triton_mm_8 4.0675 ms 25.6% [1,9]: triton_mm_4 4.3381 ms 24.0% [1,9]: triton_mm_6 4.5146 ms 23.0% [1,9]: triton_mm_1 5.0942 ms 20.4% [1,9]: triton_mm_0 5.1172 ms 20.3% [1,9]: triton_mm_2 5.1608 ms 20.2% [1,9]:SingleProcess AUTOTUNE takes 10.6917 seconds [1,10]:AUTOTUNE mm(2048x2048, 2048x8192) [1,10]: mm 0.8578 ms 100.0% [1,10]: triton_mm_14 1.4940 ms 57.4% [1,10]: triton_mm_18 1.5448 ms 55.5% [1,10]: triton_mm_19 1.5689 ms 54.7% [1,10]: triton_mm_15 1.8642 ms 46.0% [1,10]: triton_mm_11 1.9785 ms 43.4% [1,10]: triton_mm_12 2.0190 ms 42.5% [1,10]: triton_mm_13 2.0635 ms 41.6% [1,10]: triton_mm_16 2.4162 ms 35.5% [1,10]: triton_mm_17 3.2118 ms 26.7% [1,10]:SingleProcess AUTOTUNE takes 1.7698 seconds [1,8]:AUTOTUNE mm(2048x2048, 2048x8192) [1,8]: mm 0.8600 ms 100.0% [1,8]: triton_mm_14 1.5070 ms 57.1% [1,8]: triton_mm_18 1.5498 ms 55.5% [1,8]: triton_mm_19 1.5738 ms 54.6% [1,8]: triton_mm_15 1.8619 ms 46.2% [1,8]: triton_mm_11 1.9930 ms 43.2% [1,8]: triton_mm_12 2.0232 ms 42.5% [1,8]: triton_mm_13 2.0418 ms 42.1% [1,8]: triton_mm_16 2.4278 ms 35.4% [1,8]: triton_mm_17 3.2021 ms 26.9% [1,8]:SingleProcess AUTOTUNE takes 1.7889 seconds [1,10]:[rank10]:[2024-09-28 17:12:07,727] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,11]:AUTOTUNE mm(2048x2048, 2048x8192) [1,11]: mm 0.8608 ms 100.0% [1,11]: triton_mm_14 1.4989 ms 57.4% [1,11]: triton_mm_18 1.5522 ms 55.5% [1,11]: triton_mm_19 1.5772 ms 54.6% [1,11]: triton_mm_15 1.8843 ms 45.7% [1,11]: triton_mm_11 1.9591 ms 43.9% [1,11]: triton_mm_12 2.0316 ms 42.4% [1,11]: triton_mm_13 2.0363 ms 42.3% [1,11]: triton_mm_16 2.4147 ms 35.6% [1,11]: triton_mm_17 3.1298 ms 27.5% [1,11]:SingleProcess AUTOTUNE takes 1.7861 seconds [1,8]:[rank8]:[2024-09-28 17:12:07,863] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] due to: [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] [1,10]:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] [1,9]:AUTOTUNE mm(2048x2048, 2048x8192) [1,9]: mm 0.8618 ms 100.0% [1,9]: triton_mm_14 1.5059 ms 57.2% [1,9]: triton_mm_18 1.5520 ms 55.5% [1,9]: triton_mm_19 1.5754 ms 54.7% [1,9]: triton_mm_15 1.8899 ms 45.6% [1,9]: triton_mm_11 1.9677 ms 43.8% [1,9]: triton_mm_13 2.0334 ms 42.4% [1,9]: triton_mm_12 2.0339 ms 42.4% [1,9]: triton_mm_16 2.4330 ms 35.4% [1,9]: triton_mm_17 3.1490 ms 27.4% [1,9]:SingleProcess AUTOTUNE takes 1.7963 seconds [1,11]:[rank11]:[2024-09-28 17:12:08,024] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] due to: [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] [1,8]:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] [1,9]:[rank9]:[2024-09-28 17:12:08,142] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] due to: [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] [1,11]:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] due to: [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] [1,9]:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] [1,10]:AUTOTUNE mm(2048x8192, 8192x14784) [1,10]: mm 5.1471 ms 100.0% [1,10]: triton_mm_29 17.8129 ms 28.9% [1,10]: triton_mm_30 19.0960 ms 27.0% [1,10]: triton_mm_26 19.7803 ms 26.0% [1,10]: triton_mm_27 22.0302 ms 23.4% [1,10]: triton_mm_24 22.9596 ms 22.4% [1,10]: triton_mm_25 23.9913 ms 21.5% [1,10]: triton_mm_28 25.4541 ms 20.2% [1,10]: triton_mm_22 30.2564 ms 17.0% [1,10]: triton_mm_23 32.3434 ms 15.9% [1,10]:SingleProcess AUTOTUNE takes 9.3133 seconds [1,8]:AUTOTUNE mm(2048x8192, 8192x14784) [1,8]: mm 5.0688 ms 100.0% [1,8]: triton_mm_29 18.3441 ms 27.6% [1,8]: triton_mm_30 19.3977 ms 26.1% [1,8]: triton_mm_26 20.3070 ms 25.0% [1,8]: triton_mm_27 21.7696 ms 23.3% [1,8]: triton_mm_24 23.4632 ms 21.6% [1,8]: triton_mm_25 24.5947 ms 20.6% [1,8]: triton_mm_28 25.2464 ms 20.1% [1,8]: triton_mm_22 30.4956 ms 16.6% [1,8]: triton_mm_31 32.0996 ms 15.8% [1,8]:SingleProcess AUTOTUNE takes 9.2404 seconds [1,11]:AUTOTUNE mm(2048x8192, 8192x14784) [1,11]: mm 5.0896 ms 100.0% [1,11]: triton_mm_29 17.6160 ms 28.9% [1,11]: triton_mm_30 19.1918 ms 26.5% [1,11]: triton_mm_26 19.5337 ms 26.1% [1,11]: triton_mm_27 21.7936 ms 23.4% [1,11]: triton_mm_24 23.1823 ms 22.0% [1,11]: triton_mm_28 24.8120 ms 20.5% [1,11]: triton_mm_25 25.1007 ms 20.3% [1,11]: triton_mm_22 30.2913 ms 16.8% [1,11]: triton_mm_23 32.0214 ms 15.9% [1,11]:SingleProcess AUTOTUNE takes 9.2340 seconds [1,9]:AUTOTUNE mm(2048x8192, 8192x14784) [1,9]: mm 5.1352 ms 100.0% [1,9]: triton_mm_29 18.1032 ms 28.4% [1,9]: triton_mm_30 19.3904 ms 26.5% [1,9]: triton_mm_26 19.7254 ms 26.0% [1,9]: triton_mm_27 21.9104 ms 23.4% [1,9]: triton_mm_24 22.1791 ms 23.2% [1,9]: triton_mm_25 23.6411 ms 21.7% [1,9]: triton_mm_28 24.8541 ms 20.7% [1,9]: triton_mm_22 31.0699 ms 16.5% [1,9]: triton_mm_31 32.6123 ms 15.7% [1,9]:SingleProcess AUTOTUNE takes 9.1810 seconds [1,10]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,10]: torch.has_cuda, [1,10]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,10]: torch.has_cudnn, [1,10]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,10]: torch.has_mps, [1,10]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,10]: torch.has_mkldnn, [1,8]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,8]: torch.has_cuda, [1,8]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,8]: torch.has_cudnn, [1,8]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,8]: torch.has_mps, [1,8]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,8]: torch.has_mkldnn, [1,11]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,11]: torch.has_cuda, [1,11]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,11]: torch.has_cudnn, [1,11]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,11]: torch.has_mps, [1,11]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,11]: torch.has_mkldnn, [1,9]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,9]: torch.has_cuda, [1,9]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,9]: torch.has_cudnn, [1,9]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,9]: torch.has_mps, [1,9]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,9]: torch.has_mkldnn, [1,10]:AUTOTUNE mm(2048x7392, 7392x8192) [1,10]: mm 3.3741 ms 100.0% [1,10]: triton_mm_40 3.8266 ms 88.2% [1,10]: triton_mm_36 5.0447 ms 66.9% [1,10]: triton_mm_41 5.1620 ms 65.4% [1,10]: triton_mm_37 5.5195 ms 61.1% [1,10]: triton_mm_38 10.1286 ms 33.3% [1,10]: triton_mm_35 11.8828 ms 28.4% [1,10]: triton_mm_34 14.0384 ms 24.0% [1,10]: triton_mm_33 15.4860 ms 21.8% [1,10]: triton_mm_39 16.2022 ms 20.8% [1,10]:SingleProcess AUTOTUNE takes 2.5020 seconds [1,10]:[rank10]:[2024-09-28 17:12:23,286] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,10]:[rank10]:[2024-09-28 17:12:23,286] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,10]:[rank10]:[2024-09-28 17:12:23,306] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,10]:[rank10]:[2024-09-28 17:12:23,306] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,8]:AUTOTUNE mm(2048x7392, 7392x8192) [1,8]: mm 3.3576 ms 100.0% [1,8]: triton_mm_40 3.8105 ms 88.1% [1,8]: triton_mm_36 5.0969 ms 65.9% [1,8]: triton_mm_37 5.6791 ms 59.1% [1,8]: triton_mm_41 6.7754 ms 49.6% [1,8]: triton_mm_38 10.1430 ms 33.1% [1,8]: triton_mm_35 11.7184 ms 28.7% [1,8]: triton_mm_34 14.0096 ms 24.0% [1,8]: triton_mm_33 15.0187 ms 22.4% [1,8]: triton_mm_39 15.2330 ms 22.0% [1,8]:SingleProcess AUTOTUNE takes 2.5036 seconds [1,8]:[rank8]:[2024-09-28 17:12:23,456] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,8]:[rank8]:[2024-09-28 17:12:23,456] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,8]:[rank8]:[2024-09-28 17:12:23,476] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,8]:[rank8]:[2024-09-28 17:12:23,476] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,9]:AUTOTUNE mm(2048x7392, 7392x8192) [1,9]: mm 3.3615 ms 100.0% [1,9]: triton_mm_40 3.8418 ms 87.5% [1,9]: triton_mm_36 5.0737 ms 66.3% [1,9]: triton_mm_41 5.1965 ms 64.7% [1,9]: triton_mm_37 5.3039 ms 63.4% [1,9]: triton_mm_38 10.3903 ms 32.4% [1,9]: triton_mm_35 11.8389 ms 28.4% [1,9]: triton_mm_34 12.8541 ms 26.2% [1,9]: triton_mm_39 15.9118 ms 21.1% [1,9]: triton_mm_33 16.2588 ms 20.7% [1,9]:SingleProcess AUTOTUNE takes 2.5068 seconds [1,9]:[rank9]:[2024-09-28 17:12:23,668] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,9]:[rank9]:[2024-09-28 17:12:23,668] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,9]:[rank9]:[2024-09-28 17:12:23,688] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,9]:[rank9]:[2024-09-28 17:12:23,688] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,11]:AUTOTUNE mm(2048x7392, 7392x8192) [1,11]: mm 3.3541 ms 100.0% [1,11]: triton_mm_40 3.8623 ms 86.8% [1,11]: triton_mm_41 4.9458 ms 67.8% [1,11]: triton_mm_36 5.0301 ms 66.7% [1,11]: triton_mm_37 5.7869 ms 58.0% [1,11]: triton_mm_38 10.0409 ms 33.4% [1,11]: triton_mm_35 11.1122 ms 30.2% [1,11]: triton_mm_34 13.1232 ms 25.6% [1,11]: triton_mm_33 15.6761 ms 21.4% [1,11]: triton_mm_39 15.7778 ms 21.3% [1,11]:SingleProcess AUTOTUNE takes 2.5044 seconds [1,11]:[rank11]:[2024-09-28 17:12:23,797] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,11]:[rank11]:[2024-09-28 17:12:23,797] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,11]:[rank11]:[2024-09-28 17:12:23,818] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,11]:[rank11]:[2024-09-28 17:12:23,818] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,10]:W0928 17:12:30.706014 24831 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,8]:W0928 17:12:30.706085 24819 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,9]:W0928 17:12:30.706105 24825 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,11]:W0928 17:12:30.706283 24836 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,2]:W0928 17:12:31.165109 21514 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,3]:W0928 17:12:31.165242 21518 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,0]:W0928 17:12:31.165334 21506 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,1]:W0928 17:12:31.165256 21509 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,12]:I0928 17:12:37.894549 24841 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,12]:[rank12]:[2024-09-28 17:12:41,570] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,15]:[rank15]:[2024-09-28 17:12:41,648] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,13]:[rank13]:[2024-09-28 17:12:41,701] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,14]:[rank14]:[2024-09-28 17:12:41,706] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] due to: [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] [1,12]:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] due to: [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] [1,15]:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] due to: [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] [1,13]:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] due to: [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] [1,14]:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] [1,12]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,12]: torch.has_cuda, [1,12]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,12]: torch.has_cudnn, [1,12]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,12]: torch.has_mps, [1,12]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,12]: torch.has_mkldnn, [1,15]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,15]: torch.has_cuda, [1,15]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,15]: torch.has_cudnn, [1,15]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,15]: torch.has_mps, [1,15]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,15]: torch.has_mkldnn, [1,13]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,13]: torch.has_cuda, [1,13]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,13]: torch.has_cudnn, [1,13]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,13]: torch.has_mps, [1,13]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,13]: torch.has_mkldnn, [1,14]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,14]: torch.has_cuda, [1,14]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,14]: torch.has_cudnn, [1,14]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,14]: torch.has_mps, [1,14]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,14]: torch.has_mkldnn, [1,12]:[rank12]:[2024-09-28 17:12:45,269] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,12]:[rank12]:[2024-09-28 17:12:45,269] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,12]:[rank12]:[2024-09-28 17:12:45,290] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,12]:[rank12]:[2024-09-28 17:12:45,290] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,15]:[rank15]:[2024-09-28 17:12:45,388] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,15]:[rank15]:[2024-09-28 17:12:45,388] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,15]:[rank15]:[2024-09-28 17:12:45,409] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,15]:[rank15]:[2024-09-28 17:12:45,409] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,14]:[rank14]:[2024-09-28 17:12:45,446] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,14]:[rank14]:[2024-09-28 17:12:45,446] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,14]:[rank14]:[2024-09-28 17:12:45,467] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,14]:[rank14]:[2024-09-28 17:12:45,467] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,13]:[rank13]:[2024-09-28 17:12:45,494] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,13]:[rank13]:[2024-09-28 17:12:45,495] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,13]:[rank13]:[2024-09-28 17:12:45,515] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,13]:[rank13]:[2024-09-28 17:12:45,515] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,13]:W0928 17:12:51.981734 24844 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,14]:W0928 17:12:51.981797 24846 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,15]:W0928 17:12:51.981910 24847 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,12]:W0928 17:12:51.981982 24841 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,16]:I0928 17:12:59.068799 6894 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,16]:AUTOTUNE mm(2048x8192, 8192x2560) [1,16]: mm 1.0190 ms 100.0% [1,16]: triton_mm_7 3.2111 ms 31.7% [1,16]: triton_mm_3 3.6087 ms 28.2% [1,16]: triton_mm_5 3.8998 ms 26.1% [1,16]: triton_mm_8 4.1183 ms 24.7% [1,16]: triton_mm_4 4.3388 ms 23.5% [1,16]: triton_mm_6 4.5380 ms 22.5% [1,16]: triton_mm_2 5.1388 ms 19.8% [1,16]: triton_mm_1 5.1524 ms 19.8% [1,16]: triton_mm_0 5.2148 ms 19.5% [1,16]:SingleProcess AUTOTUNE takes 10.3527 seconds [1,19]:AUTOTUNE mm(2048x8192, 8192x2560) [1,19]: mm 1.0235 ms 100.0% [1,19]: triton_mm_7 3.1793 ms 32.2% [1,19]: triton_mm_3 3.7515 ms 27.3% [1,19]: triton_mm_5 3.9185 ms 26.1% [1,19]: triton_mm_8 4.0877 ms 25.0% [1,19]: triton_mm_4 4.3755 ms 23.4% [1,19]: triton_mm_6 4.5028 ms 22.7% [1,19]: triton_mm_0 5.1844 ms 19.7% [1,19]: triton_mm_1 5.2008 ms 19.7% [1,19]: triton_mm_2 5.2366 ms 19.5% [1,19]:SingleProcess AUTOTUNE takes 10.4489 seconds [1,18]:AUTOTUNE mm(2048x8192, 8192x2560) [1,18]: mm 1.0227 ms 100.0% [1,18]: triton_mm_7 3.2333 ms 31.6% [1,18]: triton_mm_3 3.6914 ms 27.7% [1,18]: triton_mm_5 3.9632 ms 25.8% [1,18]: triton_mm_8 4.2793 ms 23.9% [1,18]: triton_mm_4 4.3592 ms 23.5% [1,18]: triton_mm_6 4.5308 ms 22.6% [1,18]: triton_mm_2 5.1752 ms 19.8% [1,18]: triton_mm_1 5.1757 ms 19.8% [1,18]: triton_mm_0 5.2391 ms 19.5% [1,18]:SingleProcess AUTOTUNE takes 10.5093 seconds [1,17]:AUTOTUNE mm(2048x8192, 8192x2560) [1,17]: mm 1.0296 ms 100.0% [1,17]: triton_mm_7 3.1932 ms 32.2% [1,17]: triton_mm_3 3.6848 ms 27.9% [1,17]: triton_mm_5 3.9222 ms 26.3% [1,17]: triton_mm_8 4.2350 ms 24.3% [1,17]: triton_mm_4 4.3359 ms 23.7% [1,17]: triton_mm_6 4.5138 ms 22.8% [1,17]: triton_mm_0 5.0115 ms 20.5% [1,17]: triton_mm_1 5.0366 ms 20.4% [1,17]: triton_mm_2 5.2412 ms 19.6% [1,17]:SingleProcess AUTOTUNE takes 10.5546 seconds [1,16]:AUTOTUNE mm(2048x2048, 2048x8192) [1,16]: mm 0.8559 ms 100.0% [1,16]: triton_mm_14 1.5090 ms 56.7% [1,16]: triton_mm_18 1.5495 ms 55.2% [1,16]: triton_mm_19 1.5749 ms 54.3% [1,16]: triton_mm_15 1.8731 ms 45.7% [1,16]: triton_mm_11 1.9632 ms 43.6% [1,16]: triton_mm_12 2.0257 ms 42.2% [1,16]: triton_mm_13 2.0380 ms 42.0% [1,16]: triton_mm_16 2.4394 ms 35.1% [1,16]: triton_mm_17 3.2313 ms 26.5% [1,16]:SingleProcess AUTOTUNE takes 1.7943 seconds [1,19]:AUTOTUNE mm(2048x2048, 2048x8192) [1,19]: mm 0.8568 ms 100.0% [1,19]: triton_mm_14 1.5150 ms 56.6% [1,19]: triton_mm_18 1.5533 ms 55.2% [1,19]: triton_mm_19 1.5725 ms 54.5% [1,19]: triton_mm_15 1.8858 ms 45.4% [1,19]: triton_mm_11 1.9758 ms 43.4% [1,19]: triton_mm_12 2.0317 ms 42.2% [1,19]: triton_mm_13 2.0470 ms 41.9% [1,19]: triton_mm_16 2.4290 ms 35.3% [1,19]: triton_mm_17 3.1559 ms 27.1% [1,19]:SingleProcess AUTOTUNE takes 1.7534 seconds [1,18]:AUTOTUNE mm(2048x2048, 2048x8192) [1,18]: mm 0.8570 ms 100.0% [1,18]: triton_mm_14 1.5101 ms 56.8% [1,18]: triton_mm_18 1.5481 ms 55.4% [1,18]: triton_mm_19 1.5707 ms 54.6% [1,18]: triton_mm_15 1.8746 ms 45.7% [1,18]: triton_mm_11 1.9809 ms 43.3% [1,18]: triton_mm_12 2.0145 ms 42.5% [1,18]: triton_mm_13 2.0616 ms 41.6% [1,18]: triton_mm_16 2.4156 ms 35.5% [1,18]: triton_mm_17 3.1658 ms 27.1% [1,18]:SingleProcess AUTOTUNE takes 1.7824 seconds [1,16]:[rank16]:[2024-09-28 17:13:12,044] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,17]:AUTOTUNE mm(2048x2048, 2048x8192) [1,17]: mm 0.8564 ms 100.0% [1,17]: triton_mm_14 1.5045 ms 56.9% [1,17]: triton_mm_18 1.5483 ms 55.3% [1,17]: triton_mm_19 1.5658 ms 54.7% [1,17]: triton_mm_15 1.8664 ms 45.9% [1,17]: triton_mm_11 1.9753 ms 43.4% [1,17]: triton_mm_12 2.0206 ms 42.4% [1,17]: triton_mm_13 2.0464 ms 41.9% [1,17]: triton_mm_16 2.4359 ms 35.2% [1,17]: triton_mm_17 3.1247 ms 27.4% [1,17]:SingleProcess AUTOTUNE takes 1.7854 seconds [1,19]:[rank19]:[2024-09-28 17:13:12,201] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,18]:[rank18]:[2024-09-28 17:13:12,202] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] due to: [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] [1,16]:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] [1,17]:[rank17]:[2024-09-28 17:13:12,264] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] due to: [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] [1,19]:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] due to: [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] [1,18]:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] due to: [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] [1,17]:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] [1,16]:AUTOTUNE mm(2048x8192, 8192x14784) [1,16]: mm 5.0306 ms 100.0% [1,16]: triton_mm_29 16.9649 ms 29.7% [1,16]: triton_mm_30 18.3686 ms 27.4% [1,16]: triton_mm_26 19.7958 ms 25.4% [1,16]: triton_mm_27 22.1148 ms 22.7% [1,16]: triton_mm_24 23.3976 ms 21.5% [1,16]: triton_mm_28 25.6505 ms 19.6% [1,16]: triton_mm_25 29.3881 ms 17.1% [1,16]: triton_mm_23 30.9675 ms 16.2% [1,16]: triton_mm_22 32.8563 ms 15.3% [1,16]:SingleProcess AUTOTUNE takes 9.0260 seconds [1,18]:AUTOTUNE mm(2048x8192, 8192x14784) [1,18]: mm 5.0678 ms 100.0% [1,18]: triton_mm_29 17.1335 ms 29.6% [1,18]: triton_mm_30 18.9359 ms 26.8% [1,18]: triton_mm_26 19.4301 ms 26.1% [1,18]: triton_mm_27 21.9373 ms 23.1% [1,18]: triton_mm_24 22.5571 ms 22.5% [1,18]: triton_mm_28 24.8174 ms 20.4% [1,18]: triton_mm_25 27.0890 ms 18.7% [1,18]: triton_mm_22 31.6592 ms 16.0% [1,18]: triton_mm_31 34.5828 ms 14.7% [1,18]:SingleProcess AUTOTUNE takes 9.0317 seconds [1,19]:AUTOTUNE mm(2048x8192, 8192x14784) [1,19]: mm 5.0343 ms 100.0% [1,19]: triton_mm_29 17.3920 ms 28.9% [1,19]: triton_mm_30 19.4286 ms 25.9% [1,19]: triton_mm_26 19.8028 ms 25.4% [1,19]: triton_mm_27 21.7512 ms 23.1% [1,19]: triton_mm_24 22.9824 ms 21.9% [1,19]: triton_mm_28 24.8577 ms 20.3% [1,19]: triton_mm_25 26.1739 ms 19.2% [1,19]: triton_mm_22 30.5890 ms 16.5% [1,19]: triton_mm_31 33.4553 ms 15.0% [1,19]:SingleProcess AUTOTUNE takes 9.0865 seconds [1,17]:AUTOTUNE mm(2048x8192, 8192x14784) [1,17]: mm 5.1117 ms 100.0% [1,17]: triton_mm_29 18.0306 ms 28.4% [1,17]: triton_mm_30 19.2974 ms 26.5% [1,17]: triton_mm_26 20.3030 ms 25.2% [1,17]: triton_mm_27 21.9314 ms 23.3% [1,17]: triton_mm_24 22.7694 ms 22.5% [1,17]: triton_mm_25 23.7199 ms 21.6% [1,17]: triton_mm_28 25.5769 ms 20.0% [1,17]: triton_mm_22 31.2066 ms 16.4% [1,17]: triton_mm_31 32.1495 ms 15.9% [1,17]:SingleProcess AUTOTUNE takes 9.0968 seconds [1,16]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,16]: torch.has_cuda, [1,16]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,16]: torch.has_cudnn, [1,16]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,16]: torch.has_mps, [1,16]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,16]: torch.has_mkldnn, [1,18]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,18]: torch.has_cuda, [1,18]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,18]: torch.has_cudnn, [1,18]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,18]: torch.has_mps, [1,18]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,18]: torch.has_mkldnn, [1,19]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,19]: torch.has_cuda, [1,19]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,19]: torch.has_cudnn, [1,19]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,19]: torch.has_mps, [1,19]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,19]: torch.has_mkldnn, [1,17]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,17]: torch.has_cuda, [1,17]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,17]: torch.has_cudnn, [1,17]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,17]: torch.has_mps, [1,17]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,17]: torch.has_mkldnn, [1,16]:AUTOTUNE mm(2048x7392, 7392x8192) [1,16]: mm 3.3568 ms 100.0% [1,16]: triton_mm_40 3.8584 ms 87.0% [1,16]: triton_mm_41 5.0432 ms 66.6% [1,16]: triton_mm_36 5.0435 ms 66.6% [1,16]: triton_mm_37 5.7070 ms 58.8% [1,16]: triton_mm_38 10.0488 ms 33.4% [1,16]: triton_mm_35 11.6857 ms 28.7% [1,16]: triton_mm_34 13.0232 ms 25.8% [1,16]: triton_mm_33 13.6706 ms 24.6% [1,16]: triton_mm_39 15.7521 ms 21.3% [1,16]:SingleProcess AUTOTUNE takes 2.4829 seconds [1,16]:[rank16]:[2024-09-28 17:13:27,494] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,16]:[rank16]:[2024-09-28 17:13:27,494] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,16]:[rank16]:[2024-09-28 17:13:27,514] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,16]:[rank16]:[2024-09-28 17:13:27,514] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,18]:AUTOTUNE mm(2048x7392, 7392x8192) [1,18]: mm 3.3547 ms 100.0% [1,18]: triton_mm_40 3.8231 ms 87.7% [1,18]: triton_mm_36 5.0457 ms 66.5% [1,18]: triton_mm_37 5.7275 ms 58.6% [1,18]: triton_mm_41 7.7503 ms 43.3% [1,18]: triton_mm_38 9.9515 ms 33.7% [1,18]: triton_mm_35 11.5821 ms 29.0% [1,18]: triton_mm_34 14.0749 ms 23.8% [1,18]: triton_mm_39 15.9786 ms 21.0% [1,18]: triton_mm_33 16.1320 ms 20.8% [1,18]:SingleProcess AUTOTUNE takes 2.5257 seconds [1,18]:[rank18]:[2024-09-28 17:13:27,660] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,18]:[rank18]:[2024-09-28 17:13:27,660] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,18]:[rank18]:[2024-09-28 17:13:27,680] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,18]:[rank18]:[2024-09-28 17:13:27,680] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,19]:AUTOTUNE mm(2048x7392, 7392x8192) [1,19]: mm 3.3595 ms 100.0% [1,19]: triton_mm_40 3.7699 ms 89.1% [1,19]: triton_mm_36 5.0412 ms 66.6% [1,19]: triton_mm_37 5.5369 ms 60.7% [1,19]: triton_mm_41 5.9061 ms 56.9% [1,19]: triton_mm_38 10.1118 ms 33.2% [1,19]: triton_mm_35 11.7117 ms 28.7% [1,19]: triton_mm_34 12.5776 ms 26.7% [1,19]: triton_mm_39 15.3133 ms 21.9% [1,19]: triton_mm_33 16.7293 ms 20.1% [1,19]:SingleProcess AUTOTUNE takes 2.4784 seconds [1,19]:[rank19]:[2024-09-28 17:13:27,779] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,19]:[rank19]:[2024-09-28 17:13:27,779] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,19]:[rank19]:[2024-09-28 17:13:27,799] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,19]:[rank19]:[2024-09-28 17:13:27,799] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,17]:AUTOTUNE mm(2048x7392, 7392x8192) [1,17]: mm 3.2347 ms 100.0% [1,17]: triton_mm_40 3.7981 ms 85.2% [1,17]: triton_mm_36 5.0346 ms 64.2% [1,17]: triton_mm_41 5.3511 ms 60.4% [1,17]: triton_mm_37 5.6873 ms 56.9% [1,17]: triton_mm_38 10.2980 ms 31.4% [1,17]: triton_mm_35 12.0414 ms 26.9% [1,17]: triton_mm_34 14.6746 ms 22.0% [1,17]: triton_mm_33 15.7319 ms 20.6% [1,17]: triton_mm_39 16.3028 ms 19.8% [1,17]:SingleProcess AUTOTUNE takes 2.5271 seconds [1,17]:[rank17]:[2024-09-28 17:13:27,851] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,17]:[rank17]:[2024-09-28 17:13:27,852] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,17]:[rank17]:[2024-09-28 17:13:27,872] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,17]:[rank17]:[2024-09-28 17:13:27,872] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,16]:W0928 17:13:34.576684 6894 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,19]:W0928 17:13:34.576723 6914 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,17]:W0928 17:13:34.576835 6901 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,18]:W0928 17:13:34.576884 6907 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,20]:I0928 17:13:41.432934 6917 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,22]:[rank22]:[2024-09-28 17:13:45,333] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,23]:[rank23]:[2024-09-28 17:13:45,395] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,21]:[rank21]:[2024-09-28 17:13:45,496] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,20]:[rank20]:[2024-09-28 17:13:45,496] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] due to: [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] [1,22]:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] due to: [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] [1,23]:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] due to: [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] [1,21]:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] due to: [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] [1,20]:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] [1,22]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,22]: torch.has_cuda, [1,22]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,22]: torch.has_cudnn, [1,22]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,22]: torch.has_mps, [1,22]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,22]: torch.has_mkldnn, [1,23]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,23]: torch.has_cuda, [1,23]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,23]: torch.has_cudnn, [1,23]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,23]: torch.has_mps, [1,23]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,23]: torch.has_mkldnn, [1,20]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,20]: torch.has_cuda, [1,20]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,20]: torch.has_cudnn, [1,20]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,20]: torch.has_mps, [1,20]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,20]: torch.has_mkldnn, [1,21]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,21]: torch.has_cuda, [1,21]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,21]: torch.has_cudnn, [1,21]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,21]: torch.has_mps, [1,21]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,21]: torch.has_mkldnn, [1,22]:[rank22]:[2024-09-28 17:13:49,037] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,22]:[rank22]:[2024-09-28 17:13:49,037] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,22]:[rank22]:[2024-09-28 17:13:49,058] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,22]:[rank22]:[2024-09-28 17:13:49,058] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,23]:[rank23]:[2024-09-28 17:13:49,279] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,23]:[rank23]:[2024-09-28 17:13:49,279] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,23]:[rank23]:[2024-09-28 17:13:49,301] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,23]:[rank23]:[2024-09-28 17:13:49,301] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,20]:[rank20]:[2024-09-28 17:13:49,424] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,20]:[rank20]:[2024-09-28 17:13:49,424] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,20]:[rank20]:[2024-09-28 17:13:49,445] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,20]:[rank20]:[2024-09-28 17:13:49,445] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,21]:[rank21]:[2024-09-28 17:13:49,484] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,21]:[rank21]:[2024-09-28 17:13:49,484] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,21]:[rank21]:[2024-09-28 17:13:49,505] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,21]:[rank21]:[2024-09-28 17:13:49,505] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,23]:W0928 17:13:56.501966 6922 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,22]:W0928 17:13:56.501960 6921 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,20]:W0928 17:13:56.502000 6917 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,21]:W0928 17:13:56.502178 6919 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,24]:I0928 17:14:03.832532 8434 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,24]:AUTOTUNE mm(2048x8192, 8192x2560) [1,24]: mm 1.0391 ms 100.0% [1,24]: triton_mm_7 3.1755 ms 32.7% [1,24]: triton_mm_3 3.6507 ms 28.5% [1,24]: triton_mm_5 3.8906 ms 26.7% [1,24]: triton_mm_8 4.0837 ms 25.4% [1,24]: triton_mm_4 4.3992 ms 23.6% [1,24]: triton_mm_6 4.5329 ms 22.9% [1,24]: triton_mm_1 4.8434 ms 21.5% [1,24]: triton_mm_2 5.1816 ms 20.1% [1,24]: triton_mm_0 5.1966 ms 20.0% [1,24]:SingleProcess AUTOTUNE takes 10.1101 seconds [1,27]:AUTOTUNE mm(2048x8192, 8192x2560) [1,27]: mm 1.0300 ms 100.0% [1,27]: triton_mm_7 3.2299 ms 31.9% [1,27]: triton_mm_3 3.5824 ms 28.8% [1,27]: triton_mm_5 3.9000 ms 26.4% [1,27]: triton_mm_8 4.1541 ms 24.8% [1,27]: triton_mm_4 4.3677 ms 23.6% [1,27]: triton_mm_6 4.5098 ms 22.8% [1,27]: triton_mm_2 5.0905 ms 20.2% [1,27]: triton_mm_0 5.1741 ms 19.9% [1,27]: triton_mm_1 5.3158 ms 19.4% [1,27]:SingleProcess AUTOTUNE takes 10.5964 seconds [1,26]:AUTOTUNE mm(2048x8192, 8192x2560) [1,26]: mm 1.0305 ms 100.0% [1,26]: triton_mm_7 3.2091 ms 32.1% [1,26]: triton_mm_3 3.7131 ms 27.8% [1,26]: triton_mm_5 3.9533 ms 26.1% [1,26]: triton_mm_8 4.2911 ms 24.0% [1,26]: triton_mm_4 4.3884 ms 23.5% [1,26]: triton_mm_6 4.5206 ms 22.8% [1,26]: triton_mm_2 5.0828 ms 20.3% [1,26]: triton_mm_0 5.2260 ms 19.7% [1,26]: triton_mm_1 5.3700 ms 19.2% [1,26]:SingleProcess AUTOTUNE takes 10.6384 seconds [1,25]:AUTOTUNE mm(2048x8192, 8192x2560) [1,25]: mm 1.0298 ms 100.0% [1,25]: triton_mm_7 3.2100 ms 32.1% [1,25]: triton_mm_3 3.7548 ms 27.4% [1,25]: triton_mm_5 3.8834 ms 26.5% [1,25]: triton_mm_8 4.2774 ms 24.1% [1,25]: triton_mm_4 4.3370 ms 23.7% [1,25]: triton_mm_6 4.4937 ms 22.9% [1,25]: triton_mm_1 5.0691 ms 20.3% [1,25]: triton_mm_0 5.1841 ms 19.9% [1,25]: triton_mm_2 5.2164 ms 19.7% [1,25]:SingleProcess AUTOTUNE takes 10.6731 seconds [1,24]:AUTOTUNE mm(2048x2048, 2048x8192) [1,24]: mm 0.8595 ms 100.0% [1,24]: triton_mm_14 1.5019 ms 57.2% [1,24]: triton_mm_18 1.5523 ms 55.4% [1,24]: triton_mm_19 1.5722 ms 54.7% [1,24]: triton_mm_15 1.8682 ms 46.0% [1,24]: triton_mm_11 1.9915 ms 43.2% [1,24]: triton_mm_12 2.0223 ms 42.5% [1,24]: triton_mm_13 2.0627 ms 41.7% [1,24]: triton_mm_16 2.4373 ms 35.3% [1,24]: triton_mm_17 3.1016 ms 27.7% [1,24]:SingleProcess AUTOTUNE takes 1.7573 seconds [1,24]:[rank24]:[2024-09-28 17:14:16,434] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] due to: [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] [1,24]:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] [1,27]:AUTOTUNE mm(2048x2048, 2048x8192) [1,27]: mm 0.8593 ms 100.0% [1,27]: triton_mm_14 1.5113 ms 56.9% [1,27]: triton_mm_18 1.5460 ms 55.6% [1,27]: triton_mm_19 1.5695 ms 54.7% [1,27]: triton_mm_15 1.8808 ms 45.7% [1,27]: triton_mm_11 1.9772 ms 43.5% [1,27]: triton_mm_12 2.0304 ms 42.3% [1,27]: triton_mm_13 2.0558 ms 41.8% [1,27]: triton_mm_16 2.4557 ms 35.0% [1,27]: triton_mm_17 3.1676 ms 27.1% [1,27]:SingleProcess AUTOTUNE takes 1.7926 seconds [1,26]:AUTOTUNE mm(2048x2048, 2048x8192) [1,26]: mm 0.8599 ms 100.0% [1,26]: triton_mm_14 1.5139 ms 56.8% [1,26]: triton_mm_18 1.5590 ms 55.2% [1,26]: triton_mm_19 1.5747 ms 54.6% [1,26]: triton_mm_15 1.8672 ms 46.1% [1,26]: triton_mm_11 1.9782 ms 43.5% [1,26]: triton_mm_12 2.0314 ms 42.3% [1,26]: triton_mm_13 2.0495 ms 42.0% [1,26]: triton_mm_16 2.4120 ms 35.7% [1,26]: triton_mm_17 3.2531 ms 26.4% [1,26]:SingleProcess AUTOTUNE takes 1.9581 seconds [1,25]:AUTOTUNE mm(2048x2048, 2048x8192) [1,25]: mm 0.8618 ms 100.0% [1,25]: triton_mm_14 1.5112 ms 57.0% [1,25]: triton_mm_18 1.5500 ms 55.6% [1,25]: triton_mm_19 1.5699 ms 54.9% [1,25]: triton_mm_15 1.8734 ms 46.0% [1,25]: triton_mm_11 1.9760 ms 43.6% [1,25]: triton_mm_12 2.0074 ms 42.9% [1,25]: triton_mm_13 2.0494 ms 42.1% [1,25]: triton_mm_16 2.4026 ms 35.9% [1,25]: triton_mm_17 3.1352 ms 27.5% [1,25]:SingleProcess AUTOTUNE takes 1.7803 seconds [1,27]:[rank27]:[2024-09-28 17:14:17,004] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,26]:[rank26]:[2024-09-28 17:14:17,060] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,25]:[rank25]:[2024-09-28 17:14:17,080] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] due to: [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] [1,27]:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] due to: [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] [1,26]:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] due to: [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] [1,25]:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] [1,24]:AUTOTUNE mm(2048x8192, 8192x14784) [1,24]: mm 5.1857 ms 100.0% [1,24]: triton_mm_30 17.9639 ms 28.9% [1,24]: triton_mm_29 18.2298 ms 28.4% [1,24]: triton_mm_26 18.8290 ms 27.5% [1,24]: triton_mm_27 22.2413 ms 23.3% [1,24]: triton_mm_24 23.3049 ms 22.3% [1,24]: triton_mm_28 25.0196 ms 20.7% [1,24]: triton_mm_25 26.7174 ms 19.4% [1,24]: triton_mm_23 30.4225 ms 17.0% [1,24]: triton_mm_22 30.6673 ms 16.9% [1,24]:SingleProcess AUTOTUNE takes 9.3091 seconds [1,24]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,24]: torch.has_cuda, [1,24]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,24]: torch.has_cudnn, [1,24]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,24]: torch.has_mps, [1,24]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,24]: torch.has_mkldnn, [1,27]:AUTOTUNE mm(2048x8192, 8192x14784) [1,27]: mm 5.0847 ms 100.0% [1,27]: triton_mm_29 16.9729 ms 30.0% [1,27]: triton_mm_30 18.4372 ms 27.6% [1,27]: triton_mm_26 18.6484 ms 27.3% [1,27]: triton_mm_27 21.7123 ms 23.4% [1,27]: triton_mm_24 23.5352 ms 21.6% [1,27]: triton_mm_28 24.9348 ms 20.4% [1,27]: triton_mm_25 26.4609 ms 19.2% [1,27]: triton_mm_22 30.5302 ms 16.7% [1,27]: triton_mm_23 31.2967 ms 16.2% [1,27]:SingleProcess AUTOTUNE takes 9.1324 seconds [1,25]:AUTOTUNE mm(2048x8192, 8192x14784) [1,25]: mm 5.0961 ms 100.0% [1,25]: triton_mm_29 18.0852 ms 28.2% [1,25]: triton_mm_30 18.9151 ms 26.9% [1,25]: triton_mm_26 20.1964 ms 25.2% [1,25]: triton_mm_27 21.8812 ms 23.3% [1,25]: triton_mm_25 22.3130 ms 22.8% [1,25]: triton_mm_24 22.5970 ms 22.6% [1,25]: triton_mm_28 24.7673 ms 20.6% [1,25]: triton_mm_22 30.7182 ms 16.6% [1,25]: triton_mm_31 33.4807 ms 15.2% [1,25]:SingleProcess AUTOTUNE takes 9.1230 seconds [1,26]:AUTOTUNE mm(2048x8192, 8192x14784) [1,26]: mm 5.1129 ms 100.0% [1,26]: triton_mm_30 18.2823 ms 28.0% [1,26]: triton_mm_29 18.7388 ms 27.3% [1,26]: triton_mm_26 19.0723 ms 26.8% [1,26]: triton_mm_27 21.8754 ms 23.4% [1,26]: triton_mm_24 23.8963 ms 21.4% [1,26]: triton_mm_28 25.0097 ms 20.4% [1,26]: triton_mm_25 25.2861 ms 20.2% [1,26]: triton_mm_22 29.8695 ms 17.1% [1,26]: triton_mm_31 32.5502 ms 15.7% [1,26]:SingleProcess AUTOTUNE takes 9.3279 seconds [1,27]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,27]: torch.has_cuda, [1,27]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,27]: torch.has_cudnn, [1,27]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,27]: torch.has_mps, [1,27]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,27]: torch.has_mkldnn, [1,25]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,25]: torch.has_cuda, [1,25]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,25]: torch.has_cudnn, [1,25]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,25]: torch.has_mps, [1,25]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,25]: torch.has_mkldnn, [1,26]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,26]: torch.has_cuda, [1,26]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,26]: torch.has_cudnn, [1,26]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,26]: torch.has_mps, [1,26]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,26]: torch.has_mkldnn, [1,24]:AUTOTUNE mm(2048x7392, 7392x8192) [1,24]: mm 3.3505 ms 100.0% [1,24]: triton_mm_40 3.8707 ms 86.6% [1,24]: triton_mm_36 5.1307 ms 65.3% [1,24]: triton_mm_37 5.2460 ms 63.9% [1,24]: triton_mm_41 6.7128 ms 49.9% [1,24]: triton_mm_38 10.2252 ms 32.8% [1,24]: triton_mm_35 11.4638 ms 29.2% [1,24]: triton_mm_34 14.1986 ms 23.6% [1,24]: triton_mm_33 15.5529 ms 21.5% [1,24]: triton_mm_39 16.3878 ms 20.4% [1,24]:SingleProcess AUTOTUNE takes 2.5218 seconds [1,24]:[rank24]:[2024-09-28 17:14:32,005] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,24]:[rank24]:[2024-09-28 17:14:32,005] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,24]:[rank24]:[2024-09-28 17:14:32,024] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,24]:[rank24]:[2024-09-28 17:14:32,025] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,27]:AUTOTUNE mm(2048x7392, 7392x8192) [1,27]: mm 3.3702 ms 100.0% [1,27]: triton_mm_40 3.7775 ms 89.2% [1,27]: triton_mm_36 5.1127 ms 65.9% [1,27]: triton_mm_37 5.7541 ms 58.6% [1,27]: triton_mm_41 7.0124 ms 48.1% [1,27]: triton_mm_38 10.1167 ms 33.3% [1,27]: triton_mm_35 12.0237 ms 28.0% [1,27]: triton_mm_34 14.0471 ms 24.0% [1,27]: triton_mm_39 15.8385 ms 21.3% [1,27]: triton_mm_33 16.7307 ms 20.1% [1,27]:SingleProcess AUTOTUNE takes 2.5555 seconds [1,27]:[rank27]:[2024-09-28 17:14:32,503] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,27]:[rank27]:[2024-09-28 17:14:32,503] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,27]:[rank27]:[2024-09-28 17:14:32,523] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,27]:[rank27]:[2024-09-28 17:14:32,523] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,25]:AUTOTUNE mm(2048x7392, 7392x8192) [1,25]: mm 3.3742 ms 100.0% [1,25]: triton_mm_40 3.8454 ms 87.7% [1,25]: triton_mm_36 5.0190 ms 67.2% [1,25]: triton_mm_41 5.3428 ms 63.2% [1,25]: triton_mm_37 5.7840 ms 58.3% [1,25]: triton_mm_38 10.0044 ms 33.7% [1,25]: triton_mm_35 11.8802 ms 28.4% [1,25]: triton_mm_34 13.9136 ms 24.3% [1,25]: triton_mm_33 15.3226 ms 22.0% [1,25]: triton_mm_39 16.1431 ms 20.9% [1,25]:SingleProcess AUTOTUNE takes 2.5182 seconds [1,25]:[rank25]:[2024-09-28 17:14:32,609] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,25]:[rank25]:[2024-09-28 17:14:32,609] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,25]:[rank25]:[2024-09-28 17:14:32,630] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,25]:[rank25]:[2024-09-28 17:14:32,630] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,26]:AUTOTUNE mm(2048x7392, 7392x8192) [1,26]: mm 3.2780 ms 100.0% [1,26]: triton_mm_40 3.8086 ms 86.1% [1,26]: triton_mm_36 5.0192 ms 65.3% [1,26]: triton_mm_41 5.2896 ms 62.0% [1,26]: triton_mm_37 6.1512 ms 53.3% [1,26]: triton_mm_38 10.2586 ms 32.0% [1,26]: triton_mm_35 11.9237 ms 27.5% [1,26]: triton_mm_34 13.6025 ms 24.1% [1,26]: triton_mm_33 15.2115 ms 21.5% [1,26]: triton_mm_39 15.4452 ms 21.2% [1,26]:SingleProcess AUTOTUNE takes 2.5798 seconds [1,26]:[rank26]:[2024-09-28 17:14:32,983] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,26]:[rank26]:[2024-09-28 17:14:32,983] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,26]:[rank26]:[2024-09-28 17:14:33,004] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,26]:[rank26]:[2024-09-28 17:14:33,004] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,24]:W0928 17:14:39.956445 8434 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,25]:W0928 17:14:39.956475 8440 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,27]:W0928 17:14:39.956598 8451 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,26]:W0928 17:14:39.956630 8446 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,28]:I0928 17:14:40.918779 8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,28]:[rank28]:[2024-09-28 17:14:50,940] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,31]:[rank31]:[2024-09-28 17:14:50,968] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,30]:[rank30]:[2024-09-28 17:14:51,019] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,29]:[rank29]:[2024-09-28 17:14:51,030] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations. [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] due to: [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] [1,28]:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] due to: [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] [1,31]:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] due to: [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] [1,30]:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] due to: [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] File "", line 1, in [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] [1,29]:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] [1,28]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,28]: torch.has_cuda, [1,28]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,28]: torch.has_cudnn, [1,28]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,28]: torch.has_mps, [1,28]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,28]: torch.has_mkldnn, [1,31]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,31]: torch.has_cuda, [1,31]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,31]: torch.has_cudnn, [1,31]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,31]: torch.has_mps, [1,31]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,31]: torch.has_mkldnn, [1,30]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,30]: torch.has_cuda, [1,30]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,30]: torch.has_cudnn, [1,30]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,30]: torch.has_mps, [1,30]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,30]: torch.has_mkldnn, [1,29]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' [1,29]: torch.has_cuda, [1,29]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' [1,29]: torch.has_cudnn, [1,29]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' [1,29]: torch.has_mps, [1,29]:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' [1,29]: torch.has_mkldnn, [1,28]:[rank28]:[2024-09-28 17:14:54,695] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,28]:[rank28]:[2024-09-28 17:14:54,695] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,28]:[rank28]:[2024-09-28 17:14:54,715] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,28]:[rank28]:[2024-09-28 17:14:54,715] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,31]:[rank31]:[2024-09-28 17:14:54,732] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,31]:[rank31]:[2024-09-28 17:14:54,732] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,31]:[rank31]:[2024-09-28 17:14:54,753] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,31]:[rank31]:[2024-09-28 17:14:54,753] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,29]:[rank29]:[2024-09-28 17:14:54,767] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,29]:[rank29]:[2024-09-28 17:14:54,767] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,29]:[rank29]:[2024-09-28 17:14:54,788] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,29]:[rank29]:[2024-09-28 17:14:54,788] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,30]:[rank30]:[2024-09-28 17:14:54,861] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,30]:[rank30]:[2024-09-28 17:14:54,862] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,30]:[rank30]:[2024-09-28 17:14:54,883] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [1,30]:[rank30]:[2024-09-28 17:14:54,883] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args: ProcessGroupVariable() [1,30]:I0928 17:15:01.845300 8461 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,31]:I0928 17:15:01.846387 8462 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,29]:I0928 17:15:01.858175 8459 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,28]:I0928 17:15:01.873312 8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,29]:W0928 17:15:09.145107 8459 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,28]:W0928 17:15:09.145203 8456 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,31]:W0928 17:15:09.146502 8462 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,30]:W0928 17:15:09.149472 8461 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) [1,30]:I0928 17:16:36.815590 8461 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,28]:I0928 17:16:36.824781 8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,29]:I0928 17:16:36.839496 8459 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,31]:I0928 17:16:36.839843 8462 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,27]:I0928 17:16:37.125530 8451 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,26]:I0928 17:16:37.130755 8446 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,24]:I0928 17:16:37.154068 8434 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,25]:I0928 17:16:37.162684 8440 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,20]:I0928 17:16:37.479439 6917 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,21]:I0928 17:16:37.485435 6919 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,23]:I0928 17:16:37.486351 6922 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,22]:I0928 17:16:37.492189 6921 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,19]:I0928 17:16:37.791796 6914 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,18]:I0928 17:16:37.805619 6907 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,16]:I0928 17:16:37.810765 6894 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,17]:I0928 17:16:37.823271 6901 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,13]:I0928 17:16:38.077255 24844 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,15]:I0928 17:16:38.077323 24847 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,12]:I0928 17:16:38.084888 24841 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,14]:I0928 17:16:38.138005 24846 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,10]:I0928 17:16:38.387204 24831 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,9]:I0928 17:16:38.410526 24825 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,11]:I0928 17:16:38.410938 24836 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,8]:I0928 17:16:38.416317 24819 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,5]:I0928 17:16:38.717661 21524 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,6]:I0928 17:16:38.732295 21525 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,4]:I0928 17:16:38.742352 21522 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,7]:I0928 17:16:38.777984 21526 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,2]:I0928 17:16:39.009244 21514 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,3]:I0928 17:16:39.011678 21518 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,1]:I0928 17:16:39.014308 21509 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,0]:I0928 17:16:39.014585 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A [1,0]:Number of parameters in transformer layers in billions: 70.22 [1,0]:Number of parameters in embedding layers in billions: 2.49 [1,0]:Total number of parameters in billions: 72.71 [1,0]:Number of parameters in most loaded shard in billions: 2.5057 [1,0]:Number of parameters in other shards in billions: 2.1942 [1,0]:Theoretical memory footprints: weight and optimizer=43012.42 MB [1,31]: [2024-09-28 17:16:39] iteration 1/ 100 | consumed samples: 64 | elapsed time per iteration (ms): 357535.0 | throughput per GPU (TFLOP/s/GPU): 5.1 | learning rate: 3.000000E-05 | global batch size: 64 | lm loss: 1.205764E+01 | loss scale: 1.0 | grad norm: 27.325 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,3]:[Rank 3] (after 1 iterations) memory (MB) | allocated: 21890.52978515625 | max allocated: 29690.8583984375 | reserved: 34624.0 | max reserved: 34624.0 [1,1]:[Rank 1] (after 1 iterations) memory (MB) | allocated: 21890.62744140625 | max allocated: 29691.4560546875 | reserved: 34856.0 | max reserved: 34856.0 [1,5]:[Rank 5] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24763.95166015625 | reserved: 28130.0 | max reserved: 28130.0 [1,6]:[Rank 6] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24763.57666015625 | reserved: 28114.0 | max reserved: 28114.0 [1,0]:[Rank 0] (after 1 iterations) memory (MB) | allocated: 21890.5302734375 | max allocated: 29692.14013671875 | reserved: 34624.0 | max reserved: 34624.0 [1,24]:[Rank 24] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 16809.54541015625 | reserved: 20012.0 | max reserved: 20012.0 [1,27]:[Rank 27] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16809.107421875 | reserved: 20012.0 | max reserved: 20012.0 [1,13]:[Rank 13] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21581.22802734375 | reserved: 24870.0 | max reserved: 24870.0 [1,15]:[Rank 15] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.10302734375 | reserved: 24886.0 | max reserved: 24886.0 [1,14]:[Rank 14] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.72802734375 | reserved: 24886.0 | max reserved: 24886.0 [1,8]:[Rank 8] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 23173.74267578125 | reserved: 26502.0 | max reserved: 26502.0 [1,29]:[Rank 29] (after 1 iterations) memory (MB) | allocated: 17754.10107421875 | max allocated: 17754.1328125 | reserved: 20046.0 | max reserved: 20046.0 [1,26]:[Rank 26] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16808.357421875 | reserved: 20032.0 | max reserved: 20032.0 [1,28]:[Rank 28] (after 1 iterations) memory (MB) | allocated: 17754.10107421875 | max allocated: 17754.1328125 | reserved: 20046.0 | max reserved: 20046.0 [1,31]:[Rank 31] (after 1 iterations) memory (MB) | allocated: 17754.85107421875 | max allocated: 17754.8828125 | reserved: 20046.0 | max reserved: 20046.0 [1,30]:[Rank 30] (after 1 iterations) memory (MB) | allocated: 17756.10107421875 | max allocated: 17756.1328125 | reserved: 19816.0 | max reserved: 19816.0 [1,10]:[Rank 10] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23172.6171875 | reserved: 26502.0 | max reserved: 26502.0 [1,9]:[Rank 9] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23172.1171875 | reserved: 26502.0 | max reserved: 26502.0 [1,11]:[Rank 11] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23173.0546875 | reserved: 26502.0 | max reserved: 26502.0 [1,7]:[Rank 7] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24764.32666015625 | reserved: 28132.0 | max reserved: 28132.0 [1,2]:[Rank 2] (after 1 iterations) memory (MB) | allocated: 21890.62744140625 | max allocated: 29690.6123046875 | reserved: 34856.0 | max reserved: 34856.0 [1,12]:[Rank 12] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.10302734375 | reserved: 24870.0 | max reserved: 24870.0 [1,25]:[Rank 25] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16809.357421875 | reserved: 20022.0 | max reserved: 20022.0[1,25]: [1,4]:[Rank 4] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24764.07666015625 | reserved: 28134.0 | max reserved: 28134.0 [1,19]:[Rank 19] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19992.2373046875 | reserved: 23258.0 | max reserved: 23258.0 [1,16]:[Rank 16] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 19991.51904296875 | reserved: 23258.0 | max reserved: 23258.0 [1,17]:[Rank 17] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19991.6123046875 | reserved: 23142.0 | max reserved: 23142.0 [1,18]:[Rank 18] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19991.5810546875 | reserved: 23142.0 | max reserved: 23142.0 [1,20]:[Rank 20] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18400.25439453125 | reserved: 21646.0 | max reserved: 21646.0 [1,23]:[Rank 23] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18399.75439453125 | reserved: 21646.0 | max reserved: 21646.0 [1,22]:[Rank 22] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18399.62939453125 | reserved: 21646.0 | max reserved: 21646.0 [1,21]:[Rank 21] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18400.12939453125 | reserved: 21646.0 | max reserved: 21646.0[1,21]: [1,31]: [2024-09-28 17:17:18] iteration 2/ 100 | consumed samples: 128 | elapsed time per iteration (ms): 39808.6 | throughput per GPU (TFLOP/s/GPU): 45.8 | learning rate: 2.999320E-05 | global batch size: 64 | lm loss: 1.206543E+01 | loss scale: 1.0 | grad norm: 29.727 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:17:58] iteration 3/ 100 | consumed samples: 192 | elapsed time per iteration (ms): 39574.8 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.997282E-05 | global batch size: 64 | lm loss: 1.069553E+01 | loss scale: 1.0 | grad norm: 261.951 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:18:38] iteration 4/ 100 | consumed samples: 256 | elapsed time per iteration (ms): 39625.8 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.993887E-05 | global batch size: 64 | lm loss: 1.103311E+01 | loss scale: 1.0 | grad norm: 18.421 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:19:17] iteration 5/ 100 | consumed samples: 320 | elapsed time per iteration (ms): 39592.1 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.989139E-05 | global batch size: 64 | lm loss: 1.566007E+01 | loss scale: 1.0 | grad norm: 151.662 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:19:57] iteration 6/ 100 | consumed samples: 384 | elapsed time per iteration (ms): 39573.2 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.983042E-05 | global batch size: 64 | lm loss: 1.319092E+01 | loss scale: 1.0 | grad norm: 47.626 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:20:36] iteration 7/ 100 | consumed samples: 448 | elapsed time per iteration (ms): 39535.6 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.975604E-05 | global batch size: 64 | lm loss: 1.252701E+01 | loss scale: 1.0 | grad norm: 10.791 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:21:16] iteration 8/ 100 | consumed samples: 512 | elapsed time per iteration (ms): 39780.8 | throughput per GPU (TFLOP/s/GPU): 45.8 | learning rate: 2.966830E-05 | global batch size: 64 | lm loss: 1.230831E+01 | loss scale: 1.0 | grad norm: 3.800 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:21:56] iteration 9/ 100 | consumed samples: 576 | elapsed time per iteration (ms): 39597.3 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.956731E-05 | global batch size: 64 | lm loss: 1.122603E+01 | loss scale: 1.0 | grad norm: 2.744 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:22:35] iteration 10/ 100 | consumed samples: 640 | elapsed time per iteration (ms): 39674.7 | throughput per GPU (TFLOP/s/GPU): 45.9 | learning rate: 2.945316E-05 | global batch size: 64 | lm loss: 1.062877E+01 | loss scale: 1.0 | grad norm: 16.466 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:23:15] iteration 11/ 100 | consumed samples: 704 | elapsed time per iteration (ms): 39649.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.932596E-05 | global batch size: 64 | lm loss: 1.030651E+01 | loss scale: 1.0 | grad norm: 13.611 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:23:55] iteration 12/ 100 | consumed samples: 768 | elapsed time per iteration (ms): 39566.2 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.918585E-05 | global batch size: 64 | lm loss: 1.008296E+01 | loss scale: 1.0 | grad norm: 1.557 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:24:34] iteration 13/ 100 | consumed samples: 832 | elapsed time per iteration (ms): 39626.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.903297E-05 | global batch size: 64 | lm loss: 9.951680E+00 | loss scale: 1.0 | grad norm: 2.166 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:25:14] iteration 14/ 100 | consumed samples: 896 | elapsed time per iteration (ms): 39557.3 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.886746E-05 | global batch size: 64 | lm loss: 9.911741E+00 | loss scale: 1.0 | grad norm: 0.892 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:25:53] iteration 15/ 100 | consumed samples: 960 | elapsed time per iteration (ms): 39587.4 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.868951E-05 | global batch size: 64 | lm loss: 9.708189E+00 | loss scale: 1.0 | grad norm: 1.484 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:26:33] iteration 16/ 100 | consumed samples: 1024 | elapsed time per iteration (ms): 39539.5 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.849928E-05 | global batch size: 64 | lm loss: 9.610130E+00 | loss scale: 1.0 | grad norm: 0.635 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:27:13] iteration 17/ 100 | consumed samples: 1088 | elapsed time per iteration (ms): 39556.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.829697E-05 | global batch size: 64 | lm loss: 9.517210E+00 | loss scale: 1.0 | grad norm: 0.615 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:27:52] iteration 18/ 100 | consumed samples: 1152 | elapsed time per iteration (ms): 39482.5 | throughput per GPU (TFLOP/s/GPU): 46.2 | learning rate: 2.808278E-05 | global batch size: 64 | lm loss: 9.492043E+00 | loss scale: 1.0 | grad norm: 0.544 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:28:32] iteration 19/ 100 | consumed samples: 1216 | elapsed time per iteration (ms): 39568.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.785692E-05 | global batch size: 64 | lm loss: 9.486827E+00 | loss scale: 1.0 | grad norm: 1.113 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:29:11] iteration 20/ 100 | consumed samples: 1280 | elapsed time per iteration (ms): 39528.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.761963E-05 | global batch size: 64 | lm loss: 9.444993E+00 | loss scale: 1.0 | grad norm: 1.076 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:29:51] iteration 21/ 100 | consumed samples: 1344 | elapsed time per iteration (ms): 39522.9 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.737115E-05 | global batch size: 64 | lm loss: 9.371952E+00 | loss scale: 1.0 | grad norm: 0.430 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:30:30] iteration 22/ 100 | consumed samples: 1408 | elapsed time per iteration (ms): 39518.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.711172E-05 | global batch size: 64 | lm loss: 9.420528E+00 | loss scale: 1.0 | grad norm: 7.430 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:31:10] iteration 23/ 100 | consumed samples: 1472 | elapsed time per iteration (ms): 39554.6 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.684160E-05 | global batch size: 64 | lm loss: 9.177224E+00 | loss scale: 1.0 | grad norm: 0.652 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:31:49] iteration 24/ 100 | consumed samples: 1536 | elapsed time per iteration (ms): 39503.5 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.656107E-05 | global batch size: 64 | lm loss: 9.407356E+00 | loss scale: 1.0 | grad norm: 0.578 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:32:29] iteration 25/ 100 | consumed samples: 1600 | elapsed time per iteration (ms): 39528.4 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.627041E-05 | global batch size: 64 | lm loss: 9.249225E+00 | loss scale: 1.0 | grad norm: 0.390 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:33:08] iteration 26/ 100 | consumed samples: 1664 | elapsed time per iteration (ms): 39506.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.596991E-05 | global batch size: 64 | lm loss: 9.312948E+00 | loss scale: 1.0 | grad norm: 0.439 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:33:48] iteration 27/ 100 | consumed samples: 1728 | elapsed time per iteration (ms): 39518.2 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.565988E-05 | global batch size: 64 | lm loss: 9.285652E+00 | loss scale: 1.0 | grad norm: 0.402 | number of skipped iterations: 0 | number of nan iterations: 0 | [1,31]: [2024-09-28 17:34:27] iteration 28/ 100 | consumed samples: 1792 | elapsed time per iteration (ms): 39500.3 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.534062E-05 | global batch size: 64 | lm loss: 9.194493E+00 | loss scale: 1.0 | grad norm: 0.371 | number of skipped iterations: 0 | number of nan iterations: 0 |