trian.log

[1,15]<stdout>:[2024-09-28 17:10:34,208] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,25]<stdout>:[2024-09-28 17:10:34,241] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,28]<stdout>:[2024-09-28 17:10:34,257] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,31]<stdout>:[2024-09-28 17:10:34,262] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,8]<stdout>:[2024-09-28 17:10:34,302] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,11]<stdout>:[2024-09-28 17:10:34,313] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,13]<stdout>:[2024-09-28 17:10:34,320] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,30]<stdout>:[2024-09-28 17:10:34,341] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,27]<stdout>:[2024-09-28 17:10:34,347] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,1]<stdout>:[2024-09-28 17:10:34,391] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,4]<stdout>:[2024-09-28 17:10:34,397] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,7]<stdout>:[2024-09-28 17:10:34,400] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,29]<stdout>:[2024-09-28 17:10:34,369] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,5]<stdout>:[2024-09-28 17:10:34,417] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,14]<stdout>:[2024-09-28 17:10:34,407] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,2]<stdout>:[2024-09-28 17:10:34,474] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,3]<stdout>:[2024-09-28 17:10:34,476] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,6]<stdout>:[2024-09-28 17:10:34,477] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,26]<stdout>:[2024-09-28 17:10:34,480] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,9]<stdout>:[2024-09-28 17:10:34,527] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,0]<stdout>:[2024-09-28 17:10:34,643] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,10]<stdout>:[2024-09-28 17:10:34,617] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,12]<stdout>:[2024-09-28 17:10:34,659] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,24]<stdout>:[2024-09-28 17:10:34,737] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,17]<stdout>:[2024-09-28 17:10:34,713] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,18]<stdout>:[2024-09-28 17:10:34,729] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,23]<stdout>:[2024-09-28 17:10:34,757] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,21]<stdout>:[2024-09-28 17:10:34,773] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,20]<stdout>:[2024-09-28 17:10:34,815] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,19]<stdout>:[2024-09-28 17:10:34,819] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,22]<stdout>:[2024-09-28 17:10:34,857] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,16]<stdout>:[2024-09-28 17:10:34,882] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[1,31]<stdout>:> setting tensorboard ...
[1,0]<stdout>:using world size: 32, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 8 
[1,0]<stdout>:WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:QwenTokenizer
[1,0]<stdout>:WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
[1,0]<stdout>:accumulate and all-reduce gradients in fp32 for bfloat16 data type.
[1,0]<stdout>:using torch.bfloat16 for parameters ...
[1,0]<stdout>:------------------------ arguments ------------------------
[1,0]<stdout>:  accumulate_allreduce_grads_in_fp32 .............. True
[1,0]<stdout>:  adam_beta1 ...................................... 0.9
[1,0]<stdout>:  adam_beta2 ...................................... 0.95
[1,0]<stdout>:  adam_eps ........................................ 1e-08
[1,0]<stdout>:  add_bias_linear ................................. False
[1,0]<stdout>:  add_position_embedding .......................... False
[1,0]<stdout>:  add_qkv_bias .................................... False
[1,0]<stdout>:  adlr_autoresume ................................. False
[1,0]<stdout>:  adlr_autoresume_interval ........................ 1000
[1,0]<stdout>:  apply_layernorm_1p .............................. False
[1,0]<stdout>:  apply_query_key_layer_scaling ................... False
[1,0]<stdout>:  apply_residual_connection_post_layernorm ........ False
[1,0]<stdout>:  apply_rope_fusion ............................... True
[1,0]<stdout>:  async_save ...................................... None
[1,0]<stdout>:  async_tensor_model_parallel_allreduce ........... False
[1,0]<stdout>:  attention_dropout ............................... 0.0
[1,0]<stdout>:  attention_softmax_in_fp32 ....................... False
[1,0]<stdout>:  auto_detect_ckpt_format ......................... False
[1,0]<stdout>:  barrier_with_L1_time ............................ True
[1,0]<stdout>:  bert_binary_head ................................ True
[1,0]<stdout>:  bert_embedder_type .............................. megatron
[1,0]<stdout>:  bert_load ....................................... None
[1,0]<stdout>:  bf16 ............................................ True
[1,0]<stdout>:  bias_dropout_fusion ............................. True
[1,0]<stdout>:  bias_gelu_fusion ................................ False
[1,0]<stdout>:  bias_swiglu_fusion .............................. True
[1,0]<stdout>:  biencoder_projection_dim ........................ 0
[1,0]<stdout>:  biencoder_shared_query_context_model ............ False
[1,0]<stdout>:  block_data_path ................................. None
[1,0]<stdout>:  calculate_per_token_loss ........................ False
[1,0]<stdout>:  check_for_nan_in_loss_and_grad .................. True
[1,0]<stdout>:  check_weight_hash_across_dp_replicas_interval ... None
[1,0]<stdout>:  ckpt_assume_constant_structure .................. False
[1,0]<stdout>:  ckpt_fully_parallel_load ........................ False
[1,0]<stdout>:  ckpt_fully_parallel_save ........................ False
[1,0]<stdout>:  ckpt_step ....................................... None
[1,0]<stdout>:  classes_fraction ................................ 1.0
[1,0]<stdout>:  clip_grad ....................................... 1.0
[1,0]<stdout>:  clone_scatter_output_in_embedding ............... True
[1,0]<stdout>:  consumed_train_samples .......................... 0
[1,0]<stdout>:  consumed_valid_samples .......................... 0
[1,0]<stdout>:  context_parallel_size ........................... 1
[1,0]<stdout>:  create_attention_mask_in_dataloader ............. True
[1,0]<stdout>:  cross_entropy_loss_fusion ....................... False
[1,0]<stdout>:  data_cache_path ................................. None
[1,0]<stdout>:  data_parallel_random_init ....................... False
[1,0]<stdout>:  data_parallel_size .............................. 1
[1,0]<stdout>:  data_path ....................................... ['./qwen_token/my-qwen_text_document']
[1,0]<stdout>:  data_per_class_fraction ......................... 1.0
[1,0]<stdout>:  data_sharding ................................... True
[1,0]<stdout>:  dataloader_type ................................. single
[1,0]<stdout>:  ddp_average_in_collective ....................... False
[1,0]<stdout>:  ddp_bucket_size ................................. None
[1,0]<stdout>:  decoder_num_layers .............................. None
[1,0]<stdout>:  decoder_seq_length .............................. None
[1,0]<stdout>:  decoupled_lr .................................... None
[1,0]<stdout>:  decoupled_min_lr ................................ None
[1,0]<stdout>:  delay_grad_reduce ............................... True
[1,0]<stdout>:  delay_param_gather .............................. False
[1,0]<stdout>:  deprecated_use_mcore_models ..................... False
[1,0]<stdout>:  deterministic_mode .............................. False
[1,0]<stdout>:  dino_bottleneck_size ............................ 256
[1,0]<stdout>:  dino_freeze_last_layer .......................... 1[1,0]<stdout>:
[1,0]<stdout>:  dino_head_hidden_size ........................... 2048
[1,0]<stdout>:  dino_local_crops_number ......................... 10
[1,0]<stdout>:  dino_local_img_size ............................. 96
[1,0]<stdout>:  dino_norm_last_layer ............................ False
[1,0]<stdout>:  dino_teacher_temp ............................... 0.07
[1,0]<stdout>:  dino_warmup_teacher_temp ........................ 0.04[1,0]<stdout>:
[1,0]<stdout>:  dino_warmup_teacher_temp_epochs ................. 30
[1,0]<stdout>:  disable_straggler_on_startup .................... False
[1,0]<stdout>:  dist_ckpt_format ................................ torch_dist
[1,0]<stdout>:  dist_url ........................................ tcp://node116:34566
[1,0]<stdout>:  distribute_saved_activations .................... False
[1,0]<stdout>:  distributed_backend ............................. nccl
[1,0]<stdout>:  distributed_timeout_minutes ..................... 10
[1,0]<stdout>:  embedding_path .................................. None
[1,0]<stdout>:  empty_unused_memory_level ....................... 0
[1,0]<stdout>:  enable_one_logger ............................... False
[1,0]<stdout>:  encoder_num_layers .............................. 80
[1,0]<stdout>:  encoder_seq_length .............................. 2048
[1,0]<stdout>:  end_weight_decay ................................ 0.1
[1,0]<stdout>:  eod_mask_loss ................................... False
[1,0]<stdout>:  eval_interval ................................... 1000
[1,0]<stdout>:  eval_iters ...................................... 10
[1,0]<stdout>:  evidence_data_path .............................. None
[1,0]<stdout>:  exit_duration_in_mins ........................... None
[1,0]<stdout>:  exit_interval ................................... None
[1,0]<stdout>:  exit_on_missing_checkpoint ...................... False
[1,0]<stdout>:  exit_signal_handler ............................. False
[1,0]<stdout>:  expert_model_parallel_size ...................... 1
[1,0]<stdout>:  ffn_hidden_size ................................. 29568
[1,0]<stdout>:  finetune ........................................ False
[1,0]<stdout>:  fp16 ............................................ False
[1,0]<stdout>:  fp16_lm_cross_entropy ........................... False
[1,0]<stdout>:  fp32_residual_connection ........................ False
[1,0]<stdout>:  fp8 ............................................. None
[1,0]<stdout>:  fp8_amax_compute_algo ........................... most_recent
[1,0]<stdout>:  fp8_amax_history_len ............................ 1
[1,0]<stdout>:  fp8_interval .................................... 1
[1,0]<stdout>:  fp8_margin ...................................... 0
[1,0]<stdout>:  fp8_wgrad ....................................... True
[1,0]<stdout>:  global_batch_size ............................... 64
[1,0]<stdout>:  gradient_accumulation_fusion .................... False
[1,0]<stdout>:  group_query_attention ........................... True
[1,0]<stdout>:  head_lr_mult .................................... 1.0
[1,0]<stdout>:  hidden_dropout .................................. 0.0
[1,0]<stdout>:  hidden_size ..................................... 8192
[1,0]<stdout>:  hybrid_attention_ratio .......................... 0.0
[1,0]<stdout>:  hybrid_mlp_ratio ................................ 0.0
[1,0]<stdout>:  hybrid_override_pattern ......................... None
[1,0]<stdout>:  hysteresis ...................................... 2
[1,0]<stdout>:  ict_head_size ................................... None
[1,0]<stdout>:  ict_load ........................................ None
[1,0]<stdout>:  img_h ........................................... 224
[1,0]<stdout>:  img_w ........................................... 224
[1,0]<stdout>:  indexer_batch_size .............................. 128
[1,0]<stdout>:  indexer_log_interval ............................ 1000
[1,0]<stdout>:  inference_batch_times_seqlen_threshold .......... 512
[1,0]<stdout>:  init_method_std ................................. 0.006
[1,0]<stdout>:  init_method_xavier_uniform ...................... False
[1,0]<stdout>:  initial_loss_scale .............................. 4294967296
[1,0]<stdout>:  iter_per_epoch .................................. 1250
[1,0]<stdout>:  kv_channels ..................................... 128
[1,0]<stdout>:  lazy_mpu_init ................................... None
[1,0]<stdout>:  load ............................................ ./tmp/qwen1_5_72b/ckpt
[1,0]<stdout>:  local_rank ...................................... None
[1,0]<stdout>:  log_batch_size_to_tensorboard ................... False
[1,0]<stdout>:  log_interval .................................... 1
[1,0]<stdout>:  log_learning_rate_to_tensorboard ................ True
[1,0]<stdout>:  log_loss_scale_to_tensorboard ................... True
[1,0]<stdout>:  log_memory_to_tensorboard ....................... False
[1,0]<stdout>:  log_num_zeros_in_grad ........................... False
[1,0]<stdout>:  log_params_norm ................................. False
[1,0]<stdout>:  log_progress .................................... False
[1,0]<stdout>:  log_straggler ................................... False
[1,0]<stdout>:  log_throughput .................................. True
[1,0]<stdout>:  log_timers_to_tensorboard ....................... False
[1,0]<stdout>:  log_validation_ppl_to_tensorboard ............... False
[1,0]<stdout>:  log_world_size_to_tensorboard ................... False
[1,0]<stdout>:  logging_level ................................... None
[1,0]<stdout>:  loss_scale ...................................... None
[1,0]<stdout>:  loss_scale_window ............................... 1000
[1,0]<stdout>:  lr .............................................. 3e-05
[1,0]<stdout>:  lr_decay_iters .................................. None
[1,0]<stdout>:  lr_decay_samples ................................ None
[1,0]<stdout>:  lr_decay_style .................................. cosine
[1,0]<stdout>:  lr_warmup_fraction .............................. None
[1,0]<stdout>:  lr_warmup_init .................................. 0.0
[1,0]<stdout>:  lr_warmup_iters ................................. 1
[1,0]<stdout>:  lr_warmup_samples ............................... 0
[1,0]<stdout>:  lr_wsd_decay_iters .............................. None
[1,0]<stdout>:  lr_wsd_decay_samples ............................ None
[1,0]<stdout>:  lr_wsd_decay_style .............................. exponential
[1,0]<stdout>:  make_vocab_size_divisible_by .................... 128
[1,0]<stdout>:  manual_gc ....................................... False
[1,0]<stdout>:  manual_gc_eval .................................. True
[1,0]<stdout>:  manual_gc_interval .............................. 0
[1,0]<stdout>:  mask_factor ..................................... 1.0
[1,0]<stdout>:  mask_prob ....................................... 0.15
[1,0]<stdout>:  mask_type ....................................... random
[1,0]<stdout>:  masked_softmax_fusion ........................... True
[1,0]<stdout>:  max_position_embeddings ......................... 32768
[1,0]<stdout>:  max_tokens_to_oom ............................... 12000
[1,0]<stdout>:  merge_file ...................................... ./qwen_token/merges.txt
[1,0]<stdout>:  micro_batch_size ................................ 1
[1,0]<stdout>:  min_loss_scale .................................. 1.0
[1,0]<stdout>:  min_lr .......................................... 3e-06
[1,0]<stdout>:  mmap_bin_files .................................. True
[1,0]<stdout>:  mock_data ....................................... False
[1,0]<stdout>:  moe_aux_loss_coeff .............................. 0.0
[1,0]<stdout>:  moe_expert_capacity_factor ...................... None
[1,0]<stdout>:  moe_extended_tp ................................. False
[1,0]<stdout>:  moe_grouped_gemm ................................ False
[1,0]<stdout>:  moe_input_jitter_eps ............................ None
[1,0]<stdout>:  moe_layer_recompute ............................. False
[1,0]<stdout>:  moe_p[1,0]<stdout>:ad_expert_input_to_capacity ................ False
[1,0]<stdout>:  moe_per_layer_logging ........................... False
[1,0]<stdout>:  moe_router_load_balancing_type .................. aux_loss
[1,0]<stdout>:  moe_router_topk ................................. 2
[1,0]<stdout>:  moe_token_dispatcher_type ....................... allgather
[1,0]<stdout>:  moe_token_drop_policy ........................... probs
[1,0]<stdout>:  moe_z_loss_coeff ................................ None
[1,0]<stdout>:  nccl_communicator_config_path ................... None
[1,0]<stdout>:  no_load_optim ................................... None
[1,0]<stdout>:  no_load_rng ..................................... None
[1,0]<stdout>:  no_persist_layer_norm ........................... False
[1,0]<stdout>:  no_save_optim ................................... None
[1,0]<stdout>:  no_save_rng ..................................... None
[1,0]<stdout>:  norm_epsilon .................................... 1e-05
[1,0]<stdout>:  normalization ................................... RMSNorm
[1,0]<stdout>:  num_attention_heads ............................. 64
[1,0]<stdout>:  num_channels .................................... 3
[1,0]<stdout>:  num_classes ..................................... 1000
[1,0]<stdout>:  num_dataset_builder_threads ..................... 1
[1,0]<stdout>:  num_experts ..................................... None
[1,0]<stdout>:  num_layers ...................................... 80
[1,0]<stdout>:  num_layers_per_virtual_pipeline_stage ........... None
[1,0]<stdout>:  num_query_groups ................................ 8
[1,0]<stdout>:  num_workers ..................................... 2
[1,0]<stdout>:  one_logger_entity ............................... hwinf_dcm
[1,0]<stdout>:  one_logger_project .............................. e2e-tracking
[1,0]<stdout>:  one_logger_run_name ............................. None
[1,0]<stdout>:  onnx_safe ....................................... None
[1,0]<stdout>:  openai_gelu ..................................... False
[1,0]<stdout>:  optimizer ....................................... adam
[1,0]<stdout>:  output_bert_embeddings .......................... False
[1,0]<stdout>:  overlap_grad_reduce ............................. False
[1,0]<stdout>:  overlap_p2p_comm ................................ False
[1,0]<stdout>:  overlap_param_gather ............................ False
[1,0]<stdout>:  override_opt_param_scheduler .................... False
[1,0]<stdout>:  params_dtype .................................... torch.bfloat16
[1,0]<stdout>:  patch_dim ....................................... 16
[1,0]<stdout>:  perform_initialization .......................... True
[1,0]<stdout>:  pipeline_model_parallel_size .................... 8
[1,0]<stdout>:  pipeline_model_parallel_split_rank .............. None
[1,0]<stdout>:  position_embedding_type ......................... rope
[1,0]<stdout>:  pretrained_checkpoint ........................... None
[1,0]<stdout>:  profile ......................................... False
[1,0]<stdout>:  profile_ranks ................................... [0]
[1,0]<stdout>:  profile_step_end ................................ 12
[1,0]<stdout>:  profile_step_start .............................. 10
[1,0]<stdout>:  qk_layernorm .................................... False
[1,0]<stdout>:  query_in_block_prob ............................. 0.1
[1,0]<stdout>:  rampup_batch_size ............................... None
[1,0]<stdout>:  rank ............................................ 0
[1,0]<stdout>:  recompute_granularity ........................... None
[1,0]<stdout>:  recompute_method ................................ None
[1,0]<stdout>:  recompute_num_layers ............................ None
[1,0]<stdout>:  reset_attention_mask ............................ False
[1,0]<stdout>:  reset_position_ids .............................. False
[1,0]<stdout>:  retriever_report_topk_accuracies ................ []
[1,0]<stdout>:  retriever_score_scaling ......................... False
[1,0]<stdout>:  retriever_seq_length ............................ 256
[1,0]<stdout>:  retro_add_retriever ............................. False
[1,0]<stdout>:  retro_attention_gate ............................ 1
[1,0]<stdout>:  retro_cyclic_train_iters ........................ None
[1,0]<stdout>:  retro_encoder_attention_dropout ................. 0.1
[1,0]<stdout>:  retro_encoder_hidden_dropout .................... 0.1
[1,0]<stdout>:  retro_encoder_layers ............................ 2
[1,0]<stdout>:  retro_num_neighbors ............................. 2
[1,0]<stdout>:  retro_num_retrieved_chunks ...................... 2
[1,0]<stdout>:  retro_project_dir ............................... None
[1,0]<stdout>:  retro_verify_neighbor_count ..................... True
[1,0]<stdout>:  rotary_interleaved .............................. False
[1,0]<stdout>:  rotary_percent .................................. 1.0
[1,0]<stdout>:  rotary_seq_len_interpolation_factor ............. None
[1,0]<stdout>:  sample_rate ..................................... 1.0
[1,0]<stdout>:  save ............................................ ./tmp/qwen1_5_72b/ckpt
[1,0]<stdout>:  save_interval ................................... 10000
[1,0]<stdout>:  scatter_gather_tensors_in_pipeline .............. True[1,0]<stdout>:
[1,0]<stdout>:  seed ............................................ 1234
[1,0]<stdout>:  seq_length ...................................... 2048
[1,0]<stdout>:  sequence_parallel ............................... True
[1,0]<stdout>:  sgd_momentum .................................... 0.9
[1,0]<stdout>:  short_seq_prob .................................. 0.1
[1,0]<stdout>:  skip_train ...................................... False
[1,0]<stdout>:  spec ............................................ None
[1,0]<stdout>:  split ........................................... 949,50,1
[1,0]<stdout>:  squared_relu .................................... False
[1,0]<stdout>:  standalone_embedding_stage ...................... False
[1,0]<stdout>:  start_weight_decay .............................. 0.1
[1,0]<stdout>:  straggler_ctrlr_port ............................ 65535
[1,0]<stdout>:  straggler_minmax_count .......................... 1
[1,0]<stdout>:  swiglu .......................................... True
[1,0]<stdout>:  swin_backbone_type .............................. tiny
[1,0]<stdout>:  tensor_model_parallel_size ...................... 4
[1,0]<stdout>:  tensorboard_dir ................................. ./tmp/qwen1_5_72b/tblog
[1,0]<stdout>:  tensorboard_log_interval ........................ 1
[1,0]<stdout>:  tensorboard_queue_size .......................... 1000
[1,0]<stdout>:  test_data_path .................................. None
[1,0]<stdout>:  test_mode ....................................... False
[1,0]<stdout>:  timing_log_level ................................ 0
[1,0]<stdout>:  timing_log_option ............................... minmax
[1,0]<stdout>:  titles_data_path ................................ None
[1,0]<stdout>:  tokenizer_model ................................. None
[1,0]<stdout>:  tokenizer_type .................................. QwenTokenizer
[1,0]<stdout>:  tp_comm_bulk_dgrad .............................. True
[1,0]<stdout>:  tp_comm_bulk_wgrad .............................. True
[1,0]<stdout>:  tp_comm_overlap ................................. False
[1,0]<stdout>:  tp_comm_overlap_ag .............................. True
[1,0]<stdout>:  tp_comm_overlap_cfg ............................. None
[1,0]<stdout>:  tp_comm_overlap_rs .............................. True
[1,0]<stdout>:  tp_comm_overlap_rs_dgrad ........................ False
[1,0]<stdout>:  tp_comm_split_ag ................................ True
[1,0]<stdout>:  tp_comm_split_rs ................................ True
[1,0]<stdout>:  train_data_path ................................. None
[1,0]<stdout>:  train_iters ..................................... 100
[1,0]<stdout>:  train_samples ................................... None
[1,0]<stdout>:  transformer_impl ................................ local
[1,0]<stdout>:  transformer_pipeline_model_parallel_size ........ 8
[1,0]<stdout>:  untie_embeddings_and_output_weights ............. True
[1,0]<stdout>:  use_checkpoint_args ............................. False
[1,0]<stdout>:  use_checkpoint_opt_param_scheduler .............. False
[1,0]<stdout>:  use_cpu_initialization .......................... None
[1,0]<stdout>:  use_dist_ckpt ................................... False
[1,0]<stdout>:  use_distributed_optimizer ....................... True
[1,0]<stdout>:  use_fast_cross_entropy_loss ..................... False
[1,0]<stdout>:  use_fast_rms_layernorm .......................... False
[1,0]<stdout>:  use_flash_attn .................................. True
[1,0]<stdout>:  use_flash_attn_triton ........................... False
[1,0]<stdout>:  use_flash_attn_v1 ............................... False
[1,0]<stdout>:  use_flash_attn_v2 ............................... True
[1,0]<stdout>:  use_legacy_models ............................... True
[1,0]<stdout>:  use_one_sent_docs ............................... False
[1,0]<stdout>:  use_ring_exchange_p2p ........................... False
[1,0]<stdout>:  use_rotary_position_embeddings .................. True
[1,0]<stdout>:  use_tp_pp_dp_mapping ............................ False
[1,0]<stdout>:  valid_data_path ................................. None
[1,0]<stdout>:  variable_seq_lengths ............................ False
[1,0]<stdout>:  virtual_pipeline_model_parallel_size ............ None
[1,0]<stdout>:  vision_backbone_type ............................ vit
[1,0]<stdout>:  vision_pretraining .............................. False
[1,0]<stdout>:  vision_pretraining_type ......................... classify
[1,0]<stdout>:  vocab_extra_ids ................................. 0
[1,0]<stdout>:  vocab_file ...................................... ./qwen_token/vocab.json
[1,0]<stdout>:  vocab_size ...................................... None
[1,0]<stdout>:  wandb_exp_name .................................. 
[1,0]<stdout>:  wandb_project ................................... 
[1,0]<stdout>:  wandb_save_dir .................................. 
[1,0]<stdout>:  weight_decay .................................... 0.1
[1,0]<stdout>:  weight_decay_incr_style ......................... constant
[1,0]<stdout>:  world_size ...................................... 32
[1,0]<stdout>:  yaml_cfg ........................................ None
[1,0]<stdout>:-------------------- end of arguments ---------------------
[1,0]<stdout>:setting number of micro-batches to constant 64
[1,0]<stdout>:> building QwenTokenizer tokenizer ...
[1,0]<stdout>: > padded vocab (size: 151643) with 421 dummy tokens (new size: 152064)
[1,0]<stdout>:> initializing torch distributed ...
[1,12]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,12]<stderr>:I0928 17:10:36.284003 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843548017696
[1,12]<stderr>:I0928 17:10:36.285332 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534134176
[1,12]<stderr>:I0928 17:10:36.291662 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534124432
[1,12]<stderr>:I0928 17:10:36.293839 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843534741728
[1,12]<stderr>:I0928 17:10:36.294546 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843533522800
[1,12]<stderr>:I0928 17:10:36.294782 24841 ProcessGroupNCCL.cpp:686] [Rank 12] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843545869136
[1,12]<stderr>:I0928 17:10:36.295115 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551010032
[1,12]<stderr>:I0928 17:10:36.295454 24841 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551012256
[1,12]<stderr>:I0928 17:10:36.296087 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843551014480
[1,12]<stderr>:I0928 17:10:36.296543 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843543298800
[1,12]<stderr>:I0928 17:10:36.296970 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843543301024
[1,12]<stderr>:I0928 17:10:36.297653 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843555296880
[1,12]<stderr>:I0928 17:10:36.298990 24841 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94843555299104
[1,24]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,24]<stderr>:I0928 17:10:36.375571  8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797634713232
[1,24]<stderr>:I0928 17:10:36.377507  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622326816
[1,24]<stderr>:I0928 17:10:36.383812  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622545344
[1,24]<stderr>:I0928 17:10:36.385749  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797624659104
[1,24]<stderr>:I0928 17:10:36.386175  8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622265424
[1,24]<stderr>:I0928 17:10:36.386415  8434 ProcessGroupNCCL.cpp:686] [Rank 24] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797622275168
[1,24]<stderr>:I0928 17:10:36.386817  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797621658848
[1,24]<stderr>:I0928 17:10:36.387077  8434 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797632374736
[1,24]<stderr>:I0928 17:10:36.387781  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797643245072
[1,24]<stderr>:I0928 17:10:36.388206  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797632397408
[1,24]<stderr>:I0928 17:10:36.388716  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633717248
[1,24]<stderr>:I0928 17:10:36.389621  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633719472
[1,24]<stderr>:I0928 17:10:36.391235  8434 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94797633721696
[1,17]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,17]<stderr>:I0928 17:10:36.355304  6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586130816
[1,24]<stderr>:I0928 17:10:36.393000  8434 ProcessGroupNCCL.cpp:2780] Rank 24 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,17]<stderr>:I0928 17:10:36.356807  6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494576519840
[1,17]<stderr>:I0928 17:10:36.363021  6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494577123520
[1,17]<stderr>:I0928 17:10:36.365171  6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494577113776
[1,17]<stderr>:I0928 17:10:36.365768  6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586057360
[1,17]<stderr>:I0928 17:10:36.366010  6901 ProcessGroupNCCL.cpp:686] [Rank 17] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494586060704
[1,17]<stderr>:I0928 17:10:36.366370  6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494594119296
[1,17]<stderr>:I0928 17:10:36.366777  6901 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494594121520
[1,17]<stderr>:I0928 17:10:36.367363  6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494590678000
[1,17]<stderr>:I0928 17:10:36.367805  6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494590682576
[1,17]<stderr>:I0928 17:10:36.368239  6901 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494588830336
[1,17]<stderr>:I0928 17:10:36.369048  6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494588833360
[1,17]<stderr>:I0928 17:10:36.370493  6901 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94494595266384
[1,18]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,18]<stderr>:I0928 17:10:36.379254  6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032103418048
[1,18]<stderr>:I0928 17:10:36.380781  6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109756784
[1,18]<stderr>:I0928 17:10:36.386423  6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032089922368
[1,18]<stderr>:I0928 17:10:36.388432  6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032090504128
[1,18]<stderr>:I0928 17:10:36.389006  6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032090494384
[1,18]<stderr>:I0928 17:10:36.389246  6907 ProcessGroupNCCL.cpp:686] [Rank 18] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032104369376
[1,18]<stderr>:I0928 17:10:36.389598  6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032111636832
[1,18]<stderr>:I0928 17:10:36.390074  6907 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109772272
[1,18]<stderr>:I0928 17:10:36.390568  6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032109774496
[1,18]<stderr>:I0928 17:10:36.391021  6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032102357712
[1,18]<stderr>:I0928 17:10:36.391466  6907 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099218592
[1,18]<stderr>:I0928 17:10:36.392282  6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099220768
[1,18]<stderr>:I0928 17:10:36.393748  6907 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94032099222944
[1,22]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,22]<stderr>:I0928 17:10:36.398507  6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748338852304
[1,22]<stderr>:I0928 17:10:36.400197  6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372540256
[1,22]<stderr>:I0928 17:10:36.406311  6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748381780240
[1,22]<stderr>:I0928 17:10:36.408246  6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372765792
[1,22]<stderr>:I0928 17:10:36.408762  6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748372835696
[1,22]<stderr>:I0928 17:10:36.408998  6921 ProcessGroupNCCL.cpp:686] [Rank 22] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748371169840
[1,22]<stderr>:I0928 17:10:36.409375  6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748384710064
[1,22]<stderr>:I0928 17:10:36.409832  6921 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748380275088
[1,22]<stderr>:I0928 17:10:36.410344  6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748380277264
[1,22]<stderr>:I0928 17:10:36.410764  6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748391377760
[1,22]<stderr>:I0928 17:10:36.411202  6921 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748391379936
[1,22]<stderr>:I0928 17:10:36.412055  6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748383594128
[1,22]<stderr>:I0928 17:10:36.413607  6921 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94748383596352
[1,21]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,21]<stderr>:I0928 17:10:36.423324  6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389532071376
[1,21]<stderr>:I0928 17:10:36.425096  6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389531242864
[1,21]<stderr>:I0928 17:10:36.430836  6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539382400
[1,21]<stderr>:I0928 17:10:36.432796  6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389534722736
[1,21]<stderr>:I0928 17:10:36.433297  6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389536419424
[1,21]<stderr>:I0928 17:10:36.433529  6919 ProcessGroupNCCL.cpp:686] [Rank 21] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389518170400
[1,21]<stderr>:I0928 17:10:36.433919  6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389531713216
[1,21]<stderr>:I0928 17:10:36.434312  6919 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389535046480
[1,21]<stderr>:I0928 17:10:36.434938  6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389535048704
[1,21]<stderr>:I0928 17:10:36.435379  6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389530190496
[1,21]<stderr>:I0928 17:10:36.435827  6919 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389530192720
[1,21]<stderr>:I0928 17:10:36.436689  6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539349392
[1,21]<stderr>:I0928 17:10:36.438230  6919 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94389539351616
[1,23]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,23]<stderr>:I0928 17:10:36.463274  6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766919504
[1,23]<stderr>:I0928 17:10:36.465096  6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236776792976
[1,23]<stderr>:I0928 17:10:36.470860  6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236776312880
[1,23]<stderr>:I0928 17:10:36.472807  6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236775420416
[1,23]<stderr>:I0928 17:10:36.473253  6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766348256
[1,23]<stderr>:I0928 17:10:36.473493  6922 ProcessGroupNCCL.cpp:686] [Rank 23] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236771144880
[1,23]<stderr>:I0928 17:10:36.473889  6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766991904
[1,23]<stderr>:I0928 17:10:36.474437  6922 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236766994080
[1,23]<stderr>:I0928 17:10:36.474905  6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767015632
[1,23]<stderr>:I0928 17:10:36.475334  6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767018896
[1,23]<stderr>:I0928 17:10:36.475767  6922 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767021120
[1,23]<stderr>:I0928 17:10:36.476676  6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236767022272
[1,23]<stderr>:I0928 17:10:36.478250  6922 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94236771216624
[1,16]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,16]<stderr>:I0928 17:10:36.488837  6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531083697728
[1,16]<stderr>:I0928 17:10:36.490300  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531092418800
[1,16]<stderr>:I0928 17:10:36.495915  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531085015024
[1,16]<stderr>:I0928 17:10:36.497959  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094577728
[1,16]<stderr>:I0928 17:10:36.498570  6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094579904
[1,16]<stderr>:I0928 17:10:36.498822  6894 ProcessGroupNCCL.cpp:686] [Rank 16] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531094583200
[1,16]<stderr>:I0928 17:10:36.499187  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096685760
[1,16]<stderr>:I0928 17:10:36.499490  6894 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096687984
[1,16]<stderr>:I0928 17:10:36.500156  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096689136
[1,16]<stderr>:I0928 17:10:36.500597  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096946416
[1,16]<stderr>:I0928 17:10:36.501024  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096731856
[1,16]<stderr>:I0928 17:10:36.501770  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096734080
[1,16]<stderr>:I0928 17:10:36.503214  6894 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94531096736304
[1,16]<stderr>:I0928 17:10:36.505196  6894 ProcessGroupNCCL.cpp:2780] Rank 16 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,19]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,19]<stderr>:I0928 17:10:36.518448  6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300337116736
[1,19]<stderr>:I0928 17:10:36.520079  6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300325515072
[1,19]<stderr>:I0928 17:10:36.525683  6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300326086992
[1,19]<stderr>:I0928 17:10:36.527709  6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300334892960
[1,19]<stderr>:I0928 17:10:36.528247  6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300334895088
[1,19]<stderr>:I0928 17:10:36.528477  6914 ProcessGroupNCCL.cpp:686] [Rank 19] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300338145840
[1,19]<stderr>:I0928 17:10:36.528829  6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300338149088
[1,19]<stderr>:I0928 17:10:36.529383  6914 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300339687936
[1,19]<stderr>:I0928 17:10:36.529805  6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300339690160
[1,19]<stderr>:I0928 17:10:36.530236  6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343115088
[1,19]<stderr>:I0928 17:10:36.530668  6914 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343117312
[1,19]<stderr>:I0928 17:10:36.531503  6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300343119536
[1,19]<stderr>:I0928 17:10:36.532995  6914 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94300337097584
[1,20]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,20]<stderr>:I0928 17:10:36.554998  6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072592032544
[1,20]<stderr>:I0928 17:10:36.556634  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072592411888
[1,20]<stderr>:I0928 17:10:36.562213  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578212336
[1,20]<stderr>:I0928 17:10:36.564203  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578201872
[1,20]<stderr>:I0928 17:10:36.564730  6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578587728
[1,20]<stderr>:I0928 17:10:36.564967  6917 ProcessGroupNCCL.cpp:686] [Rank 20] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072578176608
[1,20]<stderr>:I0928 17:10:36.565351  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072577999200
[1,20]<stderr>:I0928 17:10:36.565639  6917 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072587434096
[1,20]<stderr>:I0928 17:10:36.566325  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072589537184
[1,20]<stderr>:I0928 17:10:36.566758  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599580624
[1,20]<stderr>:I0928 17:10:36.567198  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599582800
[1,20]<stderr>:I0928 17:10:36.568019  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599585024
[1,20]<stderr>:I0928 17:10:36.569561  6917 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94072599819744
[1,15]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,15]<stderr>:I0928 17:10:36.737020 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461299984
[1,15]<stderr>:I0928 17:10:36.738432 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461389152
[1,15]<stderr>:I0928 17:10:36.744571 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109463704672
[1,15]<stderr>:I0928 17:10:36.746693 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461309728
[1,15]<stderr>:I0928 17:10:36.747330 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109461253648
[1,15]<stderr>:I0928 17:10:36.747577 24847 ProcessGroupNCCL.cpp:686] [Rank 15] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109460698656
[1,15]<stderr>:I0928 17:10:36.747915 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109473134976
[1,15]<stderr>:I0928 17:10:36.748497 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109472801488
[1,15]<stderr>:I0928 17:10:36.748901 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109472803712
[1,15]<stderr>:I0928 17:10:36.749331 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482399584
[1,15]<stderr>:I0928 17:10:36.749781 24847 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482401808
[1,15]<stderr>:I0928 17:10:36.750537 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109482404032
[1,15]<stderr>:I0928 17:10:36.751936 24847 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94109475019632
[1,25]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,25]<stderr>:I0928 17:10:36.752588  8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234098688
[1,25]<stderr>:I0928 17:10:36.754529  8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220635968
[1,25]<stderr>:I0928 17:10:36.760869  8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220626224
[1,25]<stderr>:I0928 17:10:36.762804  8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234528464
[1,25]<stderr>:I0928 17:10:36.763206  8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220087904
[1,25]<stderr>:I0928 17:10:36.763445  8440 ProcessGroupNCCL.cpp:686] [Rank 25] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423220118176
[1,25]<stderr>:I0928 17:10:36.763859  8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423230722784
[1,25]<stderr>:I0928 17:10:36.764223  8440 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423233605248
[1,25]<stderr>:I0928 17:10:36.764853  8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423233607424
[1,25]<stderr>:I0928 17:10:36.765290  8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231555728
[1,25]<stderr>:I0928 17:10:36.765724  8440 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231557952
[1,25]<stderr>:I0928 17:10:36.766669  8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423231560176
[1,25]<stderr>:I0928 17:10:36.768301  8440 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94423234156288
[1,8]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,8]<stderr>:I0928 17:10:36.805460 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862987275536
[1,8]<stderr>:I0928 17:10:36.806476 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862987200032
[1,8]<stderr>:I0928 17:10:36.811786 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981200896
[1,8]<stderr>:I0928 17:10:36.814039 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862631544896
[1,8]<stderr>:I0928 17:10:36.814842 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862919336832
[1,8]<stderr>:I0928 17:10:36.815078 24819 ProcessGroupNCCL.cpp:686] [Rank 8] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981465536
[1,8]<stderr>:I0928 17:10:36.815388 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862967433392
[1,8]<stderr>:I0928 17:10:36.815769 24819 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981012400
[1,8]<stderr>:I0928 17:10:36.816380 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981957888
[1,8]<stderr>:I0928 17:10:36.816820 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862981961824
[1,31]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,31]<stderr>:I0928 17:10:36.802973  8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866576640
[1,8]<stderr>:I0928 17:10:36.817243 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978075328
[1,8]<stderr>:I0928 17:10:36.817868 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978077552
[1,8]<stderr>:I0928 17:10:36.819103 24819 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94862978079776
[1,31]<stderr>:I0928 17:10:36.805253  8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866566896
[1,8]<stderr>:I0928 17:10:36.821259 24819 ProcessGroupNCCL.cpp:2780] Rank 8 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,31]<stderr>:I0928 17:10:36.810376  8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247866021808
[1,31]<stderr>:I0928 17:10:36.812062  8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247865999184
[1,31]<stderr>:I0928 17:10:36.812310  8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877783136
[1,31]<stderr>:I0928 17:10:36.812574  8462 ProcessGroupNCCL.cpp:686] [Rank 31] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877786480
[1,31]<stderr>:I0928 17:10:36.813009  8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877789776
[1,31]<stderr>:I0928 17:10:36.813500  8462 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247877792000
[1,31]<stderr>:I0928 17:10:36.813617  8462 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879112736
[1,31]<stderr>:I0928 17:10:36.814082  8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879117760
[1,31]<stderr>:I0928 17:10:36.814509  8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879119840
[1,31]<stderr>:I0928 17:10:36.814936  8462 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879121920
[1,31]<stderr>:I0928 17:10:36.815979  8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879124096
[1,31]<stderr>:I0928 17:10:36.817761  8462 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94247879126272
[1,12]<stderr>:I0928 17:10:36.874425 24841 ProcessGroupNCCL.cpp:2780] Rank 12 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,28]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,28]<stderr>:I0928 17:10:36.866642  8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795432624
[1,28]<stderr>:I0928 17:10:36.868739  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229792988208
[1,1]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,1]<stderr>:I0928 17:10:36.914921 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197189493312
[1,1]<stderr>:I0928 17:10:36.915673 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197200118480
[1,28]<stderr>:I0928 17:10:36.873793  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229733066432
[1,28]<stderr>:I0928 17:10:36.875526  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793672560
[1,28]<stderr>:I0928 17:10:36.875842  8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229799201184
[1,28]<stderr>:I0928 17:10:36.876076  8456 ProcessGroupNCCL.cpp:686] [Rank 28] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229781891456
[1,28]<stderr>:I0928 17:10:36.876530  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229781226064
[1,28]<stderr>:I0928 17:10:36.876778  8456 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793551760
[1,28]<stderr>:I0928 17:10:36.876883  8456 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793553984
[1,28]<stderr>:I0928 17:10:36.877589  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793959648
[1,28]<stderr>:I0928 17:10:36.878011  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229793961872
[1,1]<stderr>:I0928 17:10:36.922307 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180903056
[1,28]<stderr>:I0928 17:10:36.878453  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795357264
[1,28]<stderr>:I0928 17:10:36.879438  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795359488
[1,7]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,7]<stderr>:I0928 17:10:36.923707 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861098353136
[1,7]<stderr>:I0928 17:10:36.924728 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861096737312
[1,1]<stderr>:I0928 17:10:36.924889 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180893312
[1,28]<stderr>:I0928 17:10:36.881153  8456 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94229795361712
[1,1]<stderr>:I0928 17:10:36.925887 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197189868976
[1,1]<stderr>:I0928 17:10:36.926132 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197180295600
[1,5]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,5]<stderr>:I0928 17:10:36.926327 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819155932480
[1,1]<stderr>:I0928 17:10:36.926362 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197193874336
[1,1]<stderr>:I0928 17:10:36.926870 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197193876560
[1,1]<stderr>:I0928 17:10:36.926990 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197201948928
[1,1]<stderr>:I0928 17:10:36.927093 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197201953200
[1,5]<stderr>:I0928 17:10:36.927234 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819177567008
[1,1]<stderr>:I0928 17:10:36.927510 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194033664
[1,1]<stderr>:I0928 17:10:36.927934 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194035888
[1,1]<stderr>:I0928 17:10:36.928367 21509 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197194038064
[1,1]<stderr>:I0928 17:10:36.928872 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197202020224
[1,7]<stderr>:I0928 17:10:36.929852 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861104499648
[1,1]<stderr>:I0928 17:10:36.929937 21509 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94197202022928
[1,5]<stderr>:I0928 17:10:36.932377 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107612576
[1,7]<stderr>:I0928 17:10:36.932574 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861103093088
[1,7]<stderr>:I0928 17:10:36.933432 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861102548352
[1,7]<stderr>:I0928 17:10:36.933667 21526 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861102525392
[1,7]<stderr>:I0928 17:10:36.933935 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861085333808
[1,7]<stderr>:I0928 17:10:36.934589 21526 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861052349808
[1,5]<stderr>:I0928 17:10:36.934896 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819157173744
[1,7]<stderr>:I0928 17:10:36.934911 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094929008
[1,7]<stderr>:I0928 17:10:36.935353 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094932864
[1,7]<stderr>:I0928 17:10:36.935786 21526 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861094935088
[1,5]<stderr>:I0928 17:10:36.935801 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819153183616
[1,5]<stderr>:I0928 17:10:36.936040 21524 ProcessGroupNCCL.cpp:686] [Rank 5] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107844992
[1,5]<stderr>:I0928 17:10:36.936304 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819107567760
[1,7]<stderr>:I0928 17:10:36.936416 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861099051504
[1,5]<stderr>:I0928 17:10:36.936815 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819170248480
[1,5]<stderr>:I0928 17:10:36.937319 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819170250704
[1,7]<stderr>:I0928 17:10:36.937649 21526 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93861099054256
[1,5]<stderr>:I0928 17:10:36.937741 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169545952
[1,5]<stderr>:I0928 17:10:36.938180 21524 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169548176
[1,5]<stderr>:I0928 17:10:36.938771 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169670160
[1,5]<stderr>:I0928 17:10:36.939939 21524 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94819169672384
[1,29]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,29]<stderr>:I0928 17:10:36.899094  8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681200400
[1,29]<stderr>:I0928 17:10:36.901144  8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854668022080
[1,29]<stderr>:I0928 17:10:36.906114  8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854668257792
[1,29]<stderr>:I0928 17:10:36.907840  8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681371424
[1,29]<stderr>:I0928 17:10:36.908150  8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667952112
[1,29]<stderr>:I0928 17:10:36.908402  8459 ProcessGroupNCCL.cpp:686] [Rank 29] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667959808
[1,29]<stderr>:I0928 17:10:36.908844  8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854667961856
[1,4]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,4]<stderr>:I0928 17:10:36.952936 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748528960
[1,29]<stderr>:I0928 17:10:36.909166  8459 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854681445824
[1,29]<stderr>:I0928 17:10:36.909273  8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854635908528
[1,4]<stderr>:I0928 17:10:36.953810 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070755923376
[1,29]<stderr>:I0928 17:10:36.909904  8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854676619312
[1,29]<stderr>:I0928 17:10:36.910317  8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854689090560
[1,29]<stderr>:I0928 17:10:36.910737  8459 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854689092784
[1,29]<stderr>:I0928 17:10:36.911710  8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854685104048
[1,29]<stderr>:I0928 17:10:36.913434  8459 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93854685106800
[1,4]<stderr>:I0928 17:10:36.958770 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070737500832
[1,4]<stderr>:I0928 17:10:36.961220 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070757860832
[1,4]<stderr>:I0928 17:10:36.962128 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070749840400
[1,4]<stderr>:I0928 17:10:36.962368 21522 ProcessGroupNCCL.cpp:686] [Rank 4] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070687931392
[1,4]<stderr>:I0928 17:10:36.962633 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070736072192
[1,4]<stderr>:I0928 17:10:36.963016 21522 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070746462272
[1,4]<stderr>:I0928 17:10:36.963591 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748671648
[1,4]<stderr>:I0928 17:10:36.964022 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070748675664
[1,4]<stderr>:I0928 17:10:36.964448 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750620080
[1,3]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,3]<stderr>:I0928 17:10:36.964892 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668003536
[1,4]<stderr>:I0928 17:10:36.965009 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750622304
[1,3]<stderr>:I0928 17:10:36.965688 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668046576
[1,4]<stderr>:I0928 17:10:36.966142 21522 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94070750624528
[1,3]<stderr>:I0928 17:10:36.971949 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447674233936
[1,3]<stderr>:I0928 17:10:36.974468 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447672383664
[1,3]<stderr>:I0928 17:10:36.975409 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447606256144
[1,3]<stderr>:I0928 17:10:36.975648 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447657472512
[1,3]<stderr>:I0928 17:10:36.975901 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447622088544
[1,3]<stderr>:I0928 17:10:36.976589 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447667383952
[1,3]<stderr>:I0928 17:10:36.976708 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447668543072
[1,3]<stderr>:I0928 17:10:36.976804 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665615008
[1,3]<stderr>:I0928 17:10:36.977036 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665617184
[1,3]<stderr>:I0928 17:10:36.977481 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447665619360
[1,3]<stderr>:I0928 17:10:36.977908 21518 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676008624
[1,3]<stderr>:I0928 17:10:36.978473 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676010848
[1,3]<stderr>:I0928 17:10:36.979580 21518 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94447676013024
[1,27]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,27]<stderr>:I0928 17:10:36.952451  8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521774252128
[1,27]<stderr>:I0928 17:10:36.954414  8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521817222960
[1,27]<stderr>:I0928 17:10:36.959306  8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806302864
[1,27]<stderr>:I0928 17:10:36.961118  8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806442496
[1,27]<stderr>:I0928 17:10:36.961479  8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521757550784
[1,27]<stderr>:I0928 17:10:36.961714  8451 ProcessGroupNCCL.cpp:686] [Rank 27] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521806312608
[1,27]<stderr>:I0928 17:10:36.962117  8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521805743536
[1,27]<stderr>:I0928 17:10:36.962615  8451 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819356656
[1,27]<stderr>:I0928 17:10:36.963070  8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819358832
[1,27]<stderr>:I0928 17:10:36.963505  8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827551056
[1,27]<stderr>:I0928 17:10:36.963940  8451 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827553232
[1,27]<stderr>:I0928 17:10:36.964911  8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521827555456
[1,27]<stderr>:I0928 17:10:36.966615  8451 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94521819344144
[1,22]<stderr>:I0928 17:10:36.962049  6921 ProcessGroupNCCL.cpp:2780] Rank 22 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,17]<stderr>:I0928 17:10:36.971592  6901 ProcessGroupNCCL.cpp:2780] Rank 17 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,21]<stderr>:I0928 17:10:36.977463  6919 ProcessGroupNCCL.cpp:2780] Rank 21 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,2]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,2]<stderr>:I0928 17:10:37.073457 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012302754640
[1,2]<stderr>:I0928 17:10:37.074244 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012336462736
[1,2]<stderr>:I0928 17:10:37.080004 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348934864
[1,2]<stderr>:I0928 17:10:37.082574 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012345126304
[1,2]<stderr>:I0928 17:10:37.083531 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012335727184
[1,2]<stderr>:I0928 17:10:37.083765 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012335106512
[1,2]<stderr>:I0928 17:10:37.084004 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012352592832
[1,2]<stderr>:I0928 17:10:37.084596 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012352595056
[1,2]<stderr>:I0928 17:10:37.084703 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012356784400
[1,2]<stderr>:I0928 17:10:37.084812 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012356788976
[1,2]<stderr>:I0928 17:10:37.085153 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348880352
[1,2]<stderr>:I0928 17:10:37.085594 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348882576
[1,2]<stderr>:I0928 17:10:37.086021 21514 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012348884800
[1,2]<stderr>:I0928 17:10:37.086558 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012354867344
[1,2]<stderr>:I0928 17:10:37.087626 21514 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94012354869568
[1,18]<stderr>:I0928 17:10:37.012248  6907 ProcessGroupNCCL.cpp:2780] Rank 18 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,30]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,30]<stderr>:I0928 17:10:37.059296  8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159712551936
[1,30]<stderr>:I0928 17:10:37.061789  8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159712542192
[1,6]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,6]<stderr>:I0928 17:10:37.110015 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968938272
[1,6]<stderr>:I0928 17:10:37.110972 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875972849008
[1,30]<stderr>:I0928 17:10:37.067155  8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159663803232
[1,30]<stderr>:I0928 17:10:37.068888  8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725652016
[1,30]<stderr>:I0928 17:10:37.069159  8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725468368
[1,30]<stderr>:I0928 17:10:37.069396  8461 ProcessGroupNCCL.cpp:686] [Rank 30] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159725471712
[1,30]<stderr>:I0928 17:10:37.069839  8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733738016
[1,30]<stderr>:I0928 17:10:37.070231  8461 ProcessGroupNCCL.cpp:686] [Rank 7] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733740192
[1,30]<stderr>:I0928 17:10:37.070331  8461 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159733742368
[1,30]<stderr>:I0928 17:10:37.070874  8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724029824
[1,30]<stderr>:I0928 17:10:37.071305  8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724032000
[1,30]<stderr>:I0928 17:10:37.071736  8461 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159724034224
[1,6]<stderr>:I0928 17:10:37.116189 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875973933408
[1,30]<stderr>:I0928 17:10:37.072749  8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159731088448
[1,30]<stderr>:I0928 17:10:37.074509  8461 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94159731091072
[1,6]<stderr>:I0928 17:10:37.118682 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875963737456
[1,6]<stderr>:I0928 17:10:37.119536 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875967152384
[1,6]<stderr>:I0928 17:10:37.119781 21525 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875966863200
[1,6]<stderr>:I0928 17:10:37.120056 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875966866544
[1,6]<stderr>:I0928 17:10:37.120616 21525 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968701344
[1,6]<stderr>:I0928 17:10:37.121022 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968703568
[1,6]<stderr>:I0928 17:10:37.121486 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968705744
[1,6]<stderr>:I0928 17:10:37.121914 21525 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968871392
[1,6]<stderr>:I0928 17:10:37.122524 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968873568
[1,6]<stderr>:I0928 17:10:37.123687 21525 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94875968875792
[1,23]<stderr>:I0928 17:10:37.044494  6922 ProcessGroupNCCL.cpp:2780] Rank 23 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,11]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,11]<stderr>:I0928 17:10:37.128873 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498980641440
[1,11]<stderr>:I0928 17:10:37.130040 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498918979744
[1,11]<stderr>:I0928 17:10:37.135072 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498968532784
[1,26]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,26]<stderr>:I0928 17:10:37.122563  8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029190032
[1,11]<stderr>:I0928 17:10:37.137250 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498980072064
[1,10]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,10]<stderr>:I0928 17:10:37.137311 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594181953040
[1,11]<stderr>:I0928 17:10:37.137989 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988908752
[1,11]<stderr>:I0928 17:10:37.138221 24836 ProcessGroupNCCL.cpp:686] [Rank 11] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988912048
[1,26]<stderr>:I0928 17:10:37.124487  8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94650980360432
[1,11]<stderr>:I0928 17:10:37.138522 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978682928
[1,10]<stderr>:I0928 17:10:37.138542 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169667872
[1,11]<stderr>:I0928 17:10:37.139132 24836 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978685152
[1,11]<stderr>:I0928 17:10:37.139508 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498978687328
[1,11]<stderr>:I0928 17:10:37.139928 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498989155104
[1,11]<stderr>:I0928 17:10:37.140359 24836 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498989157328
[1,11]<stderr>:I0928 17:10:37.141057 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988754304
[1,11]<stderr>:I0928 17:10:37.142352 24836 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94498988756528
[1,26]<stderr>:I0928 17:10:37.129940  8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651037985728
[1,10]<stderr>:I0928 17:10:37.144203 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169665824
[1,26]<stderr>:I0928 17:10:37.131722  8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029171808
[1,26]<stderr>:I0928 17:10:37.132103  8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651029108848
[1,26]<stderr>:I0928 17:10:37.132349  8446 ProcessGroupNCCL.cpp:686] [Rank 26] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651028696608
[1,10]<stderr>:I0928 17:10:37.146450 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169658128
[1,26]<stderr>:I0928 17:10:37.132762  8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651028533888
[1,10]<stderr>:I0928 17:10:37.147200 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594169059168
[1,26]<stderr>:I0928 17:10:37.133189  8446 ProcessGroupNCCL.cpp:686] [Rank 6] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651039194288
[1,10]<stderr>:I0928 17:10:37.147445 24831 ProcessGroupNCCL.cpp:686] [Rank 10] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180588112
[1,26]<stderr>:I0928 17:10:37.133728  8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651039196416
[1,10]<stderr>:I0928 17:10:37.147754 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180591408
[1,26]<stderr>:I0928 17:10:37.134155  8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651048342112
[1,10]<stderr>:I0928 17:10:37.148296 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180593632
[1,26]<stderr>:I0928 17:10:37.134593  8446 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050523248
[1,10]<stderr>:I0928 17:10:37.148746 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594178361936
[1,10]<stderr>:I0928 17:10:37.149168 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594178366176
[1,26]<stderr>:I0928 17:10:37.135520  8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050525472
[1,10]<stderr>:I0928 17:10:37.149605 24831 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180691712
[1,10]<stderr>:I0928 17:10:37.150280 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180693936
[1,26]<stderr>:I0928 17:10:37.137228  8446 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94651050527696
[1,10]<stderr>:I0928 17:10:37.151577 24831 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94594180696240
[1,20]<stderr>:I0928 17:10:37.120223  6917 ProcessGroupNCCL.cpp:2780] Rank 20 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,13]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,13]<stderr>:I0928 17:10:37.194805 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544541280
[1,13]<stderr>:I0928 17:10:37.196089 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531708128
[1,13]<stderr>:I0928 17:10:37.201287 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969532443504
[1,19]<stderr>:I0928 17:10:37.151767  6914 ProcessGroupNCCL.cpp:2780] Rank 19 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,13]<stderr>:I0928 17:10:37.203401 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969540597472
[1,13]<stderr>:I0928 17:10:37.204113 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531492160
[1,13]<stderr>:I0928 17:10:37.204353 24844 ProcessGroupNCCL.cpp:686] [Rank 13] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969531015632
[1,13]<stderr>:I0928 17:10:37.204679 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969482635904
[1,13]<stderr>:I0928 17:10:37.205111 24844 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544429168
[1,13]<stderr>:I0928 17:10:37.205663 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969544431392
[1,13]<stderr>:I0928 17:10:37.206087 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545340448
[1,9]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,9]<stderr>:I0928 17:10:37.206405 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262122367072
[1,13]<stderr>:I0928 17:10:37.206526 24844 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545342576
[1,13]<stderr>:I0928 17:10:37.207253 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969545344800
[1,9]<stderr>:I0928 17:10:37.207465 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262102520816
[1,13]<stderr>:I0928 17:10:37.208600 24844 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=93969542317984
[1,9]<stderr>:I0928 17:10:37.212440 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262103222480
[1,9]<stderr>:I0928 17:10:37.214661 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262114321520
[1,9]<stderr>:I0928 17:10:37.215456 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116224016
[1,9]<stderr>:I0928 17:10:37.215713 24825 ProcessGroupNCCL.cpp:686] [Rank 9] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124306496
[1,9]<stderr>:I0928 17:10:37.216012 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124309840
[1,9]<stderr>:I0928 17:10:37.216480 24825 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124312064
[1,9]<stderr>:I0928 17:10:37.217005 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116058896
[1,9]<stderr>:I0928 17:10:37.217443 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262116063472
[1,9]<stderr>:I0928 17:10:37.217875 24825 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124147728
[1,9]<stderr>:I0928 17:10:37.218526 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124149952
[1,9]<stderr>:I0928 17:10:37.219781 24825 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94262124152176
[1,14]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,14]<stderr>:I0928 17:10:37.228828 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603221840
[1,0]<stderr>:WARNING: Logging before InitGoogleLogging() is written to STDERR
[1,0]<stderr>:I0928 17:10:37.259811 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877283227312
[1,14]<stderr>:I0928 17:10:37.230141 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603296960
[1,0]<stderr>:I0928 17:10:37.260424 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877283828432
[1,14]<stderr>:I0928 17:10:37.235035 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616761952
[1,0]<stderr>:I0928 17:10:37.265904 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877295878384
[1,14]<stderr>:I0928 17:10:37.237162 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603040288
[1,14]<stderr>:I0928 17:10:37.237843 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417603050032
[1,14]<stderr>:I0928 17:10:37.238082 24846 ProcessGroupNCCL.cpp:686] [Rank 14] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417602464272
[1,14]<stderr>:I0928 17:10:37.238412 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417614119712
[1,0]<stderr>:I0928 17:10:37.268484 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304147392
[1,14]<stderr>:I0928 17:10:37.238921 24846 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417614121936
[1,14]<stderr>:I0928 17:10:37.239393 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616167520
[1,0]<stderr>:I0928 17:10:37.269519 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304149472
[1,0]<stderr>:I0928 17:10:37.269762 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297562688
[1,14]<stderr>:I0928 17:10:37.239832 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616171376
[1,0]<stderr>:I0928 17:10:37.269999 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297566032
[1,14]<stderr>:I0928 17:10:37.240267 24846 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616019872
[1,0]<stderr>:I0928 17:10:37.270411 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877303057776
[1,0]<stderr>:I0928 17:10:37.270515 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877303059952
[1,0]<stderr>:I0928 17:10:37.270612 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297344592
[1,14]<stderr>:I0928 17:10:37.241012 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417616022096
[1,0]<stderr>:I0928 17:10:37.271090 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877297346816
[1,0]<stderr>:I0928 17:10:37.271520 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304888864
[1,0]<stderr>:I0928 17:10:37.271955 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304891088
[1,14]<stderr>:I0928 17:10:37.242390 24846 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94417624106544
[1,0]<stderr>:I0928 17:10:37.272440 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304893312
[1,0]<stderr>:I0928 17:10:37.273483 21506 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=94877304894464
[1,0]<stdout>:> initialized tensor model parallel with size 4
[1,0]<stdout>:> initialized pipeline model parallel with size 8
[1,0]<stdout>:> setting random seeds to 1234 ...
[1,0]<stdout>:> compiling dataset index builder ...
[1,0]<stdout>:make: Entering directory '/data/project/Megatron-LM-Qwen/megatron/core/datasets'
[1,0]<stdout>:make: Nothing to be done for 'default'.
[1,0]<stdout>:make: Leaving directory '/data/project/Megatron-LM-Qwen/megatron/core/datasets'
[1,0]<stdout>:>>> done with dataset index builder. Compilation time: 0.030 seconds
[1,0]<stdout>:> compiling and loading fused kernels ...
[1,0]<stderr>:I0928 17:10:37.305773 21506 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,15]<stderr>:I0928 17:10:37.293015 24847 ProcessGroupNCCL.cpp:2780] Rank 15 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,25]<stderr>:I0928 17:10:37.331790  8440 ProcessGroupNCCL.cpp:2780] Rank 25 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,31]<stderr>:I0928 17:10:37.398240  8462 ProcessGroupNCCL.cpp:2780] Rank 31 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,7]<stderr>:I0928 17:10:37.479629 21526 ProcessGroupNCCL.cpp:2780] Rank 7 using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,5]<stderr>:I0928 17:10:37.484269 21524 ProcessGroupNCCL.cpp:2780] Rank 5 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,1]<stderr>:I0928 17:10:37.485459 21509 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,28]<stderr>:I0928 17:10:37.449275  8456 ProcessGroupNCCL.cpp:2780] Rank 28 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,4]<stderr>:I0928 17:10:37.514962 21522 ProcessGroupNCCL.cpp:2780] Rank 4 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,29]<stderr>:I0928 17:10:37.483800  8459 ProcessGroupNCCL.cpp:2780] Rank 29 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,3]<stderr>:I0928 17:10:37.528204 21518 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,27]<stderr>:I0928 17:10:37.532408  8451 ProcessGroupNCCL.cpp:2780] Rank 27 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,2]<stderr>:I0928 17:10:37.663508 21514 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,30]<stderr>:I0928 17:10:37.644487  8461 ProcessGroupNCCL.cpp:2780] Rank 30 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,6]<stderr>:I0928 17:10:37.698328 21525 ProcessGroupNCCL.cpp:2780] Rank 6 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,11]<stderr>:I0928 17:10:37.696750 24836 ProcessGroupNCCL.cpp:2780] Rank 11 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,10]<stderr>:I0928 17:10:37.718168 24831 ProcessGroupNCCL.cpp:2780] Rank 10 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,26]<stderr>:I0928 17:10:37.706135  8446 ProcessGroupNCCL.cpp:2780] Rank 26 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,13]<stderr>:I0928 17:10:37.763438 24844 ProcessGroupNCCL.cpp:2780] Rank 13 using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,9]<stderr>:I0928 17:10:37.785578 24825 ProcessGroupNCCL.cpp:2780] Rank 9 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,14]<stderr>:I0928 17:10:37.803975 24846 ProcessGroupNCCL.cpp:2780] Rank 14 using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[1,0]<stderr>:I0928 17:10:40.743487 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,0]<stdout>:>>> done with compiling and loading fused kernels. Compilation time: 3.489 seconds
[1,0]<stdout>:time to initialize megatron (seconds): 16.969
[1,0]<stdout>:[after megatron is initialized] datetime: 2024-09-28 17:10:40 
[1,0]<stdout>:building GPT model ...
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,4]<stdout>:torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,4]<stdout>:torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,4]<stdout>:torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,4]<stdout>:torch.Size([8192, 2048])
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,1]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,8]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,25]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,1]<stdout>:torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:torch.Size([8192, 2048])
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,25]<stdout>:torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:torch.Size([8192, 2048])
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,4]<stdout>:torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,25]<stdout>:torch.Size([8192, 2048])
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,8]<stdout>:torch.Size([8192, 2048])
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,11]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,1]<stdout>:torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,8]<stdout>:torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,24]<stdout>:torch.Size([8192, 2048])
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,25]<stdout>:torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,28]<stdout>:torch.Size([8192, 2048])
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,11]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 605716480
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,21]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,25]<stdout>:torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,24]<stdout>:torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:torch.Size([8192, 2048])
[1,21]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 5): 605716480
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:++++++++++++++++++++++++padding is done[1,28]<stdout>:torch.Size([8192, 2048])
[1,26]<stdout>:
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,10]<stdout>:torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,18]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,8]<stdout>:torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,29]<stdout>:torch.Size([8192, 2048])
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,22]<stdout>:torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,8]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,22]<stdout>:torch.Size([8192, 2048])
[1,8]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,1]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,11]<stdout>:torch.Size([8192, 2048])
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,10]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 2): 605716480
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,28]<stdout>:torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:torch.Size([8192, 2048])
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,1]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 917143552
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,25]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 6): 605716480
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 2): 605716480
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,24]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 6): 605716480
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,3]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 917143552
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,29]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,7]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,28]<stdout>:torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,17]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 4): 605716480
[1,12]<stdout>:torch.Size([8192, 2048])
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,29]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 7): 605724672
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,15]<stdout>:torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 1): 605716480
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,6]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,28]<stdout>:torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,11]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:torch.Size([8192, 2048])
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,4]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,4]<stdout>:Params for bucket 1 (605716480 elements):
[1,4]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,4]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,4]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,15]<stdout>:torch.Size([8192, 2048])
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 605716480
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,11]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 2): 605716480
[1,16]<stdout>:torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,12]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 3): 605716480
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,28]<stdout>:torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,30]<stdout>:torch.Size([8192, 2048])
[1,5]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,9]<stdout>:torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,15]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,15]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,0]<stdout>:torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:torch.Size([8192, 2048])
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,4]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,21]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,5]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 1): 605716480
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,15]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 3): 605716480
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,28]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 7): 605724672
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,9]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 917143552
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,27]<stdout>:torch.Size([8192, 2048])
[1,27]<stdout>:++++++++++++++++++++++++padding is done
[1,22]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 5): 605716480
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,13]<stdout>:torch.Size([8192, 2048])
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,13]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,9]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 2): 605716480
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,27]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 6): 605716480
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,8]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,8]<stdout>:Params for bucket 1 (605716480 elements):
[1,8]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,8]<stdout>:module.language_model.encoder.layers.4.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,8]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,8]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_[1,8]<stdout>:h.weight
[1,13]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (1, 3): 605716480
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,18]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,26]<stdout>:torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,26]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,2]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,2]<stdout>:++++++++++++++++++++++++padding is done
[1,18]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 4): 605716480
[1,24]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,24]<stdout>:Params for bucket 1 (605716480 elements):
[1,24]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,24]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,24]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,20]<stdout>:torch.Size([8192, 2048])
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,8]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,26]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 6): 605716480
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,14]<stdout>:torch.Size([8192, 2048])
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,14]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,10]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,2]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 917143552
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,25]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,17]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,0]<stdout>:INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=False, use_distributed_optimizer=True, check_for_nan_in_grad=True, bucket_size=None, average_in_collective=False)
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,14]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 3): 605716480
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,20]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 5): 605716480
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,24]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,7]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,3]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,16]<stdout>:++++++++++++++++++++++++padding is done
[1,12]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,12]<stdout>:Params for bucket 1 (605716480 elements):
[1,12]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	[1,12]<stdout>:module.language_model.encoder.layers.1.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,12]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,12]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,11]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,30]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (2, 7): 605724672
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,1]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,29]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,19]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,19]<stdout>:torch.Size([8192, 2048])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,31]<stdout>:torch.Size([8192, 2048])
[1,19]<stdout>:++++++++++++++++++++++++padding is done
[1,16]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (0, 4): 605716480
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,6]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 4): 605716480
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,12]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,31]<stdout>:++++++++++++++++++++++++padding is done
[1,31]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 605724672
[1,0]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,0]<stdout>:Params for bucket 1 (917143552 elements):
[1,0]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.embedding.word_embeddings.weight
[1,0]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,0]<stdout>:	[1,0]<stdout>:module.language_model.encoder.layers.0.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,0]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,0]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,28]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,28]<stdout>:Params for bucket 1 (605724672 elements):
[1,28]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.final_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,28]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,28]<stdout>:	[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,9]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,28]<stdout>:module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,28]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,15]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,5]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,0]<stdout>:INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=3e-05, min_lr=3e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x15320ed09cf0>)
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,22]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,27]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,28]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,13]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,18]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,0]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,26]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,20]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,20]<stdout>:Params for bucket 1 (605716480 elements):
[1,20]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,20]<stdout>:	[1,20]<stdout>:module.language_model.encoder.layers.6.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,20]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,20]<stdout>:	module.language_model.encoder.layers.2.input_nor[1,20]<stdout>:m.weight
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,20]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,14]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,19]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,16]<stdout>:INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
[1,16]<stdout>:Params for bucket 1 (605716480 elements):
[1,16]<stdout>:	module.language_model.encoder.layers.7.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.7.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.5.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.4.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.0.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.8.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.6.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.3.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.4.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.7.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.6.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.6.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.2.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.8.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.5.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.5.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.1.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.0.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.3.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.9.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.2.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.1.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.9.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.9.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.4.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.2.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.1.input_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.8.mlp.dense_4h_to_h.weight
[1,16]<stdout>:	module.language_model.encoder.layers.3.post_attention_norm.weight
[1,16]<stdout>:	module.language_model.encoder.layers.0.mlp.dense_4h_to_h.weight
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,30]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,2]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,16]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,31]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,0]<stdout>:> learning rate decay style: cosine
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: [1,23]<stdout>:torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>:++++++++ weight.size is in RowParallelLinear: torch.Size([8192, 2048])
[1,23]<stdout>:++++++++++++++++++++++++padding is done
[1,23]<stdout>: > number of parameters on (tensor, pipeline) model parallel rank (3, 5): 605716480
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/apex/optimizers/fused_adam.py:77: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /data/jenkins_workspace/workspace/pytorch@4/torch/csrc/tensor/python_tensor.cpp:83.)
[1,23]<stderr>:  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
[1,0]<stdout>:WARNING: could not find the metadata file ./tmp/qwen1_5_72b/ckpt/latest_checkpointed_iteration.txt 
[1,0]<stdout>:    will not load any checkpoints and will start from random
[1,31]<stdout>:(min, max) time across ranks (ms):
[1,31]<stdout>:    load-checkpoint ................................: (0.58, 0.76)
[1,0]<stdout>:[after model, optimizer, and learning rate scheduler are built] datetime: 2024-09-28 17:10:41 
[1,0]<stdout>:> building train, validation, and test datasets ...
[1,0]<stdout>: > datasets target sizes (minimum size):
[1,0]<stdout>:    train:      6400
[1,0]<stdout>:    validation: 640
[1,0]<stdout>:    test:       640
[1,0]<stdout>:INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
[1,0]<stdout>:> building train, validation, and test datasets for GPT ...
[1,0]<stdout>:INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(6400, 640, 640), and config=GPTDatasetConfig(random_seed=1234, sequence_length=2048, blend=(['./qwen_token/my-qwen_text_document'], None), blend_per_split=[None, None, None], split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer._Qwen2Tokenizer object at 0x15320ed30610>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True)
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ./qwen_token/my-qwen_text_document.idx
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 79000
[1,0]<stdout>:INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 79000
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset train indices
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 107536
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset valid indices
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 5734
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 1
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:Build and save the GPTDataset test indices
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 743
[1,0]<stdout>:INFO:megatron.core.datasets.gpt_dataset:> total number of epochs: 4
[1,0]<stdout>:> finished creating GPT datasets ...
[1,0]<stdout>:[after dataloaders are built] datetime: 2024-09-28 17:10:41 
[1,0]<stdout>:done with setup ...
[1,0]<stdout>:training ...
[1,31]<stdout>:(min, max) time across ranks (ms):
[1,31]<stdout>:    model-and-optimizer-setup ......................: (524.77, 564.45)
[1,31]<stdout>:    train/valid/test-data-iterators-setup ..........: (49.20, 134.69)
[1,0]<stdout>:[before the start of training step] datetime: 2024-09-28 17:10:41 
[1,6]<stderr>:W0928 17:10:41.720322 21525 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,13]<stderr>:W0928 17:10:41.687623 24844 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,11]<stderr>:W0928 17:10:41.687621 24836 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,10]<stderr>:W0928 17:10:41.687646 24831 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,9]<stderr>:W0928 17:10:41.687633 24825 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,14]<stderr>:W0928 17:10:41.687645 24846 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,15]<stderr>:W0928 17:10:41.687639 24847 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,8]<stderr>:W0928 17:10:41.687639 24819 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,12]<stderr>:W0928 17:10:41.687701 24841 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,4]<stderr>:W0928 17:10:41.720338 21522 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,5]<stderr>:W0928 17:10:41.720358 21524 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,7]<stderr>:W0928 17:10:41.720341 21526 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,21]<stderr>:W0928 17:10:41.638365  6919 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,27]<stderr>:W0928 17:10:41.663214  8451 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,19]<stderr>:W0928 17:10:41.638398  6914 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,30]<stderr>:W0928 17:10:41.663210  8461 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,26]<stderr>:W0928 17:10:41.663236  8446 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,31]<stderr>:W0928 17:10:41.663370  8462 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,24]<stderr>:W0928 17:10:41.663281  8434 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,25]<stderr>:W0928 17:10:41.663278  8440 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,17]<stderr>:W0928 17:10:41.638440  6901 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,18]<stderr>:W0928 17:10:41.638557  6907 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,20]<stderr>:W0928 17:10:41.638422  6917 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,16]<stderr>:W0928 17:10:41.638450  6894 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,23]<stderr>:W0928 17:10:41.638474  6922 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,28]<stderr>:W0928 17:10:41.663606  8456 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,22]<stderr>:W0928 17:10:41.647271  6921 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,29]<stderr>:W0928 17:10:41.672372  8459 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,0]<stderr>:I0928 17:10:42.460181 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,3]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,3]<stderr>:  mm 1.0498 ms 100.0%
[1,3]<stderr>:  triton_mm_7 3.1983 ms 32.8%
[1,3]<stderr>:  triton_mm_3 3.6087 ms 29.1%
[1,3]<stderr>:  triton_mm_5 3.8699 ms 27.1%
[1,3]<stderr>:  triton_mm_8 4.1841 ms 25.1%
[1,3]<stderr>:  triton_mm_4 4.3498 ms 24.1%
[1,3]<stderr>:  triton_mm_6 4.5394 ms 23.1%
[1,3]<stderr>:  triton_mm_2 5.1234 ms 20.5%
[1,3]<stderr>:  triton_mm_1 5.2411 ms 20.0%
[1,3]<stderr>:  triton_mm_0 5.2778 ms 19.9%
[1,3]<stderr>:SingleProcess AUTOTUNE takes 10.2842 seconds
[1,1]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,1]<stderr>:  mm 1.0295 ms 100.0%
[1,1]<stderr>:  triton_mm_7 3.1563 ms 32.6%
[1,1]<stderr>:  triton_mm_3 3.6237 ms 28.4%
[1,1]<stderr>:  triton_mm_5 3.9042 ms 26.4%
[1,1]<stderr>:  triton_mm_8 4.1048 ms 25.1%
[1,1]<stderr>:  triton_mm_4 4.3397 ms 23.7%
[1,1]<stderr>:  triton_mm_6 4.5323 ms 22.7%
[1,1]<stderr>:  triton_mm_0 5.1038 ms 20.2%
[1,1]<stderr>:  triton_mm_2 5.1453 ms 20.0%
[1,1]<stderr>:  triton_mm_1 5.2684 ms 19.5%
[1,1]<stderr>:SingleProcess AUTOTUNE takes 10.3284 seconds
[1,2]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,2]<stderr>:  mm 1.0287 ms 100.0%
[1,2]<stderr>:  triton_mm_7 3.2200 ms 31.9%
[1,2]<stderr>:  triton_mm_3 3.6764 ms 28.0%
[1,2]<stderr>:  triton_mm_5 3.9014 ms 26.4%
[1,2]<stderr>:  triton_mm_8 4.1172 ms 25.0%
[1,2]<stderr>:  triton_mm_4 4.4075 ms 23.3%
[1,2]<stderr>:  triton_mm_6 4.5329 ms 22.7%
[1,2]<stderr>:  triton_mm_1 5.0831 ms 20.2%
[1,2]<stderr>:  triton_mm_0 5.1285 ms 20.1%
[1,2]<stderr>:  triton_mm_2 5.1481 ms 20.0%
[1,2]<stderr>:SingleProcess AUTOTUNE takes 10.4001 seconds
[1,0]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,0]<stderr>:  mm 1.0242 ms 100.0%
[1,0]<stderr>:  triton_mm_7 3.1608 ms 32.4%
[1,0]<stderr>:  triton_mm_3 3.6354 ms 28.2%
[1,0]<stderr>:  triton_mm_5 3.9250 ms 26.1%
[1,0]<stderr>:  triton_mm_8 4.1619 ms 24.6%
[1,0]<stderr>:  triton_mm_4 4.3849 ms 23.4%
[1,0]<stderr>:  triton_mm_6 4.5466 ms 22.5%
[1,0]<stderr>:  triton_mm_2 5.0089 ms 20.4%
[1,0]<stderr>:  triton_mm_1 5.1330 ms 20.0%
[1,0]<stderr>:  triton_mm_0 5.1458 ms 19.9%
[1,0]<stderr>:SingleProcess AUTOTUNE takes 10.4098 seconds
[1,3]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,3]<stderr>:  mm 0.8634 ms 100.0%
[1,3]<stderr>:  triton_mm_14 1.5196 ms 56.8%
[1,3]<stderr>:  triton_mm_18 1.5497 ms 55.7%
[1,3]<stderr>:  triton_mm_19 1.5765 ms 54.8%
[1,3]<stderr>:  triton_mm_15 1.8716 ms 46.1%
[1,3]<stderr>:  triton_mm_11 2.0062 ms 43.0%
[1,3]<stderr>:  triton_mm_12 2.0176 ms 42.8%
[1,3]<stderr>:  triton_mm_13 2.0432 ms 42.3%
[1,3]<stderr>:  triton_mm_16 2.5107 ms 34.4%
[1,3]<stderr>:  triton_mm_17 3.2397 ms 26.7%
[1,3]<stderr>:SingleProcess AUTOTUNE takes 1.7922 seconds
[1,0]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,0]<stderr>:  mm 0.8564 ms 100.0%
[1,0]<stderr>:  triton_mm_14 1.4993 ms 57.1%
[1,0]<stderr>:  triton_mm_18 1.5417 ms 55.5%
[1,0]<stderr>:  triton_mm_19 1.5793 ms 54.2%
[1,0]<stderr>:  triton_mm_15 1.8664 ms 45.9%
[1,0]<stderr>:  triton_mm_12 2.0045 ms 42.7%
[1,0]<stderr>:  triton_mm_11 2.0251 ms 42.3%
[1,0]<stderr>:  triton_mm_13 2.0588 ms 41.6%
[1,0]<stderr>:  triton_mm_16 2.4971 ms 34.3%
[1,0]<stderr>:  triton_mm_17 3.2688 ms 26.2%
[1,0]<stderr>:SingleProcess AUTOTUNE takes 1.7752 seconds
[1,1]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,1]<stderr>:  mm 0.8574 ms 100.0%
[1,1]<stderr>:  triton_mm_14 1.5078 ms 56.9%
[1,1]<stderr>:  triton_mm_18 1.5438 ms 55.5%
[1,1]<stderr>:  triton_mm_19 1.5790 ms 54.3%
[1,1]<stderr>:  triton_mm_15 1.8748 ms 45.7%
[1,1]<stderr>:  triton_mm_12 2.0099 ms 42.7%
[1,1]<stderr>:  triton_mm_11 2.0266 ms 42.3%
[1,1]<stderr>:  triton_mm_13 2.0479 ms 41.9%
[1,1]<stderr>:  triton_mm_16 2.5000 ms 34.3%
[1,1]<stderr>:  triton_mm_17 3.2497 ms 26.4%
[1,1]<stderr>:SingleProcess AUTOTUNE takes 1.7940 seconds
[1,2]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,2]<stderr>:  mm 0.8571 ms 100.0%
[1,2]<stderr>:  triton_mm_14 1.5051 ms 56.9%
[1,2]<stderr>:  triton_mm_18 1.5540 ms 55.2%
[1,2]<stderr>:  triton_mm_19 1.5806 ms 54.2%
[1,2]<stderr>:  triton_mm_15 1.8697 ms 45.8%
[1,2]<stderr>:  triton_mm_12 2.0208 ms 42.4%
[1,2]<stderr>:  triton_mm_11 2.0212 ms 42.4%
[1,2]<stderr>:  triton_mm_13 2.0525 ms 41.8%
[1,2]<stderr>:  triton_mm_16 2.5044 ms 34.2%
[1,2]<stderr>:  triton_mm_17 3.2073 ms 26.7%
[1,2]<stderr>:SingleProcess AUTOTUNE takes 1.7819 seconds
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:01,910] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,004] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,069] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] due to: 
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] 
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:02,110] torch._dynamo.convert_frame: [WARNING] 
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,123] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] due to: 
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] 
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:02,203] torch._dynamo.convert_frame: [WARNING] 
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] due to: 
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] 
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:02,268] torch._dynamo.convert_frame: [WARNING] 
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] due to: 
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] 
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:02,323] torch._dynamo.convert_frame: [WARNING] 
[1,3]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,3]<stderr>:  mm 5.1855 ms 100.0%
[1,3]<stderr>:  triton_mm_29 18.1666 ms 28.5%
[1,3]<stderr>:  triton_mm_30 19.0999 ms 27.1%
[1,3]<stderr>:  triton_mm_26 19.8113 ms 26.2%
[1,3]<stderr>:  triton_mm_27 21.9293 ms 23.6%
[1,3]<stderr>:  triton_mm_24 23.4876 ms 22.1%
[1,3]<stderr>:  triton_mm_28 25.4800 ms 20.4%
[1,3]<stderr>:  triton_mm_25 27.2083 ms 19.1%
[1,3]<stderr>:  triton_mm_22 29.6523 ms 17.5%
[1,3]<stderr>:  triton_mm_23 33.2988 ms 15.6%
[1,3]<stderr>:SingleProcess AUTOTUNE takes 9.1258 seconds
[1,0]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,0]<stderr>:  mm 5.0763 ms 100.0%
[1,0]<stderr>:  triton_mm_29 17.8295 ms 28.5%
[1,0]<stderr>:  triton_mm_30 19.3175 ms 26.3%
[1,0]<stderr>:  triton_mm_26 20.0796 ms 25.3%
[1,0]<stderr>:  triton_mm_27 22.0445 ms 23.0%
[1,0]<stderr>:  triton_mm_24 22.6142 ms 22.4%
[1,0]<stderr>:  triton_mm_28 24.9178 ms 20.4%
[1,0]<stderr>:  triton_mm_25 26.3039 ms 19.3%
[1,0]<stderr>:  triton_mm_22 30.4394 ms 16.7%
[1,0]<stderr>:  triton_mm_23 34.1525 ms 14.9%
[1,0]<stderr>:SingleProcess AUTOTUNE takes 9.0679 seconds
[1,1]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,1]<stderr>:  mm 5.0566 ms 100.0%
[1,1]<stderr>:  triton_mm_29 17.3261 ms 29.2%
[1,1]<stderr>:  triton_mm_26 19.1989 ms 26.3%
[1,1]<stderr>:  triton_mm_30 19.3449 ms 26.1%
[1,1]<stderr>:  triton_mm_27 21.8966 ms 23.1%
[1,1]<stderr>:  triton_mm_24 22.5473 ms 22.4%
[1,1]<stderr>:  triton_mm_25 23.7000 ms 21.3%
[1,1]<stderr>:  triton_mm_28 24.8784 ms 20.3%
[1,1]<stderr>:  triton_mm_22 32.4732 ms 15.6%
[1,1]<stderr>:  triton_mm_31 33.6202 ms 15.0%
[1,1]<stderr>:SingleProcess AUTOTUNE takes 9.2076 seconds
[1,2]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,2]<stderr>:  mm 5.0838 ms 100.0%
[1,2]<stderr>:  triton_mm_29 17.7675 ms 28.6%
[1,2]<stderr>:  triton_mm_30 18.3269 ms 27.7%
[1,2]<stderr>:  triton_mm_26 19.8405 ms 25.6%
[1,2]<stderr>:  triton_mm_27 22.0307 ms 23.1%
[1,2]<stderr>:  triton_mm_24 23.4502 ms 21.7%
[1,2]<stderr>:  triton_mm_28 24.8083 ms 20.5%
[1,2]<stderr>:  triton_mm_25 26.1370 ms 19.5%
[1,2]<stderr>:  triton_mm_22 30.5937 ms 16.6%
[1,2]<stderr>:  triton_mm_23 30.9062 ms 16.4%
[1,2]<stderr>:SingleProcess AUTOTUNE takes 9.1920 seconds
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,3]<stderr>:  torch.has_cuda,
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,3]<stderr>:  torch.has_cudnn,
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,3]<stderr>:  torch.has_mps,
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,3]<stderr>:  torch.has_mkldnn,
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,0]<stderr>:  torch.has_cuda,
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,0]<stderr>:  torch.has_cudnn,
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,0]<stderr>:  torch.has_mps,
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,0]<stderr>:  torch.has_mkldnn,
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,1]<stderr>:  torch.has_cuda,
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,1]<stderr>:  torch.has_cudnn,
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,1]<stderr>:  torch.has_mps,
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,1]<stderr>:  torch.has_mkldnn,
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,2]<stderr>:  torch.has_cuda,
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,2]<stderr>:  torch.has_cudnn,
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,2]<stderr>:  torch.has_mps,
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,2]<stderr>:  torch.has_mkldnn,
[1,3]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,3]<stderr>:  mm 3.3452 ms 100.0%
[1,3]<stderr>:  triton_mm_40 3.8633 ms 86.6%
[1,3]<stderr>:  triton_mm_36 5.0951 ms 65.7%
[1,3]<stderr>:  triton_mm_37 5.5629 ms 60.1%
[1,3]<stderr>:  triton_mm_41 7.2336 ms 46.2%
[1,3]<stderr>:  triton_mm_38 10.0755 ms 33.2%
[1,3]<stderr>:  triton_mm_35 11.6711 ms 28.7%
[1,3]<stderr>:  triton_mm_34 12.3092 ms 27.2%
[1,3]<stderr>:  triton_mm_33 15.2904 ms 21.9%
[1,3]<stderr>:  triton_mm_39 15.7015 ms 21.3%
[1,3]<stderr>:SingleProcess AUTOTUNE takes 2.4936 seconds
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:17,263] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:17,263] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:17,283] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,3]<stderr>:[rank3]:[2024-09-28 17:11:17,283] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,0]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,0]<stderr>:  mm 3.3581 ms 100.0%
[1,0]<stderr>:  triton_mm_40 3.8333 ms 87.6%
[1,0]<stderr>:  triton_mm_36 5.0018 ms 67.1%
[1,0]<stderr>:  triton_mm_41 5.6386 ms 59.6%
[1,0]<stderr>:  triton_mm_37 6.0385 ms 55.6%
[1,0]<stderr>:  triton_mm_38 10.0579 ms 33.4%
[1,0]<stderr>:  triton_mm_35 11.8332 ms 28.4%
[1,0]<stderr>:  triton_mm_34 14.9400 ms 22.5%
[1,0]<stderr>:  triton_mm_39 15.9981 ms 21.0%
[1,0]<stderr>:  triton_mm_33 16.1442 ms 20.8%
[1,0]<stderr>:SingleProcess AUTOTUNE takes 2.5098 seconds
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:17,451] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:17,451] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:17,471] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,0]<stderr>:[rank0]:[2024-09-28 17:11:17,471] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,2]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,2]<stderr>:  mm 3.2400 ms 100.0%
[1,2]<stderr>:  triton_mm_40 3.8364 ms 84.5%
[1,2]<stderr>:  triton_mm_36 5.0255 ms 64.5%
[1,2]<stderr>:  triton_mm_37 5.6424 ms 57.4%
[1,2]<stderr>:  triton_mm_41 7.0218 ms 46.1%
[1,2]<stderr>:  triton_mm_38 10.1663 ms 31.9%
[1,2]<stderr>:  triton_mm_35 11.5364 ms 28.1%
[1,2]<stderr>:  triton_mm_34 13.3459 ms 24.3%
[1,2]<stderr>:  triton_mm_33 14.6459 ms 22.1%
[1,2]<stderr>:  triton_mm_39 15.8142 ms 20.5%
[1,2]<stderr>:SingleProcess AUTOTUNE takes 2.5147 seconds
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:17,614] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:17,614] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:17,634] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,2]<stderr>:[rank2]:[2024-09-28 17:11:17,634] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,1]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,1]<stderr>:  mm 3.3703 ms 100.0%
[1,1]<stderr>:  triton_mm_40 3.7507 ms 89.9%
[1,1]<stderr>:  triton_mm_36 5.0206 ms 67.1%
[1,1]<stderr>:  triton_mm_37 5.7857 ms 58.3%
[1,1]<stderr>:  triton_mm_41 6.2546 ms 53.9%
[1,1]<stderr>:  triton_mm_38 10.2652 ms 32.8%
[1,1]<stderr>:  triton_mm_35 11.7461 ms 28.7%
[1,1]<stderr>:  triton_mm_34 13.1324 ms 25.7%
[1,1]<stderr>:  triton_mm_39 14.8818 ms 22.6%
[1,1]<stderr>:  triton_mm_33 16.2534 ms 20.7%
[1,1]<stderr>:SingleProcess AUTOTUNE takes 2.5369 seconds
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:17,704] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:17,704] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:17,724] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,1]<stderr>:[rank1]:[2024-09-28 17:11:17,724] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,3]<stderr>:W0928 17:11:24.389076 21518 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,1]<stderr>:W0928 17:11:24.391553 21509 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,0]<stderr>:W0928 17:11:24.393659 21506 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,2]<stderr>:W0928 17:11:24.399269 21514 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,3]<stderr>:I0928 17:11:25.779922 21518 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,0]<stderr>:I0928 17:11:26.047698 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,2]<stderr>:I0928 17:11:26.222893 21514 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,1]<stderr>:I0928 17:11:26.331490 21509 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,4]<stderr>:I0928 17:11:32.974344 21522 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,690] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,697] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,704] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] due to: 
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] 
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:36,886] torch._dynamo.convert_frame: [WARNING] 
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] due to: 
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] 
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:36,894] torch._dynamo.convert_frame: [WARNING] 
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] due to: 
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] 
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:36,900] torch._dynamo.convert_frame: [WARNING] 
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:36,930] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] due to: 
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] 
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:37,126] torch._dynamo.convert_frame: [WARNING] 
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,5]<stderr>:  torch.has_cuda,
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,5]<stderr>:  torch.has_cudnn,
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,5]<stderr>:  torch.has_mps,
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,5]<stderr>:  torch.has_mkldnn,
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,4]<stderr>:  torch.has_cuda,
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,4]<stderr>:  torch.has_cudnn,
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,4]<stderr>:  torch.has_mps,
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,4]<stderr>:  torch.has_mkldnn,
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,6]<stderr>:  torch.has_cuda,
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,6]<stderr>:  torch.has_cudnn,
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,6]<stderr>:  torch.has_mps,
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,6]<stderr>:  torch.has_mkldnn,
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,7]<stderr>:  torch.has_cuda,
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,7]<stderr>:  torch.has_cudnn,
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,7]<stderr>:  torch.has_mps,
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,7]<stderr>:  torch.has_mkldnn,
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:40,537] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:40,537] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:40,558] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,4]<stderr>:[rank4]:[2024-09-28 17:11:40,558] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:40,588] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:40,588] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:40,609] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,5]<stderr>:[rank5]:[2024-09-28 17:11:40,609] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:40,639] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:40,639] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:40,659] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,6]<stderr>:[rank6]:[2024-09-28 17:11:40,659] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:40,690] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:40,690] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:40,711] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,7]<stderr>:[rank7]:[2024-09-28 17:11:40,711] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,4]<stderr>:W0928 17:11:47.335281 21522 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,7]<stderr>:W0928 17:11:47.335413 21526 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,6]<stderr>:W0928 17:11:47.341578 21525 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,5]<stderr>:W0928 17:11:47.341688 21524 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,8]<stderr>:I0928 17:11:54.878262 24819 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,10]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,10]<stderr>:  mm 1.0378 ms 100.0%
[1,10]<stderr>:  triton_mm_7 3.2085 ms 32.3%
[1,10]<stderr>:  triton_mm_3 3.7506 ms 27.7%
[1,10]<stderr>:  triton_mm_5 3.8942 ms 26.6%
[1,10]<stderr>:  triton_mm_8 4.3069 ms 24.1%
[1,10]<stderr>:  triton_mm_4 4.3168 ms 24.0%
[1,10]<stderr>:  triton_mm_6 4.5318 ms 22.9%
[1,10]<stderr>:  triton_mm_0 5.1230 ms 20.3%
[1,10]<stderr>:  triton_mm_1 5.2015 ms 20.0%
[1,10]<stderr>:  triton_mm_2 5.2401 ms 19.8%
[1,10]<stderr>:SingleProcess AUTOTUNE takes 10.3330 seconds
[1,8]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,8]<stderr>:  mm 1.0301 ms 100.0%
[1,8]<stderr>:  triton_mm_7 3.2140 ms 32.0%
[1,8]<stderr>:  triton_mm_3 3.7502 ms 27.5%
[1,8]<stderr>:  triton_mm_5 3.8955 ms 26.4%
[1,8]<stderr>:  triton_mm_8 4.1798 ms 24.6%
[1,8]<stderr>:  triton_mm_4 4.3744 ms 23.5%
[1,8]<stderr>:  triton_mm_6 4.5068 ms 22.9%
[1,8]<stderr>:  triton_mm_0 5.1285 ms 20.1%
[1,8]<stderr>:  triton_mm_2 5.2497 ms 19.6%
[1,8]<stderr>:  triton_mm_1 5.3827 ms 19.1%
[1,8]<stderr>:SingleProcess AUTOTUNE takes 10.4021 seconds
[1,11]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,11]<stderr>:  mm 1.0266 ms 100.0%
[1,11]<stderr>:  triton_mm_7 3.2032 ms 32.0%
[1,11]<stderr>:  triton_mm_3 3.6907 ms 27.8%
[1,11]<stderr>:  triton_mm_5 3.8802 ms 26.5%
[1,11]<stderr>:  triton_mm_8 4.0726 ms 25.2%
[1,11]<stderr>:  triton_mm_4 4.3446 ms 23.6%
[1,11]<stderr>:  triton_mm_6 4.5232 ms 22.7%
[1,11]<stderr>:  triton_mm_0 4.9972 ms 20.5%
[1,11]<stderr>:  triton_mm_2 5.1340 ms 20.0%
[1,11]<stderr>:  triton_mm_1 5.1699 ms 19.9%
[1,11]<stderr>:SingleProcess AUTOTUNE takes 10.5661 seconds
[1,9]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,9]<stderr>:  mm 1.0400 ms 100.0%
[1,9]<stderr>:  triton_mm_7 3.1960 ms 32.5%
[1,9]<stderr>:  triton_mm_3 3.6162 ms 28.8%
[1,9]<stderr>:  triton_mm_5 3.8928 ms 26.7%
[1,9]<stderr>:  triton_mm_8 4.0675 ms 25.6%
[1,9]<stderr>:  triton_mm_4 4.3381 ms 24.0%
[1,9]<stderr>:  triton_mm_6 4.5146 ms 23.0%
[1,9]<stderr>:  triton_mm_1 5.0942 ms 20.4%
[1,9]<stderr>:  triton_mm_0 5.1172 ms 20.3%
[1,9]<stderr>:  triton_mm_2 5.1608 ms 20.2%
[1,9]<stderr>:SingleProcess AUTOTUNE takes 10.6917 seconds
[1,10]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,10]<stderr>:  mm 0.8578 ms 100.0%
[1,10]<stderr>:  triton_mm_14 1.4940 ms 57.4%
[1,10]<stderr>:  triton_mm_18 1.5448 ms 55.5%
[1,10]<stderr>:  triton_mm_19 1.5689 ms 54.7%
[1,10]<stderr>:  triton_mm_15 1.8642 ms 46.0%
[1,10]<stderr>:  triton_mm_11 1.9785 ms 43.4%
[1,10]<stderr>:  triton_mm_12 2.0190 ms 42.5%
[1,10]<stderr>:  triton_mm_13 2.0635 ms 41.6%
[1,10]<stderr>:  triton_mm_16 2.4162 ms 35.5%
[1,10]<stderr>:  triton_mm_17 3.2118 ms 26.7%
[1,10]<stderr>:SingleProcess AUTOTUNE takes 1.7698 seconds
[1,8]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,8]<stderr>:  mm 0.8600 ms 100.0%
[1,8]<stderr>:  triton_mm_14 1.5070 ms 57.1%
[1,8]<stderr>:  triton_mm_18 1.5498 ms 55.5%
[1,8]<stderr>:  triton_mm_19 1.5738 ms 54.6%
[1,8]<stderr>:  triton_mm_15 1.8619 ms 46.2%
[1,8]<stderr>:  triton_mm_11 1.9930 ms 43.2%
[1,8]<stderr>:  triton_mm_12 2.0232 ms 42.5%
[1,8]<stderr>:  triton_mm_13 2.0418 ms 42.1%
[1,8]<stderr>:  triton_mm_16 2.4278 ms 35.4%
[1,8]<stderr>:  triton_mm_17 3.2021 ms 26.9%
[1,8]<stderr>:SingleProcess AUTOTUNE takes 1.7889 seconds
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,727] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,11]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,11]<stderr>:  mm 0.8608 ms 100.0%
[1,11]<stderr>:  triton_mm_14 1.4989 ms 57.4%
[1,11]<stderr>:  triton_mm_18 1.5522 ms 55.5%
[1,11]<stderr>:  triton_mm_19 1.5772 ms 54.6%
[1,11]<stderr>:  triton_mm_15 1.8843 ms 45.7%
[1,11]<stderr>:  triton_mm_11 1.9591 ms 43.9%
[1,11]<stderr>:  triton_mm_12 2.0316 ms 42.4%
[1,11]<stderr>:  triton_mm_13 2.0363 ms 42.3%
[1,11]<stderr>:  triton_mm_16 2.4147 ms 35.6%
[1,11]<stderr>:  triton_mm_17 3.1298 ms 27.5%
[1,11]<stderr>:SingleProcess AUTOTUNE takes 1.7861 seconds
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:07,863] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] due to: 
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] 
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:07,923] torch._dynamo.convert_frame: [WARNING] 
[1,9]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,9]<stderr>:  mm 0.8618 ms 100.0%
[1,9]<stderr>:  triton_mm_14 1.5059 ms 57.2%
[1,9]<stderr>:  triton_mm_18 1.5520 ms 55.5%
[1,9]<stderr>:  triton_mm_19 1.5754 ms 54.7%
[1,9]<stderr>:  triton_mm_15 1.8899 ms 45.6%
[1,9]<stderr>:  triton_mm_11 1.9677 ms 43.8%
[1,9]<stderr>:  triton_mm_13 2.0334 ms 42.4%
[1,9]<stderr>:  triton_mm_12 2.0339 ms 42.4%
[1,9]<stderr>:  triton_mm_16 2.4330 ms 35.4%
[1,9]<stderr>:  triton_mm_17 3.1490 ms 27.4%
[1,9]<stderr>:SingleProcess AUTOTUNE takes 1.7963 seconds
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,024] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] due to: 
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] 
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:08,064] torch._dynamo.convert_frame: [WARNING] 
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,142] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] due to: 
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] 
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:08,238] torch._dynamo.convert_frame: [WARNING] 
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] due to: 
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] 
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:08,342] torch._dynamo.convert_frame: [WARNING] 
[1,10]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,10]<stderr>:  mm 5.1471 ms 100.0%
[1,10]<stderr>:  triton_mm_29 17.8129 ms 28.9%
[1,10]<stderr>:  triton_mm_30 19.0960 ms 27.0%
[1,10]<stderr>:  triton_mm_26 19.7803 ms 26.0%
[1,10]<stderr>:  triton_mm_27 22.0302 ms 23.4%
[1,10]<stderr>:  triton_mm_24 22.9596 ms 22.4%
[1,10]<stderr>:  triton_mm_25 23.9913 ms 21.5%
[1,10]<stderr>:  triton_mm_28 25.4541 ms 20.2%
[1,10]<stderr>:  triton_mm_22 30.2564 ms 17.0%
[1,10]<stderr>:  triton_mm_23 32.3434 ms 15.9%
[1,10]<stderr>:SingleProcess AUTOTUNE takes 9.3133 seconds
[1,8]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,8]<stderr>:  mm 5.0688 ms 100.0%
[1,8]<stderr>:  triton_mm_29 18.3441 ms 27.6%
[1,8]<stderr>:  triton_mm_30 19.3977 ms 26.1%
[1,8]<stderr>:  triton_mm_26 20.3070 ms 25.0%
[1,8]<stderr>:  triton_mm_27 21.7696 ms 23.3%
[1,8]<stderr>:  triton_mm_24 23.4632 ms 21.6%
[1,8]<stderr>:  triton_mm_25 24.5947 ms 20.6%
[1,8]<stderr>:  triton_mm_28 25.2464 ms 20.1%
[1,8]<stderr>:  triton_mm_22 30.4956 ms 16.6%
[1,8]<stderr>:  triton_mm_31 32.0996 ms 15.8%
[1,8]<stderr>:SingleProcess AUTOTUNE takes 9.2404 seconds
[1,11]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,11]<stderr>:  mm 5.0896 ms 100.0%
[1,11]<stderr>:  triton_mm_29 17.6160 ms 28.9%
[1,11]<stderr>:  triton_mm_30 19.1918 ms 26.5%
[1,11]<stderr>:  triton_mm_26 19.5337 ms 26.1%
[1,11]<stderr>:  triton_mm_27 21.7936 ms 23.4%
[1,11]<stderr>:  triton_mm_24 23.1823 ms 22.0%
[1,11]<stderr>:  triton_mm_28 24.8120 ms 20.5%
[1,11]<stderr>:  triton_mm_25 25.1007 ms 20.3%
[1,11]<stderr>:  triton_mm_22 30.2913 ms 16.8%
[1,11]<stderr>:  triton_mm_23 32.0214 ms 15.9%
[1,11]<stderr>:SingleProcess AUTOTUNE takes 9.2340 seconds
[1,9]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,9]<stderr>:  mm 5.1352 ms 100.0%
[1,9]<stderr>:  triton_mm_29 18.1032 ms 28.4%
[1,9]<stderr>:  triton_mm_30 19.3904 ms 26.5%
[1,9]<stderr>:  triton_mm_26 19.7254 ms 26.0%
[1,9]<stderr>:  triton_mm_27 21.9104 ms 23.4%
[1,9]<stderr>:  triton_mm_24 22.1791 ms 23.2%
[1,9]<stderr>:  triton_mm_25 23.6411 ms 21.7%
[1,9]<stderr>:  triton_mm_28 24.8541 ms 20.7%
[1,9]<stderr>:  triton_mm_22 31.0699 ms 16.5%
[1,9]<stderr>:  triton_mm_31 32.6123 ms 15.7%
[1,9]<stderr>:SingleProcess AUTOTUNE takes 9.1810 seconds
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,10]<stderr>:  torch.has_cuda,
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,10]<stderr>:  torch.has_cudnn,
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,10]<stderr>:  torch.has_mps,
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,10]<stderr>:  torch.has_mkldnn,
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,8]<stderr>:  torch.has_cuda,
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,8]<stderr>:  torch.has_cudnn,
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,8]<stderr>:  torch.has_mps,
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,8]<stderr>:  torch.has_mkldnn,
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,11]<stderr>:  torch.has_cuda,
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,11]<stderr>:  torch.has_cudnn,
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,11]<stderr>:  torch.has_mps,
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,11]<stderr>:  torch.has_mkldnn,
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,9]<stderr>:  torch.has_cuda,
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,9]<stderr>:  torch.has_cudnn,
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,9]<stderr>:  torch.has_mps,
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,9]<stderr>:  torch.has_mkldnn,
[1,10]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,10]<stderr>:  mm 3.3741 ms 100.0%
[1,10]<stderr>:  triton_mm_40 3.8266 ms 88.2%
[1,10]<stderr>:  triton_mm_36 5.0447 ms 66.9%
[1,10]<stderr>:  triton_mm_41 5.1620 ms 65.4%
[1,10]<stderr>:  triton_mm_37 5.5195 ms 61.1%
[1,10]<stderr>:  triton_mm_38 10.1286 ms 33.3%
[1,10]<stderr>:  triton_mm_35 11.8828 ms 28.4%
[1,10]<stderr>:  triton_mm_34 14.0384 ms 24.0%
[1,10]<stderr>:  triton_mm_33 15.4860 ms 21.8%
[1,10]<stderr>:  triton_mm_39 16.2022 ms 20.8%
[1,10]<stderr>:SingleProcess AUTOTUNE takes 2.5020 seconds
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:23,286] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:23,286] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:23,306] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,10]<stderr>:[rank10]:[2024-09-28 17:12:23,306] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,8]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,8]<stderr>:  mm 3.3576 ms 100.0%
[1,8]<stderr>:  triton_mm_40 3.8105 ms 88.1%
[1,8]<stderr>:  triton_mm_36 5.0969 ms 65.9%
[1,8]<stderr>:  triton_mm_37 5.6791 ms 59.1%
[1,8]<stderr>:  triton_mm_41 6.7754 ms 49.6%
[1,8]<stderr>:  triton_mm_38 10.1430 ms 33.1%
[1,8]<stderr>:  triton_mm_35 11.7184 ms 28.7%
[1,8]<stderr>:  triton_mm_34 14.0096 ms 24.0%
[1,8]<stderr>:  triton_mm_33 15.0187 ms 22.4%
[1,8]<stderr>:  triton_mm_39 15.2330 ms 22.0%
[1,8]<stderr>:SingleProcess AUTOTUNE takes 2.5036 seconds
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:23,456] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:23,456] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:23,476] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,8]<stderr>:[rank8]:[2024-09-28 17:12:23,476] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,9]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,9]<stderr>:  mm 3.3615 ms 100.0%
[1,9]<stderr>:  triton_mm_40 3.8418 ms 87.5%
[1,9]<stderr>:  triton_mm_36 5.0737 ms 66.3%
[1,9]<stderr>:  triton_mm_41 5.1965 ms 64.7%
[1,9]<stderr>:  triton_mm_37 5.3039 ms 63.4%
[1,9]<stderr>:  triton_mm_38 10.3903 ms 32.4%
[1,9]<stderr>:  triton_mm_35 11.8389 ms 28.4%
[1,9]<stderr>:  triton_mm_34 12.8541 ms 26.2%
[1,9]<stderr>:  triton_mm_39 15.9118 ms 21.1%
[1,9]<stderr>:  triton_mm_33 16.2588 ms 20.7%
[1,9]<stderr>:SingleProcess AUTOTUNE takes 2.5068 seconds
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:23,668] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:23,668] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:23,688] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,9]<stderr>:[rank9]:[2024-09-28 17:12:23,688] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,11]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,11]<stderr>:  mm 3.3541 ms 100.0%
[1,11]<stderr>:  triton_mm_40 3.8623 ms 86.8%
[1,11]<stderr>:  triton_mm_41 4.9458 ms 67.8%
[1,11]<stderr>:  triton_mm_36 5.0301 ms 66.7%
[1,11]<stderr>:  triton_mm_37 5.7869 ms 58.0%
[1,11]<stderr>:  triton_mm_38 10.0409 ms 33.4%
[1,11]<stderr>:  triton_mm_35 11.1122 ms 30.2%
[1,11]<stderr>:  triton_mm_34 13.1232 ms 25.6%
[1,11]<stderr>:  triton_mm_33 15.6761 ms 21.4%
[1,11]<stderr>:  triton_mm_39 15.7778 ms 21.3%
[1,11]<stderr>:SingleProcess AUTOTUNE takes 2.5044 seconds
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:23,797] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:23,797] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:23,818] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,11]<stderr>:[rank11]:[2024-09-28 17:12:23,818] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,10]<stderr>:W0928 17:12:30.706014 24831 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,8]<stderr>:W0928 17:12:30.706085 24819 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,9]<stderr>:W0928 17:12:30.706105 24825 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,11]<stderr>:W0928 17:12:30.706283 24836 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,2]<stderr>:W0928 17:12:31.165109 21514 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,3]<stderr>:W0928 17:12:31.165242 21518 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,0]<stderr>:W0928 17:12:31.165334 21506 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,1]<stderr>:W0928 17:12:31.165256 21509 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,12]<stderr>:I0928 17:12:37.894549 24841 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,570] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,648] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,701] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,706] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] due to: 
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] 
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:41,764] torch._dynamo.convert_frame: [WARNING] 
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] due to: 
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] 
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:41,844] torch._dynamo.convert_frame: [WARNING] 
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] due to: 
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] 
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:41,898] torch._dynamo.convert_frame: [WARNING] 
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] due to: 
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] 
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:41,901] torch._dynamo.convert_frame: [WARNING] 
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,12]<stderr>:  torch.has_cuda,
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,12]<stderr>:  torch.has_cudnn,
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,12]<stderr>:  torch.has_mps,
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,12]<stderr>:  torch.has_mkldnn,
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,15]<stderr>:  torch.has_cuda,
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,15]<stderr>:  torch.has_cudnn,
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,15]<stderr>:  torch.has_mps,
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,15]<stderr>:  torch.has_mkldnn,
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,13]<stderr>:  torch.has_cuda,
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,13]<stderr>:  torch.has_cudnn,
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,13]<stderr>:  torch.has_mps,
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,13]<stderr>:  torch.has_mkldnn,
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,14]<stderr>:  torch.has_cuda,
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,14]<stderr>:  torch.has_cudnn,
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,14]<stderr>:  torch.has_mps,
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,14]<stderr>:  torch.has_mkldnn,
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:45,269] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:45,269] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:45,290] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,12]<stderr>:[rank12]:[2024-09-28 17:12:45,290] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:45,388] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:45,388] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:45,409] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,15]<stderr>:[rank15]:[2024-09-28 17:12:45,409] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:45,446] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:45,446] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:45,467] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,14]<stderr>:[rank14]:[2024-09-28 17:12:45,467] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:45,494] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:45,495] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:45,515] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,13]<stderr>:[rank13]:[2024-09-28 17:12:45,515] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,13]<stderr>:W0928 17:12:51.981734 24844 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,14]<stderr>:W0928 17:12:51.981797 24846 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,15]<stderr>:W0928 17:12:51.981910 24847 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,12]<stderr>:W0928 17:12:51.981982 24841 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,16]<stderr>:I0928 17:12:59.068799  6894 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,16]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,16]<stderr>:  mm 1.0190 ms 100.0%
[1,16]<stderr>:  triton_mm_7 3.2111 ms 31.7%
[1,16]<stderr>:  triton_mm_3 3.6087 ms 28.2%
[1,16]<stderr>:  triton_mm_5 3.8998 ms 26.1%
[1,16]<stderr>:  triton_mm_8 4.1183 ms 24.7%
[1,16]<stderr>:  triton_mm_4 4.3388 ms 23.5%
[1,16]<stderr>:  triton_mm_6 4.5380 ms 22.5%
[1,16]<stderr>:  triton_mm_2 5.1388 ms 19.8%
[1,16]<stderr>:  triton_mm_1 5.1524 ms 19.8%
[1,16]<stderr>:  triton_mm_0 5.2148 ms 19.5%
[1,16]<stderr>:SingleProcess AUTOTUNE takes 10.3527 seconds
[1,19]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,19]<stderr>:  mm 1.0235 ms 100.0%
[1,19]<stderr>:  triton_mm_7 3.1793 ms 32.2%
[1,19]<stderr>:  triton_mm_3 3.7515 ms 27.3%
[1,19]<stderr>:  triton_mm_5 3.9185 ms 26.1%
[1,19]<stderr>:  triton_mm_8 4.0877 ms 25.0%
[1,19]<stderr>:  triton_mm_4 4.3755 ms 23.4%
[1,19]<stderr>:  triton_mm_6 4.5028 ms 22.7%
[1,19]<stderr>:  triton_mm_0 5.1844 ms 19.7%
[1,19]<stderr>:  triton_mm_1 5.2008 ms 19.7%
[1,19]<stderr>:  triton_mm_2 5.2366 ms 19.5%
[1,19]<stderr>:SingleProcess AUTOTUNE takes 10.4489 seconds
[1,18]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,18]<stderr>:  mm 1.0227 ms 100.0%
[1,18]<stderr>:  triton_mm_7 3.2333 ms 31.6%
[1,18]<stderr>:  triton_mm_3 3.6914 ms 27.7%
[1,18]<stderr>:  triton_mm_5 3.9632 ms 25.8%
[1,18]<stderr>:  triton_mm_8 4.2793 ms 23.9%
[1,18]<stderr>:  triton_mm_4 4.3592 ms 23.5%
[1,18]<stderr>:  triton_mm_6 4.5308 ms 22.6%
[1,18]<stderr>:  triton_mm_2 5.1752 ms 19.8%
[1,18]<stderr>:  triton_mm_1 5.1757 ms 19.8%
[1,18]<stderr>:  triton_mm_0 5.2391 ms 19.5%
[1,18]<stderr>:SingleProcess AUTOTUNE takes 10.5093 seconds
[1,17]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,17]<stderr>:  mm 1.0296 ms 100.0%
[1,17]<stderr>:  triton_mm_7 3.1932 ms 32.2%
[1,17]<stderr>:  triton_mm_3 3.6848 ms 27.9%
[1,17]<stderr>:  triton_mm_5 3.9222 ms 26.3%
[1,17]<stderr>:  triton_mm_8 4.2350 ms 24.3%
[1,17]<stderr>:  triton_mm_4 4.3359 ms 23.7%
[1,17]<stderr>:  triton_mm_6 4.5138 ms 22.8%
[1,17]<stderr>:  triton_mm_0 5.0115 ms 20.5%
[1,17]<stderr>:  triton_mm_1 5.0366 ms 20.4%
[1,17]<stderr>:  triton_mm_2 5.2412 ms 19.6%
[1,17]<stderr>:SingleProcess AUTOTUNE takes 10.5546 seconds
[1,16]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,16]<stderr>:  mm 0.8559 ms 100.0%
[1,16]<stderr>:  triton_mm_14 1.5090 ms 56.7%
[1,16]<stderr>:  triton_mm_18 1.5495 ms 55.2%
[1,16]<stderr>:  triton_mm_19 1.5749 ms 54.3%
[1,16]<stderr>:  triton_mm_15 1.8731 ms 45.7%
[1,16]<stderr>:  triton_mm_11 1.9632 ms 43.6%
[1,16]<stderr>:  triton_mm_12 2.0257 ms 42.2%
[1,16]<stderr>:  triton_mm_13 2.0380 ms 42.0%
[1,16]<stderr>:  triton_mm_16 2.4394 ms 35.1%
[1,16]<stderr>:  triton_mm_17 3.2313 ms 26.5%
[1,16]<stderr>:SingleProcess AUTOTUNE takes 1.7943 seconds
[1,19]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,19]<stderr>:  mm 0.8568 ms 100.0%
[1,19]<stderr>:  triton_mm_14 1.5150 ms 56.6%
[1,19]<stderr>:  triton_mm_18 1.5533 ms 55.2%
[1,19]<stderr>:  triton_mm_19 1.5725 ms 54.5%
[1,19]<stderr>:  triton_mm_15 1.8858 ms 45.4%
[1,19]<stderr>:  triton_mm_11 1.9758 ms 43.4%
[1,19]<stderr>:  triton_mm_12 2.0317 ms 42.2%
[1,19]<stderr>:  triton_mm_13 2.0470 ms 41.9%
[1,19]<stderr>:  triton_mm_16 2.4290 ms 35.3%
[1,19]<stderr>:  triton_mm_17 3.1559 ms 27.1%
[1,19]<stderr>:SingleProcess AUTOTUNE takes 1.7534 seconds
[1,18]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,18]<stderr>:  mm 0.8570 ms 100.0%
[1,18]<stderr>:  triton_mm_14 1.5101 ms 56.8%
[1,18]<stderr>:  triton_mm_18 1.5481 ms 55.4%
[1,18]<stderr>:  triton_mm_19 1.5707 ms 54.6%
[1,18]<stderr>:  triton_mm_15 1.8746 ms 45.7%
[1,18]<stderr>:  triton_mm_11 1.9809 ms 43.3%
[1,18]<stderr>:  triton_mm_12 2.0145 ms 42.5%
[1,18]<stderr>:  triton_mm_13 2.0616 ms 41.6%
[1,18]<stderr>:  triton_mm_16 2.4156 ms 35.5%
[1,18]<stderr>:  triton_mm_17 3.1658 ms 27.1%
[1,18]<stderr>:SingleProcess AUTOTUNE takes 1.7824 seconds
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,044] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,17]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,17]<stderr>:  mm 0.8564 ms 100.0%
[1,17]<stderr>:  triton_mm_14 1.5045 ms 56.9%
[1,17]<stderr>:  triton_mm_18 1.5483 ms 55.3%
[1,17]<stderr>:  triton_mm_19 1.5658 ms 54.7%
[1,17]<stderr>:  triton_mm_15 1.8664 ms 45.9%
[1,17]<stderr>:  triton_mm_11 1.9753 ms 43.4%
[1,17]<stderr>:  triton_mm_12 2.0206 ms 42.4%
[1,17]<stderr>:  triton_mm_13 2.0464 ms 41.9%
[1,17]<stderr>:  triton_mm_16 2.4359 ms 35.2%
[1,17]<stderr>:  triton_mm_17 3.1247 ms 27.4%
[1,17]<stderr>:SingleProcess AUTOTUNE takes 1.7854 seconds
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,201] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,202] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] due to: 
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] 
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:12,242] torch._dynamo.convert_frame: [WARNING] 
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,264] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] due to: 
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] 
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] 
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] due to: 
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] 
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:12,399] torch._dynamo.convert_frame: [WARNING] 
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] due to: 
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] 
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:12,461] torch._dynamo.convert_frame: [WARNING] 
[1,16]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,16]<stderr>:  mm 5.0306 ms 100.0%
[1,16]<stderr>:  triton_mm_29 16.9649 ms 29.7%
[1,16]<stderr>:  triton_mm_30 18.3686 ms 27.4%
[1,16]<stderr>:  triton_mm_26 19.7958 ms 25.4%
[1,16]<stderr>:  triton_mm_27 22.1148 ms 22.7%
[1,16]<stderr>:  triton_mm_24 23.3976 ms 21.5%
[1,16]<stderr>:  triton_mm_28 25.6505 ms 19.6%
[1,16]<stderr>:  triton_mm_25 29.3881 ms 17.1%
[1,16]<stderr>:  triton_mm_23 30.9675 ms 16.2%
[1,16]<stderr>:  triton_mm_22 32.8563 ms 15.3%
[1,16]<stderr>:SingleProcess AUTOTUNE takes 9.0260 seconds
[1,18]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,18]<stderr>:  mm 5.0678 ms 100.0%
[1,18]<stderr>:  triton_mm_29 17.1335 ms 29.6%
[1,18]<stderr>:  triton_mm_30 18.9359 ms 26.8%
[1,18]<stderr>:  triton_mm_26 19.4301 ms 26.1%
[1,18]<stderr>:  triton_mm_27 21.9373 ms 23.1%
[1,18]<stderr>:  triton_mm_24 22.5571 ms 22.5%
[1,18]<stderr>:  triton_mm_28 24.8174 ms 20.4%
[1,18]<stderr>:  triton_mm_25 27.0890 ms 18.7%
[1,18]<stderr>:  triton_mm_22 31.6592 ms 16.0%
[1,18]<stderr>:  triton_mm_31 34.5828 ms 14.7%
[1,18]<stderr>:SingleProcess AUTOTUNE takes 9.0317 seconds
[1,19]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,19]<stderr>:  mm 5.0343 ms 100.0%
[1,19]<stderr>:  triton_mm_29 17.3920 ms 28.9%
[1,19]<stderr>:  triton_mm_30 19.4286 ms 25.9%
[1,19]<stderr>:  triton_mm_26 19.8028 ms 25.4%
[1,19]<stderr>:  triton_mm_27 21.7512 ms 23.1%
[1,19]<stderr>:  triton_mm_24 22.9824 ms 21.9%
[1,19]<stderr>:  triton_mm_28 24.8577 ms 20.3%
[1,19]<stderr>:  triton_mm_25 26.1739 ms 19.2%
[1,19]<stderr>:  triton_mm_22 30.5890 ms 16.5%
[1,19]<stderr>:  triton_mm_31 33.4553 ms 15.0%
[1,19]<stderr>:SingleProcess AUTOTUNE takes 9.0865 seconds
[1,17]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,17]<stderr>:  mm 5.1117 ms 100.0%
[1,17]<stderr>:  triton_mm_29 18.0306 ms 28.4%
[1,17]<stderr>:  triton_mm_30 19.2974 ms 26.5%
[1,17]<stderr>:  triton_mm_26 20.3030 ms 25.2%
[1,17]<stderr>:  triton_mm_27 21.9314 ms 23.3%
[1,17]<stderr>:  triton_mm_24 22.7694 ms 22.5%
[1,17]<stderr>:  triton_mm_25 23.7199 ms 21.6%
[1,17]<stderr>:  triton_mm_28 25.5769 ms 20.0%
[1,17]<stderr>:  triton_mm_22 31.2066 ms 16.4%
[1,17]<stderr>:  triton_mm_31 32.1495 ms 15.9%
[1,17]<stderr>:SingleProcess AUTOTUNE takes 9.0968 seconds
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,16]<stderr>:  torch.has_cuda,
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,16]<stderr>:  torch.has_cudnn,
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,16]<stderr>:  torch.has_mps,
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,16]<stderr>:  torch.has_mkldnn,
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,18]<stderr>:  torch.has_cuda,
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,18]<stderr>:  torch.has_cudnn,
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,18]<stderr>:  torch.has_mps,
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,18]<stderr>:  torch.has_mkldnn,
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,19]<stderr>:  torch.has_cuda,
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,19]<stderr>:  torch.has_cudnn,
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,19]<stderr>:  torch.has_mps,
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,19]<stderr>:  torch.has_mkldnn,
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,17]<stderr>:  torch.has_cuda,
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,17]<stderr>:  torch.has_cudnn,
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,17]<stderr>:  torch.has_mps,
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,17]<stderr>:  torch.has_mkldnn,
[1,16]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,16]<stderr>:  mm 3.3568 ms 100.0%
[1,16]<stderr>:  triton_mm_40 3.8584 ms 87.0%
[1,16]<stderr>:  triton_mm_41 5.0432 ms 66.6%
[1,16]<stderr>:  triton_mm_36 5.0435 ms 66.6%
[1,16]<stderr>:  triton_mm_37 5.7070 ms 58.8%
[1,16]<stderr>:  triton_mm_38 10.0488 ms 33.4%
[1,16]<stderr>:  triton_mm_35 11.6857 ms 28.7%
[1,16]<stderr>:  triton_mm_34 13.0232 ms 25.8%
[1,16]<stderr>:  triton_mm_33 13.6706 ms 24.6%
[1,16]<stderr>:  triton_mm_39 15.7521 ms 21.3%
[1,16]<stderr>:SingleProcess AUTOTUNE takes 2.4829 seconds
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:27,494] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:27,494] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:27,514] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,16]<stderr>:[rank16]:[2024-09-28 17:13:27,514] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,18]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,18]<stderr>:  mm 3.3547 ms 100.0%
[1,18]<stderr>:  triton_mm_40 3.8231 ms 87.7%
[1,18]<stderr>:  triton_mm_36 5.0457 ms 66.5%
[1,18]<stderr>:  triton_mm_37 5.7275 ms 58.6%
[1,18]<stderr>:  triton_mm_41 7.7503 ms 43.3%
[1,18]<stderr>:  triton_mm_38 9.9515 ms 33.7%
[1,18]<stderr>:  triton_mm_35 11.5821 ms 29.0%
[1,18]<stderr>:  triton_mm_34 14.0749 ms 23.8%
[1,18]<stderr>:  triton_mm_39 15.9786 ms 21.0%
[1,18]<stderr>:  triton_mm_33 16.1320 ms 20.8%
[1,18]<stderr>:SingleProcess AUTOTUNE takes 2.5257 seconds
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:27,660] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:27,660] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:27,680] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,18]<stderr>:[rank18]:[2024-09-28 17:13:27,680] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,19]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,19]<stderr>:  mm 3.3595 ms 100.0%
[1,19]<stderr>:  triton_mm_40 3.7699 ms 89.1%
[1,19]<stderr>:  triton_mm_36 5.0412 ms 66.6%
[1,19]<stderr>:  triton_mm_37 5.5369 ms 60.7%
[1,19]<stderr>:  triton_mm_41 5.9061 ms 56.9%
[1,19]<stderr>:  triton_mm_38 10.1118 ms 33.2%
[1,19]<stderr>:  triton_mm_35 11.7117 ms 28.7%
[1,19]<stderr>:  triton_mm_34 12.5776 ms 26.7%
[1,19]<stderr>:  triton_mm_39 15.3133 ms 21.9%
[1,19]<stderr>:  triton_mm_33 16.7293 ms 20.1%
[1,19]<stderr>:SingleProcess AUTOTUNE takes 2.4784 seconds
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:27,779] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:27,779] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:27,799] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,19]<stderr>:[rank19]:[2024-09-28 17:13:27,799] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,17]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,17]<stderr>:  mm 3.2347 ms 100.0%
[1,17]<stderr>:  triton_mm_40 3.7981 ms 85.2%
[1,17]<stderr>:  triton_mm_36 5.0346 ms 64.2%
[1,17]<stderr>:  triton_mm_41 5.3511 ms 60.4%
[1,17]<stderr>:  triton_mm_37 5.6873 ms 56.9%
[1,17]<stderr>:  triton_mm_38 10.2980 ms 31.4%
[1,17]<stderr>:  triton_mm_35 12.0414 ms 26.9%
[1,17]<stderr>:  triton_mm_34 14.6746 ms 22.0%
[1,17]<stderr>:  triton_mm_33 15.7319 ms 20.6%
[1,17]<stderr>:  triton_mm_39 16.3028 ms 19.8%
[1,17]<stderr>:SingleProcess AUTOTUNE takes 2.5271 seconds
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:27,851] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:27,852] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:27,872] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,17]<stderr>:[rank17]:[2024-09-28 17:13:27,872] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,16]<stderr>:W0928 17:13:34.576684  6894 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,19]<stderr>:W0928 17:13:34.576723  6914 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,17]<stderr>:W0928 17:13:34.576835  6901 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,18]<stderr>:W0928 17:13:34.576884  6907 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,20]<stderr>:I0928 17:13:41.432934  6917 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,333] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,395] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,496] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,496] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] due to: 
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] 
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:45,529] torch._dynamo.convert_frame: [WARNING] 
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] due to: 
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] 
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:45,594] torch._dynamo.convert_frame: [WARNING] 
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] due to: 
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] 
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:45,696] torch._dynamo.convert_frame: [WARNING] 
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] due to: 
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] 
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:45,697] torch._dynamo.convert_frame: [WARNING] 
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,22]<stderr>:  torch.has_cuda,
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,22]<stderr>:  torch.has_cudnn,
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,22]<stderr>:  torch.has_mps,
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,22]<stderr>:  torch.has_mkldnn,
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,23]<stderr>:  torch.has_cuda,
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,23]<stderr>:  torch.has_cudnn,
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,23]<stderr>:  torch.has_mps,
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,23]<stderr>:  torch.has_mkldnn,
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,20]<stderr>:  torch.has_cuda,
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,20]<stderr>:  torch.has_cudnn,
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,20]<stderr>:  torch.has_mps,
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,20]<stderr>:  torch.has_mkldnn,
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,21]<stderr>:  torch.has_cuda,
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,21]<stderr>:  torch.has_cudnn,
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,21]<stderr>:  torch.has_mps,
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,21]<stderr>:  torch.has_mkldnn,
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:49,037] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:49,037] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:49,058] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,22]<stderr>:[rank22]:[2024-09-28 17:13:49,058] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:49,279] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:49,279] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:49,301] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,23]<stderr>:[rank23]:[2024-09-28 17:13:49,301] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:49,424] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:49,424] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:49,445] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,20]<stderr>:[rank20]:[2024-09-28 17:13:49,445] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:49,484] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:49,484] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:49,505] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,21]<stderr>:[rank21]:[2024-09-28 17:13:49,505] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,23]<stderr>:W0928 17:13:56.501966  6922 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,22]<stderr>:W0928 17:13:56.501960  6921 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,20]<stderr>:W0928 17:13:56.502000  6917 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,21]<stderr>:W0928 17:13:56.502178  6919 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,24]<stderr>:I0928 17:14:03.832532  8434 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,24]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,24]<stderr>:  mm 1.0391 ms 100.0%
[1,24]<stderr>:  triton_mm_7 3.1755 ms 32.7%
[1,24]<stderr>:  triton_mm_3 3.6507 ms 28.5%
[1,24]<stderr>:  triton_mm_5 3.8906 ms 26.7%
[1,24]<stderr>:  triton_mm_8 4.0837 ms 25.4%
[1,24]<stderr>:  triton_mm_4 4.3992 ms 23.6%
[1,24]<stderr>:  triton_mm_6 4.5329 ms 22.9%
[1,24]<stderr>:  triton_mm_1 4.8434 ms 21.5%
[1,24]<stderr>:  triton_mm_2 5.1816 ms 20.1%
[1,24]<stderr>:  triton_mm_0 5.1966 ms 20.0%
[1,24]<stderr>:SingleProcess AUTOTUNE takes 10.1101 seconds
[1,27]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,27]<stderr>:  mm 1.0300 ms 100.0%
[1,27]<stderr>:  triton_mm_7 3.2299 ms 31.9%
[1,27]<stderr>:  triton_mm_3 3.5824 ms 28.8%
[1,27]<stderr>:  triton_mm_5 3.9000 ms 26.4%
[1,27]<stderr>:  triton_mm_8 4.1541 ms 24.8%
[1,27]<stderr>:  triton_mm_4 4.3677 ms 23.6%
[1,27]<stderr>:  triton_mm_6 4.5098 ms 22.8%
[1,27]<stderr>:  triton_mm_2 5.0905 ms 20.2%
[1,27]<stderr>:  triton_mm_0 5.1741 ms 19.9%
[1,27]<stderr>:  triton_mm_1 5.3158 ms 19.4%
[1,27]<stderr>:SingleProcess AUTOTUNE takes 10.5964 seconds
[1,26]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,26]<stderr>:  mm 1.0305 ms 100.0%
[1,26]<stderr>:  triton_mm_7 3.2091 ms 32.1%
[1,26]<stderr>:  triton_mm_3 3.7131 ms 27.8%
[1,26]<stderr>:  triton_mm_5 3.9533 ms 26.1%
[1,26]<stderr>:  triton_mm_8 4.2911 ms 24.0%
[1,26]<stderr>:  triton_mm_4 4.3884 ms 23.5%
[1,26]<stderr>:  triton_mm_6 4.5206 ms 22.8%
[1,26]<stderr>:  triton_mm_2 5.0828 ms 20.3%
[1,26]<stderr>:  triton_mm_0 5.2260 ms 19.7%
[1,26]<stderr>:  triton_mm_1 5.3700 ms 19.2%
[1,26]<stderr>:SingleProcess AUTOTUNE takes 10.6384 seconds
[1,25]<stderr>:AUTOTUNE mm(2048x8192, 8192x2560)
[1,25]<stderr>:  mm 1.0298 ms 100.0%
[1,25]<stderr>:  triton_mm_7 3.2100 ms 32.1%
[1,25]<stderr>:  triton_mm_3 3.7548 ms 27.4%
[1,25]<stderr>:  triton_mm_5 3.8834 ms 26.5%
[1,25]<stderr>:  triton_mm_8 4.2774 ms 24.1%
[1,25]<stderr>:  triton_mm_4 4.3370 ms 23.7%
[1,25]<stderr>:  triton_mm_6 4.4937 ms 22.9%
[1,25]<stderr>:  triton_mm_1 5.0691 ms 20.3%
[1,25]<stderr>:  triton_mm_0 5.1841 ms 19.9%
[1,25]<stderr>:  triton_mm_2 5.2164 ms 19.7%
[1,25]<stderr>:SingleProcess AUTOTUNE takes 10.6731 seconds
[1,24]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,24]<stderr>:  mm 0.8595 ms 100.0%
[1,24]<stderr>:  triton_mm_14 1.5019 ms 57.2%
[1,24]<stderr>:  triton_mm_18 1.5523 ms 55.4%
[1,24]<stderr>:  triton_mm_19 1.5722 ms 54.7%
[1,24]<stderr>:  triton_mm_15 1.8682 ms 46.0%
[1,24]<stderr>:  triton_mm_11 1.9915 ms 43.2%
[1,24]<stderr>:  triton_mm_12 2.0223 ms 42.5%
[1,24]<stderr>:  triton_mm_13 2.0627 ms 41.7%
[1,24]<stderr>:  triton_mm_16 2.4373 ms 35.3%
[1,24]<stderr>:  triton_mm_17 3.1016 ms 27.7%
[1,24]<stderr>:SingleProcess AUTOTUNE takes 1.7573 seconds
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,434] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] due to: 
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] 
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:16,628] torch._dynamo.convert_frame: [WARNING] 
[1,27]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,27]<stderr>:  mm 0.8593 ms 100.0%
[1,27]<stderr>:  triton_mm_14 1.5113 ms 56.9%
[1,27]<stderr>:  triton_mm_18 1.5460 ms 55.6%
[1,27]<stderr>:  triton_mm_19 1.5695 ms 54.7%
[1,27]<stderr>:  triton_mm_15 1.8808 ms 45.7%
[1,27]<stderr>:  triton_mm_11 1.9772 ms 43.5%
[1,27]<stderr>:  triton_mm_12 2.0304 ms 42.3%
[1,27]<stderr>:  triton_mm_13 2.0558 ms 41.8%
[1,27]<stderr>:  triton_mm_16 2.4557 ms 35.0%
[1,27]<stderr>:  triton_mm_17 3.1676 ms 27.1%
[1,27]<stderr>:SingleProcess AUTOTUNE takes 1.7926 seconds
[1,26]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,26]<stderr>:  mm 0.8599 ms 100.0%
[1,26]<stderr>:  triton_mm_14 1.5139 ms 56.8%
[1,26]<stderr>:  triton_mm_18 1.5590 ms 55.2%
[1,26]<stderr>:  triton_mm_19 1.5747 ms 54.6%
[1,26]<stderr>:  triton_mm_15 1.8672 ms 46.1%
[1,26]<stderr>:  triton_mm_11 1.9782 ms 43.5%
[1,26]<stderr>:  triton_mm_12 2.0314 ms 42.3%
[1,26]<stderr>:  triton_mm_13 2.0495 ms 42.0%
[1,26]<stderr>:  triton_mm_16 2.4120 ms 35.7%
[1,26]<stderr>:  triton_mm_17 3.2531 ms 26.4%
[1,26]<stderr>:SingleProcess AUTOTUNE takes 1.9581 seconds
[1,25]<stderr>:AUTOTUNE mm(2048x2048, 2048x8192)
[1,25]<stderr>:  mm 0.8618 ms 100.0%
[1,25]<stderr>:  triton_mm_14 1.5112 ms 57.0%
[1,25]<stderr>:  triton_mm_18 1.5500 ms 55.6%
[1,25]<stderr>:  triton_mm_19 1.5699 ms 54.9%
[1,25]<stderr>:  triton_mm_15 1.8734 ms 46.0%
[1,25]<stderr>:  triton_mm_11 1.9760 ms 43.6%
[1,25]<stderr>:  triton_mm_12 2.0074 ms 42.9%
[1,25]<stderr>:  triton_mm_13 2.0494 ms 42.1%
[1,25]<stderr>:  triton_mm_16 2.4026 ms 35.9%
[1,25]<stderr>:  triton_mm_17 3.1352 ms 27.5%
[1,25]<stderr>:SingleProcess AUTOTUNE takes 1.7803 seconds
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,004] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,060] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,080] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] due to: 
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] 
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:17,205] torch._dynamo.convert_frame: [WARNING] 
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] due to: 
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] 
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:17,263] torch._dynamo.convert_frame: [WARNING] 
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] due to: 
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] 
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:17,282] torch._dynamo.convert_frame: [WARNING] 
[1,24]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,24]<stderr>:  mm 5.1857 ms 100.0%
[1,24]<stderr>:  triton_mm_30 17.9639 ms 28.9%
[1,24]<stderr>:  triton_mm_29 18.2298 ms 28.4%
[1,24]<stderr>:  triton_mm_26 18.8290 ms 27.5%
[1,24]<stderr>:  triton_mm_27 22.2413 ms 23.3%
[1,24]<stderr>:  triton_mm_24 23.3049 ms 22.3%
[1,24]<stderr>:  triton_mm_28 25.0196 ms 20.7%
[1,24]<stderr>:  triton_mm_25 26.7174 ms 19.4%
[1,24]<stderr>:  triton_mm_23 30.4225 ms 17.0%
[1,24]<stderr>:  triton_mm_22 30.6673 ms 16.9%
[1,24]<stderr>:SingleProcess AUTOTUNE takes 9.3091 seconds
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,24]<stderr>:  torch.has_cuda,
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,24]<stderr>:  torch.has_cudnn,
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,24]<stderr>:  torch.has_mps,
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,24]<stderr>:  torch.has_mkldnn,
[1,27]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,27]<stderr>:  mm 5.0847 ms 100.0%
[1,27]<stderr>:  triton_mm_29 16.9729 ms 30.0%
[1,27]<stderr>:  triton_mm_30 18.4372 ms 27.6%
[1,27]<stderr>:  triton_mm_26 18.6484 ms 27.3%
[1,27]<stderr>:  triton_mm_27 21.7123 ms 23.4%
[1,27]<stderr>:  triton_mm_24 23.5352 ms 21.6%
[1,27]<stderr>:  triton_mm_28 24.9348 ms 20.4%
[1,27]<stderr>:  triton_mm_25 26.4609 ms 19.2%
[1,27]<stderr>:  triton_mm_22 30.5302 ms 16.7%
[1,27]<stderr>:  triton_mm_23 31.2967 ms 16.2%
[1,27]<stderr>:SingleProcess AUTOTUNE takes 9.1324 seconds
[1,25]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,25]<stderr>:  mm 5.0961 ms 100.0%
[1,25]<stderr>:  triton_mm_29 18.0852 ms 28.2%
[1,25]<stderr>:  triton_mm_30 18.9151 ms 26.9%
[1,25]<stderr>:  triton_mm_26 20.1964 ms 25.2%
[1,25]<stderr>:  triton_mm_27 21.8812 ms 23.3%
[1,25]<stderr>:  triton_mm_25 22.3130 ms 22.8%
[1,25]<stderr>:  triton_mm_24 22.5970 ms 22.6%
[1,25]<stderr>:  triton_mm_28 24.7673 ms 20.6%
[1,25]<stderr>:  triton_mm_22 30.7182 ms 16.6%
[1,25]<stderr>:  triton_mm_31 33.4807 ms 15.2%
[1,25]<stderr>:SingleProcess AUTOTUNE takes 9.1230 seconds
[1,26]<stderr>:AUTOTUNE mm(2048x8192, 8192x14784)
[1,26]<stderr>:  mm 5.1129 ms 100.0%
[1,26]<stderr>:  triton_mm_30 18.2823 ms 28.0%
[1,26]<stderr>:  triton_mm_29 18.7388 ms 27.3%
[1,26]<stderr>:  triton_mm_26 19.0723 ms 26.8%
[1,26]<stderr>:  triton_mm_27 21.8754 ms 23.4%
[1,26]<stderr>:  triton_mm_24 23.8963 ms 21.4%
[1,26]<stderr>:  triton_mm_28 25.0097 ms 20.4%
[1,26]<stderr>:  triton_mm_25 25.2861 ms 20.2%
[1,26]<stderr>:  triton_mm_22 29.8695 ms 17.1%
[1,26]<stderr>:  triton_mm_31 32.5502 ms 15.7%
[1,26]<stderr>:SingleProcess AUTOTUNE takes 9.3279 seconds
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,27]<stderr>:  torch.has_cuda,
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,27]<stderr>:  torch.has_cudnn,
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,27]<stderr>:  torch.has_mps,
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,27]<stderr>:  torch.has_mkldnn,
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,25]<stderr>:  torch.has_cuda,
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,25]<stderr>:  torch.has_cudnn,
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,25]<stderr>:  torch.has_mps,
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,25]<stderr>:  torch.has_mkldnn,
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,26]<stderr>:  torch.has_cuda,
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,26]<stderr>:  torch.has_cudnn,
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,26]<stderr>:  torch.has_mps,
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,26]<stderr>:  torch.has_mkldnn,
[1,24]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,24]<stderr>:  mm 3.3505 ms 100.0%
[1,24]<stderr>:  triton_mm_40 3.8707 ms 86.6%
[1,24]<stderr>:  triton_mm_36 5.1307 ms 65.3%
[1,24]<stderr>:  triton_mm_37 5.2460 ms 63.9%
[1,24]<stderr>:  triton_mm_41 6.7128 ms 49.9%
[1,24]<stderr>:  triton_mm_38 10.2252 ms 32.8%
[1,24]<stderr>:  triton_mm_35 11.4638 ms 29.2%
[1,24]<stderr>:  triton_mm_34 14.1986 ms 23.6%
[1,24]<stderr>:  triton_mm_33 15.5529 ms 21.5%
[1,24]<stderr>:  triton_mm_39 16.3878 ms 20.4%
[1,24]<stderr>:SingleProcess AUTOTUNE takes 2.5218 seconds
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:32,005] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:32,005] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:32,024] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,24]<stderr>:[rank24]:[2024-09-28 17:14:32,025] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,27]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,27]<stderr>:  mm 3.3702 ms 100.0%
[1,27]<stderr>:  triton_mm_40 3.7775 ms 89.2%
[1,27]<stderr>:  triton_mm_36 5.1127 ms 65.9%
[1,27]<stderr>:  triton_mm_37 5.7541 ms 58.6%
[1,27]<stderr>:  triton_mm_41 7.0124 ms 48.1%
[1,27]<stderr>:  triton_mm_38 10.1167 ms 33.3%
[1,27]<stderr>:  triton_mm_35 12.0237 ms 28.0%
[1,27]<stderr>:  triton_mm_34 14.0471 ms 24.0%
[1,27]<stderr>:  triton_mm_39 15.8385 ms 21.3%
[1,27]<stderr>:  triton_mm_33 16.7307 ms 20.1%
[1,27]<stderr>:SingleProcess AUTOTUNE takes 2.5555 seconds
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:32,503] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:32,503] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:32,523] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,27]<stderr>:[rank27]:[2024-09-28 17:14:32,523] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,25]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,25]<stderr>:  mm 3.3742 ms 100.0%
[1,25]<stderr>:  triton_mm_40 3.8454 ms 87.7%
[1,25]<stderr>:  triton_mm_36 5.0190 ms 67.2%
[1,25]<stderr>:  triton_mm_41 5.3428 ms 63.2%
[1,25]<stderr>:  triton_mm_37 5.7840 ms 58.3%
[1,25]<stderr>:  triton_mm_38 10.0044 ms 33.7%
[1,25]<stderr>:  triton_mm_35 11.8802 ms 28.4%
[1,25]<stderr>:  triton_mm_34 13.9136 ms 24.3%
[1,25]<stderr>:  triton_mm_33 15.3226 ms 22.0%
[1,25]<stderr>:  triton_mm_39 16.1431 ms 20.9%
[1,25]<stderr>:SingleProcess AUTOTUNE takes 2.5182 seconds
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:32,609] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:32,609] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:32,630] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,25]<stderr>:[rank25]:[2024-09-28 17:14:32,630] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,26]<stderr>:AUTOTUNE mm(2048x7392, 7392x8192)
[1,26]<stderr>:  mm 3.2780 ms 100.0%
[1,26]<stderr>:  triton_mm_40 3.8086 ms 86.1%
[1,26]<stderr>:  triton_mm_36 5.0192 ms 65.3%
[1,26]<stderr>:  triton_mm_41 5.2896 ms 62.0%
[1,26]<stderr>:  triton_mm_37 6.1512 ms 53.3%
[1,26]<stderr>:  triton_mm_38 10.2586 ms 32.0%
[1,26]<stderr>:  triton_mm_35 11.9237 ms 27.5%
[1,26]<stderr>:  triton_mm_34 13.6025 ms 24.1%
[1,26]<stderr>:  triton_mm_33 15.2115 ms 21.5%
[1,26]<stderr>:  triton_mm_39 15.4452 ms 21.2%
[1,26]<stderr>:SingleProcess AUTOTUNE takes 2.5798 seconds
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:32,983] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:32,983] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:33,004] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,26]<stderr>:[rank26]:[2024-09-28 17:14:33,004] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,24]<stderr>:W0928 17:14:39.956445  8434 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,25]<stderr>:W0928 17:14:39.956475  8440 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,27]<stderr>:W0928 17:14:39.956598  8451 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,26]<stderr>:W0928 17:14:39.956630  8446 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,28]<stderr>:I0928 17:14:40.918779  8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:50,940] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:50,968] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,019] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,030] [11/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] due to: 
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] 
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:51,136] torch._dynamo.convert_frame: [WARNING] 
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] due to: 
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] 
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:51,164] torch._dynamo.convert_frame: [WARNING] 
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] due to: 
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] 
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:51,216] torch._dynamo.convert_frame: [WARNING] 
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT <resume in get_tensor> /data/project/Megatron-LM-Qwen/megatron/core/utils.py line 88 
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] due to: 
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING]   File "<string>", line 1, in <module>
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.InternalTorchDynamoError: name 'torch' is not defined
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] 
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:51,226] torch._dynamo.convert_frame: [WARNING] 
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,28]<stderr>:  torch.has_cuda,
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,28]<stderr>:  torch.has_cudnn,
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,28]<stderr>:  torch.has_mps,
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,28]<stderr>:  torch.has_mkldnn,
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,31]<stderr>:  torch.has_cuda,
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,31]<stderr>:  torch.has_cudnn,
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,31]<stderr>:  torch.has_mps,
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,31]<stderr>:  torch.has_mkldnn,
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,30]<stderr>:  torch.has_cuda,
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,30]<stderr>:  torch.has_cudnn,
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,30]<stderr>:  torch.has_mps,
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,30]<stderr>:  torch.has_mkldnn,
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
[1,29]<stderr>:  torch.has_cuda,
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
[1,29]<stderr>:  torch.has_cudnn,
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
[1,29]<stderr>:  torch.has_mps,
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
[1,29]<stderr>:  torch.has_mkldnn,
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:54,695] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:54,695] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:54,715] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,28]<stderr>:[rank28]:[2024-09-28 17:14:54,715] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:54,732] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:54,732] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:54,753] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,31]<stderr>:[rank31]:[2024-09-28 17:14:54,753] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:54,767] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:54,767] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:54,788] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,29]<stderr>:[rank29]:[2024-09-28 17:14:54,788] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:54,861] [18/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:54,862] [18/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:54,883] [19/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting the user-defined autograd.Function, we were unable to trace function `trampoline_autograd_fwd` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[1,30]<stderr>:[rank30]:[2024-09-28 17:14:54,883] [19/0] torch._dynamo.variables.higher_order_ops: [ERROR] call_function args:  ProcessGroupVariable()
[1,30]<stderr>:I0928 17:15:01.845300  8461 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,31]<stderr>:I0928 17:15:01.846387  8462 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,29]<stderr>:I0928 17:15:01.858175  8459 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,28]<stderr>:I0928 17:15:01.873312  8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,29]<stderr>:W0928 17:15:09.145107  8459 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,28]<stderr>:W0928 17:15:09.145203  8456 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,31]<stderr>:W0928 17:15:09.146502  8462 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,30]<stderr>:W0928 17:15:09.149472  8461 ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[1,30]<stderr>:I0928 17:16:36.815590  8461 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,28]<stderr>:I0928 17:16:36.824781  8456 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,29]<stderr>:I0928 17:16:36.839496  8459 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,31]<stderr>:I0928 17:16:36.839843  8462 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,27]<stderr>:I0928 17:16:37.125530  8451 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,26]<stderr>:I0928 17:16:37.130755  8446 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,24]<stderr>:I0928 17:16:37.154068  8434 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,25]<stderr>:I0928 17:16:37.162684  8440 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,20]<stderr>:I0928 17:16:37.479439  6917 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,21]<stderr>:I0928 17:16:37.485435  6919 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,23]<stderr>:I0928 17:16:37.486351  6922 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,22]<stderr>:I0928 17:16:37.492189  6921 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,19]<stderr>:I0928 17:16:37.791796  6914 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,18]<stderr>:I0928 17:16:37.805619  6907 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,16]<stderr>:I0928 17:16:37.810765  6894 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,17]<stderr>:I0928 17:16:37.823271  6901 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,13]<stderr>:I0928 17:16:38.077255 24844 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,15]<stderr>:I0928 17:16:38.077323 24847 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,12]<stderr>:I0928 17:16:38.084888 24841 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,14]<stderr>:I0928 17:16:38.138005 24846 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,10]<stderr>:I0928 17:16:38.387204 24831 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,9]<stderr>:I0928 17:16:38.410526 24825 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,11]<stderr>:I0928 17:16:38.410938 24836 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,8]<stderr>:I0928 17:16:38.416317 24819 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,5]<stderr>:I0928 17:16:38.717661 21524 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,6]<stderr>:I0928 17:16:38.732295 21525 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,4]<stderr>:I0928 17:16:38.742352 21522 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,7]<stderr>:I0928 17:16:38.777984 21526 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,2]<stderr>:I0928 17:16:39.009244 21514 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,3]<stderr>:I0928 17:16:39.011678 21518 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,1]<stderr>:I0928 17:16:39.014308 21509 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,0]<stderr>:I0928 17:16:39.014585 21506 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
[1,0]<stdout>:Number of parameters in transformer layers in billions:  70.22
[1,0]<stdout>:Number of parameters in embedding layers in billions: 2.49
[1,0]<stdout>:Total number of parameters in billions: 72.71
[1,0]<stdout>:Number of parameters in most loaded shard in billions: 2.5057
[1,0]<stdout>:Number of parameters in other shards in billions: 2.1942
[1,0]<stdout>:Theoretical memory footprints: weight and optimizer=43012.42 MB
[1,31]<stdout>: [2024-09-28 17:16:39] iteration        1/     100 | consumed samples:           64 | elapsed time per iteration (ms): 357535.0 | throughput per GPU (TFLOP/s/GPU): 5.1 | learning rate: 3.000000E-05 | global batch size:    64 | lm loss: 1.205764E+01 | loss scale: 1.0 | grad norm: 27.325 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,3]<stdout>:[Rank 3] (after 1 iterations) memory (MB) | allocated: 21890.52978515625 | max allocated: 29690.8583984375 | reserved: 34624.0 | max reserved: 34624.0
[1,1]<stdout>:[Rank 1] (after 1 iterations) memory (MB) | allocated: 21890.62744140625 | max allocated: 29691.4560546875 | reserved: 34856.0 | max reserved: 34856.0
[1,5]<stdout>:[Rank 5] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24763.95166015625 | reserved: 28130.0 | max reserved: 28130.0
[1,6]<stdout>:[Rank 6] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24763.57666015625 | reserved: 28114.0 | max reserved: 28114.0
[1,0]<stdout>:[Rank 0] (after 1 iterations) memory (MB) | allocated: 21890.5302734375 | max allocated: 29692.14013671875 | reserved: 34624.0 | max reserved: 34624.0
[1,24]<stdout>:[Rank 24] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 16809.54541015625 | reserved: 20012.0 | max reserved: 20012.0
[1,27]<stdout>:[Rank 27] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16809.107421875 | reserved: 20012.0 | max reserved: 20012.0
[1,13]<stdout>:[Rank 13] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21581.22802734375 | reserved: 24870.0 | max reserved: 24870.0
[1,15]<stdout>:[Rank 15] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.10302734375 | reserved: 24886.0 | max reserved: 24886.0
[1,14]<stdout>:[Rank 14] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.72802734375 | reserved: 24886.0 | max reserved: 24886.0
[1,8]<stdout>:[Rank 8] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 23173.74267578125 | reserved: 26502.0 | max reserved: 26502.0
[1,29]<stdout>:[Rank 29] (after 1 iterations) memory (MB) | allocated: 17754.10107421875 | max allocated: 17754.1328125 | reserved: 20046.0 | max reserved: 20046.0
[1,26]<stdout>:[Rank 26] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16808.357421875 | reserved: 20032.0 | max reserved: 20032.0
[1,28]<stdout>:[Rank 28] (after 1 iterations) memory (MB) | allocated: 17754.10107421875 | max allocated: 17754.1328125 | reserved: 20046.0 | max reserved: 20046.0
[1,31]<stdout>:[Rank 31] (after 1 iterations) memory (MB) | allocated: 17754.85107421875 | max allocated: 17754.8828125 | reserved: 20046.0 | max reserved: 20046.0
[1,30]<stdout>:[Rank 30] (after 1 iterations) memory (MB) | allocated: 17756.10107421875 | max allocated: 17756.1328125 | reserved: 19816.0 | max reserved: 19816.0
[1,10]<stdout>:[Rank 10] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23172.6171875 | reserved: 26502.0 | max reserved: 26502.0
[1,9]<stdout>:[Rank 9] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23172.1171875 | reserved: 26502.0 | max reserved: 26502.0
[1,11]<stdout>:[Rank 11] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 23173.0546875 | reserved: 26502.0 | max reserved: 26502.0
[1,7]<stdout>:[Rank 7] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24764.32666015625 | reserved: 28132.0 | max reserved: 28132.0
[1,2]<stdout>:[Rank 2] (after 1 iterations) memory (MB) | allocated: 21890.62744140625 | max allocated: 29690.6123046875 | reserved: 34856.0 | max reserved: 34856.0
[1,12]<stdout>:[Rank 12] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 21582.10302734375 | reserved: 24870.0 | max reserved: 24870.0
[1,25]<stdout>:[Rank 25] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 16809.357421875 | reserved: 20022.0 | max reserved: 20022.0[1,25]<stdout>:
[1,4]<stdout>:[Rank 4] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 24764.07666015625 | reserved: 28134.0 | max reserved: 28134.0
[1,19]<stdout>:[Rank 19] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19992.2373046875 | reserved: 23258.0 | max reserved: 23258.0
[1,16]<stdout>:[Rank 16] (after 1 iterations) memory (MB) | allocated: 16569.4326171875 | max allocated: 19991.51904296875 | reserved: 23258.0 | max reserved: 23258.0
[1,17]<stdout>:[Rank 17] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19991.6123046875 | reserved: 23142.0 | max reserved: 23142.0
[1,18]<stdout>:[Rank 18] (after 1 iterations) memory (MB) | allocated: 16569.43212890625 | max allocated: 19991.5810546875 | reserved: 23142.0 | max reserved: 23142.0
[1,20]<stdout>:[Rank 20] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18400.25439453125 | reserved: 21646.0 | max reserved: 21646.0
[1,23]<stdout>:[Rank 23] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18399.75439453125 | reserved: 21646.0 | max reserved: 21646.0
[1,22]<stdout>:[Rank 22] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18399.62939453125 | reserved: 21646.0 | max reserved: 21646.0
[1,21]<stdout>:[Rank 21] (after 1 iterations) memory (MB) | allocated: 16569.52978515625 | max allocated: 18400.12939453125 | reserved: 21646.0 | max reserved: 21646.0[1,21]<stdout>:
[1,31]<stdout>: [2024-09-28 17:17:18] iteration        2/     100 | consumed samples:          128 | elapsed time per iteration (ms): 39808.6 | throughput per GPU (TFLOP/s/GPU): 45.8 | learning rate: 2.999320E-05 | global batch size:    64 | lm loss: 1.206543E+01 | loss scale: 1.0 | grad norm: 29.727 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:17:58] iteration        3/     100 | consumed samples:          192 | elapsed time per iteration (ms): 39574.8 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.997282E-05 | global batch size:    64 | lm loss: 1.069553E+01 | loss scale: 1.0 | grad norm: 261.951 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:18:38] iteration        4/     100 | consumed samples:          256 | elapsed time per iteration (ms): 39625.8 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.993887E-05 | global batch size:    64 | lm loss: 1.103311E+01 | loss scale: 1.0 | grad norm: 18.421 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:19:17] iteration        5/     100 | consumed samples:          320 | elapsed time per iteration (ms): 39592.1 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.989139E-05 | global batch size:    64 | lm loss: 1.566007E+01 | loss scale: 1.0 | grad norm: 151.662 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:19:57] iteration        6/     100 | consumed samples:          384 | elapsed time per iteration (ms): 39573.2 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.983042E-05 | global batch size:    64 | lm loss: 1.319092E+01 | loss scale: 1.0 | grad norm: 47.626 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:20:36] iteration        7/     100 | consumed samples:          448 | elapsed time per iteration (ms): 39535.6 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.975604E-05 | global batch size:    64 | lm loss: 1.252701E+01 | loss scale: 1.0 | grad norm: 10.791 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:21:16] iteration        8/     100 | consumed samples:          512 | elapsed time per iteration (ms): 39780.8 | throughput per GPU (TFLOP/s/GPU): 45.8 | learning rate: 2.966830E-05 | global batch size:    64 | lm loss: 1.230831E+01 | loss scale: 1.0 | grad norm: 3.800 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:21:56] iteration        9/     100 | consumed samples:          576 | elapsed time per iteration (ms): 39597.3 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.956731E-05 | global batch size:    64 | lm loss: 1.122603E+01 | loss scale: 1.0 | grad norm: 2.744 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:22:35] iteration       10/     100 | consumed samples:          640 | elapsed time per iteration (ms): 39674.7 | throughput per GPU (TFLOP/s/GPU): 45.9 | learning rate: 2.945316E-05 | global batch size:    64 | lm loss: 1.062877E+01 | loss scale: 1.0 | grad norm: 16.466 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:23:15] iteration       11/     100 | consumed samples:          704 | elapsed time per iteration (ms): 39649.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.932596E-05 | global batch size:    64 | lm loss: 1.030651E+01 | loss scale: 1.0 | grad norm: 13.611 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:23:55] iteration       12/     100 | consumed samples:          768 | elapsed time per iteration (ms): 39566.2 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.918585E-05 | global batch size:    64 | lm loss: 1.008296E+01 | loss scale: 1.0 | grad norm: 1.557 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:24:34] iteration       13/     100 | consumed samples:          832 | elapsed time per iteration (ms): 39626.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.903297E-05 | global batch size:    64 | lm loss: 9.951680E+00 | loss scale: 1.0 | grad norm: 2.166 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:25:14] iteration       14/     100 | consumed samples:          896 | elapsed time per iteration (ms): 39557.3 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.886746E-05 | global batch size:    64 | lm loss: 9.911741E+00 | loss scale: 1.0 | grad norm: 0.892 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:25:53] iteration       15/     100 | consumed samples:          960 | elapsed time per iteration (ms): 39587.4 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.868951E-05 | global batch size:    64 | lm loss: 9.708189E+00 | loss scale: 1.0 | grad norm: 1.484 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:26:33] iteration       16/     100 | consumed samples:         1024 | elapsed time per iteration (ms): 39539.5 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.849928E-05 | global batch size:    64 | lm loss: 9.610130E+00 | loss scale: 1.0 | grad norm: 0.635 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:27:13] iteration       17/     100 | consumed samples:         1088 | elapsed time per iteration (ms): 39556.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.829697E-05 | global batch size:    64 | lm loss: 9.517210E+00 | loss scale: 1.0 | grad norm: 0.615 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:27:52] iteration       18/     100 | consumed samples:         1152 | elapsed time per iteration (ms): 39482.5 | throughput per GPU (TFLOP/s/GPU): 46.2 | learning rate: 2.808278E-05 | global batch size:    64 | lm loss: 9.492043E+00 | loss scale: 1.0 | grad norm: 0.544 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:28:32] iteration       19/     100 | consumed samples:         1216 | elapsed time per iteration (ms): 39568.7 | throughput per GPU (TFLOP/s/GPU): 46.0 | learning rate: 2.785692E-05 | global batch size:    64 | lm loss: 9.486827E+00 | loss scale: 1.0 | grad norm: 1.113 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:29:11] iteration       20/     100 | consumed samples:         1280 | elapsed time per iteration (ms): 39528.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.761963E-05 | global batch size:    64 | lm loss: 9.444993E+00 | loss scale: 1.0 | grad norm: 1.076 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:29:51] iteration       21/     100 | consumed samples:         1344 | elapsed time per iteration (ms): 39522.9 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.737115E-05 | global batch size:    64 | lm loss: 9.371952E+00 | loss scale: 1.0 | grad norm: 0.430 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:30:30] iteration       22/     100 | consumed samples:         1408 | elapsed time per iteration (ms): 39518.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.711172E-05 | global batch size:    64 | lm loss: 9.420528E+00 | loss scale: 1.0 | grad norm: 7.430 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:31:10] iteration       23/     100 | consumed samples:         1472 | elapsed time per iteration (ms): 39554.6 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.684160E-05 | global batch size:    64 | lm loss: 9.177224E+00 | loss scale: 1.0 | grad norm: 0.652 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:31:49] iteration       24/     100 | consumed samples:         1536 | elapsed time per iteration (ms): 39503.5 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.656107E-05 | global batch size:    64 | lm loss: 9.407356E+00 | loss scale: 1.0 | grad norm: 0.578 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:32:29] iteration       25/     100 | consumed samples:         1600 | elapsed time per iteration (ms): 39528.4 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.627041E-05 | global batch size:    64 | lm loss: 9.249225E+00 | loss scale: 1.0 | grad norm: 0.390 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:33:08] iteration       26/     100 | consumed samples:         1664 | elapsed time per iteration (ms): 39506.8 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.596991E-05 | global batch size:    64 | lm loss: 9.312948E+00 | loss scale: 1.0 | grad norm: 0.439 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:33:48] iteration       27/     100 | consumed samples:         1728 | elapsed time per iteration (ms): 39518.2 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.565988E-05 | global batch size:    64 | lm loss: 9.285652E+00 | loss scale: 1.0 | grad norm: 0.402 | number of skipped iterations:   0 | number of nan iterations:   0 |
[1,31]<stdout>: [2024-09-28 17:34:27] iteration       28/     100 | consumed samples:         1792 | elapsed time per iteration (ms): 39500.3 | throughput per GPU (TFLOP/s/GPU): 46.1 | learning rate: 2.534062E-05 | global batch size:    64 | lm loss: 9.194493E+00 | loss scale: 1.0 | grad norm: 0.371 | number of skipped iterations:   0 | number of nan iterations:   0 |