Remove .log files from repository

6eabbacb · lim · 5814e156 · 5814e156 · 5814e156
Commit 6eabbacb authored Feb 12, 2026 by lim
Showing with 0 additions and 1828 deletions

examples/qwen/qwen3_06B_1nodes_megatron.log examples/qwen/qwen3_06B_1nodes_megatron.log +0 -927

examples/qwen/qwen3_8B_1nodes_megatron.log examples/qwen/qwen3_8B_1nodes_megatron.log +0 -901

No files found.
--- a/examples/qwen/qwen3_06B_1nodes_megatron.log
+++ b/examples/qwen/qwen3_06B_1nodes_megatron.log
-[92mSuccessfully preprocessed all matching files.[0m
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-using world size: 8, data-parallel size: 2, context-parallel size: 1, hierarchical context-parallel sizes: None, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
-WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:HuggingFaceTokenizer
-Number of virtual stages per pipeline stage: None
-accumulate and all-reduce gradients in fp32 for bfloat16 data type.
-using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
-  account_for_embedding_in_pipeline_split ......... False
-  account_for_loss_in_pipeline_split .............. False
-  accumulate_allreduce_grads_in_fp32 .............. True
-  activation_func_clamp_value ..................... None
-  adam_beta1 ...................................... 0.9
-  adam_beta2 ...................................... 0.95
-  adam_eps ........................................ 1e-08
-  add_bias_linear ................................. False
-  add_position_embedding .......................... True
-  add_qkv_bias .................................... False
-  adlr_autoresume ................................. False
-  adlr_autoresume_interval ........................ 1000
-  align_grad_reduce ............................... True
-  align_param_gather .............................. False
-  app_tag_run_name ................................ None
-  app_tag_run_version ............................. 0.0.0
-  apply_layernorm_1p .............................. False
-  apply_query_key_layer_scaling ................... False
-  apply_residual_connection_post_layernorm ........ False
-  apply_rope_fusion ............................... True
-  async_save ...................................... None
-  async_tensor_model_parallel_allreduce ........... True
-  attention_backend ............................... AttnBackend.auto
-  attention_dropout ............................... 0.0
-  attention_softmax_in_fp32 ....................... False
-  auto_detect_ckpt_format ......................... False
-  barrier_with_L1_time ............................ True
-  bert_binary_head ................................ True
-  bert_embedder_type .............................. megatron
-  bert_load ....................................... None
-  bf16 ............................................ True
-  bias_dropout_fusion ............................. True
-  bias_gelu_fusion ................................ False
-  bias_swiglu_fusion .............................. True
-  biencoder_projection_dim ........................ 0
-  biencoder_shared_query_context_model ............ False
-  block_data_path ................................. None
-  cache_mla_latents ............................... False
-  calc_ft_timeouts ................................ False
-  calculate_per_token_loss ........................ False
-  check_for_large_grads ........................... False
-  check_for_nan_in_loss_and_grad .................. True
-  check_for_spiky_loss ............................ False
-  check_weight_hash_across_dp_replicas_interval ... None
-  ckpt_assume_constant_structure .................. False
-  ckpt_convert_format ............................. None
-  ckpt_convert_save ............................... None
-  ckpt_convert_update_legacy_dist_opt_format ...... False
-  ckpt_format ..................................... torch
-  ckpt_fully_parallel_load ........................ False
-  ckpt_fully_parallel_save ........................ True
-  ckpt_fully_parallel_save_deprecated ............. False
-  ckpt_step ....................................... None
-  classes_fraction ................................ 1.0
-  clip_grad ....................................... 1.0
-  clone_scatter_output_in_embedding ............... True
-  collect_log_path ................................ ./logs
-  comm_time_log_iter .............................. None
-  config_logger_dir ............................... 
-  consumed_train_samples .......................... 0
-  consumed_valid_samples .......................... 0
-  context_parallel_size ........................... 1
-  cp_comm_type .................................... ['p2p']
-  create_attention_mask_in_dataloader ............. True
-  cross_entropy_fusion_impl ....................... native
-  cross_entropy_loss_fusion ....................... False
-  cuda_graph_impl ................................. none
-  cuda_graph_scope ................................ full
-  cuda_graph_warmup_steps ......................... 3
-  data_args_path .................................. None
-  data_cache_path ................................. None
-  data_parallel_random_init ....................... False
-  data_parallel_sharding_strategy ................. no_shard
-  data_parallel_size .............................. 2
-  data_path ....................................... ['/workspace/data/oscar/oscar-1GB_head-qwen_text_document']
-  data_per_class_fraction ......................... 1.0
-  data_sharding ................................... True
-  dataloader_type ................................. single
-  ddp_average_in_collective ....................... True
-  ddp_bucket_size ................................. None
-  ddp_num_buckets ................................. None
-  ddp_pad_buckets_for_high_nccl_busbw ............. False
-  decode_only_cuda_graphs ......................... False
-  decoder_first_pipeline_num_layers ............... None
-  decoder_last_pipeline_num_layers ................ None
-  decoder_num_layers .............................. None
-  decoder_seq_length .............................. None
-  decoupled_lr .................................... None
-  decoupled_min_lr ................................ None
-  decrease_batch_size_if_needed ................... False
-  defer_embedding_wgrad_compute ................... False
-  delay_wgrad_compute ............................. False
-  deprecated_use_mcore_models ..................... True
-  deterministic_mode .............................. False
-  dino_bottleneck_size ............................ 256
-  dino_freeze_last_layer .......................... 1
-  dino_head_hidden_size ........................... 2048
-  dino_local_crops_number ......................... 10
-  dino_local_img_size ............................. 96
-  dino_norm_last_layer ............................ False
-  dino_teacher_temp ............................... 0.07
-  dino_warmup_teacher_temp ........................ 0.04
-  dino_warmup_teacher_temp_epochs ................. 30
-  disable_backward_fusion ......................... False
-  disable_bf16_reduced_precision_matmul ........... False
-  disable_chunked_prefill ......................... False
-  disable_mamba_mem_eff_path ...................... False
-  disable_straggler_on_startup .................... False
-  disable_symmetric_registration .................. False
-  disable_vision_class_token ...................... False
-  dist_ckpt_format_deprecated ..................... None
-  dist_ckpt_optim_fully_reshardable ............... False
-  dist_ckpt_save_pre_mcore_014 .................... False
-  dist_ckpt_strictness ............................ assume_ok_unexpected
-  dist_url ........................................ tcp://localhost:25900
-  distrib_optim_fully_reshardable_mem_efficient ... False
-  distribute_saved_activations .................... False
-  distributed_backend ............................. nccl
-  distributed_timeout_minutes ..................... 10
-  distributed_timeout_seconds_after_init .......... None
-  embedding_init_method_std ....................... None
-  embedding_path .................................. None
-  empty_unused_memory_level ....................... 0
-  enable_bw_flux_gemmrs_op ........................ True
-  enable_cuda_graph ............................... False
-  enable_dynamic_grad_comp ........................ False
-  enable_experimental ............................. False
-  enable_ft_package ............................... False
-  enable_full_sharding_in_hsdp .................... False
-  enable_gloo_process_groups ...................... True
-  enable_msc ...................................... True
-  enable_one_logger ............................... True
-  enable_vocab_parallel ........................... False
-  encoder_num_layers .............................. 28
-  encoder_seq_length .............................. 4096
-  end_weight_decay ................................ 0.1
-  eod_mask_loss ................................... False
-  error_injection_rate ............................ 0
-  error_injection_type ............................ transient_error
-  eval_interval ................................... 1000
-  eval_iters ...................................... 5
-  evidence_data_path .............................. None
-  exit_duration_in_mins ........................... None
-  exit_interval ................................... None
-  exit_on_missing_checkpoint ...................... False
-  exit_signal_handler ............................. False
-  exp_avg_dtype ................................... torch.float32
-  exp_avg_sq_dtype ................................ torch.float32
-  expert_model_parallel_size ...................... 1
-  expert_tensor_parallel_size ..................... 4
-  export_force_local_attention .................... False
-  export_kd_cfg ................................... None
-  export_kd_teacher_ckpt_format ................... None
-  export_kd_teacher_load .......................... None
-  export_kv_cache_quant ........................... False
-  export_legacy_megatron .......................... False
-  export_model_type ............................... GPTModel
-  export_moe_apply_probs_on_input ................. False
-  export_offline_model ............................ False
-  export_qk_l2_norm ............................... False
-  export_quant_cfg ................................ None
-  export_real_quant_cfg ........................... None
-  export_te_mcore_model ........................... False
-  external_cuda_graph ............................. False
-  extra_vocab_size ................................ 0
-  ffn_hidden_size ................................. 5504
-  fine_grained_activation_offloading .............. False
-  finetune ........................................ False
-  finetune_data_split ............................. train
-  finetune_hf_dataset ............................. None
-  first_last_layers_bf16 .......................... False
-  flash_decode .................................... False
-  flux_transpose_weight ........................... False
-  fp16 ............................................ False
-  fp16_lm_cross_entropy ........................... False
-  fp32_residual_connection ........................ False
-  fp4 ............................................. None
-  fp4_param ....................................... False
-  fp4_recipe ...................................... nvfp4
-  fp8 ............................................. None
-  fp8_amax_compute_algo ........................... most_recent
-  fp8_amax_history_len ............................ 1
-  fp8_interval .................................... 1
-  fp8_margin ...................................... 0
-  fp8_param_gather ................................ False
-  fp8_recipe ...................................... delayed
-  fp8_wgrad ....................................... True
-  freeze_LM ....................................... False
-  freeze_ViT ...................................... False
-  fsdp_double_buffer .............................. False
-  full_validation ................................. False
-  global_batch_size ............................... 256
-  glu_linear_offset ............................... 0.0
-  grad_comp ....................................... False
-  grad_comp_warm_up ............................... 0.1
-  grad_reduce_in_bf16 ............................. False
-  gradient_accumulation_fusion .................... True
-  gradient_reduce_div_fusion ...................... True
-  gradient_sample_ratio ........................... 1.0
-  group_query_attention ........................... True
-  grpo_clamp_eps_lower ............................ 0.01
-  grpo_clamp_eps_upper ............................ 0.01
-  grpo_default_temperature ........................ 1.0
-  grpo_default_top_p .............................. 0
-  grpo_entropy_term_weight ........................ 0.0
-  grpo_filter_groups_with_same_reward ............. False
-  grpo_group_size ................................. 2
-  grpo_iterations ................................. 2
-  grpo_kl_beta .................................... 0.001
-  grpo_prompts_per_step ........................... 32
-  head_lr_mult .................................... 1.0
-  heterogeneous_layers_config_encoded_json ........ None
-  heterogeneous_layers_config_path ................ None
-  hidden_dropout .................................. 0.0
-  hidden_size ..................................... 1024
-  hierarchical_context_parallel_sizes ............. None
-  high_priority_stream_groups ..................... []
-  hybrid_attention_ratio .......................... 0.0
-  hybrid_mlp_ratio ................................ 0.0
-  hybrid_override_pattern ......................... None
-  hysteresis ...................................... 2
-  ict_head_size ................................... None
-  ict_load ........................................ None
-  img_h ........................................... 224
-  img_w ........................................... 224
-  indexer_batch_size .............................. 128
-  indexer_log_interval ............................ 1000
-  inference_batch_times_seqlen_threshold .......... -1
-  inference_dynamic_batching ...................... False
-  inference_dynamic_batching_block_size ........... 256
-  inference_dynamic_batching_buffer_guaranteed_fraction  0.2
-  inference_dynamic_batching_buffer_overflow_factor  None
-  inference_dynamic_batching_buffer_size_gb ....... 40.0
-  inference_dynamic_batching_max_requests_override  None
-  inference_dynamic_batching_max_tokens_override .. None
-  inference_dynamic_batching_num_cuda_graphs ...... 16
-  inference_dynamic_batching_track_paused_request_events  False
-  inference_dynamic_batching_unified_memory_level . 0
-  inference_max_batch_size ........................ 8
-  inference_max_seq_length ........................ 2560
-  inference_rng_tracker ........................... False
-  init_method_std ................................. 0.006
-  init_method_xavier_uniform ...................... False
-  init_model_with_meta_device ..................... False
-  initial_loss_scale .............................. 4294967296
-  inprocess_active_world_size ..................... 1
-  inprocess_barrier_timeout ....................... 120
-  inprocess_completion_timeout .................... 120
-  inprocess_empty_cuda_cache ...................... False
-  inprocess_granularity ........................... node
-  inprocess_hard_timeout .......................... 90
-  inprocess_heartbeat_interval .................... 30
-  inprocess_heartbeat_timeout ..................... 60
-  inprocess_last_call_wait ........................ 1
-  inprocess_max_iterations ........................ None
-  inprocess_monitor_process_interval .............. 1.0
-  inprocess_monitor_thread_interval ............... 1.0
-  inprocess_progress_watchdog_interval ............ 1.0
-  inprocess_restart ............................... False
-  inprocess_soft_timeout .......................... 60
-  inprocess_termination_grace_time ................ 1
-  is_hybrid_model ................................. False
-  iter_per_epoch .................................. 1250
-  iteration_sample_ratio .......................... 0.01
-  iterations_to_skip .............................. []
-  keep_fp8_transpose_cache ........................ False
-  kitchen_config_file ............................. None
-  kitchen_recipe_number ........................... None
-  kv_channels ..................................... 64
-  kv_lora_rank .................................... 32
-  langrl_env_config ............................... None
-  langrl_external_server .......................... False
-  langrl_inference_server_conversation_template ... None
-  langrl_inference_server_type .................... inplace_megatron
-  lazy_mpu_init ................................... None
-  legacy_tokenizer ................................ False
-  load ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1
-  load_main_params_from_ckpt ...................... None
-  local_rank ...................................... 0
-  log_energy ...................................... False
-  log_interval .................................... 1
-  log_loss_scale_to_tensorboard ................... True
-  log_memory_to_tensorboard ....................... False
-  log_num_zeros_in_grad ........................... False
-  log_params_norm ................................. False
-  log_progress .................................... False
-  log_straggler ................................... False
-  log_throughput .................................. True
-  log_timers_to_tensorboard ....................... False
-  log_validation_ppl_to_tensorboard ............... False
-  log_world_size_to_tensorboard ................... False
-  logging_level ................................... None
-  loss_scale ...................................... None
-  loss_scale_window ............................... 1000
-  lr .............................................. 3e-05
-  lr_decay_iters .................................. None
-  lr_decay_samples ................................ None
-  lr_decay_style .................................. cosine
-  lr_warmup_fraction .............................. None
-  lr_warmup_init .................................. 0.0
-  lr_warmup_iters ................................. 1
-  lr_warmup_samples ............................... 0
-  lr_wsd_decay_iters .............................. None
-  lr_wsd_decay_samples ............................ None
-  lr_wsd_decay_style .............................. exponential
-  main_grads_dtype ................................ torch.float32
-  main_params_dtype ............................... torch.float32
-  make_vocab_size_divisible_by .................... 128
-  mamba_head_dim .................................. 64
-  mamba_num_groups ................................ 8
-  mamba_num_heads ................................. None
-  mamba_state_dim ................................. 128
-  manual_gc ....................................... False
-  manual_gc_eval .................................. True
-  manual_gc_interval .............................. 0
-  mask_factor ..................................... 1.0
-  mask_prob ....................................... 0.15
-  mask_type ....................................... random
-  masked_softmax_fusion ........................... True
-  max_padding_length .............................. None
-  max_position_embeddings ......................... 32768
-  max_tokens_to_oom ............................... 12000
-  memory_snapshot_path ............................ snapshot.pickle
-  merge_file ...................................... None
-  micro_batch_size ................................ 1
-  microbatch_group_size_per_vp_stage .............. None
-  mid_level_dataset_surplus ....................... 0.005
-  min_loss_scale .................................. 1.0
-  min_lr .......................................... 3e-06
-  min_offloaded_tensor_size ....................... 1048576
-  mlp_chunks_for_prefill .......................... 1
-  mmap_bin_files .................................. True
-  mock_data ....................................... False
-  modelopt_enabled ................................ False
-  moe_apply_probs_on_input ........................ False
-  moe_aux_loss_coeff .............................. 0.0
-  moe_deepep_num_sms .............................. 20
-  moe_enable_deepep ............................... False
-  moe_expert_capacity_factor ...................... None
-  moe_extended_tp ................................. False
-  moe_ffn_hidden_size ............................. None
-  moe_grouped_gemm ................................ False
-  moe_input_jitter_eps ............................ None
-  moe_layer_freq .................................. 1
-  moe_layer_recompute ............................. False
-  moe_pad_expert_input_to_capacity ................ False
-  moe_pad_experts_for_cuda_graph_inference ........ False
-  moe_per_layer_logging ........................... False
-  moe_permute_fusion .............................. False
-  moe_router_bias_update_rate ..................... 0.001
-  moe_router_dtype ................................ None
-  moe_router_enable_expert_bias ................... False
-  moe_router_force_load_balancing ................. False
-  moe_router_fusion ............................... False
-  moe_router_group_topk ........................... None
-  moe_router_load_balancing_type .................. aux_loss
-  moe_router_num_groups ........................... None
-  moe_router_padding_for_fp8 ...................... False
-  moe_router_pre_softmax .......................... False
-  moe_router_score_function ....................... softmax
-  moe_router_topk ................................. 2
-  moe_router_topk_scaling_factor .................. None
-  moe_shared_expert_intermediate_size ............. None
-  moe_shared_expert_overlap ....................... False
-  moe_token_dispatcher_type ....................... allgather
-  moe_token_drop_policy ........................... probs
-  moe_upcycling_granularity ....................... 1
-  moe_use_legacy_grouped_gemm ..................... False
-  moe_use_upcycling ............................... False
-  moe_z_loss_coeff ................................ None
-  mrope_section ................................... None
-  mscale .......................................... 1.0
-  mscale_all_dim .................................. 0.0
-  mtp_loss_scaling_factor ......................... 0.1
-  mtp_num_layers .................................. None
-  multi_latent_attention .......................... False
-  multiple_validation_sets ........................ False
-  nccl_all_reduce_for_prefill ..................... False
-  nccl_communicator_config_path ................... None
-  nccl_ub ......................................... False
-  no_load_optim ................................... None
-  no_load_rng ..................................... None
-  no_persist_layer_norm ........................... False
-  no_rope_freq .................................... None
-  no_save_optim ................................... None
-  no_save_rng ..................................... None
-  non_persistent_ckpt_type ........................ None
-  non_persistent_global_ckpt_dir .................. None
-  non_persistent_local_ckpt_algo .................. fully_parallel
-  non_persistent_local_ckpt_dir ................... None
-  non_persistent_save_interval .................... None
-  norm_epsilon .................................... 1e-05
-  normalization ................................... RMSNorm
-  num_attention_heads ............................. 16
-  num_channels .................................... 3
-  num_classes ..................................... 1000
-  num_dataset_builder_threads ..................... 1
-  num_distributed_optimizer_instances ............. 1
-  num_experts ..................................... None
-  num_layers ...................................... 28
-  num_layers_at_end_in_bf16 ....................... 1
-  num_layers_at_start_in_bf16 ..................... 1
-  num_layers_per_virtual_pipeline_stage ........... None
-  num_layers_to_build ............................. None
-  num_query_groups ................................ 8
-  num_virtual_stages_per_pipeline_rank ............ None
-  num_workers ..................................... 2
-  object_storage_cache_path ....................... None
-  offload_modules ................................. None
-  one_logger_async ................................ False
-  one_logger_project .............................. megatron-lm
-  one_logger_run_name ............................. None
-  onnx_safe ....................................... None
-  openai_gelu ..................................... False
-  optimizer ....................................... adam
-  optimizer_cpu_offload ........................... False
-  optimizer_offload_fraction ...................... 1.0
-  output_bert_embeddings .......................... False
-  overlap_cpu_optimizer_d2h_h2d ................... False
-  overlap_ep_comm_with_split_attn ................. False
-  overlap_grad_reduce ............................. True
-  overlap_moe_expert_parallel_comm ................ False
-  overlap_p2p_comm ................................ False
-  overlap_p2p_comm_warmup_flush ................... False
-  overlap_param_gather ............................ False
-  overlap_param_gather_with_optimizer_step ........ False
-  override_opt_param_scheduler .................... False
-  padded_vocab_size ............................... None
-  parallel_linear_impl ............................ None
-  params_dtype .................................... torch.bfloat16
-  patch_dim ....................................... 16
-  patch_size ...................................... 14
-  per_split_data_args_path ........................ None
-  perform_initialization .......................... True
-  perform_rl_step ................................. False
-  pin_cpu_grads ................................... True
-  pin_cpu_params .................................. True
-  pipe_sp_splits .................................. 1
-  pipe_sp_strategy ................................ average
-  pipeline_model_parallel_comm_backend ............ None
-  pipeline_model_parallel_layout .................. None
-  pipeline_model_parallel_size .................... 1
-  position_embedding_type ......................... rope
-  pretrained_checkpoint ........................... None
-  profile ......................................... False
-  profile_dir ..................................... ./
-  profile_ranks ................................... [0]
-  profile_step_end ................................ 12
-  profile_step_start .............................. 10
-  q_lora_rank ..................................... None
-  qk_head_dim ..................................... 128
-  qk_l2_norm ...................................... False
-  qk_layernorm .................................... False
-  qk_pos_emb_head_dim ............................. 64
-  quant_comm_bits ................................. 8
-  quant_group_size ................................ None
-  quant_scale_dtype ............................... bf16
-  query_in_block_prob ............................. 0.1
-  quick_geglu ..................................... False
-  rampup_batch_size ............................... None
-  rank ............................................ 0
-  rank_adjust_window_size ......................... 1000
-  recompute_activation_function ................... False
-  recompute_activation_function_num_layers ........ None
-  recompute_granularity ........................... None
-  recompute_method ................................ None
-  recompute_modules ............................... None
-  recompute_num_layers ............................ None
-  record_memory_history ........................... False
-  reduce_recompute_for_last_chunk ................. False
-  relative_attention_max_distance ................. 128
-  relative_attention_num_buckets .................. 32
-  replication ..................................... False
-  replication_factor .............................. 2
-  replication_jump ................................ None
-  reproduce ....................................... False
-  rerun_mode ...................................... validate_results
-  reset_attention_mask ............................ False
-  reset_position_ids .............................. False
-  result_rejected_tracker_filename ................ None
-  retriever_report_topk_accuracies ................ []
-  retriever_score_scaling ......................... False
-  retriever_seq_length ............................ 256
-  retro_add_retriever ............................. False
-  retro_attention_gate ............................ 1
-  retro_cyclic_train_iters ........................ None
-  retro_encoder_attention_dropout ................. 0.1
-  retro_encoder_hidden_dropout .................... 0.1
-  retro_encoder_layers ............................ 2
-  retro_num_neighbors ............................. 2
-  retro_num_retrieved_chunks ...................... 2
-  retro_project_dir ............................... None
-  retro_verify_neighbor_count ..................... True
-  reuse_fp32_param ................................ False
-  reuse_grad_buf_for_mxfp8_param_ag ............... False
-  rl_calculate_intra_group_similarity ............. False
-  rl_importance_sampling_truncation_coef .......... None
-  rl_inference_logprobs_is_correction ............. False
-  rl_offload_kv_cache_during_training ............. False
-  rl_offload_optimizer_during_inference ........... False
-  rl_partial_rollouts ............................. False
-  rl_prompts_per_eval ............................. 32
-  rl_remove_kv_cache_during_training .............. False
-  rl_reset_cuda_graphs ............................ False
-  rl_sequence_packing_algo ........................ fifo
-  rl_sequence_packing_bin_size .................... 8192
-  rl_use_sequence_packing ......................... False
-  rope_scaling_factor ............................. 8.0
-  rope_type ....................................... None
-  rotary_base ..................................... 1000000
-  rotary_interleaved .............................. False
-  rotary_percent .................................. 1.0
-  rotary_scaling_factor ........................... 1.0
-  rotary_seq_len_interpolation_factor ............. None
-  run_workload_inspector_server ................... False
-  sample_rate ..................................... 1.0
-  save ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1
-  save_flux_gather_input .......................... False
-  save_interval ................................... 5
-  save_retain_interval ............................ None
-  scatter_gather_tensors_in_pipeline .............. True
-  schedule_method ................................. vanilla
-  schedule_timer_end .............................. 20
-  schedule_timer_start ............................ 10
-  seed ............................................ 1234
-  seq_length ...................................... 4096
-  sequence_parallel ............................... True
-  sft ............................................. False
-  sft_tokenizer_prompt_format ..................... nemotron-h-aligned
-  sgd_momentum .................................... 0.9
-  sharp_enabled_group ............................. None
-  short_seq_prob .................................. 0.1
-  skip_train ...................................... False
-  skipped_train_samples ........................... 0
-  softmax_type .................................... vanilla
-  spatial_merge_size .............................. 2
-  spec ............................................ None
-  specify_layers .................................. None
-  split ........................................... 949,50,1
-  squared_relu .................................... False
-  start_weight_decay .............................. 0.1
-  straggler_ctrlr_port ............................ 65535
-  straggler_minmax_count .......................... 1
-  strict_fsdp_dtensor_load ........................ True
-  suggested_communication_unit_size ............... None
-  swap_attention .................................. False
-  swap_modules .................................... self_attention
-  swiglu .......................................... True
-  swin_backbone_type .............................. tiny
-  symmetric_ar_type ............................... None
-  te_rng_tracker .................................. False
-  teacher_model_config ............................ None
-  temporal_patch_size ............................. 2
-  tensor_model_parallel_size ...................... 4
-  tensorboard_dir ................................. /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1/tensorboard
-  tensorboard_log_interval ........................ 1
-  tensorboard_queue_size .......................... 1000
-  test_data_path .................................. None
-  test_mode ....................................... False
-  tiktoken_num_special_tokens ..................... 1000
-  tiktoken_pattern ................................ None
-  tiktoken_special_tokens ......................... None
-  timing_log_level ................................ 0
-  timing_log_option ............................... minmax
-  titles_data_path ................................ None
-  tokenizer_metadata .............................. None
-  tokenizer_model ................................. /home/models/qwen3/Qwen3-0.6B
-  tokenizer_type .................................. HuggingFaceTokenizer
-  torch_fsdp2_reshard_after_forward ............... True
-  tp_comm_bootstrap_backend ....................... nccl
-  tp_comm_bulk_dgrad .............................. True
-  tp_comm_bulk_wgrad .............................. True
-  tp_comm_overlap ................................. False
-  tp_comm_overlap_ag .............................. True
-  tp_comm_overlap_cfg ............................. None
-  tp_comm_overlap_rs .............................. True
-  tp_comm_overlap_rs_dgrad ........................ False
-  tp_comm_split_ag ................................ True
-  tp_comm_split_rs ................................ True
-  train_data_path ................................. None
-  train_iters ..................................... 50
-  train_samples ................................... None
-  train_sync_interval ............................. None
-  transformer_impl ................................ transformer_engine
-  transformer_pipeline_model_parallel_size ........ 1
-  trust_remote_code ............................... False
-  untie_embeddings_and_output_weights ............. True
-  use_checkpoint_args ............................. False
-  use_checkpoint_opt_param_scheduler .............. False
-  use_ckpt_memory_cache ........................... False
-  use_cpu_initialization .......................... None
-  use_dist_ckpt ................................... False
-  use_dist_ckpt_deprecated ........................ False
-  use_distributed_optimizer ....................... True
-  use_flash_attn .................................. True
-  use_fused_weighted_squared_relu ................. False
-  use_hip_profiler ................................ False
-  use_legacy_models ............................... False
-  use_megatron_fsdp ............................... False
-  use_mp_args_from_checkpoint_args ................ False
-  use_one_sent_docs ............................... False
-  use_optimizer_feature ........................... False
-  use_persistent_ckpt_worker ...................... False
-  use_precision_aware_optimizer ................... False
-  use_pytorch_profiler ............................ False
-  use_qk_norm ..................................... True
-  use_quantize_comm ............................... False
-  use_ring_exchange_p2p ........................... False
-  use_rope_scaling ................................ False
-  use_rotary_position_embeddings .................. False
-  use_sharp ....................................... False
-  use_te_activation_func .......................... False
-  use_tokenizer_model_from_checkpoint_args ........ True
-  use_torch_fsdp2 ................................. False
-  use_torch_optimizer_for_cpu_offload ............. False
-  use_tp_pp_dp_mapping ............................ False
-  v_head_dim ...................................... 128
-  valid_data_path ................................. None
-  variable_seq_lengths ............................ False
-  virtual_pipeline_model_parallel_size ............ None
-  vision_backbone_type ............................ vit
-  vision_pretraining .............................. False
-  vision_pretraining_type ......................... classify
-  vocab_extra_ids ................................. 0
-  vocab_file ...................................... None
-  vocab_size ...................................... None
-  wandb_entity .................................... 
-  wandb_exp_name .................................. 
-  wandb_project ................................... 
-  wandb_save_dir .................................. 
-  weight_decay .................................... 0.1
-  weight_decay_incr_style ......................... constant
-  wgrad_deferral_limit ............................ 0
-  window_attn_skip_freq ........................... None
-  window_size ..................................... None
-  world_size ...................................... 8
-  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
-> building HuggingFaceTokenizer tokenizer ...
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
- > padded vocab (size: 151669) with 395 dummy tokens (new size: 152064)
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-> initializing torch distributed ...
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
-WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.207494 216834 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.211829 216828 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.313028 216826 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.313149 216833 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-tp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-dp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-ep_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-etp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-edp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-cp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-tp-cp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-pos_embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-tp-ep_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-tp-dp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
-tp-pp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-dp-cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-> initialized tensor model parallel with size 4
-> initialized pipeline model parallel with size 1
-> setting random seeds to 1234 ...
-> compiling dataset index builder ...
-make: Entering directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.380734 216832 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.382851 216831 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.387229 216830 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-make: Nothing to be done for 'default'.
-make: Leaving directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
->>> done with dataset index builder. Compilation time: 0.039 seconds
-> compiling and loading fused kernels ...
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 17:48:16.423564 216829 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
->>> done with compiling and loading fused kernels. Compilation time: 1.197 seconds
-time to initialize megatron (seconds): 5.946
-[after megatron is initialized] datetime: 2026-02-11 17:48:18 
-building GPT model ...
- > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 218293248
-[rank 0] GPTModel(
-  (embedding): LanguageModelEmbedding(
-    (word_embeddings): VocabParallelEmbedding()
-    (embedding_dropout): Dropout(p=0.0, inplace=False)
-  )
-  (rotary_pos_emb): RotaryEmbedding()
-  (decoder): TransformerBlock(
-    (layers): ModuleList(
-      (0-27): 28 x TransformerLayer(
-        (input_layernorm): IdentityOp()
-        (self_attention): SelfAttention(
-          (core_attention): TEDotProductAttention(
-            (flash_attention): FlashAttention()
-            (fused_attention): FusedAttention()
-            (unfused_attention): UnfusedDotProductAttention(
-              (scale_mask_softmax): FusedScaleMaskSoftmax()
-              (attention_dropout): Dropout(p=0.0, inplace=False)
-            )
-          )
-          (linear_proj): TERowParallelLinear(in_features=256, out_features=1024, bias=False, TP=4)
-          (linear_qkv): TELayerNormColumnParallelLinear(in_features=1024, out_features=512, bias=False, TP=4)
-          (q_layernorm): IdentityOp()
-          (k_layernorm): IdentityOp()
-        )
-        (pre_cross_attn_layernorm): IdentityOp()
-        (cross_attention): IdentityOp()
-        (cross_attn_bda): IdentityFuncOp()
-        (pre_mlp_layernorm): IdentityOp()
-        (mlp): MLP(
-          (linear_fc1): TELayerNormColumnParallelLinear(in_features=1024, out_features=2752, bias=False, TP=4)
-          (linear_fc2): TERowParallelLinear(in_features=1376, out_features=1024, bias=False, TP=4)
-        )
-      )
-    )
-    (final_layernorm): RMSNorm()
-  )
-  (output_layer): ColumnParallelLinear(in_features=1024, out_features=152064, bias=False, TP=4)
-)
- > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 218293248
- > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 218293248
- > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 218293248
-WARNING: could not find the metadata file /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1/latest_checkpointed_iteration.txt
-    will not load any checkpoints and will start from random
-(min, max) time across ranks (ms):
-    load-checkpoint ................................: (1.06, 1.07)
-[after model, optimizer, and learning rate scheduler are built] datetime: 2026-02-11 17:48:19 
-> building train, validation, and test datasets ...
- > datasets target sizes (minimum size):
-    train:      12800
-    validation: 1280
-    test:       1280
-> building train, validation, and test datasets for GPT ...
-> finished creating GPT datasets ...
-[after dataloaders are built] datetime: 2026-02-11 17:48:19 
-done with setup ...
-(min, max) time across ranks (ms):
-    model-and-optimizer-setup ......................: (265.26, 274.11)
-    train/valid/test-data-iterators-setup ..........: (249.63, 344.36)
-training ...
-Overwriting rerun_state_machine.current_iteration from -1 to 0...
-[before the start of training step] datetime: 2026-02-11 17:48:19 
-[WARNING  | megatron.core.rerun_state_machine]: Result validation enabled
- [2026-02-11 17:48:51] iteration        1/      50 | consumed samples:          256 | elapsed time per iteration (ms): 32016.2 | throughput per GPU (TFLOP/s/GPU): 20.5 | learning rate: 3.000000E-05 | global batch size:   256 | lm loss: 1.194528E+01 | loss scale: 1.0 | grad norm: 17.556 | number of skipped iterations:   0 | number of nan iterations:   0 |
-Number of parameters in transformer block in billions:  0.56
-Number of parameters in embedding layers in billions: 0.31
-Total number of parameters in billions: 0.87
-Number of parameters in most loaded shard in billions: 0.2183
-compute_activation_memory_without_sp
-Activation memory footprint per transformer layer (precise, without SP): 64.0 MB
-Theoretical memory footprints: weight and optimizer=2497.83 MB, activation=2206.08 MB, total=4703.92 MB
-
-[Rank 3] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4294.29541015625 | reserved: 5464.0 | max reserved: 5464.0
-[Rank 2] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4290.59130859375 | reserved: 5464.0 | max reserved: 5464.0
-[Rank 1] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4294.671875 | reserved: 5464.0 | max reserved: 5464.0
-[Rank 0] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4280.4609375 | reserved: 4870.0 | max reserved: 4870.0
- [2026-02-11 17:49:15] iteration        2/      50 | consumed samples:          512 | elapsed time per iteration (ms): 23547.6 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 2.997226E-05 | global batch size:   256 | lm loss: 1.194559E+01 | loss scale: 1.0 | grad norm: 17.100 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:49:40] iteration        3/      50 | consumed samples:          768 | elapsed time per iteration (ms): 25037.7 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 2.988916E-05 | global batch size:   256 | lm loss: 1.173730E+01 | loss scale: 1.0 | grad norm: 38.599 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:50:04] iteration        4/      50 | consumed samples:         1024 | elapsed time per iteration (ms): 24486.3 | throughput per GPU (TFLOP/s/GPU): 26.8 | learning rate: 2.975105E-05 | global batch size:   256 | lm loss: 1.157678E+01 | loss scale: 1.0 | grad norm: 3.598 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:50:28] iteration        5/      50 | consumed samples:         1280 | elapsed time per iteration (ms): 24209.4 | throughput per GPU (TFLOP/s/GPU): 27.1 | learning rate: 2.955848E-05 | global batch size:   256 | lm loss: 1.147159E+01 | loss scale: 1.0 | grad norm: 2.897 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration       5 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration       5 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6053.01, 6053.04)
- [2026-02-11 17:50:58] iteration        6/      50 | consumed samples:         1536 | elapsed time per iteration (ms): 23787.3 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 2.931225E-05 | global batch size:   256 | lm loss: 1.180910E+01 | loss scale: 1.0 | grad norm: 220.934 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:51:22] iteration        7/      50 | consumed samples:         1792 | elapsed time per iteration (ms): 23771.1 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 2.901338E-05 | global batch size:   256 | lm loss: 1.140885E+01 | loss scale: 1.0 | grad norm: 2.688 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:51:46] iteration        8/      50 | consumed samples:         2048 | elapsed time per iteration (ms): 23571.3 | throughput per GPU (TFLOP/s/GPU): 27.8 | learning rate: 2.866308E-05 | global batch size:   256 | lm loss: 1.137786E+01 | loss scale: 1.0 | grad norm: 2.564 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:52:09] iteration        9/      50 | consumed samples:         2304 | elapsed time per iteration (ms): 23910.3 | throughput per GPU (TFLOP/s/GPU): 27.5 | learning rate: 2.826280E-05 | global batch size:   256 | lm loss: 1.133599E+01 | loss scale: 1.0 | grad norm: 2.567 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:52:35] iteration       10/      50 | consumed samples:         2560 | elapsed time per iteration (ms): 25138.6 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 2.781419E-05 | global batch size:   256 | lm loss: 1.130382E+01 | loss scale: 1.0 | grad norm: 2.557 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      10 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      10 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6240.67, 6240.69)
- [2026-02-11 17:53:06] iteration       11/      50 | consumed samples:         2816 | elapsed time per iteration (ms): 24910.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 2.731908E-05 | global batch size:   256 | lm loss: 1.127207E+01 | loss scale: 1.0 | grad norm: 2.545 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:53:30] iteration       12/      50 | consumed samples:         3072 | elapsed time per iteration (ms): 24302.7 | throughput per GPU (TFLOP/s/GPU): 27.0 | learning rate: 2.677952E-05 | global batch size:   256 | lm loss: 1.123584E+01 | loss scale: 1.0 | grad norm: 2.534 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:53:54] iteration       13/      50 | consumed samples:         3328 | elapsed time per iteration (ms): 23970.4 | throughput per GPU (TFLOP/s/GPU): 27.4 | learning rate: 2.619772E-05 | global batch size:   256 | lm loss: 1.120071E+01 | loss scale: 1.0 | grad norm: 2.547 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:54:18] iteration       14/      50 | consumed samples:         3584 | elapsed time per iteration (ms): 23715.7 | throughput per GPU (TFLOP/s/GPU): 27.7 | learning rate: 2.557606E-05 | global batch size:   256 | lm loss: 1.116887E+01 | loss scale: 1.0 | grad norm: 2.514 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:54:43] iteration       15/      50 | consumed samples:         3840 | elapsed time per iteration (ms): 24919.0 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 2.491711E-05 | global batch size:   256 | lm loss: 1.112481E+01 | loss scale: 1.0 | grad norm: 2.569 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      15 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      15 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6368.84, 6368.91)
- [2026-02-11 17:55:15] iteration       16/      50 | consumed samples:         4096 | elapsed time per iteration (ms): 26104.9 | throughput per GPU (TFLOP/s/GPU): 25.1 | learning rate: 2.422357E-05 | global batch size:   256 | lm loss: 1.109704E+01 | loss scale: 1.0 | grad norm: 2.518 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:55:41] iteration       17/      50 | consumed samples:         4352 | elapsed time per iteration (ms): 25885.0 | throughput per GPU (TFLOP/s/GPU): 25.4 | learning rate: 2.349830E-05 | global batch size:   256 | lm loss: 1.106508E+01 | loss scale: 1.0 | grad norm: 3.518 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:56:06] iteration       18/      50 | consumed samples:         4608 | elapsed time per iteration (ms): 25098.1 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 2.274427E-05 | global batch size:   256 | lm loss: 1.101524E+01 | loss scale: 1.0 | grad norm: 2.577 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:56:31] iteration       19/      50 | consumed samples:         4864 | elapsed time per iteration (ms): 24881.3 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 2.196458E-05 | global batch size:   256 | lm loss: 1.099613E+01 | loss scale: 1.0 | grad norm: 2.496 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:56:56] iteration       20/      50 | consumed samples:         5120 | elapsed time per iteration (ms): 24919.2 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 2.116243E-05 | global batch size:   256 | lm loss: 1.095716E+01 | loss scale: 1.0 | grad norm: 2.519 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      20 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      20 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6076.06, 6076.09)
- [2026-02-11 17:57:27] iteration       21/      50 | consumed samples:         5376 | elapsed time per iteration (ms): 25205.7 | throughput per GPU (TFLOP/s/GPU): 26.0 | learning rate: 2.034112E-05 | global batch size:   256 | lm loss: 1.092020E+01 | loss scale: 1.0 | grad norm: 2.551 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:57:52] iteration       22/      50 | consumed samples:         5632 | elapsed time per iteration (ms): 25014.1 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 1.950403E-05 | global batch size:   256 | lm loss: 1.089746E+01 | loss scale: 1.0 | grad norm: 2.506 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:58:17] iteration       23/      50 | consumed samples:         5888 | elapsed time per iteration (ms): 24387.6 | throughput per GPU (TFLOP/s/GPU): 26.9 | learning rate: 1.865460E-05 | global batch size:   256 | lm loss: 1.085281E+01 | loss scale: 1.0 | grad norm: 2.566 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:58:41] iteration       24/      50 | consumed samples:         6144 | elapsed time per iteration (ms): 24708.5 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 1.779631E-05 | global batch size:   256 | lm loss: 1.082638E+01 | loss scale: 1.0 | grad norm: 2.531 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:59:06] iteration       25/      50 | consumed samples:         6400 | elapsed time per iteration (ms): 25163.9 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 1.693270E-05 | global batch size:   256 | lm loss: 1.079843E+01 | loss scale: 1.0 | grad norm: 2.545 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      25 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      25 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6257.54, 6257.71)
- [2026-02-11 17:59:36] iteration       26/      50 | consumed samples:         6656 | elapsed time per iteration (ms): 23507.4 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 1.606730E-05 | global batch size:   256 | lm loss: 1.076447E+01 | loss scale: 1.0 | grad norm: 2.546 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:00:00] iteration       27/      50 | consumed samples:         6912 | elapsed time per iteration (ms): 23555.7 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 1.520369E-05 | global batch size:   256 | lm loss: 1.073713E+01 | loss scale: 1.0 | grad norm: 2.670 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:00:23] iteration       28/      50 | consumed samples:         7168 | elapsed time per iteration (ms): 23440.2 | throughput per GPU (TFLOP/s/GPU): 28.0 | learning rate: 1.434540E-05 | global batch size:   256 | lm loss: 1.070240E+01 | loss scale: 1.0 | grad norm: 2.560 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:00:47] iteration       29/      50 | consumed samples:         7424 | elapsed time per iteration (ms): 23776.8 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 1.349597E-05 | global batch size:   256 | lm loss: 1.067331E+01 | loss scale: 1.0 | grad norm: 2.558 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:01:11] iteration       30/      50 | consumed samples:         7680 | elapsed time per iteration (ms): 24240.5 | throughput per GPU (TFLOP/s/GPU): 27.1 | learning rate: 1.265888E-05 | global batch size:   256 | lm loss: 1.064471E+01 | loss scale: 1.0 | grad norm: 2.593 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      30 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      30 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6359.60, 6359.77)
- [2026-02-11 18:01:42] iteration       31/      50 | consumed samples:         7936 | elapsed time per iteration (ms): 23973.2 | throughput per GPU (TFLOP/s/GPU): 27.4 | learning rate: 1.183757E-05 | global batch size:   256 | lm loss: 1.061864E+01 | loss scale: 1.0 | grad norm: 2.945 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:02:06] iteration       32/      50 | consumed samples:         8192 | elapsed time per iteration (ms): 24138.4 | throughput per GPU (TFLOP/s/GPU): 27.2 | learning rate: 1.103542E-05 | global batch size:   256 | lm loss: 1.060612E+01 | loss scale: 1.0 | grad norm: 2.685 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:02:30] iteration       33/      50 | consumed samples:         8448 | elapsed time per iteration (ms): 24111.5 | throughput per GPU (TFLOP/s/GPU): 27.2 | learning rate: 1.025573E-05 | global batch size:   256 | lm loss: 1.058644E+01 | loss scale: 1.0 | grad norm: 17.452 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:02:54] iteration       34/      50 | consumed samples:         8704 | elapsed time per iteration (ms): 24509.7 | throughput per GPU (TFLOP/s/GPU): 26.8 | learning rate: 9.501700E-06 | global batch size:   256 | lm loss: 1.056656E+01 | loss scale: 1.0 | grad norm: 3.815 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:03:19] iteration       35/      50 | consumed samples:         8960 | elapsed time per iteration (ms): 24686.1 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 8.776425E-06 | global batch size:   256 | lm loss: 1.054440E+01 | loss scale: 1.0 | grad norm: 2.482 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      35 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      35 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6290.02, 6290.09)
- [2026-02-11 18:03:50] iteration       36/      50 | consumed samples:         9216 | elapsed time per iteration (ms): 24716.9 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 8.082888E-06 | global batch size:   256 | lm loss: 1.052087E+01 | loss scale: 1.0 | grad norm: 3.101 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:04:14] iteration       37/      50 | consumed samples:         9472 | elapsed time per iteration (ms): 24331.1 | throughput per GPU (TFLOP/s/GPU): 27.0 | learning rate: 7.423938E-06 | global batch size:   256 | lm loss: 1.050959E+01 | loss scale: 1.0 | grad norm: 2.672 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:04:39] iteration       38/      50 | consumed samples:         9728 | elapsed time per iteration (ms): 24831.3 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 6.802284E-06 | global batch size:   256 | lm loss: 1.048346E+01 | loss scale: 1.0 | grad norm: 2.642 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:05:04] iteration       39/      50 | consumed samples:         9984 | elapsed time per iteration (ms): 24760.1 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 6.220479E-06 | global batch size:   256 | lm loss: 1.048476E+01 | loss scale: 1.0 | grad norm: 2.795 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:05:29] iteration       40/      50 | consumed samples:        10240 | elapsed time per iteration (ms): 25262.7 | throughput per GPU (TFLOP/s/GPU): 26.0 | learning rate: 5.680916E-06 | global batch size:   256 | lm loss: 1.046774E+01 | loss scale: 1.0 | grad norm: 2.509 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      40 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      40 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6104.48, 6104.56)
- [2026-02-11 18:06:00] iteration       41/      50 | consumed samples:        10496 | elapsed time per iteration (ms): 25132.2 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 5.185811E-06 | global batch size:   256 | lm loss: 1.046618E+01 | loss scale: 1.0 | grad norm: 2.557 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:06:25] iteration       42/      50 | consumed samples:        10752 | elapsed time per iteration (ms): 24748.1 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 4.737197E-06 | global batch size:   256 | lm loss: 1.045832E+01 | loss scale: 1.0 | grad norm: 2.603 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:06:50] iteration       43/      50 | consumed samples:        11008 | elapsed time per iteration (ms): 24830.1 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 4.336920E-06 | global batch size:   256 | lm loss: 1.044374E+01 | loss scale: 1.0 | grad norm: 2.531 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:07:15] iteration       44/      50 | consumed samples:        11264 | elapsed time per iteration (ms): 24816.0 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 3.986624E-06 | global batch size:   256 | lm loss: 1.043142E+01 | loss scale: 1.0 | grad norm: 2.509 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:07:40] iteration       45/      50 | consumed samples:        11520 | elapsed time per iteration (ms): 24840.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 3.687747E-06 | global batch size:   256 | lm loss: 1.042535E+01 | loss scale: 1.0 | grad norm: 2.623 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      45 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      45 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (6001.84, 6001.97)
- [2026-02-11 18:08:11] iteration       46/      50 | consumed samples:        11776 | elapsed time per iteration (ms): 25069.9 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 3.441519E-06 | global batch size:   256 | lm loss: 1.043089E+01 | loss scale: 1.0 | grad norm: 2.498 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:08:36] iteration       47/      50 | consumed samples:        12032 | elapsed time per iteration (ms): 25030.7 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 3.248951E-06 | global batch size:   256 | lm loss: 1.041368E+01 | loss scale: 1.0 | grad norm: 2.453 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:09:00] iteration       48/      50 | consumed samples:        12288 | elapsed time per iteration (ms): 24655.4 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 3.110835E-06 | global batch size:   256 | lm loss: 1.041769E+01 | loss scale: 1.0 | grad norm: 2.474 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:09:25] iteration       49/      50 | consumed samples:        12544 | elapsed time per iteration (ms): 24934.6 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 3.027737E-06 | global batch size:   256 | lm loss: 1.041792E+01 | loss scale: 1.0 | grad norm: 2.497 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 18:09:50] iteration       50/      50 | consumed samples:        12800 | elapsed time per iteration (ms): 24820.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 3.000000E-06 | global batch size:   256 | lm loss: 1.039724E+01 | loss scale: 1.0 | grad norm: 2.511 | number of skipped iterations:   0 | number of nan iterations:   0 |
-saving checkpoint at iteration      50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
-  successfully saved checkpoint from iteration      50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
-(min, max) time across ranks (ms):
-    save-checkpoint ................................: (5870.59, 5870.67)
-[after training is done] datetime: 2026-02-11 18:09:56 
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
-Evaluating on 1280 samples
-Evaluating iter 1/5
-Evaluating iter 2/5
-Evaluating iter 3/5
-Evaluating iter 4/5
-Evaluating iter 5/5
-(min, max) time across ranks (ms):
-    evaluate .......................................: (55213.15, 55215.66)
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------------
- validation loss at iteration 50 on validation set | lm loss value: 1.044579E+01 | lm loss PPL: 3.439929E+04 | 
----------------------------------------------------------------------------------------------------------------
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
-Evaluating on 1280 samples
-Evaluating iter 1/5
-Evaluating iter 2/5
-Evaluating iter 3/5
-Evaluating iter 4/5
-Evaluating iter 5/5
-(min, max) time across ranks (ms):
-    evaluate .......................................: (54313.53, 54316.17)
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
- validation loss at iteration 50 on test set | lm loss value: 1.046412E+01 | lm loss PPL: 3.503556E+04 | 
----------------------------------------------------------------------------------------------------------
-W0211 18:11:46.632072 216828 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.632174 216826 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.656929 216831 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.702957 216833 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.704921 216834 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.719092 216832 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.756732 216829 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 18:11:46.805349 216830 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
--- a/examples/qwen/qwen3_8B_1nodes_megatron.log
+++ b/examples/qwen/qwen3_8B_1nodes_megatron.log
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
-[92mSuccessfully preprocessed all matching files.[0m
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
-  warnings.warn("Failed to create unified memory mempool.")
-[WARNING  | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
-using world size: 8, data-parallel size: 1, context-parallel size: 2, hierarchical context-parallel sizes: None, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
-WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:HuggingFaceTokenizer
-Number of virtual stages per pipeline stage: None
-accumulate and all-reduce gradients in fp32 for bfloat16 data type.
-using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
-  account_for_embedding_in_pipeline_split ......... False
-  account_for_loss_in_pipeline_split .............. False
-  accumulate_allreduce_grads_in_fp32 .............. True
-  activation_func_clamp_value ..................... None
-  adam_beta1 ...................................... 0.9
-  adam_beta2 ...................................... 0.95
-  adam_eps ........................................ 1e-08
-  add_bias_linear ................................. False
-  add_position_embedding .......................... True
-  add_qkv_bias .................................... False
-  adlr_autoresume ................................. False
-  adlr_autoresume_interval ........................ 1000
-  align_grad_reduce ............................... True
-  align_param_gather .............................. False
-  app_tag_run_name ................................ None
-  app_tag_run_version ............................. 0.0.0
-  apply_layernorm_1p .............................. False
-  apply_query_key_layer_scaling ................... False
-  apply_residual_connection_post_layernorm ........ False
-  apply_rope_fusion ............................... True
-  async_save ...................................... None
-  async_tensor_model_parallel_allreduce ........... True
-  attention_backend ............................... AttnBackend.auto
-  attention_dropout ............................... 0.0
-  attention_softmax_in_fp32 ....................... False
-  auto_detect_ckpt_format ......................... False
-  barrier_with_L1_time ............................ True
-  bert_binary_head ................................ True
-  bert_embedder_type .............................. megatron
-  bert_load ....................................... None
-  bf16 ............................................ True
-  bias_dropout_fusion ............................. True
-  bias_gelu_fusion ................................ False
-  bias_swiglu_fusion .............................. True
-  biencoder_projection_dim ........................ 0
-  biencoder_shared_query_context_model ............ False
-  block_data_path ................................. None
-  cache_mla_latents ............................... False
-  calc_ft_timeouts ................................ False
-  calculate_per_token_loss ........................ False
-  check_for_large_grads ........................... False
-  check_for_nan_in_loss_and_grad .................. True
-  check_for_spiky_loss ............................ False
-  check_weight_hash_across_dp_replicas_interval ... None
-  ckpt_assume_constant_structure .................. False
-  ckpt_convert_format ............................. None
-  ckpt_convert_save ............................... None
-  ckpt_convert_update_legacy_dist_opt_format ...... False
-  ckpt_format ..................................... torch
-  ckpt_fully_parallel_load ........................ False
-  ckpt_fully_parallel_save ........................ True
-  ckpt_fully_parallel_save_deprecated ............. False
-  ckpt_step ....................................... None
-  classes_fraction ................................ 1.0
-  clip_grad ....................................... 1.0
-  clone_scatter_output_in_embedding ............... True
-  collect_log_path ................................ ./logs
-  comm_time_log_iter .............................. None
-  config_logger_dir ............................... 
-  consumed_train_samples .......................... 0
-  consumed_valid_samples .......................... 0
-  context_parallel_size ........................... 2
-  cp_comm_type .................................... ['p2p']
-  create_attention_mask_in_dataloader ............. True
-  cross_entropy_fusion_impl ....................... native
-  cross_entropy_loss_fusion ....................... False
-  cuda_graph_impl ................................. none
-  cuda_graph_scope ................................ full
-  cuda_graph_warmup_steps ......................... 3
-  data_args_path .................................. None
-  data_cache_path ................................. None
-  data_parallel_random_init ....................... False
-  data_parallel_sharding_strategy ................. no_shard
-  data_parallel_size .............................. 1
-  data_path ....................................... ['/workspace/data/oscar/oscar-1GB_head-qwen_text_document']
-  data_per_class_fraction ......................... 1.0
-  data_sharding ................................... True
-  dataloader_type ................................. single
-  ddp_average_in_collective ....................... True
-  ddp_bucket_size ................................. None
-  ddp_num_buckets ................................. None
-  ddp_pad_buckets_for_high_nccl_busbw ............. False
-  decode_only_cuda_graphs ......................... False
-  decoder_first_pipeline_num_layers ............... None
-  decoder_last_pipeline_num_layers ................ None
-  decoder_num_layers .............................. None
-  decoder_seq_length .............................. None
-  decoupled_lr .................................... None
-  decoupled_min_lr ................................ None
-  decrease_batch_size_if_needed ................... False
-  defer_embedding_wgrad_compute ................... False
-  delay_wgrad_compute ............................. False
-  deprecated_use_mcore_models ..................... True
-  deterministic_mode .............................. False
-  dino_bottleneck_size ............................ 256
-  dino_freeze_last_layer .......................... 1
-  dino_head_hidden_size ........................... 2048
-  dino_local_crops_number ......................... 10
-  dino_local_img_size ............................. 96
-  dino_norm_last_layer ............................ False
-  dino_teacher_temp ............................... 0.07
-  dino_warmup_teacher_temp ........................ 0.04
-  dino_warmup_teacher_temp_epochs ................. 30
-  disable_backward_fusion ......................... False
-  disable_bf16_reduced_precision_matmul ........... False
-  disable_chunked_prefill ......................... False
-  disable_mamba_mem_eff_path ...................... False
-  disable_straggler_on_startup .................... False
-  disable_symmetric_registration .................. False
-  disable_vision_class_token ...................... False
-  dist_ckpt_format_deprecated ..................... None
-  dist_ckpt_optim_fully_reshardable ............... False
-  dist_ckpt_save_pre_mcore_014 .................... False
-  dist_ckpt_strictness ............................ assume_ok_unexpected
-  dist_url ........................................ tcp://localhost:25900
-  distrib_optim_fully_reshardable_mem_efficient ... False
-  distribute_saved_activations .................... False
-  distributed_backend ............................. nccl
-  distributed_timeout_minutes ..................... 10
-  distributed_timeout_seconds_after_init .......... None
-  embedding_init_method_std ....................... None
-  embedding_path .................................. None
-  empty_unused_memory_level ....................... 0
-  enable_bw_flux_gemmrs_op ........................ True
-  enable_cuda_graph ............................... False
-  enable_dynamic_grad_comp ........................ False
-  enable_experimental ............................. False
-  enable_ft_package ............................... False
-  enable_full_sharding_in_hsdp .................... False
-  enable_gloo_process_groups ...................... True
-  enable_msc ...................................... True
-  enable_one_logger ............................... True
-  enable_vocab_parallel ........................... False
-  encoder_num_layers .............................. 36
-  encoder_seq_length .............................. 8192
-  end_weight_decay ................................ 0.1
-  eod_mask_loss ................................... False
-  error_injection_rate ............................ 0
-  error_injection_type ............................ transient_error
-  eval_interval ................................... 1000
-  eval_iters ...................................... 5
-  evidence_data_path .............................. None
-  exit_duration_in_mins ........................... None
-  exit_interval ................................... None
-  exit_on_missing_checkpoint ...................... False
-  exit_signal_handler ............................. False
-  exp_avg_dtype ................................... torch.float32
-  exp_avg_sq_dtype ................................ torch.float32
-  expert_model_parallel_size ...................... 1
-  expert_tensor_parallel_size ..................... 4
-  export_force_local_attention .................... False
-  export_kd_cfg ................................... None
-  export_kd_teacher_ckpt_format ................... None
-  export_kd_teacher_load .......................... None
-  export_kv_cache_quant ........................... False
-  export_legacy_megatron .......................... False
-  export_model_type ............................... GPTModel
-  export_moe_apply_probs_on_input ................. False
-  export_offline_model ............................ False
-  export_qk_l2_norm ............................... False
-  export_quant_cfg ................................ None
-  export_real_quant_cfg ........................... None
-  export_te_mcore_model ........................... False
-  external_cuda_graph ............................. False
-  extra_vocab_size ................................ 0
-  ffn_hidden_size ................................. 12288
-  fine_grained_activation_offloading .............. False
-  finetune ........................................ False
-  finetune_data_split ............................. train
-  finetune_hf_dataset ............................. None
-  first_last_layers_bf16 .......................... False
-  flash_decode .................................... False
-  flux_transpose_weight ........................... False
-  fp16 ............................................ False
-  fp16_lm_cross_entropy ........................... False
-  fp32_residual_connection ........................ False
-  fp4 ............................................. None
-  fp4_param ....................................... False
-  fp4_recipe ...................................... nvfp4
-  fp8 ............................................. None
-  fp8_amax_compute_algo ........................... most_recent
-  fp8_amax_history_len ............................ 1
-  fp8_interval .................................... 1
-  fp8_margin ...................................... 0
-  fp8_param_gather ................................ False
-  fp8_recipe ...................................... delayed
-  fp8_wgrad ....................................... True
-  freeze_LM ....................................... False
-  freeze_ViT ...................................... False
-  fsdp_double_buffer .............................. False
-  full_validation ................................. False
-  global_batch_size ............................... 32
-  glu_linear_offset ............................... 0.0
-  grad_comp ....................................... False
-  grad_comp_warm_up ............................... 0.1
-  grad_reduce_in_bf16 ............................. False
-  gradient_accumulation_fusion .................... True
-  gradient_reduce_div_fusion ...................... True
-  gradient_sample_ratio ........................... 1.0
-  group_query_attention ........................... True
-  grpo_clamp_eps_lower ............................ 0.01
-  grpo_clamp_eps_upper ............................ 0.01
-  grpo_default_temperature ........................ 1.0
-  grpo_default_top_p .............................. 0
-  grpo_entropy_term_weight ........................ 0.0
-  grpo_filter_groups_with_same_reward ............. False
-  grpo_group_size ................................. 2
-  grpo_iterations ................................. 2
-  grpo_kl_beta .................................... 0.001
-  grpo_prompts_per_step ........................... 32
-  head_lr_mult .................................... 1.0
-  heterogeneous_layers_config_encoded_json ........ None
-  heterogeneous_layers_config_path ................ None
-  hidden_dropout .................................. 0.0
-  hidden_size ..................................... 4096
-  hierarchical_context_parallel_sizes ............. None
-  high_priority_stream_groups ..................... []
-  hybrid_attention_ratio .......................... 0.0
-  hybrid_mlp_ratio ................................ 0.0
-  hybrid_override_pattern ......................... None
-  hysteresis ...................................... 2
-  ict_head_size ................................... None
-  ict_load ........................................ None
-  img_h ........................................... 224
-  img_w ........................................... 224
-  indexer_batch_size .............................. 128
-  indexer_log_interval ............................ 1000
-  inference_batch_times_seqlen_threshold .......... -1
-  inference_dynamic_batching ...................... False
-  inference_dynamic_batching_block_size ........... 256
-  inference_dynamic_batching_buffer_guaranteed_fraction  0.2
-  inference_dynamic_batching_buffer_overflow_factor  None
-  inference_dynamic_batching_buffer_size_gb ....... 40.0
-  inference_dynamic_batching_max_requests_override  None
-  inference_dynamic_batching_max_tokens_override .. None
-  inference_dynamic_batching_num_cuda_graphs ...... 16
-  inference_dynamic_batching_track_paused_request_events  False
-  inference_dynamic_batching_unified_memory_level . 0
-  inference_max_batch_size ........................ 8
-  inference_max_seq_length ........................ 2560
-  inference_rng_tracker ........................... False
-  init_method_std ................................. 0.006
-  init_method_xavier_uniform ...................... False
-  init_model_with_meta_device ..................... False
-  initial_loss_scale .............................. 4294967296
-  inprocess_active_world_size ..................... 1
-  inprocess_barrier_timeout ....................... 120
-  inprocess_completion_timeout .................... 120
-  inprocess_empty_cuda_cache ...................... False
-  inprocess_granularity ........................... node
-  inprocess_hard_timeout .......................... 90
-  inprocess_heartbeat_interval .................... 30
-  inprocess_heartbeat_timeout ..................... 60
-  inprocess_last_call_wait ........................ 1
-  inprocess_max_iterations ........................ None
-  inprocess_monitor_process_interval .............. 1.0
-  inprocess_monitor_thread_interval ............... 1.0
-  inprocess_progress_watchdog_interval ............ 1.0
-  inprocess_restart ............................... False
-  inprocess_soft_timeout .......................... 60
-  inprocess_termination_grace_time ................ 1
-  is_hybrid_model ................................. False
-  iter_per_epoch .................................. 1250
-  iteration_sample_ratio .......................... 0.01
-  iterations_to_skip .............................. []
-  keep_fp8_transpose_cache ........................ False
-  kitchen_config_file ............................. None
-  kitchen_recipe_number ........................... None
-  kv_channels ..................................... 128
-  kv_lora_rank .................................... 32
-  langrl_env_config ............................... None
-  langrl_external_server .......................... False
-  langrl_inference_server_conversation_template ... None
-  langrl_inference_server_type .................... inplace_megatron
-  lazy_mpu_init ................................... None
-  legacy_tokenizer ................................ False
-  load ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints
-  load_main_params_from_ckpt ...................... None
-  local_rank ...................................... 0
-  log_energy ...................................... False
-  log_interval .................................... 1
-  log_loss_scale_to_tensorboard ................... True
-  log_memory_to_tensorboard ....................... False
-  log_num_zeros_in_grad ........................... False
-  log_params_norm ................................. False
-  log_progress .................................... False
-  log_straggler ................................... False
-  log_throughput .................................. True
-  log_timers_to_tensorboard ....................... False
-  log_validation_ppl_to_tensorboard ............... False
-  log_world_size_to_tensorboard ................... False
-  logging_level ................................... None
-  loss_scale ...................................... None
-  loss_scale_window ............................... 1000
-  lr .............................................. 3e-05
-  lr_decay_iters .................................. None
-  lr_decay_samples ................................ None
-  lr_decay_style .................................. cosine
-  lr_warmup_fraction .............................. None
-  lr_warmup_init .................................. 0.0
-  lr_warmup_iters ................................. 1
-  lr_warmup_samples ............................... 0
-  lr_wsd_decay_iters .............................. None
-  lr_wsd_decay_samples ............................ None
-  lr_wsd_decay_style .............................. exponential
-  main_grads_dtype ................................ torch.float32
-  main_params_dtype ............................... torch.float32
-  make_vocab_size_divisible_by .................... 128
-  mamba_head_dim .................................. 64
-  mamba_num_groups ................................ 8
-  mamba_num_heads ................................. None
-  mamba_state_dim ................................. 128
-  manual_gc ....................................... False
-  manual_gc_eval .................................. True
-  manual_gc_interval .............................. 0
-  mask_factor ..................................... 1.0
-  mask_prob ....................................... 0.15
-  mask_type ....................................... random
-  masked_softmax_fusion ........................... True
-  max_padding_length .............................. None
-  max_position_embeddings ......................... 40960
-  max_tokens_to_oom ............................... 12000
-  memory_snapshot_path ............................ snapshot.pickle
-  merge_file ...................................... None
-  micro_batch_size ................................ 1
-  microbatch_group_size_per_vp_stage .............. None
-  mid_level_dataset_surplus ....................... 0.005
-  min_loss_scale .................................. 1.0
-  min_lr .......................................... 3e-06
-  min_offloaded_tensor_size ....................... 1048576
-  mlp_chunks_for_prefill .......................... 1
-  mmap_bin_files .................................. True
-  mock_data ....................................... False
-  modelopt_enabled ................................ False
-  moe_apply_probs_on_input ........................ False
-  moe_aux_loss_coeff .............................. 0.0
-  moe_deepep_num_sms .............................. 20
-  moe_enable_deepep ............................... False
-  moe_expert_capacity_factor ...................... None
-  moe_extended_tp ................................. False
-  moe_ffn_hidden_size ............................. None
-  moe_grouped_gemm ................................ False
-  moe_input_jitter_eps ............................ None
-  moe_layer_freq .................................. 1
-  moe_layer_recompute ............................. False
-  moe_pad_expert_input_to_capacity ................ False
-  moe_pad_experts_for_cuda_graph_inference ........ False
-  moe_per_layer_logging ........................... False
-  moe_permute_fusion .............................. False
-  moe_router_bias_update_rate ..................... 0.001
-  moe_router_dtype ................................ None
-  moe_router_enable_expert_bias ................... False
-  moe_router_force_load_balancing ................. False
-  moe_router_fusion ............................... False
-  moe_router_group_topk ........................... None
-  moe_router_load_balancing_type .................. aux_loss
-  moe_router_num_groups ........................... None
-  moe_router_padding_for_fp8 ...................... False
-  moe_router_pre_softmax .......................... False
-  moe_router_score_function ....................... softmax
-  moe_router_topk ................................. 2
-  moe_router_topk_scaling_factor .................. None
-  moe_shared_expert_intermediate_size ............. None
-  moe_shared_expert_overlap ....................... False
-  moe_token_dispatcher_type ....................... allgather
-  moe_token_drop_policy ........................... probs
-  moe_upcycling_granularity ....................... 1
-  moe_use_legacy_grouped_gemm ..................... False
-  moe_use_upcycling ............................... False
-  moe_z_loss_coeff ................................ None
-  mrope_section ................................... None
-  mscale .......................................... 1.0
-  mscale_all_dim .................................. 0.0
-  mtp_loss_scaling_factor ......................... 0.1
-  mtp_num_layers .................................. None
-  multi_latent_attention .......................... False
-  multiple_validation_sets ........................ False
-  nccl_all_reduce_for_prefill ..................... False
-  nccl_communicator_config_path ................... None
-  nccl_ub ......................................... False
-  no_load_optim ................................... None
-  no_load_rng ..................................... None
-  no_persist_layer_norm ........................... False
-  no_rope_freq .................................... None
-  no_save_optim ................................... None
-  no_save_rng ..................................... None
-  non_persistent_ckpt_type ........................ None
-  non_persistent_global_ckpt_dir .................. None
-  non_persistent_local_ckpt_algo .................. fully_parallel
-  non_persistent_local_ckpt_dir ................... None
-  non_persistent_save_interval .................... None
-  norm_epsilon .................................... 1e-05
-  normalization ................................... RMSNorm
-  num_attention_heads ............................. 32
-  num_channels .................................... 3
-  num_classes ..................................... 1000
-  num_dataset_builder_threads ..................... 1
-  num_distributed_optimizer_instances ............. 1
-  num_experts ..................................... None
-  num_layers ...................................... 36
-  num_layers_at_end_in_bf16 ....................... 1
-  num_layers_at_start_in_bf16 ..................... 1
-  num_layers_per_virtual_pipeline_stage ........... None
-  num_layers_to_build ............................. None
-  num_query_groups ................................ 8
-  num_virtual_stages_per_pipeline_rank ............ None
-  num_workers ..................................... 2
-  object_storage_cache_path ....................... None
-  offload_modules ................................. None
-  one_logger_async ................................ False
-  one_logger_project .............................. megatron-lm
-  one_logger_run_name ............................. None
-  onnx_safe ....................................... None
-  openai_gelu ..................................... False
-  optimizer ....................................... adam
-  optimizer_cpu_offload ........................... False
-  optimizer_offload_fraction ...................... 1.0
-  output_bert_embeddings .......................... False
-  overlap_cpu_optimizer_d2h_h2d ................... False
-  overlap_ep_comm_with_split_attn ................. False
-  overlap_grad_reduce ............................. True
-  overlap_moe_expert_parallel_comm ................ False
-  overlap_p2p_comm ................................ False
-  overlap_p2p_comm_warmup_flush ................... False
-  overlap_param_gather ............................ False
-  overlap_param_gather_with_optimizer_step ........ False
-  override_opt_param_scheduler .................... False
-  padded_vocab_size ............................... None
-  parallel_linear_impl ............................ None
-  params_dtype .................................... torch.bfloat16
-  patch_dim ....................................... 16
-  patch_size ...................................... 14
-  per_split_data_args_path ........................ None
-  perform_initialization .......................... True
-  perform_rl_step ................................. False
-  pin_cpu_grads ................................... True
-  pin_cpu_params .................................. True
-  pipe_sp_splits .................................. 1
-  pipe_sp_strategy ................................ average
-  pipeline_model_parallel_comm_backend ............ None
-  pipeline_model_parallel_layout .................. None
-  pipeline_model_parallel_size .................... 1
-  position_embedding_type ......................... rope
-  pretrained_checkpoint ........................... None
-  profile ......................................... False
-  profile_dir ..................................... ./
-  profile_ranks ................................... [0]
-  profile_step_end ................................ 12
-  profile_step_start .............................. 10
-  q_lora_rank ..................................... None
-  qk_head_dim ..................................... 128
-  qk_l2_norm ...................................... False
-  qk_layernorm .................................... True
-  qk_pos_emb_head_dim ............................. 64
-  quant_comm_bits ................................. 8
-  quant_group_size ................................ None
-  quant_scale_dtype ............................... bf16
-  query_in_block_prob ............................. 0.1
-  quick_geglu ..................................... False
-  rampup_batch_size ............................... None
-  rank ............................................ 0
-  rank_adjust_window_size ......................... 1000
-  recompute_activation_function ................... False
-  recompute_activation_function_num_layers ........ None
-  recompute_granularity ........................... None
-  recompute_method ................................ None
-  recompute_modules ............................... None
-  recompute_num_layers ............................ None
-  record_memory_history ........................... False
-  reduce_recompute_for_last_chunk ................. False
-  relative_attention_max_distance ................. 128
-  relative_attention_num_buckets .................. 32
-  replication ..................................... False
-  replication_factor .............................. 2
-  replication_jump ................................ None
-  reproduce ....................................... False
-  rerun_mode ...................................... validate_results
-  reset_attention_mask ............................ False
-  reset_position_ids .............................. False
-  result_rejected_tracker_filename ................ None
-  retriever_report_topk_accuracies ................ []
-  retriever_score_scaling ......................... False
-  retriever_seq_length ............................ 256
-  retro_add_retriever ............................. False
-  retro_attention_gate ............................ 1
-  retro_cyclic_train_iters ........................ None
-  retro_encoder_attention_dropout ................. 0.1
-  retro_encoder_hidden_dropout .................... 0.1
-  retro_encoder_layers ............................ 2
-  retro_num_neighbors ............................. 2
-  retro_num_retrieved_chunks ...................... 2
-  retro_project_dir ............................... None
-  retro_verify_neighbor_count ..................... True
-  reuse_fp32_param ................................ False
-  reuse_grad_buf_for_mxfp8_param_ag ............... False
-  rl_calculate_intra_group_similarity ............. False
-  rl_importance_sampling_truncation_coef .......... None
-  rl_inference_logprobs_is_correction ............. False
-  rl_offload_kv_cache_during_training ............. False
-  rl_offload_optimizer_during_inference ........... False
-  rl_partial_rollouts ............................. False
-  rl_prompts_per_eval ............................. 32
-  rl_remove_kv_cache_during_training .............. False
-  rl_reset_cuda_graphs ............................ False
-  rl_sequence_packing_algo ........................ fifo
-  rl_sequence_packing_bin_size .................... 8192
-  rl_use_sequence_packing ......................... False
-  rope_scaling_factor ............................. 8.0
-  rope_type ....................................... None
-  rotary_base ..................................... 1000000
-  rotary_interleaved .............................. False
-  rotary_percent .................................. 1.0
-  rotary_scaling_factor ........................... 1.0
-  rotary_seq_len_interpolation_factor ............. None
-  run_workload_inspector_server ................... False
-  sample_rate ..................................... 1.0
-  save ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints
-  save_flux_gather_input .......................... False
-  save_interval ................................... 1000
-  save_retain_interval ............................ None
-  scatter_gather_tensors_in_pipeline .............. True
-  schedule_method ................................. vanilla
-  schedule_timer_end .............................. 20
-  schedule_timer_start ............................ 10
-  seed ............................................ 1234
-  seq_length ...................................... 8192
-  sequence_parallel ............................... True
-  sft ............................................. False
-  sft_tokenizer_prompt_format ..................... nemotron-h-aligned
-  sgd_momentum .................................... 0.9
-  sharp_enabled_group ............................. None
-  short_seq_prob .................................. 0.1
-  skip_train ...................................... False
-  skipped_train_samples ........................... 0
-  softmax_type .................................... vanilla
-  spatial_merge_size .............................. 2
-  spec ............................................ None
-  specify_layers .................................. None
-  split ........................................... 949,50,1
-  squared_relu .................................... False
-  start_weight_decay .............................. 0.1
-  straggler_ctrlr_port ............................ 65535
-  straggler_minmax_count .......................... 1
-  strict_fsdp_dtensor_load ........................ True
-  suggested_communication_unit_size ............... None
-  swap_attention .................................. False
-  swap_modules .................................... self_attention
-  swiglu .......................................... True
-  swin_backbone_type .............................. tiny
-  symmetric_ar_type ............................... None
-  te_rng_tracker .................................. False
-  teacher_model_config ............................ None
-  temporal_patch_size ............................. 2
-  tensor_model_parallel_size ...................... 4
-  tensorboard_dir ................................. /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints/tensorboard
-  tensorboard_log_interval ........................ 1
-  tensorboard_queue_size .......................... 1000
-  test_data_path .................................. None
-  test_mode ....................................... False
-  tiktoken_num_special_tokens ..................... 1000
-  tiktoken_pattern ................................ None
-  tiktoken_special_tokens ......................... None
-  timing_log_level ................................ 0
-  timing_log_option ............................... minmax
-  titles_data_path ................................ None
-  tokenizer_metadata .............................. None
-  tokenizer_model ................................. /home/models/qwen3/Qwen3-8B
-  tokenizer_type .................................. HuggingFaceTokenizer
-  torch_fsdp2_reshard_after_forward ............... True
-  tp_comm_bootstrap_backend ....................... nccl
-  tp_comm_bulk_dgrad .............................. True
-  tp_comm_bulk_wgrad .............................. True
-  tp_comm_overlap ................................. False
-  tp_comm_overlap_ag .............................. True
-  tp_comm_overlap_cfg ............................. None
-  tp_comm_overlap_rs .............................. True
-  tp_comm_overlap_rs_dgrad ........................ False
-  tp_comm_split_ag ................................ True
-  tp_comm_split_rs ................................ True
-  train_data_path ................................. None
-  train_iters ..................................... 50
-  train_samples ................................... None
-  train_sync_interval ............................. None
-  transformer_impl ................................ transformer_engine
-  transformer_pipeline_model_parallel_size ........ 1
-  trust_remote_code ............................... False
-  untie_embeddings_and_output_weights ............. True
-  use_checkpoint_args ............................. False
-  use_checkpoint_opt_param_scheduler .............. False
-  use_ckpt_memory_cache ........................... False
-  use_cpu_initialization .......................... None
-  use_dist_ckpt ................................... False
-  use_dist_ckpt_deprecated ........................ False
-  use_distributed_optimizer ....................... True
-  use_flash_attn .................................. True
-  use_fused_weighted_squared_relu ................. False
-  use_hip_profiler ................................ False
-  use_legacy_models ............................... False
-  use_megatron_fsdp ............................... False
-  use_mp_args_from_checkpoint_args ................ False
-  use_one_sent_docs ............................... False
-  use_optimizer_feature ........................... False
-  use_persistent_ckpt_worker ...................... False
-  use_precision_aware_optimizer ................... False
-  use_pytorch_profiler ............................ False
-  use_qk_norm ..................................... False
-  use_quantize_comm ............................... False
-  use_ring_exchange_p2p ........................... False
-  use_rope_scaling ................................ False
-  use_rotary_position_embeddings .................. False
-  use_sharp ....................................... False
-  use_te_activation_func .......................... False
-  use_tokenizer_model_from_checkpoint_args ........ True
-  use_torch_fsdp2 ................................. False
-  use_torch_optimizer_for_cpu_offload ............. False
-  use_tp_pp_dp_mapping ............................ False
-  v_head_dim ...................................... 128
-  valid_data_path ................................. None
-  variable_seq_lengths ............................ False
-  virtual_pipeline_model_parallel_size ............ None
-  vision_backbone_type ............................ vit
-  vision_pretraining .............................. False
-  vision_pretraining_type ......................... classify
-  vocab_extra_ids ................................. 0
-  vocab_file ...................................... None
-  vocab_size ...................................... None
-  wandb_entity .................................... 
-  wandb_exp_name .................................. 
-  wandb_project ................................... 
-  wandb_save_dir .................................. 
-  weight_decay .................................... 0.1
-  weight_decay_incr_style ......................... constant
-  wgrad_deferral_limit ............................ 0
-  window_attn_skip_freq ........................... None
-  window_size ..................................... None
-  world_size ...................................... 8
-  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
-> building HuggingFaceTokenizer tokenizer ...
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
- > padded vocab (size: 151669) with 395 dummy tokens (new size: 152064)
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-> initializing torch distributed ...
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
-WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:02.731345 186405 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:02.731526 186402 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.134291 186407 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.135790 186395 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-tp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-dp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-ep_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-etp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-edp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-tp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
-embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-pos_embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
-tp-ep_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-tp-dp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
-tp-pp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
-dp-cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
-> initialized tensor model parallel with size 4
-> initialized pipeline model parallel with size 1
-> setting random seeds to 1234 ...
-> compiling dataset index builder ...
-make: Entering directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.230739 186403 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.232905 186401 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.237195 186406 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
-make: Nothing to be done for 'default'.
-make: Leaving directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
->>> done with dataset index builder. Compilation time: 0.040 seconds
-> compiling and loading fused kernels ...
-WARNING: Logging before InitGoogleLogging() is written to STDERR
-W0211 16:50:03.283905 186404 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
->>> done with compiling and loading fused kernels. Compilation time: 1.224 seconds
-time to initialize megatron (seconds): 7.497
-[after megatron is initialized] datetime: 2026-02-11 16:50:07 
-building GPT model ...
- > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 2048177152
- > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 2048177152
- > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 2048177152
-[rank 0] GPTModel(
-  (embedding): LanguageModelEmbedding(
-    (word_embeddings): VocabParallelEmbedding()
-    (embedding_dropout): Dropout(p=0.0, inplace=False)
-  )
-  (rotary_pos_emb): RotaryEmbedding()
-  (decoder): TransformerBlock(
-    (layers): ModuleList(
-      (0-35): 36 x TransformerLayer(
-        (input_layernorm): IdentityOp()
-        (self_attention): SelfAttention(
-          (core_attention): TEDotProductAttention(
-            (flash_attention): FlashAttention()
-            (fused_attention): FusedAttention()
-            (unfused_attention): UnfusedDotProductAttention(
-              (scale_mask_softmax): FusedScaleMaskSoftmax()
-              (attention_dropout): Dropout(p=0.0, inplace=False)
-            )
-          )
-          (linear_proj): TERowParallelLinear(in_features=1024, out_features=4096, bias=False, TP=4)
-          (linear_qkv): TELayerNormColumnParallelLinear(in_features=4096, out_features=1536, bias=False, TP=4)
-          (q_layernorm): RMSNorm()
-          (k_layernorm): RMSNorm()
-        )
-        (pre_cross_attn_layernorm): IdentityOp()
-        (cross_attention): IdentityOp()
-        (cross_attn_bda): IdentityFuncOp()
-        (pre_mlp_layernorm): IdentityOp()
-        (mlp): MLP(
-          (linear_fc1): TELayerNormColumnParallelLinear(in_features=4096, out_features=6144, bias=False, TP=4)
-          (linear_fc2): TERowParallelLinear(in_features=3072, out_features=4096, bias=False, TP=4)
-        )
-      )
-    )
-    (final_layernorm): RMSNorm()
-  )
-  (output_layer): ColumnParallelLinear(in_features=4096, out_features=152064, bias=False, TP=4)
-)
- > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 2048177152
-WARNING: could not find the metadata file /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints/latest_checkpointed_iteration.txt
-    will not load any checkpoints and will start from random
-(min, max) time across ranks (ms):
-    load-checkpoint ................................: (0.95, 0.97)
-[after model, optimizer, and learning rate scheduler are built] datetime: 2026-02-11 16:50:07 
-> building train, validation, and test datasets ...
- > datasets target sizes (minimum size):
-    train:      1600
-    validation: 160
-    test:       160
-> building train, validation, and test datasets for GPT ...
-> finished creating GPT datasets ...
-[after dataloaders are built] datetime: 2026-02-11 16:50:08 
-done with setup ...
-(min, max) time across ranks (ms):
-    model-and-optimizer-setup ......................: (349.92, 359.61)
-    train/valid/test-data-iterators-setup ..........: (256.17, 355.63)
-training ...
-Overwriting rerun_state_machine.current_iteration from -1 to 0...
-[before the start of training step] datetime: 2026-02-11 16:50:08 
-[WARNING  | megatron.core.rerun_state_machine]: Result validation enabled
- [2026-02-11 16:50:38] iteration        1/      50 | consumed samples:           32 | elapsed time per iteration (ms): 30247.9 | throughput per GPU (TFLOP/s/GPU): 57.0 | learning rate: 3.000000E-05 | global batch size:    32 | lm loss: 1.202196E+01 | loss scale: 1.0 | grad norm: 47.059 | number of skipped iterations:   0 | number of nan iterations:   0 |
-Number of parameters in transformer block in billions:  6.95
-Number of parameters in embedding layers in billions: 1.25
-Total number of parameters in billions: 8.19
-Number of parameters in most loaded shard in billions: 2.0480
-compute_activation_memory_without_sp
-Activation memory footprint per transformer layer (precise, without SP): 512.0 MB
-Theoretical memory footprints: weight and optimizer=35156.57 MB, activation=20078.17 MB, total=55234.73 MB
-
-[Rank 7] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29766.0 | max reserved: 29766.0
-[Rank 5] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29928.0 | max reserved: 29928.0
-[Rank 6] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29928.0 | max reserved: 29928.0
-[Rank 1] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 27962.0 | max reserved: 27962.0
-[Rank 4] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29228.0 | max reserved: 29228.0
-[Rank 3] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 28322.0 | max reserved: 28322.0
-[Rank 2] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 29350.0 | max reserved: 29350.0
-[Rank 0] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 24686.0 | max reserved: 24686.0
- [2026-02-11 16:50:53] iteration        2/      50 | consumed samples:           64 | elapsed time per iteration (ms): 15385.6 | throughput per GPU (TFLOP/s/GPU): 112.2 | learning rate: 2.997226E-05 | global batch size:    32 | lm loss: 1.202580E+01 | loss scale: 1.0 | grad norm: 48.137 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:51:09] iteration        3/      50 | consumed samples:           96 | elapsed time per iteration (ms): 15321.7 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 2.988916E-05 | global batch size:    32 | lm loss: 1.092907E+01 | loss scale: 1.0 | grad norm: 165.459 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:51:24] iteration        4/      50 | consumed samples:          128 | elapsed time per iteration (ms): 15654.0 | throughput per GPU (TFLOP/s/GPU): 110.2 | learning rate: 2.975105E-05 | global batch size:    32 | lm loss: 1.045790E+01 | loss scale: 1.0 | grad norm: 39.117 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:51:40] iteration        5/      50 | consumed samples:          160 | elapsed time per iteration (ms): 15348.8 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.955848E-05 | global batch size:    32 | lm loss: 1.024906E+01 | loss scale: 1.0 | grad norm: 4.208 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:51:55] iteration        6/      50 | consumed samples:          192 | elapsed time per iteration (ms): 15363.3 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.931225E-05 | global batch size:    32 | lm loss: 1.000299E+01 | loss scale: 1.0 | grad norm: 3.643 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:52:10] iteration        7/      50 | consumed samples:          224 | elapsed time per iteration (ms): 15363.5 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.901338E-05 | global batch size:    32 | lm loss: 1.113537E+01 | loss scale: 1.0 | grad norm: 1047.474 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:52:26] iteration        8/      50 | consumed samples:          256 | elapsed time per iteration (ms): 15697.8 | throughput per GPU (TFLOP/s/GPU): 109.9 | learning rate: 2.866308E-05 | global batch size:    32 | lm loss: 9.984780E+00 | loss scale: 1.0 | grad norm: 4.541 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:52:41] iteration        9/      50 | consumed samples:          288 | elapsed time per iteration (ms): 15353.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.826280E-05 | global batch size:    32 | lm loss: 9.881529E+00 | loss scale: 1.0 | grad norm: 3.893 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:52:57] iteration       10/      50 | consumed samples:          320 | elapsed time per iteration (ms): 15353.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.781419E-05 | global batch size:    32 | lm loss: 9.621094E+00 | loss scale: 1.0 | grad norm: 3.681 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:53:12] iteration       11/      50 | consumed samples:          352 | elapsed time per iteration (ms): 15357.6 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.731908E-05 | global batch size:    32 | lm loss: 9.587229E+00 | loss scale: 1.0 | grad norm: 4.020 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:53:28] iteration       12/      50 | consumed samples:          384 | elapsed time per iteration (ms): 15708.9 | throughput per GPU (TFLOP/s/GPU): 109.8 | learning rate: 2.677952E-05 | global batch size:    32 | lm loss: 9.429702E+00 | loss scale: 1.0 | grad norm: 3.008 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:53:43] iteration       13/      50 | consumed samples:          416 | elapsed time per iteration (ms): 15341.5 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 2.619772E-05 | global batch size:    32 | lm loss: 9.362844E+00 | loss scale: 1.0 | grad norm: 3.049 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:53:58] iteration       14/      50 | consumed samples:          448 | elapsed time per iteration (ms): 15353.9 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.557606E-05 | global batch size:    32 | lm loss: 9.235973E+00 | loss scale: 1.0 | grad norm: 3.024 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:54:14] iteration       15/      50 | consumed samples:          480 | elapsed time per iteration (ms): 15368.8 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.491711E-05 | global batch size:    32 | lm loss: 9.124246E+00 | loss scale: 1.0 | grad norm: 3.094 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:54:29] iteration       16/      50 | consumed samples:          512 | elapsed time per iteration (ms): 15690.1 | throughput per GPU (TFLOP/s/GPU): 110.0 | learning rate: 2.422357E-05 | global batch size:    32 | lm loss: 9.039256E+00 | loss scale: 1.0 | grad norm: 3.090 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:54:45] iteration       17/      50 | consumed samples:          544 | elapsed time per iteration (ms): 15362.6 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.349830E-05 | global batch size:    32 | lm loss: 8.930184E+00 | loss scale: 1.0 | grad norm: 3.067 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:55:00] iteration       18/      50 | consumed samples:          576 | elapsed time per iteration (ms): 15365.7 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.274427E-05 | global batch size:    32 | lm loss: 8.846102E+00 | loss scale: 1.0 | grad norm: 2.797 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:55:16] iteration       19/      50 | consumed samples:          608 | elapsed time per iteration (ms): 15354.9 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.196458E-05 | global batch size:    32 | lm loss: 8.751925E+00 | loss scale: 1.0 | grad norm: 2.625 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:55:31] iteration       20/      50 | consumed samples:          640 | elapsed time per iteration (ms): 15677.6 | throughput per GPU (TFLOP/s/GPU): 110.1 | learning rate: 2.116243E-05 | global batch size:    32 | lm loss: 8.664285E+00 | loss scale: 1.0 | grad norm: 2.482 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:55:47] iteration       21/      50 | consumed samples:          672 | elapsed time per iteration (ms): 15370.2 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.034112E-05 | global batch size:    32 | lm loss: 8.609591E+00 | loss scale: 1.0 | grad norm: 2.400 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:56:02] iteration       22/      50 | consumed samples:          704 | elapsed time per iteration (ms): 15353.5 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.950403E-05 | global batch size:    32 | lm loss: 8.478221E+00 | loss scale: 1.0 | grad norm: 2.279 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:56:17] iteration       23/      50 | consumed samples:          736 | elapsed time per iteration (ms): 15359.4 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 1.865460E-05 | global batch size:    32 | lm loss: 8.495676E+00 | loss scale: 1.0 | grad norm: 2.119 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:56:33] iteration       24/      50 | consumed samples:          768 | elapsed time per iteration (ms): 15682.9 | throughput per GPU (TFLOP/s/GPU): 110.0 | learning rate: 1.779631E-05 | global batch size:    32 | lm loss: 8.401316E+00 | loss scale: 1.0 | grad norm: 2.247 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:56:48] iteration       25/      50 | consumed samples:          800 | elapsed time per iteration (ms): 15347.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.693270E-05 | global batch size:    32 | lm loss: 8.394979E+00 | loss scale: 1.0 | grad norm: 2.109 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:57:04] iteration       26/      50 | consumed samples:          832 | elapsed time per iteration (ms): 15358.2 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.606730E-05 | global batch size:    32 | lm loss: 8.387753E+00 | loss scale: 1.0 | grad norm: 2.035 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:57:19] iteration       27/      50 | consumed samples:          864 | elapsed time per iteration (ms): 15357.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.520369E-05 | global batch size:    32 | lm loss: 8.329927E+00 | loss scale: 1.0 | grad norm: 1.830 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:57:35] iteration       28/      50 | consumed samples:          896 | elapsed time per iteration (ms): 15658.3 | throughput per GPU (TFLOP/s/GPU): 110.2 | learning rate: 1.434540E-05 | global batch size:    32 | lm loss: 8.217674E+00 | loss scale: 1.0 | grad norm: 1.822 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:57:50] iteration       29/      50 | consumed samples:          928 | elapsed time per iteration (ms): 15348.2 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.349597E-05 | global batch size:    32 | lm loss: 8.206045E+00 | loss scale: 1.0 | grad norm: 1.715 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:58:05] iteration       30/      50 | consumed samples:          960 | elapsed time per iteration (ms): 15348.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.265888E-05 | global batch size:    32 | lm loss: 8.208779E+00 | loss scale: 1.0 | grad norm: 1.603 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:58:21] iteration       31/      50 | consumed samples:          992 | elapsed time per iteration (ms): 15433.1 | throughput per GPU (TFLOP/s/GPU): 111.8 | learning rate: 1.183757E-05 | global batch size:    32 | lm loss: 8.186785E+00 | loss scale: 1.0 | grad norm: 1.609 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:58:36] iteration       32/      50 | consumed samples:         1024 | elapsed time per iteration (ms): 15588.5 | throughput per GPU (TFLOP/s/GPU): 110.7 | learning rate: 1.103542E-05 | global batch size:    32 | lm loss: 8.070101E+00 | loss scale: 1.0 | grad norm: 1.694 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:58:52] iteration       33/      50 | consumed samples:         1056 | elapsed time per iteration (ms): 15335.9 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 1.025573E-05 | global batch size:    32 | lm loss: 8.066827E+00 | loss scale: 1.0 | grad norm: 1.641 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:59:07] iteration       34/      50 | consumed samples:         1088 | elapsed time per iteration (ms): 15352.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 9.501700E-06 | global batch size:    32 | lm loss: 8.050054E+00 | loss scale: 1.0 | grad norm: 1.604 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:59:23] iteration       35/      50 | consumed samples:         1120 | elapsed time per iteration (ms): 15444.5 | throughput per GPU (TFLOP/s/GPU): 111.7 | learning rate: 8.776425E-06 | global batch size:    32 | lm loss: 8.065158E+00 | loss scale: 1.0 | grad norm: 1.521 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:59:38] iteration       36/      50 | consumed samples:         1152 | elapsed time per iteration (ms): 15611.3 | throughput per GPU (TFLOP/s/GPU): 110.5 | learning rate: 8.082888E-06 | global batch size:    32 | lm loss: 7.998910E+00 | loss scale: 1.0 | grad norm: 1.496 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 16:59:54] iteration       37/      50 | consumed samples:         1184 | elapsed time per iteration (ms): 15338.0 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 7.423938E-06 | global batch size:    32 | lm loss: 7.993576E+00 | loss scale: 1.0 | grad norm: 1.429 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:00:09] iteration       38/      50 | consumed samples:         1216 | elapsed time per iteration (ms): 15354.6 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 6.802284E-06 | global batch size:    32 | lm loss: 7.927972E+00 | loss scale: 1.0 | grad norm: 1.440 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:00:24] iteration       39/      50 | consumed samples:         1248 | elapsed time per iteration (ms): 15551.1 | throughput per GPU (TFLOP/s/GPU): 111.0 | learning rate: 6.220479E-06 | global batch size:    32 | lm loss: 7.943327E+00 | loss scale: 1.0 | grad norm: 1.296 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:00:40] iteration       40/      50 | consumed samples:         1280 | elapsed time per iteration (ms): 15499.9 | throughput per GPU (TFLOP/s/GPU): 111.3 | learning rate: 5.680916E-06 | global batch size:    32 | lm loss: 7.900488E+00 | loss scale: 1.0 | grad norm: 1.334 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:00:55] iteration       41/      50 | consumed samples:         1312 | elapsed time per iteration (ms): 15316.7 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 5.185811E-06 | global batch size:    32 | lm loss: 8.008162E+00 | loss scale: 1.0 | grad norm: 1.218 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:01:11] iteration       42/      50 | consumed samples:         1344 | elapsed time per iteration (ms): 15322.9 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 4.737197E-06 | global batch size:    32 | lm loss: 7.860763E+00 | loss scale: 1.0 | grad norm: 1.340 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:01:26] iteration       43/      50 | consumed samples:         1376 | elapsed time per iteration (ms): 15501.7 | throughput per GPU (TFLOP/s/GPU): 111.3 | learning rate: 4.336920E-06 | global batch size:    32 | lm loss: 7.921451E+00 | loss scale: 1.0 | grad norm: 1.185 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:01:42] iteration       44/      50 | consumed samples:         1408 | elapsed time per iteration (ms): 15476.8 | throughput per GPU (TFLOP/s/GPU): 111.5 | learning rate: 3.986624E-06 | global batch size:    32 | lm loss: 7.933675E+00 | loss scale: 1.0 | grad norm: 1.138 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:01:57] iteration       45/      50 | consumed samples:         1440 | elapsed time per iteration (ms): 15304.0 | throughput per GPU (TFLOP/s/GPU): 112.8 | learning rate: 3.687747E-06 | global batch size:    32 | lm loss: 7.962870E+00 | loss scale: 1.0 | grad norm: 1.134 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:02:12] iteration       46/      50 | consumed samples:         1472 | elapsed time per iteration (ms): 15307.2 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 3.441519E-06 | global batch size:    32 | lm loss: 7.928866E+00 | loss scale: 1.0 | grad norm: 1.133 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:02:28] iteration       47/      50 | consumed samples:         1504 | elapsed time per iteration (ms): 15483.8 | throughput per GPU (TFLOP/s/GPU): 111.4 | learning rate: 3.248951E-06 | global batch size:    32 | lm loss: 7.920525E+00 | loss scale: 1.0 | grad norm: 1.136 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:02:43] iteration       48/      50 | consumed samples:         1536 | elapsed time per iteration (ms): 15473.4 | throughput per GPU (TFLOP/s/GPU): 111.5 | learning rate: 3.110835E-06 | global batch size:    32 | lm loss: 7.903946E+00 | loss scale: 1.0 | grad norm: 1.234 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:02:58] iteration       49/      50 | consumed samples:         1568 | elapsed time per iteration (ms): 15313.9 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 3.027737E-06 | global batch size:    32 | lm loss: 7.890891E+00 | loss scale: 1.0 | grad norm: 1.169 | number of skipped iterations:   0 | number of nan iterations:   0 |
- [2026-02-11 17:03:14] iteration       50/      50 | consumed samples:         1600 | elapsed time per iteration (ms): 15323.9 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 3.000000E-06 | global batch size:    32 | lm loss: 7.846929E+00 | loss scale: 1.0 | grad norm: 1.171 | number of skipped iterations:   0 | number of nan iterations:   0 |
-[after training is done] datetime: 2026-02-11 17:03:14 
-saving checkpoint at iteration      50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints in torch format
-  successfully saved checkpoint from iteration      50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints [ t 1/4, p 1/1 ]
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
-Evaluating on 160 samples
-Evaluating iter 1/5
-Evaluating iter 2/5
-Evaluating iter 3/5
-Evaluating iter 4/5
-Evaluating iter 5/5
-(min, max) time across ranks (ms):
-    evaluate .......................................: (27421.89, 27422.26)
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------------
- validation loss at iteration 50 on validation set | lm loss value: 8.004234E+00 | lm loss PPL: 2.993607E+03 | 
----------------------------------------------------------------------------------------------------------------
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
-Evaluating on 160 samples
-Evaluating iter 1/5
-Evaluating iter 2/5
-Evaluating iter 3/5
-Evaluating iter 4/5
-Evaluating iter 5/5
-(min, max) time across ranks (ms):
-    evaluate .......................................: (26114.85, 26115.52)
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
-[WARNING  | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
- validation loss at iteration 50 on test set | lm loss value: 8.008901E+00 | lm loss PPL: 3.007609E+03 | 
----------------------------------------------------------------------------------------------------------
-W0211 17:04:59.713297 186405 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.717712 186403 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.731681 186395 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.748812 186401 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.750737 186407 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.753005 186402 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.835309 186404 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
-W0211 17:04:59.837489 186406 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())