Commit 6eabbacb authored by lim's avatar lim
Browse files

Remove .log files from repository

parent 5814e156
Pipeline #3397 failed with stages
in 0 seconds
Successfully preprocessed all matching files.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
using world size: 8, data-parallel size: 2, context-parallel size: 1, hierarchical context-parallel sizes: None, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:HuggingFaceTokenizer
Number of virtual stages per pipeline stage: None
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
account_for_embedding_in_pipeline_split ......... False
account_for_loss_in_pipeline_split .............. False
accumulate_allreduce_grads_in_fp32 .............. True
activation_func_clamp_value ..................... None
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... True
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
align_grad_reduce ............................... True
align_param_gather .............................. False
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... True
attention_backend ............................... AttnBackend.auto
attention_dropout ............................... 0.0
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
cache_mla_latents ............................... False
calc_ft_timeouts ................................ False
calculate_per_token_loss ........................ False
check_for_large_grads ........................... False
check_for_nan_in_loss_and_grad .................. True
check_for_spiky_loss ............................ False
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_convert_format ............................. None
ckpt_convert_save ............................... None
ckpt_convert_update_legacy_dist_opt_format ...... False
ckpt_format ..................................... torch
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
collect_log_path ................................ ./logs
comm_time_log_iter .............................. None
config_logger_dir ...............................
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
cp_comm_type .................................... ['p2p']
create_attention_mask_in_dataloader ............. True
cross_entropy_fusion_impl ....................... native
cross_entropy_loss_fusion ....................... False
cuda_graph_impl ................................. none
cuda_graph_scope ................................ full
cuda_graph_warmup_steps ......................... 3
data_args_path .................................. None
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_sharding_strategy ................. no_shard
data_parallel_size .............................. 2
data_path ....................................... ['/workspace/data/oscar/oscar-1GB_head-qwen_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... True
ddp_bucket_size ................................. None
ddp_num_buckets ................................. None
ddp_pad_buckets_for_high_nccl_busbw ............. False
decode_only_cuda_graphs ......................... False
decoder_first_pipeline_num_layers ............... None
decoder_last_pipeline_num_layers ................ None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
decrease_batch_size_if_needed ................... False
defer_embedding_wgrad_compute ................... False
delay_wgrad_compute ............................. False
deprecated_use_mcore_models ..................... True
deterministic_mode .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_backward_fusion ......................... False
disable_bf16_reduced_precision_matmul ........... False
disable_chunked_prefill ......................... False
disable_mamba_mem_eff_path ...................... False
disable_straggler_on_startup .................... False
disable_symmetric_registration .................. False
disable_vision_class_token ...................... False
dist_ckpt_format_deprecated ..................... None
dist_ckpt_optim_fully_reshardable ............... False
dist_ckpt_save_pre_mcore_014 .................... False
dist_ckpt_strictness ............................ assume_ok_unexpected
dist_url ........................................ tcp://localhost:25900
distrib_optim_fully_reshardable_mem_efficient ... False
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
distributed_timeout_seconds_after_init .......... None
embedding_init_method_std ....................... None
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_bw_flux_gemmrs_op ........................ True
enable_cuda_graph ............................... False
enable_dynamic_grad_comp ........................ False
enable_experimental ............................. False
enable_ft_package ............................... False
enable_full_sharding_in_hsdp .................... False
enable_gloo_process_groups ...................... True
enable_msc ...................................... True
enable_one_logger ............................... True
enable_vocab_parallel ........................... False
encoder_num_layers .............................. 28
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
error_injection_rate ............................ 0
error_injection_type ............................ transient_error
eval_interval ................................... 1000
eval_iters ...................................... 5
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
exp_avg_dtype ................................... torch.float32
exp_avg_sq_dtype ................................ torch.float32
expert_model_parallel_size ...................... 1
expert_tensor_parallel_size ..................... 4
export_force_local_attention .................... False
export_kd_cfg ................................... None
export_kd_teacher_ckpt_format ................... None
export_kd_teacher_load .......................... None
export_kv_cache_quant ........................... False
export_legacy_megatron .......................... False
export_model_type ............................... GPTModel
export_moe_apply_probs_on_input ................. False
export_offline_model ............................ False
export_qk_l2_norm ............................... False
export_quant_cfg ................................ None
export_real_quant_cfg ........................... None
export_te_mcore_model ........................... False
external_cuda_graph ............................. False
extra_vocab_size ................................ 0
ffn_hidden_size ................................. 5504
fine_grained_activation_offloading .............. False
finetune ........................................ False
finetune_data_split ............................. train
finetune_hf_dataset ............................. None
first_last_layers_bf16 .......................... False
flash_decode .................................... False
flux_transpose_weight ........................... False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp4 ............................................. None
fp4_param ....................................... False
fp4_recipe ...................................... nvfp4
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_param_gather ................................ False
fp8_recipe ...................................... delayed
fp8_wgrad ....................................... True
freeze_LM ....................................... False
freeze_ViT ...................................... False
fsdp_double_buffer .............................. False
full_validation ................................. False
global_batch_size ............................... 256
glu_linear_offset ............................... 0.0
grad_comp ....................................... False
grad_comp_warm_up ............................... 0.1
grad_reduce_in_bf16 ............................. False
gradient_accumulation_fusion .................... True
gradient_reduce_div_fusion ...................... True
gradient_sample_ratio ........................... 1.0
group_query_attention ........................... True
grpo_clamp_eps_lower ............................ 0.01
grpo_clamp_eps_upper ............................ 0.01
grpo_default_temperature ........................ 1.0
grpo_default_top_p .............................. 0
grpo_entropy_term_weight ........................ 0.0
grpo_filter_groups_with_same_reward ............. False
grpo_group_size ................................. 2
grpo_iterations ................................. 2
grpo_kl_beta .................................... 0.001
grpo_prompts_per_step ........................... 32
head_lr_mult .................................... 1.0
heterogeneous_layers_config_encoded_json ........ None
heterogeneous_layers_config_path ................ None
hidden_dropout .................................. 0.0
hidden_size ..................................... 1024
hierarchical_context_parallel_sizes ............. None
high_priority_stream_groups ..................... []
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... -1
inference_dynamic_batching ...................... False
inference_dynamic_batching_block_size ........... 256
inference_dynamic_batching_buffer_guaranteed_fraction 0.2
inference_dynamic_batching_buffer_overflow_factor None
inference_dynamic_batching_buffer_size_gb ....... 40.0
inference_dynamic_batching_max_requests_override None
inference_dynamic_batching_max_tokens_override .. None
inference_dynamic_batching_num_cuda_graphs ...... 16
inference_dynamic_batching_track_paused_request_events False
inference_dynamic_batching_unified_memory_level . 0
inference_max_batch_size ........................ 8
inference_max_seq_length ........................ 2560
inference_rng_tracker ........................... False
init_method_std ................................. 0.006
init_method_xavier_uniform ...................... False
init_model_with_meta_device ..................... False
initial_loss_scale .............................. 4294967296
inprocess_active_world_size ..................... 1
inprocess_barrier_timeout ....................... 120
inprocess_completion_timeout .................... 120
inprocess_empty_cuda_cache ...................... False
inprocess_granularity ........................... node
inprocess_hard_timeout .......................... 90
inprocess_heartbeat_interval .................... 30
inprocess_heartbeat_timeout ..................... 60
inprocess_last_call_wait ........................ 1
inprocess_max_iterations ........................ None
inprocess_monitor_process_interval .............. 1.0
inprocess_monitor_thread_interval ............... 1.0
inprocess_progress_watchdog_interval ............ 1.0
inprocess_restart ............................... False
inprocess_soft_timeout .......................... 60
inprocess_termination_grace_time ................ 1
is_hybrid_model ................................. False
iter_per_epoch .................................. 1250
iteration_sample_ratio .......................... 0.01
iterations_to_skip .............................. []
keep_fp8_transpose_cache ........................ False
kitchen_config_file ............................. None
kitchen_recipe_number ........................... None
kv_channels ..................................... 64
kv_lora_rank .................................... 32
langrl_env_config ............................... None
langrl_external_server .......................... False
langrl_inference_server_conversation_template ... None
langrl_inference_server_type .................... inplace_megatron
lazy_mpu_init ................................... None
legacy_tokenizer ................................ False
load ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1
load_main_params_from_ckpt ...................... None
local_rank ...................................... 0
log_energy ...................................... False
log_interval .................................... 1
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. True
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 3e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 1
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
main_grads_dtype ................................ torch.float32
main_params_dtype ............................... torch.float32
make_vocab_size_divisible_by .................... 128
mamba_head_dim .................................. 64
mamba_num_groups ................................ 8
mamba_num_heads ................................. None
mamba_state_dim ................................. 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_padding_length .............................. None
max_position_embeddings ......................... 32768
max_tokens_to_oom ............................... 12000
memory_snapshot_path ............................ snapshot.pickle
merge_file ...................................... None
micro_batch_size ................................ 1
microbatch_group_size_per_vp_stage .............. None
mid_level_dataset_surplus ....................... 0.005
min_loss_scale .................................. 1.0
min_lr .......................................... 3e-06
min_offloaded_tensor_size ....................... 1048576
mlp_chunks_for_prefill .......................... 1
mmap_bin_files .................................. True
mock_data ....................................... False
modelopt_enabled ................................ False
moe_apply_probs_on_input ........................ False
moe_aux_loss_coeff .............................. 0.0
moe_deepep_num_sms .............................. 20
moe_enable_deepep ............................... False
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_ffn_hidden_size ............................. None
moe_grouped_gemm ................................ False
moe_input_jitter_eps ............................ None
moe_layer_freq .................................. 1
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_pad_experts_for_cuda_graph_inference ........ False
moe_per_layer_logging ........................... False
moe_permute_fusion .............................. False
moe_router_bias_update_rate ..................... 0.001
moe_router_dtype ................................ None
moe_router_enable_expert_bias ................... False
moe_router_force_load_balancing ................. False
moe_router_fusion ............................... False
moe_router_group_topk ........................... None
moe_router_load_balancing_type .................. aux_loss
moe_router_num_groups ........................... None
moe_router_padding_for_fp8 ...................... False
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... softmax
moe_router_topk ................................. 2
moe_router_topk_scaling_factor .................. None
moe_shared_expert_intermediate_size ............. None
moe_shared_expert_overlap ....................... False
moe_token_dispatcher_type ....................... allgather
moe_token_drop_policy ........................... probs
moe_upcycling_granularity ....................... 1
moe_use_legacy_grouped_gemm ..................... False
moe_use_upcycling ............................... False
moe_z_loss_coeff ................................ None
mrope_section ................................... None
mscale .......................................... 1.0
mscale_all_dim .................................. 0.0
mtp_loss_scaling_factor ......................... 0.1
mtp_num_layers .................................. None
multi_latent_attention .......................... False
multiple_validation_sets ........................ False
nccl_all_reduce_for_prefill ..................... False
nccl_communicator_config_path ................... None
nccl_ub ......................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_rope_freq .................................... None
no_save_optim ................................... None
no_save_rng ..................................... None
non_persistent_ckpt_type ........................ None
non_persistent_global_ckpt_dir .................. None
non_persistent_local_ckpt_algo .................. fully_parallel
non_persistent_local_ckpt_dir ................... None
non_persistent_save_interval .................... None
norm_epsilon .................................... 1e-05
normalization ................................... RMSNorm
num_attention_heads ............................. 16
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_distributed_optimizer_instances ............. 1
num_experts ..................................... None
num_layers ...................................... 28
num_layers_at_end_in_bf16 ....................... 1
num_layers_at_start_in_bf16 ..................... 1
num_layers_per_virtual_pipeline_stage ........... None
num_layers_to_build ............................. None
num_query_groups ................................ 8
num_virtual_stages_per_pipeline_rank ............ None
num_workers ..................................... 2
object_storage_cache_path ....................... None
offload_modules ................................. None
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
optimizer_cpu_offload ........................... False
optimizer_offload_fraction ...................... 1.0
output_bert_embeddings .......................... False
overlap_cpu_optimizer_d2h_h2d ................... False
overlap_ep_comm_with_split_attn ................. False
overlap_grad_reduce ............................. True
overlap_moe_expert_parallel_comm ................ False
overlap_p2p_comm ................................ False
overlap_p2p_comm_warmup_flush ................... False
overlap_param_gather ............................ False
overlap_param_gather_with_optimizer_step ........ False
override_opt_param_scheduler .................... False
padded_vocab_size ............................... None
parallel_linear_impl ............................ None
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
patch_size ...................................... 14
per_split_data_args_path ........................ None
perform_initialization .......................... True
perform_rl_step ................................. False
pin_cpu_grads ................................... True
pin_cpu_params .................................. True
pipe_sp_splits .................................. 1
pipe_sp_strategy ................................ average
pipeline_model_parallel_comm_backend ............ None
pipeline_model_parallel_layout .................. None
pipeline_model_parallel_size .................... 1
position_embedding_type ......................... rope
pretrained_checkpoint ........................... None
profile ......................................... False
profile_dir ..................................... ./
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
q_lora_rank ..................................... None
qk_head_dim ..................................... 128
qk_l2_norm ...................................... False
qk_layernorm .................................... False
qk_pos_emb_head_dim ............................. 64
quant_comm_bits ................................. 8
quant_group_size ................................ None
quant_scale_dtype ............................... bf16
query_in_block_prob ............................. 0.1
quick_geglu ..................................... False
rampup_batch_size ............................... None
rank ............................................ 0
rank_adjust_window_size ......................... 1000
recompute_activation_function ................... False
recompute_activation_function_num_layers ........ None
recompute_granularity ........................... None
recompute_method ................................ None
recompute_modules ............................... None
recompute_num_layers ............................ None
record_memory_history ........................... False
reduce_recompute_for_last_chunk ................. False
relative_attention_max_distance ................. 128
relative_attention_num_buckets .................. 32
replication ..................................... False
replication_factor .............................. 2
replication_jump ................................ None
reproduce ....................................... False
rerun_mode ...................................... validate_results
reset_attention_mask ............................ False
reset_position_ids .............................. False
result_rejected_tracker_filename ................ None
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
reuse_fp32_param ................................ False
reuse_grad_buf_for_mxfp8_param_ag ............... False
rl_calculate_intra_group_similarity ............. False
rl_importance_sampling_truncation_coef .......... None
rl_inference_logprobs_is_correction ............. False
rl_offload_kv_cache_during_training ............. False
rl_offload_optimizer_during_inference ........... False
rl_partial_rollouts ............................. False
rl_prompts_per_eval ............................. 32
rl_remove_kv_cache_during_training .............. False
rl_reset_cuda_graphs ............................ False
rl_sequence_packing_algo ........................ fifo
rl_sequence_packing_bin_size .................... 8192
rl_use_sequence_packing ......................... False
rope_scaling_factor ............................. 8.0
rope_type ....................................... None
rotary_base ..................................... 1000000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_scaling_factor ........................... 1.0
rotary_seq_len_interpolation_factor ............. None
run_workload_inspector_server ................... False
sample_rate ..................................... 1.0
save ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1
save_flux_gather_input .......................... False
save_interval ................................... 5
save_retain_interval ............................ None
scatter_gather_tensors_in_pipeline .............. True
schedule_method ................................. vanilla
schedule_timer_end .............................. 20
schedule_timer_start ............................ 10
seed ............................................ 1234
seq_length ...................................... 4096
sequence_parallel ............................... True
sft ............................................. False
sft_tokenizer_prompt_format ..................... nemotron-h-aligned
sgd_momentum .................................... 0.9
sharp_enabled_group ............................. None
short_seq_prob .................................. 0.1
skip_train ...................................... False
skipped_train_samples ........................... 0
softmax_type .................................... vanilla
spatial_merge_size .............................. 2
spec ............................................ None
specify_layers .................................. None
split ........................................... 949,50,1
squared_relu .................................... False
start_weight_decay .............................. 0.1
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
strict_fsdp_dtensor_load ........................ True
suggested_communication_unit_size ............... None
swap_attention .................................. False
swap_modules .................................... self_attention
swiglu .......................................... True
swin_backbone_type .............................. tiny
symmetric_ar_type ............................... None
te_rng_tracker .................................. False
teacher_model_config ............................ None
temporal_patch_size ............................. 2
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1/tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_metadata .............................. None
tokenizer_model ................................. /home/models/qwen3/Qwen3-0.6B
tokenizer_type .................................. HuggingFaceTokenizer
torch_fsdp2_reshard_after_forward ............... True
tp_comm_bootstrap_backend ....................... nccl
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 50
train_samples ................................... None
train_sync_interval ............................. None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
trust_remote_code ............................... False
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_ckpt_memory_cache ........................... False
use_cpu_initialization .......................... None
use_dist_ckpt ................................... False
use_dist_ckpt_deprecated ........................ False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_fused_weighted_squared_relu ................. False
use_hip_profiler ................................ False
use_legacy_models ............................... False
use_megatron_fsdp ............................... False
use_mp_args_from_checkpoint_args ................ False
use_one_sent_docs ............................... False
use_optimizer_feature ........................... False
use_persistent_ckpt_worker ...................... False
use_precision_aware_optimizer ................... False
use_pytorch_profiler ............................ False
use_qk_norm ..................................... True
use_quantize_comm ............................... False
use_ring_exchange_p2p ........................... False
use_rope_scaling ................................ False
use_rotary_position_embeddings .................. False
use_sharp ....................................... False
use_te_activation_func .......................... False
use_tokenizer_model_from_checkpoint_args ........ True
use_torch_fsdp2 ................................. False
use_torch_optimizer_for_cpu_offload ............. False
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 128
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_entity ....................................
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
window_attn_skip_freq ........................... None
window_size ..................................... None
world_size ...................................... 8
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
> building HuggingFaceTokenizer tokenizer ...
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
> padded vocab (size: 151669) with 395 dummy tokens (new size: 152064)
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
> initializing torch distributed ...
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.207494 216834 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.211829 216828 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.313028 216826 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.313149 216833 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
tp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
dp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
ep_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
etp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
edp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
cp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
tp-cp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
pos_embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
tp-ep_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
tp-dp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
tp-pp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
dp-cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
> initialized tensor model parallel with size 4
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.380734 216832 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.382851 216831 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.387229 216830 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.039 seconds
> compiling and loading fused kernels ...
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 17:48:16.423564 216829 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
>>> done with compiling and loading fused kernels. Compilation time: 1.197 seconds
time to initialize megatron (seconds): 5.946
[after megatron is initialized] datetime: 2026-02-11 17:48:18
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 218293248
[rank 0] GPTModel(
(embedding): LanguageModelEmbedding(
(word_embeddings): VocabParallelEmbedding()
(embedding_dropout): Dropout(p=0.0, inplace=False)
)
(rotary_pos_emb): RotaryEmbedding()
(decoder): TransformerBlock(
(layers): ModuleList(
(0-27): 28 x TransformerLayer(
(input_layernorm): IdentityOp()
(self_attention): SelfAttention(
(core_attention): TEDotProductAttention(
(flash_attention): FlashAttention()
(fused_attention): FusedAttention()
(unfused_attention): UnfusedDotProductAttention(
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
)
)
(linear_proj): TERowParallelLinear(in_features=256, out_features=1024, bias=False, TP=4)
(linear_qkv): TELayerNormColumnParallelLinear(in_features=1024, out_features=512, bias=False, TP=4)
(q_layernorm): IdentityOp()
(k_layernorm): IdentityOp()
)
(pre_cross_attn_layernorm): IdentityOp()
(cross_attention): IdentityOp()
(cross_attn_bda): IdentityFuncOp()
(pre_mlp_layernorm): IdentityOp()
(mlp): MLP(
(linear_fc1): TELayerNormColumnParallelLinear(in_features=1024, out_features=2752, bias=False, TP=4)
(linear_fc2): TERowParallelLinear(in_features=1376, out_features=1024, bias=False, TP=4)
)
)
)
(final_layernorm): RMSNorm()
)
(output_layer): ColumnParallelLinear(in_features=1024, out_features=152064, bias=False, TP=4)
)
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 218293248
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 218293248
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 218293248
WARNING: could not find the metadata file /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
(min, max) time across ranks (ms):
load-checkpoint ................................: (1.06, 1.07)
[after model, optimizer, and learning rate scheduler are built] datetime: 2026-02-11 17:48:19
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 12800
validation: 1280
test: 1280
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2026-02-11 17:48:19
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (265.26, 274.11)
train/valid/test-data-iterators-setup ..........: (249.63, 344.36)
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2026-02-11 17:48:19
[WARNING | megatron.core.rerun_state_machine]: Result validation enabled
[2026-02-11 17:48:51] iteration 1/ 50 | consumed samples: 256 | elapsed time per iteration (ms): 32016.2 | throughput per GPU (TFLOP/s/GPU): 20.5 | learning rate: 3.000000E-05 | global batch size: 256 | lm loss: 1.194528E+01 | loss scale: 1.0 | grad norm: 17.556 | number of skipped iterations: 0 | number of nan iterations: 0 |
Number of parameters in transformer block in billions: 0.56
Number of parameters in embedding layers in billions: 0.31
Total number of parameters in billions: 0.87
Number of parameters in most loaded shard in billions: 0.2183
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 64.0 MB
Theoretical memory footprints: weight and optimizer=2497.83 MB, activation=2206.08 MB, total=4703.92 MB
[Rank 3] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4294.29541015625 | reserved: 5464.0 | max reserved: 5464.0
[Rank 2] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4290.59130859375 | reserved: 5464.0 | max reserved: 5464.0
[Rank 1] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4294.671875 | reserved: 5464.0 | max reserved: 5464.0
[Rank 0] (after 1 iterations) memory (MB) | allocated: 2720.359375 | max allocated: 4280.4609375 | reserved: 4870.0 | max reserved: 4870.0
[2026-02-11 17:49:15] iteration 2/ 50 | consumed samples: 512 | elapsed time per iteration (ms): 23547.6 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 2.997226E-05 | global batch size: 256 | lm loss: 1.194559E+01 | loss scale: 1.0 | grad norm: 17.100 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:49:40] iteration 3/ 50 | consumed samples: 768 | elapsed time per iteration (ms): 25037.7 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 2.988916E-05 | global batch size: 256 | lm loss: 1.173730E+01 | loss scale: 1.0 | grad norm: 38.599 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:50:04] iteration 4/ 50 | consumed samples: 1024 | elapsed time per iteration (ms): 24486.3 | throughput per GPU (TFLOP/s/GPU): 26.8 | learning rate: 2.975105E-05 | global batch size: 256 | lm loss: 1.157678E+01 | loss scale: 1.0 | grad norm: 3.598 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:50:28] iteration 5/ 50 | consumed samples: 1280 | elapsed time per iteration (ms): 24209.4 | throughput per GPU (TFLOP/s/GPU): 27.1 | learning rate: 2.955848E-05 | global batch size: 256 | lm loss: 1.147159E+01 | loss scale: 1.0 | grad norm: 2.897 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 5 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 5 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6053.01, 6053.04)
[2026-02-11 17:50:58] iteration 6/ 50 | consumed samples: 1536 | elapsed time per iteration (ms): 23787.3 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 2.931225E-05 | global batch size: 256 | lm loss: 1.180910E+01 | loss scale: 1.0 | grad norm: 220.934 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:51:22] iteration 7/ 50 | consumed samples: 1792 | elapsed time per iteration (ms): 23771.1 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 2.901338E-05 | global batch size: 256 | lm loss: 1.140885E+01 | loss scale: 1.0 | grad norm: 2.688 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:51:46] iteration 8/ 50 | consumed samples: 2048 | elapsed time per iteration (ms): 23571.3 | throughput per GPU (TFLOP/s/GPU): 27.8 | learning rate: 2.866308E-05 | global batch size: 256 | lm loss: 1.137786E+01 | loss scale: 1.0 | grad norm: 2.564 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:52:09] iteration 9/ 50 | consumed samples: 2304 | elapsed time per iteration (ms): 23910.3 | throughput per GPU (TFLOP/s/GPU): 27.5 | learning rate: 2.826280E-05 | global batch size: 256 | lm loss: 1.133599E+01 | loss scale: 1.0 | grad norm: 2.567 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:52:35] iteration 10/ 50 | consumed samples: 2560 | elapsed time per iteration (ms): 25138.6 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 2.781419E-05 | global batch size: 256 | lm loss: 1.130382E+01 | loss scale: 1.0 | grad norm: 2.557 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 10 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 10 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6240.67, 6240.69)
[2026-02-11 17:53:06] iteration 11/ 50 | consumed samples: 2816 | elapsed time per iteration (ms): 24910.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 2.731908E-05 | global batch size: 256 | lm loss: 1.127207E+01 | loss scale: 1.0 | grad norm: 2.545 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:53:30] iteration 12/ 50 | consumed samples: 3072 | elapsed time per iteration (ms): 24302.7 | throughput per GPU (TFLOP/s/GPU): 27.0 | learning rate: 2.677952E-05 | global batch size: 256 | lm loss: 1.123584E+01 | loss scale: 1.0 | grad norm: 2.534 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:53:54] iteration 13/ 50 | consumed samples: 3328 | elapsed time per iteration (ms): 23970.4 | throughput per GPU (TFLOP/s/GPU): 27.4 | learning rate: 2.619772E-05 | global batch size: 256 | lm loss: 1.120071E+01 | loss scale: 1.0 | grad norm: 2.547 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:54:18] iteration 14/ 50 | consumed samples: 3584 | elapsed time per iteration (ms): 23715.7 | throughput per GPU (TFLOP/s/GPU): 27.7 | learning rate: 2.557606E-05 | global batch size: 256 | lm loss: 1.116887E+01 | loss scale: 1.0 | grad norm: 2.514 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:54:43] iteration 15/ 50 | consumed samples: 3840 | elapsed time per iteration (ms): 24919.0 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 2.491711E-05 | global batch size: 256 | lm loss: 1.112481E+01 | loss scale: 1.0 | grad norm: 2.569 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 15 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 15 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6368.84, 6368.91)
[2026-02-11 17:55:15] iteration 16/ 50 | consumed samples: 4096 | elapsed time per iteration (ms): 26104.9 | throughput per GPU (TFLOP/s/GPU): 25.1 | learning rate: 2.422357E-05 | global batch size: 256 | lm loss: 1.109704E+01 | loss scale: 1.0 | grad norm: 2.518 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:55:41] iteration 17/ 50 | consumed samples: 4352 | elapsed time per iteration (ms): 25885.0 | throughput per GPU (TFLOP/s/GPU): 25.4 | learning rate: 2.349830E-05 | global batch size: 256 | lm loss: 1.106508E+01 | loss scale: 1.0 | grad norm: 3.518 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:56:06] iteration 18/ 50 | consumed samples: 4608 | elapsed time per iteration (ms): 25098.1 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 2.274427E-05 | global batch size: 256 | lm loss: 1.101524E+01 | loss scale: 1.0 | grad norm: 2.577 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:56:31] iteration 19/ 50 | consumed samples: 4864 | elapsed time per iteration (ms): 24881.3 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 2.196458E-05 | global batch size: 256 | lm loss: 1.099613E+01 | loss scale: 1.0 | grad norm: 2.496 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:56:56] iteration 20/ 50 | consumed samples: 5120 | elapsed time per iteration (ms): 24919.2 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 2.116243E-05 | global batch size: 256 | lm loss: 1.095716E+01 | loss scale: 1.0 | grad norm: 2.519 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 20 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 20 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6076.06, 6076.09)
[2026-02-11 17:57:27] iteration 21/ 50 | consumed samples: 5376 | elapsed time per iteration (ms): 25205.7 | throughput per GPU (TFLOP/s/GPU): 26.0 | learning rate: 2.034112E-05 | global batch size: 256 | lm loss: 1.092020E+01 | loss scale: 1.0 | grad norm: 2.551 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:57:52] iteration 22/ 50 | consumed samples: 5632 | elapsed time per iteration (ms): 25014.1 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 1.950403E-05 | global batch size: 256 | lm loss: 1.089746E+01 | loss scale: 1.0 | grad norm: 2.506 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:58:17] iteration 23/ 50 | consumed samples: 5888 | elapsed time per iteration (ms): 24387.6 | throughput per GPU (TFLOP/s/GPU): 26.9 | learning rate: 1.865460E-05 | global batch size: 256 | lm loss: 1.085281E+01 | loss scale: 1.0 | grad norm: 2.566 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:58:41] iteration 24/ 50 | consumed samples: 6144 | elapsed time per iteration (ms): 24708.5 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 1.779631E-05 | global batch size: 256 | lm loss: 1.082638E+01 | loss scale: 1.0 | grad norm: 2.531 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:59:06] iteration 25/ 50 | consumed samples: 6400 | elapsed time per iteration (ms): 25163.9 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 1.693270E-05 | global batch size: 256 | lm loss: 1.079843E+01 | loss scale: 1.0 | grad norm: 2.545 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 25 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 25 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6257.54, 6257.71)
[2026-02-11 17:59:36] iteration 26/ 50 | consumed samples: 6656 | elapsed time per iteration (ms): 23507.4 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 1.606730E-05 | global batch size: 256 | lm loss: 1.076447E+01 | loss scale: 1.0 | grad norm: 2.546 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:00:00] iteration 27/ 50 | consumed samples: 6912 | elapsed time per iteration (ms): 23555.7 | throughput per GPU (TFLOP/s/GPU): 27.9 | learning rate: 1.520369E-05 | global batch size: 256 | lm loss: 1.073713E+01 | loss scale: 1.0 | grad norm: 2.670 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:00:23] iteration 28/ 50 | consumed samples: 7168 | elapsed time per iteration (ms): 23440.2 | throughput per GPU (TFLOP/s/GPU): 28.0 | learning rate: 1.434540E-05 | global batch size: 256 | lm loss: 1.070240E+01 | loss scale: 1.0 | grad norm: 2.560 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:00:47] iteration 29/ 50 | consumed samples: 7424 | elapsed time per iteration (ms): 23776.8 | throughput per GPU (TFLOP/s/GPU): 27.6 | learning rate: 1.349597E-05 | global batch size: 256 | lm loss: 1.067331E+01 | loss scale: 1.0 | grad norm: 2.558 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:01:11] iteration 30/ 50 | consumed samples: 7680 | elapsed time per iteration (ms): 24240.5 | throughput per GPU (TFLOP/s/GPU): 27.1 | learning rate: 1.265888E-05 | global batch size: 256 | lm loss: 1.064471E+01 | loss scale: 1.0 | grad norm: 2.593 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 30 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 30 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6359.60, 6359.77)
[2026-02-11 18:01:42] iteration 31/ 50 | consumed samples: 7936 | elapsed time per iteration (ms): 23973.2 | throughput per GPU (TFLOP/s/GPU): 27.4 | learning rate: 1.183757E-05 | global batch size: 256 | lm loss: 1.061864E+01 | loss scale: 1.0 | grad norm: 2.945 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:02:06] iteration 32/ 50 | consumed samples: 8192 | elapsed time per iteration (ms): 24138.4 | throughput per GPU (TFLOP/s/GPU): 27.2 | learning rate: 1.103542E-05 | global batch size: 256 | lm loss: 1.060612E+01 | loss scale: 1.0 | grad norm: 2.685 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:02:30] iteration 33/ 50 | consumed samples: 8448 | elapsed time per iteration (ms): 24111.5 | throughput per GPU (TFLOP/s/GPU): 27.2 | learning rate: 1.025573E-05 | global batch size: 256 | lm loss: 1.058644E+01 | loss scale: 1.0 | grad norm: 17.452 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:02:54] iteration 34/ 50 | consumed samples: 8704 | elapsed time per iteration (ms): 24509.7 | throughput per GPU (TFLOP/s/GPU): 26.8 | learning rate: 9.501700E-06 | global batch size: 256 | lm loss: 1.056656E+01 | loss scale: 1.0 | grad norm: 3.815 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:03:19] iteration 35/ 50 | consumed samples: 8960 | elapsed time per iteration (ms): 24686.1 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 8.776425E-06 | global batch size: 256 | lm loss: 1.054440E+01 | loss scale: 1.0 | grad norm: 2.482 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 35 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 35 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6290.02, 6290.09)
[2026-02-11 18:03:50] iteration 36/ 50 | consumed samples: 9216 | elapsed time per iteration (ms): 24716.9 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 8.082888E-06 | global batch size: 256 | lm loss: 1.052087E+01 | loss scale: 1.0 | grad norm: 3.101 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:04:14] iteration 37/ 50 | consumed samples: 9472 | elapsed time per iteration (ms): 24331.1 | throughput per GPU (TFLOP/s/GPU): 27.0 | learning rate: 7.423938E-06 | global batch size: 256 | lm loss: 1.050959E+01 | loss scale: 1.0 | grad norm: 2.672 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:04:39] iteration 38/ 50 | consumed samples: 9728 | elapsed time per iteration (ms): 24831.3 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 6.802284E-06 | global batch size: 256 | lm loss: 1.048346E+01 | loss scale: 1.0 | grad norm: 2.642 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:05:04] iteration 39/ 50 | consumed samples: 9984 | elapsed time per iteration (ms): 24760.1 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 6.220479E-06 | global batch size: 256 | lm loss: 1.048476E+01 | loss scale: 1.0 | grad norm: 2.795 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:05:29] iteration 40/ 50 | consumed samples: 10240 | elapsed time per iteration (ms): 25262.7 | throughput per GPU (TFLOP/s/GPU): 26.0 | learning rate: 5.680916E-06 | global batch size: 256 | lm loss: 1.046774E+01 | loss scale: 1.0 | grad norm: 2.509 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 40 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 40 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6104.48, 6104.56)
[2026-02-11 18:06:00] iteration 41/ 50 | consumed samples: 10496 | elapsed time per iteration (ms): 25132.2 | throughput per GPU (TFLOP/s/GPU): 26.1 | learning rate: 5.185811E-06 | global batch size: 256 | lm loss: 1.046618E+01 | loss scale: 1.0 | grad norm: 2.557 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:06:25] iteration 42/ 50 | consumed samples: 10752 | elapsed time per iteration (ms): 24748.1 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 4.737197E-06 | global batch size: 256 | lm loss: 1.045832E+01 | loss scale: 1.0 | grad norm: 2.603 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:06:50] iteration 43/ 50 | consumed samples: 11008 | elapsed time per iteration (ms): 24830.1 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 4.336920E-06 | global batch size: 256 | lm loss: 1.044374E+01 | loss scale: 1.0 | grad norm: 2.531 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:07:15] iteration 44/ 50 | consumed samples: 11264 | elapsed time per iteration (ms): 24816.0 | throughput per GPU (TFLOP/s/GPU): 26.5 | learning rate: 3.986624E-06 | global batch size: 256 | lm loss: 1.043142E+01 | loss scale: 1.0 | grad norm: 2.509 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:07:40] iteration 45/ 50 | consumed samples: 11520 | elapsed time per iteration (ms): 24840.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 3.687747E-06 | global batch size: 256 | lm loss: 1.042535E+01 | loss scale: 1.0 | grad norm: 2.623 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 45 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 45 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (6001.84, 6001.97)
[2026-02-11 18:08:11] iteration 46/ 50 | consumed samples: 11776 | elapsed time per iteration (ms): 25069.9 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 3.441519E-06 | global batch size: 256 | lm loss: 1.043089E+01 | loss scale: 1.0 | grad norm: 2.498 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:08:36] iteration 47/ 50 | consumed samples: 12032 | elapsed time per iteration (ms): 25030.7 | throughput per GPU (TFLOP/s/GPU): 26.2 | learning rate: 3.248951E-06 | global batch size: 256 | lm loss: 1.041368E+01 | loss scale: 1.0 | grad norm: 2.453 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:09:00] iteration 48/ 50 | consumed samples: 12288 | elapsed time per iteration (ms): 24655.4 | throughput per GPU (TFLOP/s/GPU): 26.6 | learning rate: 3.110835E-06 | global batch size: 256 | lm loss: 1.041769E+01 | loss scale: 1.0 | grad norm: 2.474 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:09:25] iteration 49/ 50 | consumed samples: 12544 | elapsed time per iteration (ms): 24934.6 | throughput per GPU (TFLOP/s/GPU): 26.3 | learning rate: 3.027737E-06 | global batch size: 256 | lm loss: 1.041792E+01 | loss scale: 1.0 | grad norm: 2.497 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 18:09:50] iteration 50/ 50 | consumed samples: 12800 | elapsed time per iteration (ms): 24820.8 | throughput per GPU (TFLOP/s/GPU): 26.4 | learning rate: 3.000000E-06 | global batch size: 256 | lm loss: 1.039724E+01 | loss scale: 1.0 | grad norm: 2.511 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 in torch format
successfully saved checkpoint from iteration 50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/checkpoints1 [ t 1/4, p 1/1 ]
(min, max) time across ranks (ms):
save-checkpoint ................................: (5870.59, 5870.67)
[after training is done] datetime: 2026-02-11 18:09:56
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
Evaluating on 1280 samples
Evaluating iter 1/5
Evaluating iter 2/5
Evaluating iter 3/5
Evaluating iter 4/5
Evaluating iter 5/5
(min, max) time across ranks (ms):
evaluate .......................................: (55213.15, 55215.66)
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------------
validation loss at iteration 50 on validation set | lm loss value: 1.044579E+01 | lm loss PPL: 3.439929E+04 |
----------------------------------------------------------------------------------------------------------------
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
Evaluating on 1280 samples
Evaluating iter 1/5
Evaluating iter 2/5
Evaluating iter 3/5
Evaluating iter 4/5
Evaluating iter 5/5
(min, max) time across ranks (ms):
evaluate .......................................: (54313.53, 54316.17)
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
validation loss at iteration 50 on test set | lm loss value: 1.046412E+01 | lm loss PPL: 3.503556E+04 |
----------------------------------------------------------------------------------------------------------
W0211 18:11:46.632072 216828 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.632174 216826 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.656929 216831 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.702957 216833 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.704921 216834 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.719092 216832 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.756732 216829 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 18:11:46.805349 216830 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
--tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --context-parallel-size 2 --use-distributed-optimizer --sequence-parallel
Successfully preprocessed all matching files.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/inference/unified_memory.py:83: UserWarning: Failed to create unified memory mempool.
warnings.warn("Failed to create unified memory mempool.")
[WARNING | megatron.core.utils]: fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function.
using world size: 8, data-parallel size: 1, context-parallel size: 2, hierarchical context-parallel sizes: None, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:HuggingFaceTokenizer
Number of virtual stages per pipeline stage: None
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
account_for_embedding_in_pipeline_split ......... False
account_for_loss_in_pipeline_split .............. False
accumulate_allreduce_grads_in_fp32 .............. True
activation_func_clamp_value ..................... None
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... True
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
align_grad_reduce ............................... True
align_param_gather .............................. False
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... True
attention_backend ............................... AttnBackend.auto
attention_dropout ............................... 0.0
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
cache_mla_latents ............................... False
calc_ft_timeouts ................................ False
calculate_per_token_loss ........................ False
check_for_large_grads ........................... False
check_for_nan_in_loss_and_grad .................. True
check_for_spiky_loss ............................ False
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_convert_format ............................. None
ckpt_convert_save ............................... None
ckpt_convert_update_legacy_dist_opt_format ...... False
ckpt_format ..................................... torch
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
collect_log_path ................................ ./logs
comm_time_log_iter .............................. None
config_logger_dir ...............................
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 2
cp_comm_type .................................... ['p2p']
create_attention_mask_in_dataloader ............. True
cross_entropy_fusion_impl ....................... native
cross_entropy_loss_fusion ....................... False
cuda_graph_impl ................................. none
cuda_graph_scope ................................ full
cuda_graph_warmup_steps ......................... 3
data_args_path .................................. None
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_sharding_strategy ................. no_shard
data_parallel_size .............................. 1
data_path ....................................... ['/workspace/data/oscar/oscar-1GB_head-qwen_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... True
ddp_bucket_size ................................. None
ddp_num_buckets ................................. None
ddp_pad_buckets_for_high_nccl_busbw ............. False
decode_only_cuda_graphs ......................... False
decoder_first_pipeline_num_layers ............... None
decoder_last_pipeline_num_layers ................ None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
decrease_batch_size_if_needed ................... False
defer_embedding_wgrad_compute ................... False
delay_wgrad_compute ............................. False
deprecated_use_mcore_models ..................... True
deterministic_mode .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_backward_fusion ......................... False
disable_bf16_reduced_precision_matmul ........... False
disable_chunked_prefill ......................... False
disable_mamba_mem_eff_path ...................... False
disable_straggler_on_startup .................... False
disable_symmetric_registration .................. False
disable_vision_class_token ...................... False
dist_ckpt_format_deprecated ..................... None
dist_ckpt_optim_fully_reshardable ............... False
dist_ckpt_save_pre_mcore_014 .................... False
dist_ckpt_strictness ............................ assume_ok_unexpected
dist_url ........................................ tcp://localhost:25900
distrib_optim_fully_reshardable_mem_efficient ... False
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
distributed_timeout_seconds_after_init .......... None
embedding_init_method_std ....................... None
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_bw_flux_gemmrs_op ........................ True
enable_cuda_graph ............................... False
enable_dynamic_grad_comp ........................ False
enable_experimental ............................. False
enable_ft_package ............................... False
enable_full_sharding_in_hsdp .................... False
enable_gloo_process_groups ...................... True
enable_msc ...................................... True
enable_one_logger ............................... True
enable_vocab_parallel ........................... False
encoder_num_layers .............................. 36
encoder_seq_length .............................. 8192
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
error_injection_rate ............................ 0
error_injection_type ............................ transient_error
eval_interval ................................... 1000
eval_iters ...................................... 5
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
exp_avg_dtype ................................... torch.float32
exp_avg_sq_dtype ................................ torch.float32
expert_model_parallel_size ...................... 1
expert_tensor_parallel_size ..................... 4
export_force_local_attention .................... False
export_kd_cfg ................................... None
export_kd_teacher_ckpt_format ................... None
export_kd_teacher_load .......................... None
export_kv_cache_quant ........................... False
export_legacy_megatron .......................... False
export_model_type ............................... GPTModel
export_moe_apply_probs_on_input ................. False
export_offline_model ............................ False
export_qk_l2_norm ............................... False
export_quant_cfg ................................ None
export_real_quant_cfg ........................... None
export_te_mcore_model ........................... False
external_cuda_graph ............................. False
extra_vocab_size ................................ 0
ffn_hidden_size ................................. 12288
fine_grained_activation_offloading .............. False
finetune ........................................ False
finetune_data_split ............................. train
finetune_hf_dataset ............................. None
first_last_layers_bf16 .......................... False
flash_decode .................................... False
flux_transpose_weight ........................... False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp4 ............................................. None
fp4_param ....................................... False
fp4_recipe ...................................... nvfp4
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_param_gather ................................ False
fp8_recipe ...................................... delayed
fp8_wgrad ....................................... True
freeze_LM ....................................... False
freeze_ViT ...................................... False
fsdp_double_buffer .............................. False
full_validation ................................. False
global_batch_size ............................... 32
glu_linear_offset ............................... 0.0
grad_comp ....................................... False
grad_comp_warm_up ............................... 0.1
grad_reduce_in_bf16 ............................. False
gradient_accumulation_fusion .................... True
gradient_reduce_div_fusion ...................... True
gradient_sample_ratio ........................... 1.0
group_query_attention ........................... True
grpo_clamp_eps_lower ............................ 0.01
grpo_clamp_eps_upper ............................ 0.01
grpo_default_temperature ........................ 1.0
grpo_default_top_p .............................. 0
grpo_entropy_term_weight ........................ 0.0
grpo_filter_groups_with_same_reward ............. False
grpo_group_size ................................. 2
grpo_iterations ................................. 2
grpo_kl_beta .................................... 0.001
grpo_prompts_per_step ........................... 32
head_lr_mult .................................... 1.0
heterogeneous_layers_config_encoded_json ........ None
heterogeneous_layers_config_path ................ None
hidden_dropout .................................. 0.0
hidden_size ..................................... 4096
hierarchical_context_parallel_sizes ............. None
high_priority_stream_groups ..................... []
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... -1
inference_dynamic_batching ...................... False
inference_dynamic_batching_block_size ........... 256
inference_dynamic_batching_buffer_guaranteed_fraction 0.2
inference_dynamic_batching_buffer_overflow_factor None
inference_dynamic_batching_buffer_size_gb ....... 40.0
inference_dynamic_batching_max_requests_override None
inference_dynamic_batching_max_tokens_override .. None
inference_dynamic_batching_num_cuda_graphs ...... 16
inference_dynamic_batching_track_paused_request_events False
inference_dynamic_batching_unified_memory_level . 0
inference_max_batch_size ........................ 8
inference_max_seq_length ........................ 2560
inference_rng_tracker ........................... False
init_method_std ................................. 0.006
init_method_xavier_uniform ...................... False
init_model_with_meta_device ..................... False
initial_loss_scale .............................. 4294967296
inprocess_active_world_size ..................... 1
inprocess_barrier_timeout ....................... 120
inprocess_completion_timeout .................... 120
inprocess_empty_cuda_cache ...................... False
inprocess_granularity ........................... node
inprocess_hard_timeout .......................... 90
inprocess_heartbeat_interval .................... 30
inprocess_heartbeat_timeout ..................... 60
inprocess_last_call_wait ........................ 1
inprocess_max_iterations ........................ None
inprocess_monitor_process_interval .............. 1.0
inprocess_monitor_thread_interval ............... 1.0
inprocess_progress_watchdog_interval ............ 1.0
inprocess_restart ............................... False
inprocess_soft_timeout .......................... 60
inprocess_termination_grace_time ................ 1
is_hybrid_model ................................. False
iter_per_epoch .................................. 1250
iteration_sample_ratio .......................... 0.01
iterations_to_skip .............................. []
keep_fp8_transpose_cache ........................ False
kitchen_config_file ............................. None
kitchen_recipe_number ........................... None
kv_channels ..................................... 128
kv_lora_rank .................................... 32
langrl_env_config ............................... None
langrl_external_server .......................... False
langrl_inference_server_conversation_template ... None
langrl_inference_server_type .................... inplace_megatron
lazy_mpu_init ................................... None
legacy_tokenizer ................................ False
load ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints
load_main_params_from_ckpt ...................... None
local_rank ...................................... 0
log_energy ...................................... False
log_interval .................................... 1
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. True
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 3e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 1
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
main_grads_dtype ................................ torch.float32
main_params_dtype ............................... torch.float32
make_vocab_size_divisible_by .................... 128
mamba_head_dim .................................. 64
mamba_num_groups ................................ 8
mamba_num_heads ................................. None
mamba_state_dim ................................. 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_padding_length .............................. None
max_position_embeddings ......................... 40960
max_tokens_to_oom ............................... 12000
memory_snapshot_path ............................ snapshot.pickle
merge_file ...................................... None
micro_batch_size ................................ 1
microbatch_group_size_per_vp_stage .............. None
mid_level_dataset_surplus ....................... 0.005
min_loss_scale .................................. 1.0
min_lr .......................................... 3e-06
min_offloaded_tensor_size ....................... 1048576
mlp_chunks_for_prefill .......................... 1
mmap_bin_files .................................. True
mock_data ....................................... False
modelopt_enabled ................................ False
moe_apply_probs_on_input ........................ False
moe_aux_loss_coeff .............................. 0.0
moe_deepep_num_sms .............................. 20
moe_enable_deepep ............................... False
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_ffn_hidden_size ............................. None
moe_grouped_gemm ................................ False
moe_input_jitter_eps ............................ None
moe_layer_freq .................................. 1
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_pad_experts_for_cuda_graph_inference ........ False
moe_per_layer_logging ........................... False
moe_permute_fusion .............................. False
moe_router_bias_update_rate ..................... 0.001
moe_router_dtype ................................ None
moe_router_enable_expert_bias ................... False
moe_router_force_load_balancing ................. False
moe_router_fusion ............................... False
moe_router_group_topk ........................... None
moe_router_load_balancing_type .................. aux_loss
moe_router_num_groups ........................... None
moe_router_padding_for_fp8 ...................... False
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... softmax
moe_router_topk ................................. 2
moe_router_topk_scaling_factor .................. None
moe_shared_expert_intermediate_size ............. None
moe_shared_expert_overlap ....................... False
moe_token_dispatcher_type ....................... allgather
moe_token_drop_policy ........................... probs
moe_upcycling_granularity ....................... 1
moe_use_legacy_grouped_gemm ..................... False
moe_use_upcycling ............................... False
moe_z_loss_coeff ................................ None
mrope_section ................................... None
mscale .......................................... 1.0
mscale_all_dim .................................. 0.0
mtp_loss_scaling_factor ......................... 0.1
mtp_num_layers .................................. None
multi_latent_attention .......................... False
multiple_validation_sets ........................ False
nccl_all_reduce_for_prefill ..................... False
nccl_communicator_config_path ................... None
nccl_ub ......................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_rope_freq .................................... None
no_save_optim ................................... None
no_save_rng ..................................... None
non_persistent_ckpt_type ........................ None
non_persistent_global_ckpt_dir .................. None
non_persistent_local_ckpt_algo .................. fully_parallel
non_persistent_local_ckpt_dir ................... None
non_persistent_save_interval .................... None
norm_epsilon .................................... 1e-05
normalization ................................... RMSNorm
num_attention_heads ............................. 32
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_distributed_optimizer_instances ............. 1
num_experts ..................................... None
num_layers ...................................... 36
num_layers_at_end_in_bf16 ....................... 1
num_layers_at_start_in_bf16 ..................... 1
num_layers_per_virtual_pipeline_stage ........... None
num_layers_to_build ............................. None
num_query_groups ................................ 8
num_virtual_stages_per_pipeline_rank ............ None
num_workers ..................................... 2
object_storage_cache_path ....................... None
offload_modules ................................. None
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
optimizer_cpu_offload ........................... False
optimizer_offload_fraction ...................... 1.0
output_bert_embeddings .......................... False
overlap_cpu_optimizer_d2h_h2d ................... False
overlap_ep_comm_with_split_attn ................. False
overlap_grad_reduce ............................. True
overlap_moe_expert_parallel_comm ................ False
overlap_p2p_comm ................................ False
overlap_p2p_comm_warmup_flush ................... False
overlap_param_gather ............................ False
overlap_param_gather_with_optimizer_step ........ False
override_opt_param_scheduler .................... False
padded_vocab_size ............................... None
parallel_linear_impl ............................ None
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
patch_size ...................................... 14
per_split_data_args_path ........................ None
perform_initialization .......................... True
perform_rl_step ................................. False
pin_cpu_grads ................................... True
pin_cpu_params .................................. True
pipe_sp_splits .................................. 1
pipe_sp_strategy ................................ average
pipeline_model_parallel_comm_backend ............ None
pipeline_model_parallel_layout .................. None
pipeline_model_parallel_size .................... 1
position_embedding_type ......................... rope
pretrained_checkpoint ........................... None
profile ......................................... False
profile_dir ..................................... ./
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
q_lora_rank ..................................... None
qk_head_dim ..................................... 128
qk_l2_norm ...................................... False
qk_layernorm .................................... True
qk_pos_emb_head_dim ............................. 64
quant_comm_bits ................................. 8
quant_group_size ................................ None
quant_scale_dtype ............................... bf16
query_in_block_prob ............................. 0.1
quick_geglu ..................................... False
rampup_batch_size ............................... None
rank ............................................ 0
rank_adjust_window_size ......................... 1000
recompute_activation_function ................... False
recompute_activation_function_num_layers ........ None
recompute_granularity ........................... None
recompute_method ................................ None
recompute_modules ............................... None
recompute_num_layers ............................ None
record_memory_history ........................... False
reduce_recompute_for_last_chunk ................. False
relative_attention_max_distance ................. 128
relative_attention_num_buckets .................. 32
replication ..................................... False
replication_factor .............................. 2
replication_jump ................................ None
reproduce ....................................... False
rerun_mode ...................................... validate_results
reset_attention_mask ............................ False
reset_position_ids .............................. False
result_rejected_tracker_filename ................ None
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
reuse_fp32_param ................................ False
reuse_grad_buf_for_mxfp8_param_ag ............... False
rl_calculate_intra_group_similarity ............. False
rl_importance_sampling_truncation_coef .......... None
rl_inference_logprobs_is_correction ............. False
rl_offload_kv_cache_during_training ............. False
rl_offload_optimizer_during_inference ........... False
rl_partial_rollouts ............................. False
rl_prompts_per_eval ............................. 32
rl_remove_kv_cache_during_training .............. False
rl_reset_cuda_graphs ............................ False
rl_sequence_packing_algo ........................ fifo
rl_sequence_packing_bin_size .................... 8192
rl_use_sequence_packing ......................... False
rope_scaling_factor ............................. 8.0
rope_type ....................................... None
rotary_base ..................................... 1000000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_scaling_factor ........................... 1.0
rotary_seq_len_interpolation_factor ............. None
run_workload_inspector_server ................... False
sample_rate ..................................... 1.0
save ............................................ /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints
save_flux_gather_input .......................... False
save_interval ................................... 1000
save_retain_interval ............................ None
scatter_gather_tensors_in_pipeline .............. True
schedule_method ................................. vanilla
schedule_timer_end .............................. 20
schedule_timer_start ............................ 10
seed ............................................ 1234
seq_length ...................................... 8192
sequence_parallel ............................... True
sft ............................................. False
sft_tokenizer_prompt_format ..................... nemotron-h-aligned
sgd_momentum .................................... 0.9
sharp_enabled_group ............................. None
short_seq_prob .................................. 0.1
skip_train ...................................... False
skipped_train_samples ........................... 0
softmax_type .................................... vanilla
spatial_merge_size .............................. 2
spec ............................................ None
specify_layers .................................. None
split ........................................... 949,50,1
squared_relu .................................... False
start_weight_decay .............................. 0.1
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
strict_fsdp_dtensor_load ........................ True
suggested_communication_unit_size ............... None
swap_attention .................................. False
swap_modules .................................... self_attention
swiglu .......................................... True
swin_backbone_type .............................. tiny
symmetric_ar_type ............................... None
te_rng_tracker .................................. False
teacher_model_config ............................ None
temporal_patch_size ............................. 2
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints/tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_metadata .............................. None
tokenizer_model ................................. /home/models/qwen3/Qwen3-8B
tokenizer_type .................................. HuggingFaceTokenizer
torch_fsdp2_reshard_after_forward ............... True
tp_comm_bootstrap_backend ....................... nccl
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 50
train_samples ................................... None
train_sync_interval ............................. None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
trust_remote_code ............................... False
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_ckpt_memory_cache ........................... False
use_cpu_initialization .......................... None
use_dist_ckpt ................................... False
use_dist_ckpt_deprecated ........................ False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_fused_weighted_squared_relu ................. False
use_hip_profiler ................................ False
use_legacy_models ............................... False
use_megatron_fsdp ............................... False
use_mp_args_from_checkpoint_args ................ False
use_one_sent_docs ............................... False
use_optimizer_feature ........................... False
use_persistent_ckpt_worker ...................... False
use_precision_aware_optimizer ................... False
use_pytorch_profiler ............................ False
use_qk_norm ..................................... False
use_quantize_comm ............................... False
use_ring_exchange_p2p ........................... False
use_rope_scaling ................................ False
use_rotary_position_embeddings .................. False
use_sharp ....................................... False
use_te_activation_func .......................... False
use_tokenizer_model_from_checkpoint_args ........ True
use_torch_fsdp2 ................................. False
use_torch_optimizer_for_cpu_offload ............. False
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 128
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_entity ....................................
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
window_attn_skip_freq ........................... None
window_size ..................................... None
world_size ...................................... 8
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
> building HuggingFaceTokenizer tokenizer ...
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
> padded vocab (size: 151669) with 395 dummy tokens (new size: 152064)
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
> initializing torch distributed ...
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
[WARNING | megatron.core.rerun_state_machine]: RerunStateMachine initialized in mode validate_results
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:02.731345 186405 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:02.731526 186402 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.134291 186407 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.135790 186395 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
tp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
dp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
ep_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
etp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
edp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
tp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
pos_embd-pp_group: [[0], [1], [2], [3], [4], [5], [6], [7]]
tp-ep_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
tp-dp-cp_group: [[0, 1, 2, 3, 4, 5, 6, 7]]
tp-pp_group: [[0, 1, 2, 3], [4, 5, 6, 7]]
dp-cp_group: [[0, 4], [1, 5], [2, 6], [3, 7]]
> initialized tensor model parallel with size 4
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.230739 186403 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.232905 186401 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.237195 186406 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron_0210/test2/dcu_megatron/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.040 seconds
> compiling and loading fused kernels ...
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0211 16:50:03.283905 186404 ProcessGroupNCCL.cpp:4232] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
>>> done with compiling and loading fused kernels. Compilation time: 1.224 seconds
time to initialize megatron (seconds): 7.497
[after megatron is initialized] datetime: 2026-02-11 16:50:07
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 2048177152
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 2048177152
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 2048177152
[rank 0] GPTModel(
(embedding): LanguageModelEmbedding(
(word_embeddings): VocabParallelEmbedding()
(embedding_dropout): Dropout(p=0.0, inplace=False)
)
(rotary_pos_emb): RotaryEmbedding()
(decoder): TransformerBlock(
(layers): ModuleList(
(0-35): 36 x TransformerLayer(
(input_layernorm): IdentityOp()
(self_attention): SelfAttention(
(core_attention): TEDotProductAttention(
(flash_attention): FlashAttention()
(fused_attention): FusedAttention()
(unfused_attention): UnfusedDotProductAttention(
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
)
)
(linear_proj): TERowParallelLinear(in_features=1024, out_features=4096, bias=False, TP=4)
(linear_qkv): TELayerNormColumnParallelLinear(in_features=4096, out_features=1536, bias=False, TP=4)
(q_layernorm): RMSNorm()
(k_layernorm): RMSNorm()
)
(pre_cross_attn_layernorm): IdentityOp()
(cross_attention): IdentityOp()
(cross_attn_bda): IdentityFuncOp()
(pre_mlp_layernorm): IdentityOp()
(mlp): MLP(
(linear_fc1): TELayerNormColumnParallelLinear(in_features=4096, out_features=6144, bias=False, TP=4)
(linear_fc2): TERowParallelLinear(in_features=3072, out_features=4096, bias=False, TP=4)
)
)
)
(final_layernorm): RMSNorm()
)
(output_layer): ColumnParallelLinear(in_features=4096, out_features=152064, bias=False, TP=4)
)
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 2048177152
WARNING: could not find the metadata file /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.95, 0.97)
[after model, optimizer, and learning rate scheduler are built] datetime: 2026-02-11 16:50:07
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 1600
validation: 160
test: 160
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2026-02-11 16:50:08
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (349.92, 359.61)
train/valid/test-data-iterators-setup ..........: (256.17, 355.63)
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2026-02-11 16:50:08
[WARNING | megatron.core.rerun_state_machine]: Result validation enabled
[2026-02-11 16:50:38] iteration 1/ 50 | consumed samples: 32 | elapsed time per iteration (ms): 30247.9 | throughput per GPU (TFLOP/s/GPU): 57.0 | learning rate: 3.000000E-05 | global batch size: 32 | lm loss: 1.202196E+01 | loss scale: 1.0 | grad norm: 47.059 | number of skipped iterations: 0 | number of nan iterations: 0 |
Number of parameters in transformer block in billions: 6.95
Number of parameters in embedding layers in billions: 1.25
Total number of parameters in billions: 8.19
Number of parameters in most loaded shard in billions: 2.0480
compute_activation_memory_without_sp
Activation memory footprint per transformer layer (precise, without SP): 512.0 MB
Theoretical memory footprints: weight and optimizer=35156.57 MB, activation=20078.17 MB, total=55234.73 MB
[Rank 7] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29766.0 | max reserved: 29766.0
[Rank 5] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29928.0 | max reserved: 29928.0
[Rank 6] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29928.0 | max reserved: 29928.0
[Rank 1] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 27962.0 | max reserved: 27962.0
[Rank 4] (after 1 iterations) memory (MB) | allocated: 23996.30712890625 | max allocated: 23996.32275390625 | reserved: 29228.0 | max reserved: 29228.0
[Rank 3] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 28322.0 | max reserved: 28322.0
[Rank 2] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 29350.0 | max reserved: 29350.0
[Rank 0] (after 1 iterations) memory (MB) | allocated: 23995.0146484375 | max allocated: 23995.0302734375 | reserved: 24686.0 | max reserved: 24686.0
[2026-02-11 16:50:53] iteration 2/ 50 | consumed samples: 64 | elapsed time per iteration (ms): 15385.6 | throughput per GPU (TFLOP/s/GPU): 112.2 | learning rate: 2.997226E-05 | global batch size: 32 | lm loss: 1.202580E+01 | loss scale: 1.0 | grad norm: 48.137 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:51:09] iteration 3/ 50 | consumed samples: 96 | elapsed time per iteration (ms): 15321.7 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 2.988916E-05 | global batch size: 32 | lm loss: 1.092907E+01 | loss scale: 1.0 | grad norm: 165.459 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:51:24] iteration 4/ 50 | consumed samples: 128 | elapsed time per iteration (ms): 15654.0 | throughput per GPU (TFLOP/s/GPU): 110.2 | learning rate: 2.975105E-05 | global batch size: 32 | lm loss: 1.045790E+01 | loss scale: 1.0 | grad norm: 39.117 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:51:40] iteration 5/ 50 | consumed samples: 160 | elapsed time per iteration (ms): 15348.8 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.955848E-05 | global batch size: 32 | lm loss: 1.024906E+01 | loss scale: 1.0 | grad norm: 4.208 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:51:55] iteration 6/ 50 | consumed samples: 192 | elapsed time per iteration (ms): 15363.3 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.931225E-05 | global batch size: 32 | lm loss: 1.000299E+01 | loss scale: 1.0 | grad norm: 3.643 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:52:10] iteration 7/ 50 | consumed samples: 224 | elapsed time per iteration (ms): 15363.5 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.901338E-05 | global batch size: 32 | lm loss: 1.113537E+01 | loss scale: 1.0 | grad norm: 1047.474 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:52:26] iteration 8/ 50 | consumed samples: 256 | elapsed time per iteration (ms): 15697.8 | throughput per GPU (TFLOP/s/GPU): 109.9 | learning rate: 2.866308E-05 | global batch size: 32 | lm loss: 9.984780E+00 | loss scale: 1.0 | grad norm: 4.541 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:52:41] iteration 9/ 50 | consumed samples: 288 | elapsed time per iteration (ms): 15353.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.826280E-05 | global batch size: 32 | lm loss: 9.881529E+00 | loss scale: 1.0 | grad norm: 3.893 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:52:57] iteration 10/ 50 | consumed samples: 320 | elapsed time per iteration (ms): 15353.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.781419E-05 | global batch size: 32 | lm loss: 9.621094E+00 | loss scale: 1.0 | grad norm: 3.681 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:53:12] iteration 11/ 50 | consumed samples: 352 | elapsed time per iteration (ms): 15357.6 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.731908E-05 | global batch size: 32 | lm loss: 9.587229E+00 | loss scale: 1.0 | grad norm: 4.020 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:53:28] iteration 12/ 50 | consumed samples: 384 | elapsed time per iteration (ms): 15708.9 | throughput per GPU (TFLOP/s/GPU): 109.8 | learning rate: 2.677952E-05 | global batch size: 32 | lm loss: 9.429702E+00 | loss scale: 1.0 | grad norm: 3.008 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:53:43] iteration 13/ 50 | consumed samples: 416 | elapsed time per iteration (ms): 15341.5 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 2.619772E-05 | global batch size: 32 | lm loss: 9.362844E+00 | loss scale: 1.0 | grad norm: 3.049 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:53:58] iteration 14/ 50 | consumed samples: 448 | elapsed time per iteration (ms): 15353.9 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.557606E-05 | global batch size: 32 | lm loss: 9.235973E+00 | loss scale: 1.0 | grad norm: 3.024 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:54:14] iteration 15/ 50 | consumed samples: 480 | elapsed time per iteration (ms): 15368.8 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.491711E-05 | global batch size: 32 | lm loss: 9.124246E+00 | loss scale: 1.0 | grad norm: 3.094 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:54:29] iteration 16/ 50 | consumed samples: 512 | elapsed time per iteration (ms): 15690.1 | throughput per GPU (TFLOP/s/GPU): 110.0 | learning rate: 2.422357E-05 | global batch size: 32 | lm loss: 9.039256E+00 | loss scale: 1.0 | grad norm: 3.090 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:54:45] iteration 17/ 50 | consumed samples: 544 | elapsed time per iteration (ms): 15362.6 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.349830E-05 | global batch size: 32 | lm loss: 8.930184E+00 | loss scale: 1.0 | grad norm: 3.067 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:55:00] iteration 18/ 50 | consumed samples: 576 | elapsed time per iteration (ms): 15365.7 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.274427E-05 | global batch size: 32 | lm loss: 8.846102E+00 | loss scale: 1.0 | grad norm: 2.797 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:55:16] iteration 19/ 50 | consumed samples: 608 | elapsed time per iteration (ms): 15354.9 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 2.196458E-05 | global batch size: 32 | lm loss: 8.751925E+00 | loss scale: 1.0 | grad norm: 2.625 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:55:31] iteration 20/ 50 | consumed samples: 640 | elapsed time per iteration (ms): 15677.6 | throughput per GPU (TFLOP/s/GPU): 110.1 | learning rate: 2.116243E-05 | global batch size: 32 | lm loss: 8.664285E+00 | loss scale: 1.0 | grad norm: 2.482 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:55:47] iteration 21/ 50 | consumed samples: 672 | elapsed time per iteration (ms): 15370.2 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 2.034112E-05 | global batch size: 32 | lm loss: 8.609591E+00 | loss scale: 1.0 | grad norm: 2.400 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:56:02] iteration 22/ 50 | consumed samples: 704 | elapsed time per iteration (ms): 15353.5 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.950403E-05 | global batch size: 32 | lm loss: 8.478221E+00 | loss scale: 1.0 | grad norm: 2.279 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:56:17] iteration 23/ 50 | consumed samples: 736 | elapsed time per iteration (ms): 15359.4 | throughput per GPU (TFLOP/s/GPU): 112.3 | learning rate: 1.865460E-05 | global batch size: 32 | lm loss: 8.495676E+00 | loss scale: 1.0 | grad norm: 2.119 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:56:33] iteration 24/ 50 | consumed samples: 768 | elapsed time per iteration (ms): 15682.9 | throughput per GPU (TFLOP/s/GPU): 110.0 | learning rate: 1.779631E-05 | global batch size: 32 | lm loss: 8.401316E+00 | loss scale: 1.0 | grad norm: 2.247 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:56:48] iteration 25/ 50 | consumed samples: 800 | elapsed time per iteration (ms): 15347.3 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.693270E-05 | global batch size: 32 | lm loss: 8.394979E+00 | loss scale: 1.0 | grad norm: 2.109 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:57:04] iteration 26/ 50 | consumed samples: 832 | elapsed time per iteration (ms): 15358.2 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.606730E-05 | global batch size: 32 | lm loss: 8.387753E+00 | loss scale: 1.0 | grad norm: 2.035 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:57:19] iteration 27/ 50 | consumed samples: 864 | elapsed time per iteration (ms): 15357.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.520369E-05 | global batch size: 32 | lm loss: 8.329927E+00 | loss scale: 1.0 | grad norm: 1.830 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:57:35] iteration 28/ 50 | consumed samples: 896 | elapsed time per iteration (ms): 15658.3 | throughput per GPU (TFLOP/s/GPU): 110.2 | learning rate: 1.434540E-05 | global batch size: 32 | lm loss: 8.217674E+00 | loss scale: 1.0 | grad norm: 1.822 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:57:50] iteration 29/ 50 | consumed samples: 928 | elapsed time per iteration (ms): 15348.2 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.349597E-05 | global batch size: 32 | lm loss: 8.206045E+00 | loss scale: 1.0 | grad norm: 1.715 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:58:05] iteration 30/ 50 | consumed samples: 960 | elapsed time per iteration (ms): 15348.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 1.265888E-05 | global batch size: 32 | lm loss: 8.208779E+00 | loss scale: 1.0 | grad norm: 1.603 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:58:21] iteration 31/ 50 | consumed samples: 992 | elapsed time per iteration (ms): 15433.1 | throughput per GPU (TFLOP/s/GPU): 111.8 | learning rate: 1.183757E-05 | global batch size: 32 | lm loss: 8.186785E+00 | loss scale: 1.0 | grad norm: 1.609 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:58:36] iteration 32/ 50 | consumed samples: 1024 | elapsed time per iteration (ms): 15588.5 | throughput per GPU (TFLOP/s/GPU): 110.7 | learning rate: 1.103542E-05 | global batch size: 32 | lm loss: 8.070101E+00 | loss scale: 1.0 | grad norm: 1.694 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:58:52] iteration 33/ 50 | consumed samples: 1056 | elapsed time per iteration (ms): 15335.9 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 1.025573E-05 | global batch size: 32 | lm loss: 8.066827E+00 | loss scale: 1.0 | grad norm: 1.641 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:59:07] iteration 34/ 50 | consumed samples: 1088 | elapsed time per iteration (ms): 15352.4 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 9.501700E-06 | global batch size: 32 | lm loss: 8.050054E+00 | loss scale: 1.0 | grad norm: 1.604 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:59:23] iteration 35/ 50 | consumed samples: 1120 | elapsed time per iteration (ms): 15444.5 | throughput per GPU (TFLOP/s/GPU): 111.7 | learning rate: 8.776425E-06 | global batch size: 32 | lm loss: 8.065158E+00 | loss scale: 1.0 | grad norm: 1.521 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:59:38] iteration 36/ 50 | consumed samples: 1152 | elapsed time per iteration (ms): 15611.3 | throughput per GPU (TFLOP/s/GPU): 110.5 | learning rate: 8.082888E-06 | global batch size: 32 | lm loss: 7.998910E+00 | loss scale: 1.0 | grad norm: 1.496 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 16:59:54] iteration 37/ 50 | consumed samples: 1184 | elapsed time per iteration (ms): 15338.0 | throughput per GPU (TFLOP/s/GPU): 112.5 | learning rate: 7.423938E-06 | global batch size: 32 | lm loss: 7.993576E+00 | loss scale: 1.0 | grad norm: 1.429 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:00:09] iteration 38/ 50 | consumed samples: 1216 | elapsed time per iteration (ms): 15354.6 | throughput per GPU (TFLOP/s/GPU): 112.4 | learning rate: 6.802284E-06 | global batch size: 32 | lm loss: 7.927972E+00 | loss scale: 1.0 | grad norm: 1.440 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:00:24] iteration 39/ 50 | consumed samples: 1248 | elapsed time per iteration (ms): 15551.1 | throughput per GPU (TFLOP/s/GPU): 111.0 | learning rate: 6.220479E-06 | global batch size: 32 | lm loss: 7.943327E+00 | loss scale: 1.0 | grad norm: 1.296 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:00:40] iteration 40/ 50 | consumed samples: 1280 | elapsed time per iteration (ms): 15499.9 | throughput per GPU (TFLOP/s/GPU): 111.3 | learning rate: 5.680916E-06 | global batch size: 32 | lm loss: 7.900488E+00 | loss scale: 1.0 | grad norm: 1.334 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:00:55] iteration 41/ 50 | consumed samples: 1312 | elapsed time per iteration (ms): 15316.7 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 5.185811E-06 | global batch size: 32 | lm loss: 8.008162E+00 | loss scale: 1.0 | grad norm: 1.218 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:01:11] iteration 42/ 50 | consumed samples: 1344 | elapsed time per iteration (ms): 15322.9 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 4.737197E-06 | global batch size: 32 | lm loss: 7.860763E+00 | loss scale: 1.0 | grad norm: 1.340 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:01:26] iteration 43/ 50 | consumed samples: 1376 | elapsed time per iteration (ms): 15501.7 | throughput per GPU (TFLOP/s/GPU): 111.3 | learning rate: 4.336920E-06 | global batch size: 32 | lm loss: 7.921451E+00 | loss scale: 1.0 | grad norm: 1.185 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:01:42] iteration 44/ 50 | consumed samples: 1408 | elapsed time per iteration (ms): 15476.8 | throughput per GPU (TFLOP/s/GPU): 111.5 | learning rate: 3.986624E-06 | global batch size: 32 | lm loss: 7.933675E+00 | loss scale: 1.0 | grad norm: 1.138 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:01:57] iteration 45/ 50 | consumed samples: 1440 | elapsed time per iteration (ms): 15304.0 | throughput per GPU (TFLOP/s/GPU): 112.8 | learning rate: 3.687747E-06 | global batch size: 32 | lm loss: 7.962870E+00 | loss scale: 1.0 | grad norm: 1.134 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:02:12] iteration 46/ 50 | consumed samples: 1472 | elapsed time per iteration (ms): 15307.2 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 3.441519E-06 | global batch size: 32 | lm loss: 7.928866E+00 | loss scale: 1.0 | grad norm: 1.133 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:02:28] iteration 47/ 50 | consumed samples: 1504 | elapsed time per iteration (ms): 15483.8 | throughput per GPU (TFLOP/s/GPU): 111.4 | learning rate: 3.248951E-06 | global batch size: 32 | lm loss: 7.920525E+00 | loss scale: 1.0 | grad norm: 1.136 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:02:43] iteration 48/ 50 | consumed samples: 1536 | elapsed time per iteration (ms): 15473.4 | throughput per GPU (TFLOP/s/GPU): 111.5 | learning rate: 3.110835E-06 | global batch size: 32 | lm loss: 7.903946E+00 | loss scale: 1.0 | grad norm: 1.234 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:02:58] iteration 49/ 50 | consumed samples: 1568 | elapsed time per iteration (ms): 15313.9 | throughput per GPU (TFLOP/s/GPU): 112.7 | learning rate: 3.027737E-06 | global batch size: 32 | lm loss: 7.890891E+00 | loss scale: 1.0 | grad norm: 1.169 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2026-02-11 17:03:14] iteration 50/ 50 | consumed samples: 1600 | elapsed time per iteration (ms): 15323.9 | throughput per GPU (TFLOP/s/GPU): 112.6 | learning rate: 3.000000E-06 | global batch size: 32 | lm loss: 7.846929E+00 | loss scale: 1.0 | grad norm: 1.171 | number of skipped iterations: 0 | number of nan iterations: 0 |
[after training is done] datetime: 2026-02-11 17:03:14
saving checkpoint at iteration 50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints in torch format
successfully saved checkpoint from iteration 50 to /workspace/megatron_0210/test2/dcu_megatron/examples/qwen/8B-checkpoints [ t 1/4, p 1/1 ]
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
Evaluating on 160 samples
Evaluating iter 1/5
Evaluating iter 2/5
Evaluating iter 3/5
Evaluating iter 4/5
Evaluating iter 5/5
(min, max) time across ranks (ms):
evaluate .......................................: (27421.89, 27422.26)
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------------
validation loss at iteration 50 on validation set | lm loss value: 8.004234E+00 | lm loss PPL: 2.993607E+03 |
----------------------------------------------------------------------------------------------------------------
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode disabled
Evaluating on 160 samples
Evaluating iter 1/5
Evaluating iter 2/5
Evaluating iter 3/5
Evaluating iter 4/5
Evaluating iter 5/5
(min, max) time across ranks (ms):
evaluate .......................................: (26114.85, 26115.52)
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
[WARNING | megatron.core.rerun_state_machine]: Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
validation loss at iteration 50 on test set | lm loss value: 8.008901E+00 | lm loss PPL: 3.007609E+03 |
----------------------------------------------------------------------------------------------------------
W0211 17:04:59.713297 186405 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.717712 186403 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.731681 186395 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.748812 186401 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.750737 186407 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.753005 186402 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.835309 186404 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0211 17:04:59.837489 186406 ProcessGroupNCCL.cpp:1279] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment