[2024-08-16 18:16:08,618] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) using world size: 64, data-parallel size: 16, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 4 WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:HuggingFaceTokenizer WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication accumulate and all-reduce gradients in fp32 for bfloat16 data type. using torch.bfloat16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. True adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.95 adam_eps ........................................ 1e-08 add_bias_linear ................................. False add_position_embedding .......................... False add_qkv_bias .................................... False adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False apply_rope_fusion ............................... True async_save ...................................... None async_tensor_model_parallel_allreduce ........... True attention_dropout ............................... 0.0 attention_softmax_in_fp32 ....................... False auto_detect_ckpt_format ......................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ True bias_dropout_fusion ............................. True bias_gelu_fusion ................................ False bias_swiglu_fusion .............................. True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None calculate_per_token_loss ........................ False check_for_nan_in_loss_and_grad .................. True check_weight_hash_across_dp_replicas_interval ... None ckpt_assume_constant_structure .................. False ckpt_fully_parallel_load ........................ False ckpt_fully_parallel_save ........................ False ckpt_step ....................................... None classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 create_attention_mask_in_dataloader ............. True cross_entropy_loss_fusion ....................... False data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 16 data_path ....................................... ['./dataset/alpaca_text_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single ddp_average_in_collective ....................... True ddp_bucket_size ................................. None decoder_num_layers .............................. None decoder_seq_length .............................. None decoupled_lr .................................... None decoupled_min_lr ................................ None delay_grad_reduce ............................... True delay_param_gather .............................. False deprecated_use_mcore_models ..................... False deterministic_mode .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 disable_straggler_on_startup .................... False dist_ckpt_format ................................ torch_dist dist_url ........................................ tcp://:34566 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None empty_unused_memory_level ....................... 0 enable_one_logger ............................... False encoder_num_layers .............................. 32 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.1 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 1000 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_model_parallel_size ...................... 1 ffn_hidden_size ................................. 14336 finetune ........................................ False fp16 ............................................ False fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 128 gradient_accumulation_fusion .................... False group_query_attention ........................... True head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.0 hidden_size ..................................... 4096 hybrid_attention_ratio .......................... 0.0 hybrid_mlp_ratio ................................ 0.0 hybrid_override_pattern ......................... None hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.006 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 128 lazy_mpu_init ................................... None load ............................................ ./tmp_8b local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_progress .................................... False log_straggler ................................... False log_throughput .................................. True log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False logging_level ................................... None loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 3e-05 lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 1 lr_warmup_samples ............................... 0 lr_wsd_decay_iters .............................. None lr_wsd_decay_samples ............................ None lr_wsd_decay_style .............................. exponential make_vocab_size_divisible_by .................... 128 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... True max_position_embeddings ......................... 8192 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 3e-06 mmap_bin_files .................................. True mock_data ....................................... False moe_aux_loss_coeff .............................. 0.0 moe_expert_capacity_factor ...................... None moe_extended_tp ................................. False moe_grouped_gemm ................................ False moe_input_jitter_eps ............................ None moe_layer_recompute ............................. False moe_pad_expert_input_to_capacity ................ False moe_per_layer_logging ........................... False moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_token_dispatcher_type ....................... allgather moe_token_drop_policy ........................... probs moe_z_loss_coeff ................................ None nccl_communicator_config_path ................... None no_load_optim ................................... None no_load_rng ..................................... None no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_dataset_builder_threads ..................... 1 num_experts ..................................... None num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 8 num_workers ..................................... 2 one_logger_entity ............................... hwinf_dcm one_logger_project .............................. e2e-tracking one_logger_run_name ............................. None onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. True overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.bfloat16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 4 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pretrained_checkpoint ........................... None profile ......................................... False profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 qk_layernorm .................................... False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... selective recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_attention_gate ............................ 1 retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_project_dir ............................... None retro_verify_neighbor_count ..................... True rotary_interleaved .............................. False rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ ./tmp_8b save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False spec ............................................ None split ........................................... 949,50,1 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.1 straggler_ctrlr_port ............................ 65535 straggler_minmax_count .......................... 1 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. ./tmp_8b tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None test_mode ....................................... False timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. None tokenizer_type .................................. HuggingFaceTokenizer tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_ag .............................. True tp_comm_overlap_cfg ............................. None tp_comm_overlap_rs .............................. True tp_comm_overlap_rs_dgrad ........................ False tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... 120 train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 4 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_dist_ckpt ................................... False use_distributed_optimizer ....................... True use_flash_attn .................................. True use_flash_attn_triton ........................... True use_flash_attn_v1 ............................... False use_flash_attn_v2 ............................... False use_legacy_models ............................... True use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True use_tp_pp_dp_mapping ............................ False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... None wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.1 weight_decay_incr_style ......................... constant world_size ...................................... 64 yaml_cfg ........................................ None -------------------- end of arguments --------------------- setting number of micro-batches to constant 8 > building HuggingFaceTokenizer tokenizer ...