Merge branch 'main' into 'main'

megatron升级v0.10

See merge request !3
12 jobs for main in 0 seconds (queued for 61 minutes and 19 seconds)
Status Job ID Name Coverage
  Test
canceled #11753
ssh_selene_runner
resume.checkpoint.bert.345m_tp1_pp2_1node

failed #11721
ssh_selene_runner
resume.checkpoint.gpt3.345m_tp1_pp2_1node

failed #11724
ssh_selene_runner
train.bert.345m_tp1_pp2_1node_50steps

failed #11725
ssh_selene_runner
train.bert.345m_tp1_pp4_1node_50steps

failed #11723
ssh_selene_runner
train.bert.345m_tp2_pp2_1node_50steps

failed #11722
ssh_selene_runner
train.bert.345m_tp4_pp1_1node_50steps

failed #11719
ssh_selene_runner
train.gpt3.345m_tp1_pp2_1node_50steps

failed #11720
ssh_selene_runner
train.gpt3.345m_tp1_pp4_1node_50steps

failed #11718
ssh_selene_runner
train.gpt3.345m_tp2_pp2_1node_50steps

failed #11717
ssh_selene_runner
train.gpt3.345m_tp4_pp1_1node_50steps

failed #11726
ssh_selene_runner
resume.checkpoint.bert.345m_tp1_pp2_1node

 
  Cleanup
canceled #11727
ssh_selene_runner allowed to fail
cleanup.selene

 
Name Stage Failure
failed
train.gpt3.345m_tp1_pp4_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.gpt3.345m_tp4_pp1_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.bert.345m_tp4_pp1_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.bert.345m_tp2_pp2_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.gpt3.345m_tp2_pp2_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.gpt3.345m_tp1_pp2_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.bert.345m_tp1_pp2_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
train.bert.345m_tp1_pp4_1node_50steps Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log
failed
resume.checkpoint.gpt3.345m_tp1_pp2_1node Test There has been a timeout failure or the job got stuck. Check your timeout limits or try again
No job log