Merge branch 'scale_qkt_exp_lr' into 'master'
Scale Q*K (query times key) by 1/layer-number and add exponential decay option See merge request ADLR/megatron-lm!27
Showing
Please register or sign in to comment
Scale Q*K (query times key) by 1/layer-number and add exponential decay option See merge request ADLR/megatron-lm!27