"test/git@developer.sourcefind.cn:change/sglang.git" did not exist on "ec99668ab708d51f377b7dca4fb9a255334eed4f"
Use float32 activation in Transformer.
Float32 is used if the model uses mixed precision with bfloat16. Float16 activation are unchanged. The motivation is that BERT with the LAMB optimizer with a gelu activation has an unstable loss when gelu is in bfloat16. Unfortunately, it is not easy to check if the LAMB optimizer and gelu is used, and perhaps there are other cases that work better with float32 activations instead of bfloat16 activations, so we always do the activation in float32 instead of bfloat16. PiperOrigin-RevId: 313618322
Showing
Please register or sign in to comment