"test/git@developer.sourcefind.cn:change/sglang.git" did not exist on "ec99668ab708d51f377b7dca4fb9a255334eed4f"
Commit 94b1efc1 authored by Reed Wanderman-Milne's avatar Reed Wanderman-Milne Committed by A. Unique TensorFlower
Browse files

Use float32 activation in Transformer.

Float32 is used if the model uses mixed precision with bfloat16. Float16 activation are unchanged.

The motivation is that BERT with the LAMB optimizer with a gelu activation has an unstable loss when gelu is in bfloat16. Unfortunately, it is not easy to check if the LAMB optimizer and gelu is used, and perhaps there are other cases that work better with float32 activations instead of bfloat16 activations, so we always do the activation in float32 instead of bfloat16.

PiperOrigin-RevId: 313618322
parent fbec2dbe
...@@ -141,8 +141,14 @@ class Transformer(tf.keras.layers.Layer): ...@@ -141,8 +141,14 @@ class Transformer(tf.keras.layers.Layer):
kernel_constraint=self._kernel_constraint, kernel_constraint=self._kernel_constraint,
bias_constraint=self._bias_constraint, bias_constraint=self._bias_constraint,
name="intermediate") name="intermediate")
policy = tf.keras.mixed_precision.experimental.global_policy()
if policy.name == "mixed_bfloat16":
# bfloat16 causes BERT with the LAMB optimizer to not converge
# as well, so we use float32.
# TODO(b/154538392): Investigate this.
policy = tf.float32
self._intermediate_activation_layer = tf.keras.layers.Activation( self._intermediate_activation_layer = tf.keras.layers.Activation(
self._intermediate_activation) self._intermediate_activation, dtype=policy)
self._output_dense = dense_einsum.DenseEinsum( self._output_dense = dense_einsum.DenseEinsum(
output_shape=hidden_size, output_shape=hidden_size,
kernel_initializer=self._kernel_initializer, kernel_initializer=self._kernel_initializer,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment