Use float32 activation in Transformer.

Float32 is used if the model uses mixed precision with bfloat16. Float16 activation are unchanged. The motivation is that BERT with the LAMB optimizer with a gelu activation has an unstable loss when gelu is in bfloat16. Unfortunately, it is not easy to check if the LAMB optimizer and gelu is used, and perhaps there are other cases that work better with float32 activations instead of bfloat16 activations, so we always do the activation in float32 instead of bfloat16. PiperOrigin-RevId: 313618322

Use float32 activation in Transformer.
Float32 is used if the model uses mixed precision with bfloat16. Float16 activation are unchanged. The motivation is that BERT with the LAMB optimizer with a gelu activation has an unstable loss when gelu is in bfloat16. Unfortunately, it is not easy to check if the LAMB optimizer and gelu is used, and perhaps there are other cases that work better with float32 activations instead of bfloat16 activations, so we always do the activation in float32 instead of bfloat16. PiperOrigin-RevId: 313618322
94b1efc1 · Reed Wanderman-Milne · A. Unique TensorFlower · fbec2dbe · 94b1efc1
Commit 94b1efc1 authored May 28, 2020 by Reed Wanderman-Milne Committed by A. Unique TensorFlower May 28, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 1 deletion

official/nlp/modeling/layers/transformer.py official/nlp/modeling/layers/transformer.py +7 -1

No files found.
--- a/official/nlp/modeling/layers/transformer.py
+++ b/official/nlp/modeling/layers/transformer.py
@@ -141,8 +141,14 @@ class Transformer(tf.keras.layers.Layer):
        kernel_constraint=self._kernel_constraint,
        bias_constraint=self._bias_constraint,
        name="intermediate")
+    policy = tf.keras.mixed_precision.experimental.global_policy()
+    if policy.name == "mixed_bfloat16":
+      # bfloat16 causes BERT with the LAMB optimizer to not converge
+      # as well, so we use float32.
+      # TODO(b/154538392): Investigate this.
+      policy = tf.float32
    self._intermediate_activation_layer = tf.keras.layers.Activation(
-        self._intermediate_activation)
+        self._intermediate_activation, dtype=policy)
    self._output_dense = dense_einsum.DenseEinsum(
        output_shape=hidden_size,
        kernel_initializer=self._kernel_initializer,