Temporarily disable explicit allreduce in BERT SQuAD

In BERT SQuAD, disable explicit allreduce for now to keep the original clip_by_global_norm math. With explicit allreduce, the gradients before allreduce are scaled so even if we move clip_by_global_norm before allreduce (as in TF1 and pre-TF 2.2) it will operate on scaled gradients, the math will be changed. So with explicit allreduce, it is better to move clip_by_global_norm to after allreduce. PiperOrigin-RevId: 299278082

Temporarily disable explicit allreduce in BERT SQuAD
In BERT SQuAD, disable explicit allreduce for now to keep the original clip_by_global_norm math. With explicit allreduce, the gradients before allreduce are scaled so even if we move clip_by_global_norm before allreduce (as in TF1 and pre-TF 2.2) it will operate on scaled gradients, the math will be changed. So with explicit allreduce, it is better to move clip_by_global_norm to after allreduce. PiperOrigin-RevId: 299278082
11ccb99e · Zongwei Zhou · A. Unique TensorFlower · f8777524 · 11ccb99e · 11ccb99e
Commit 11ccb99e authored Mar 05, 2020 by Zongwei Zhou Committed by A. Unique TensorFlower Mar 05, 2020
3 changed files
--- a/official/modeling/model_training_utils.py
+++ b/official/modeling/model_training_utils.py
@@ -150,7 +150,9 @@ def run_customized_training_loop(
        and model variables pairs as input, manipulate them, and returns a new
        gradients and model variables paris. The callback functions will be
        invoked in the list order and before gradients are allreduced.
-        Default is no callbacks. Only used when explicit_allreduce=True.
+        With mixed precision training, the pre_allreduce_allbacks will be
+        applied on scaled_gradients. Default is no callbacks.
+        Only used when explicit_allreduce=True.
      post_allreduce_callbacks: A list of callback functions that takes
        gradients and model variables pairs as input, manipulate them, and
        returns a new gradients and model variables paris. The callback

--- a/official/nlp/bert/run_squad_helper.py
+++ b/official/nlp/bert/run_squad_helper.py
@@ -269,11 +269,10 @@ def train_squad(strategy,
      loss_factor=1.0 /
      strategy.num_replicas_in_sync if FLAGS.scale_loss else 1.0)
-  # when all_reduce_sum_gradients = False, apply_gradients() no longer
+  # If explicit_allreduce = True, apply_gradients() no longer implicitly
-  # implicitly allreduce gradients, users manually allreduce gradient and
+  # allreduce gradients, users manually allreduce gradient and pass the
-  # passed the allreduced grads_and_vars. For now, the clip_by_global_norm
+  # allreduced grads_and_vars to apply_gradients(). clip_by_global_norm will be
-  # will be moved to before users' manual allreduce to keep the math
+  # applied to allreduced gradients.
-  # unchanged.
  def clip_by_global_norm_callback(grads_and_vars):
    grads, variables = zip(*grads_and_vars)
    (clipped_grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
@@ -291,8 +290,8 @@ def train_squad(strategy,
      init_checkpoint=FLAGS.init_checkpoint,
      run_eagerly=run_eagerly,
      custom_callbacks=custom_callbacks,
-      explicit_allreduce=True,
+      explicit_allreduce=False,
-      pre_allreduce_callbacks=[clip_by_global_norm_callback])
+      post_allreduce_callbacks=[clip_by_global_norm_callback])
 def predict_squad(strategy, input_meta_data, tokenizer, bert_config, squad_lib):

--- a/official/staging/training/grad_utils.py
+++ b/official/staging/training/grad_utils.py
@@ -104,7 +104,8 @@ def minimize_using_explicit_allreduce(tape,
        and model variables pairs as input, manipulate them, and returns a new
        gradients and model variables pairs. The callback functions will be
        invoked in the list order and before gradients are allreduced.
-        Default is no callbacks.
+        With mixed precision training, the pre_allreduce_allbacks will be
+        applied on scaled_gradients. Default is no callbacks.
      post_allreduce_callbacks: A list of callback functions that takes
        gradients and model variables pairs as input, manipulate them, and
        returns a new gradients and model variables paris. The callback