"git@developer.sourcefind.cn:OpenDAS/megatron-lm.git" did not exist on "626645c0f77e1cf46cd08d854e1f336f15b6d8b7"
Unverified Commit 1fc839bc authored by Hongkun Yu's avatar Hongkun Yu Committed by GitHub
Browse files

Merged commit includes the following changes: (#7277)

259442882  by hongkuny<hongkuny@google.com>:

    Internal

--
259341546  by mrry<mrry@google.com>:

    Remove DEBUG-level logging from the BERT benchmark.

    This triggers graph serialization and other verbose logging in the TensorFlow runtime, which inflates the execution time.

--
259253185  by hongkuny<hongkuny@google.com>:

    Writes a separated checkpoint for the core model in pretraining.
    Clean up export utils to just take a model as argument.

--
258893811  by hongkuny<hongkuny@google.com>:

    Adds summaries for metrics, allowing metrics inside keras.model.

--
258881002  by hongkuny<hongkuny@google.com>:

    Fix lint.

--
258597234  by rxsang<rxsang@google.com>:

    Update all the TPUStrategy examples to use the new v2 APIs, i.e.
    make_dataset_iterator -> experimental_distribute_dataset,
    make_input_fn_iterator -> experimental_distribute_datasets_from_function,
    unwrap -> experimental_local_results,
    experimental_run -> experimental_run_v2

--
258581998  by taylorrobie<taylorrobie@google.com>:

    Update keras v2 optimizers to reuse coefficients which are shared across all updates, which reduces the total number of ops created by between 5% (for simple optimizers such as SGD and Adagrad) and 25% (for complicated optimizers such as Adam and NAdam). Separate copies are made for each device and dtype.

    The effect of this change on run time is fairly minimal since Grappler is expected to consolidate most of these ops; however it does improve graph construction time.

--
258208153  by hongkuny<hongkuny@google.com>:

    Adds run_eagerly option for bert.

--
257883986  by hongkuny<hongkuny@google.com>:

    Adds tf.summary for bert training

--
256204636  by hongkuny<hongkuny@google.com>:

    Internal

--
256079834  by hongkuny<hongkuny@google.com>:

    Clean up: move common flags together for further refactoring
    Enable steps_per_loop option for all applications.

--
255493073  by hongkuny<hongkuny@google.com>:

    BERT initial OSS readme update.

--
255470372  by dmchen<dmchen@google.com>:

    Slightly expand expected range for F1 score in BERT SQuAD accuracy test

--
255109240  by hongkuny<hongkuny@google.com>:

    Update eval/predict batch sizes.

--
255010016  by hongkuny<hongkuny@google.com>:

    Internal

--
254874613  by hongkuny<hongkuny@google.com>:

    Update glue tasks enum to match directory name

--
254866171  by taylorrobie<taylorrobie@google.com>:

    Internal change

254785517  by zongweiz<zongweiz@google.com>:

    Use train_single_step for BERT GPU models to temporarily work around some performance bugs in GPU runs

--
254497647  by hongkuny<hongkuny@google.com>:

    Fix device placement for TPU export model.

--
254134531  by yuefengz<yuefengz@google.com>:

    Fix a typo in bert_benchmark.py

--
254069984  by hongkuny<hongkuny@google.com>:
    Automated rollback of changelist 254060732.

254061429  by hongkuny<hongkuny@google.com>:

    Use host while loop for training steps.

--
254060732  by yifeif<yifeif@google.com>:
    Automated rollback of changelist 254027750.

254027750  by hongkuny<hongkuny@google.com>:

    Internal change

253850824  by hongkuny<hongkuny@google.com>:

    Improve bert training utils.

--
253818191  by hongkuny<hongkuny@google.com>:

    Update savedmodel export to use new model.save() api.

--
253636854  by dmchen<dmchen@google.com>:

    Run only training in BERT SQuAD performance test

--
253118910  by hongkuny<hongkuny@google.com>:

    Internal change

253113801  by zongweiz<zongweiz@google.com>:

    Internal change

252697519  by dmchen<dmchen@google.com>:

    BERT SQuAD accuracy test

--
252663512  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Internal change

--
252647871  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Enable multi worker TPU training for BERT pretraining.

--
252522861  by hongkuny<hongkuny@google.com>:

    Remove export using trained model due to implementation error

--
252156812  by yuefengz<yuefengz@google.com>:

    Fix the callback method name in BERT: replaced on_batch_start with on_batch_begin. Without the fix, it won't work with Keras callbacks.

--
251782065  by dmchen<dmchen@google.com>:

    Internal change

251681245  by hongkuny<hongkuny@google.com>:

    Update bert to use the new tf.distribute APIs

--
251575972  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Remove `steps_per_run` when instantiating TPUStrategy.

--
251325964  by hongkuny<hongkuny@google.com>:

    Improve flags

--
250942274  by tobyboyd<tobyboyd@google.com>:

    Internal change

250779087  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Reduce BERT Perfzero benchmark test training steps.

--
250713045  by hongkuny<hongkuny@google.com>:

    TPU util

--
250606180  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Fix BERT benchamrk test errors.

--
250589623  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Change BERT benchmark test pretrained checkpoint url.

--
250587892  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Fix error in BERT custom training loop checkpoint restoration.

--
250577163  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add logic to inject callback that measures performance in BERT custom training
    loop.

--
250529526  by hongkuny<hongkuny@google.com>:

    Internal clean up

--
250428976  by hongkuny<hongkuny@google.com>:

    Internal change

250415383  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add min/max value to BERT classifier benchmark test.

--
250376246  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add benchmark performance test to run BERT on multiple numbers of GPUs.

--
250347237  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Fix linting errors in BERT benchmark test.

--
250326131  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Internal change

250315593  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Internal change

250303528  by haoyuzhang<haoyuzhang@google.com>:

    Add method docstring to fix lint error.

--
250009207  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add feature in BERT to write training metrics to a summary file.

--
249896208  by hongkuny<hongkuny@google.com>:

    Adds __init__.py

--
249883771  by hongkuny<hongkuny@google.com>:

    Creates a benchmark dir

--
249580533  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Internal change

249566870  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Set up BERT benchmark test.

--
249500988  by hongkuny<hongkuny@google.com>:

    Lints

--
249377254  by hongkuny<hongkuny@google.com>:

    Internal change

249373328  by hongkuny<hongkuny@google.com>:

    Clean up tf import

--
249333938  by hongkuny<hongkuny@google.com>:

    Fix tf1 import

--
249325089  by hongkuny<hongkuny@google.com>:

    BERT 2.0

--
249173564  by hongkuny<hongkuny@google.com>:

    Internal change

PiperOrigin-RevId: 259442882
parent 6a6c3616
...@@ -76,7 +76,6 @@ class BertBenchmarkBase(tf.test.Benchmark): ...@@ -76,7 +76,6 @@ class BertBenchmarkBase(tf.test.Benchmark):
def _setup(self): def _setup(self):
"""Sets up and resets flags before each test.""" """Sets up and resets flags before each test."""
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
self.timer_callback = BenchmarkTimerCallback() self.timer_callback = BenchmarkTimerCallback()
if BertBenchmarkBase.local_flags is None: if BertBenchmarkBase.local_flags is None:
......
...@@ -16,42 +16,76 @@ ...@@ -16,42 +16,76 @@
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
# from __future__ import google_type_annotations
from __future__ import print_function from __future__ import print_function
import os import os
from absl import logging from absl import logging
import tensorflow as tf import tensorflow as tf
import typing
def export_bert_model(model_export_path, def export_bert_model(
model=None, model_export_path: typing.Text,
model_fn=None, model: tf.keras.Model,
checkpoint_dir=None): checkpoint_dir: typing.Optional[typing.Text] = None) -> None:
"""Export BERT model for serving which does not include the optimizer. """Export BERT model for serving which does not include the optimizer.
Arguments: Arguments:
model_export_path: Path to which exported model will be saved. model_export_path: Path to which exported model will be saved.
model: Keras model object to export. If none, new model is created via model: Keras model object to export.
`model_fn`. checkpoint_dir: Path from which model weights will be loaded, if
model_fn: Function that returns a BERT model. Used when `model` is not specified.
provided.
checkpoint_dir: Path from which model weights will be loaded. Raises:
ValueError when either model_export_path or model is not specified.
""" """
if model: if not model_export_path:
raise ValueError('model_export_path must be specified.')
if not isinstance(model, tf.keras.Model):
raise ValueError('model must be a tf.keras.Model object.')
if checkpoint_dir:
# Restores the model from latest checkpoint.
checkpoint = tf.train.Checkpoint(model=model)
latest_checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir)
assert latest_checkpoint_file
logging.info('Checkpoint file %s found and restoring from '
'checkpoint', latest_checkpoint_file)
checkpoint.restore(latest_checkpoint_file).assert_existing_objects_matched()
model.save(model_export_path, include_optimizer=False, save_format='tf') model.save(model_export_path, include_optimizer=False, save_format='tf')
return
assert model_fn and checkpoint_dir
model_to_export = model_fn() def export_pretraining_checkpoint(
checkpoint = tf.train.Checkpoint(model=model_to_export) checkpoint_dir: typing.Text,
model: tf.keras.Model,
checkpoint_name: typing.Optional[
typing.Text] = 'pretrained/bert_model.ckpt'):
"""Exports BERT model for as a checkpoint without optimizer.
Arguments:
checkpoint_dir: Path to where training mdoel checkpoints are stored.
model: Keras model object to export.
checkpoint_name: File name or suffix path to export pretrained checkpoint.
Raises:
ValueError when either checkpoint_dir or model is not specified.
"""
if not checkpoint_dir:
raise ValueError('checkpoint_dir must be specified.')
if not isinstance(model, tf.keras.Model):
raise ValueError('model must be a tf.keras.Model object.')
checkpoint = tf.train.Checkpoint(model=model)
latest_checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir) latest_checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir)
assert latest_checkpoint_file assert latest_checkpoint_file
logging.info('Checkpoint file %s found and restoring from ' logging.info('Checkpoint file %s found and restoring from '
'checkpoint', latest_checkpoint_file) 'checkpoint', latest_checkpoint_file)
checkpoint.restore(latest_checkpoint_file).assert_existing_objects_matched() checkpoint.restore(latest_checkpoint_file).assert_existing_objects_matched()
model_to_export.save( saved_path = checkpoint.save(os.path.join(checkpoint_dir, checkpoint_name))
model_export_path, include_optimizer=False, save_format='tf') logging.info('Exporting the model as a new TF checkpoint: %s', saved_path)
class BertModelCheckpoint(tf.keras.callbacks.Callback): class BertModelCheckpoint(tf.keras.callbacks.Callback):
......
...@@ -159,13 +159,11 @@ def export_classifier(model_export_path, input_meta_data): ...@@ -159,13 +159,11 @@ def export_classifier(model_export_path, input_meta_data):
raise ValueError('Export path is not specified: %s' % model_export_path) raise ValueError('Export path is not specified: %s' % model_export_path)
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
def _model_fn(): classifier_model = bert_models.classifier_model(
return bert_models.classifier_model(bert_config, tf.float32, bert_config, tf.float32, input_meta_data['num_labels'],
input_meta_data['num_labels'],
input_meta_data['max_seq_length'])[0] input_meta_data['max_seq_length'])[0]
model_saving_utils.export_bert_model( model_saving_utils.export_bert_model(
model_export_path, model_fn=_model_fn, checkpoint_dir=FLAGS.model_dir) model_export_path, model=classifier_model, checkpoint_dir=FLAGS.model_dir)
def run_bert(strategy, input_meta_data): def run_bert(strategy, input_meta_data):
......
...@@ -29,6 +29,7 @@ import tensorflow as tf ...@@ -29,6 +29,7 @@ import tensorflow as tf
from official.bert import bert_models from official.bert import bert_models
from official.bert import common_flags from official.bert import common_flags
from official.bert import input_pipeline from official.bert import input_pipeline
from official.bert import model_saving_utils
from official.bert import model_training_utils from official.bert import model_training_utils
from official.bert import modeling from official.bert import modeling
from official.bert import optimization from official.bert import optimization
...@@ -124,7 +125,7 @@ def run_customized_training(strategy, ...@@ -124,7 +125,7 @@ def run_customized_training(strategy,
initial_lr, steps_per_epoch * epochs, warmup_steps) initial_lr, steps_per_epoch * epochs, warmup_steps)
return pretrain_model, core_model return pretrain_model, core_model
return model_training_utils.run_customized_training_loop( trained_model = model_training_utils.run_customized_training_loop(
strategy=strategy, strategy=strategy,
model_fn=_get_pretrain_model, model_fn=_get_pretrain_model,
loss_fn=get_loss_fn(), loss_fn=get_loss_fn(),
...@@ -135,6 +136,16 @@ def run_customized_training(strategy, ...@@ -135,6 +136,16 @@ def run_customized_training(strategy,
epochs=epochs, epochs=epochs,
use_remote_tpu=use_remote_tpu) use_remote_tpu=use_remote_tpu)
# Creates the BERT core model outside distribution strategy scope.
_, core_model = bert_models.pretrain_model(bert_config, max_seq_length,
max_predictions_per_seq)
# Restores the core model from model checkpoints and get a new checkpoint only
# contains the core model.
model_saving_utils.export_pretraining_checkpoint(
checkpoint_dir=model_dir, model=core_model)
return trained_model
def run_bert_pretrain(strategy): def run_bert_pretrain(strategy):
"""Runs BERT pre-training.""" """Runs BERT pre-training."""
...@@ -183,7 +194,7 @@ def main(_): ...@@ -183,7 +194,7 @@ def main(_):
if strategy: if strategy:
print('***** Number of cores used : ', strategy.num_replicas_in_sync) print('***** Number of cores used : ', strategy.num_replicas_in_sync)
return run_bert_pretrain(strategy) run_bert_pretrain(strategy)
if __name__ == '__main__': if __name__ == '__main__':
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment