Updating classifier_trainer MultiWorkerMirrored Strategy.

PiperOrigin-RevId: 316915450

Updating classifier_trainer MultiWorkerMirrored Strategy.
PiperOrigin-RevId: 316915450
57e7ca73 · A. Unique TensorFlower · e9df75ab · 57e7ca73 · 57e7ca73
Commit 57e7ca73 authored Jun 17, 2020 by A. Unique TensorFlower
2 changed files
--- a/official/vision/image_classification/README.md
+++ b/official/vision/image_classification/README.md
@@ -119,6 +119,24 @@ python3 classifier_trainer.py \
  --params_override='runtime.num_gpus=$NUM_GPUS'
 ```
+To train on multiple hosts, each with GPUs attached using
+[MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy)
+please update `runtime` section in gpu.yaml
+(or override using `--params_override`) with:
+```YAML
+# gpu.yaml
+runtime:
+  distribution_strategy: 'multi_worker_mirrored'
+  worker_hosts: '$HOST1:port,$HOST2:port'
+  num_gpus: $NUM_GPUS
+  task_index: 0
+```
+By having `task_index: 0` on the first host and `task_index: 1` on the second
+and so on. `$HOST1` and `$HOST2` are the IP addresses of the hosts, and `port`
+can be chosen any free port on the hosts. Only the first host will write
+TensorBoard Summaries and save checkpoints.
 #### On TPU:
 ```bash
 python3 classifier_trainer.py \

--- a/official/vision/image_classification/classifier_trainer.py
+++ b/official/vision/image_classification/classifier_trainer.py
@@ -235,9 +235,6 @@ def initialize(params: base_configs.ExperimentConfig,
  else:
    data_format = 'channels_last'
  tf.keras.backend.set_image_data_format(data_format)
-  distribution_utils.configure_cluster(
-      params.runtime.worker_hosts,
-      params.runtime.task_index)
  if params.runtime.run_eagerly:
    # Enable eager execution to allow step-by-step debugging
    tf.config.experimental_run_functions_eagerly(True)
@@ -296,6 +293,10 @@ def train_and_eval(
  """Runs the train and eval path using compile/fit."""
  logging.info('Running train and eval.')
+  distribution_utils.configure_cluster(
+      params.runtime.worker_hosts,
+      params.runtime.task_index)
  # Note: for TPUs, strategy and scope should be created before the dataset
  strategy = strategy_override or distribution_utils.get_distribution_strategy(
      distribution_strategy=params.runtime.distribution_strategy,