del the tf2x benchmark

4749cd5e · huchen · 772777c3 · 772777c3 · 772777c3 · 772777c3
Commit 4749cd5e authored Apr 15, 2022 by huchen
20 changed files
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/README.md
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/README.md
-# tf_cnn_benchmarks: High performance benchmarks
-**Note: tf_cnn_benchmarks is no longer maintained.**
-tf_cnn_benchmarks contains TensorFlow 1 implementations of several popular
-convolutional models, and is designed to be as fast as possible.
-tf_cnn_benchmarks supports both running on a single machine or running in
-distributed mode across multiple hosts.
-tf_cnn_benchmarks is no longer maintained. Although it will run with TensorFlow
-2, it was written and optimized for TensorFlow 1, and has not been maintained
-since TensorFlow 2 was released. For clean and easy-to-read TensorFlow 2 models,
-please see the [TensorFlow Official
-Models](https://github.com/tensorflow/models/tree/master/official).
-## Getting Started
-To run ResNet50 with synthetic data without distortions with a single GPU, run
-```
-python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
-```
-Note that the master branch of tf_cnn_benchmarks occasionally requires the
-latest nightly version of TensorFlow. You can install the nightly version by
-running `pip install tf-nightly-gpu` in a clean environment, or by installing
-TensorFlow from source. We sometimes will create a branch of tf_cnn_benchmarks,
-in the form of cnn_tf_vX.Y_compatible, that is compatible with TensorFlow
-version X.Y. For example, branch
-[cnn_tf_v1.9_compatible](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.9_compatible/scripts/tf_cnn_benchmarks)
-works with TensorFlow 1.9. However, as tf_cnn_benchmarks is no longer
-maintained, we will likely no longer create new branches.
-Some important flags are
-*   model: Model to use, e.g. resnet50, inception3, vgg16, and alexnet.
-*   num_gpus: Number of GPUs to use.
-*   data_dir: Path to data to process. If not set, synthetic data is used. To
-    use Imagenet data use these
-    [instructions](https://github.com/tensorflow/models/tree/master/research/inception#getting-started)
-    as a starting point.
-*   batch_size: Batch size for each GPU.
-*   variable_update: The method for managing variables: parameter_server
-    ,replicated, distributed_replicated, independent
-*   local_parameter_device: Device to use as parameter server: cpu or gpu.
-To see the full list of flags, run `python tf_cnn_benchmarks.py --help`.
-To run ResNet50 with real data with 8 GPUs, run:
-```
-python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
--train_dir=${CKPT_DIR}
-```
-This will train a ResNet-50 model on ImageNet with 2048 batch size on 8
-GPUs. The model should train to around 76% accuracy.
-## Running the tests
-To run the tests, run
-```bash
-pip install portpicker
-python run_tests.py && python run_tests.py --run_distributed_tests
-```
-Note the tests require portpicker.
-The command above runs a subset of tests that is both fast and fairly
-comprehensive. Alternatively, all the tests can be run, but this will take a
-long time:
-```bash
-python run_tests.py --full_tests && python run_tests.py --full_tests --run_distributed_tests
-```
-We will run all tests on every PR before merging them, so it is not necessary
-to pass `--full_tests` when running tests yourself.
-To run an individual test, such as method `testParameterServer` of test class
-`TfCnnBenchmarksTest` of module `benchmark_cnn_test`, run
-```bash
-python -m unittest -v benchmark_cnn_test.TfCnnBenchmarksTest.testParameterServer
-```
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/allreduce.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/allreduce.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/batch_allreduce.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/batch_allreduce.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/benchmark_cnn.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/benchmark_cnn.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/cnn_util.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/cnn_util.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/constants.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/constants.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/convnet_builder.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/convnet_builder.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/datasets.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/datasets.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/flags.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/flags.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/mlperf.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/mlperf.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/preprocessing.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/preprocessing.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/variable_mgr.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/variable_mgr.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/variable_mgr_util.cpython-36.pyc
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/__pycache__/variable_mgr_util.cpython-36.pyc
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/all_reduce_benchmark.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/all_reduce_benchmark.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Benchmarks the all-reduce algorithms of tf_cnn_benchmarks.
-tf_cnn_benchmarks uses all-reduce to aggregate gradients. This benchmark is
-useful for benchmarking the performance of just this gradient aggregation,
-instead of the entire model. All the flags that tf_cnn_benchmarks accepts are
-also accepted by this script, although many are silently ignored.
-The number and shapes of the tensors all-reduced are those of the variables of
-the model specified by the --model flag.
-TODO(reedwm): Allow custom sizes to be specified.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import os
-import time
-from absl import app
-from absl import flags as absl_flags
-import tensorflow.compat.v1 as tf
-from tensorflow.python.ops import control_flow_ops
-import benchmark_cnn
-import cnn_util
-import flags
-from cnn_util import log_fn
-absl_flags.DEFINE_integer('iters_per_step', 5,
-                          'Number of iterations to run all-reduce for, per '
-                          'step. Every step, a session will be run on a Graph '
-                          'that contains this many copies of the all-reduce. '
-                          'The copies are run sequentially. Setting this above '
-                          '1 is useful to lower the overhead of starting the '
-                          'session run, running the VariableV2 ops at the '
-                          'start of the step, etc.')
-flags.define_flags()
-for name in flags.param_specs.keys():
-  absl_flags.declare_key_flag(name)
-def get_var_shapes(model):
-  """Returns the list of variable shapes for a tf_cnn_benchmarks Model."""
-  with tf.Graph().as_default():
-    # The variable shapes do not depend on the batch size.
-    images = tf.placeholder(tf.float32, model.get_input_shapes('train')[0])
-    model.build_network([images])
-    return [[int(d) for d in v.shape.dims] for v in tf.trainable_variables()]
-def all_reduce(all_device_tensors, variable_mgr):
-  """Performs a single batch all-reduce.
-  Args:
-    all_device_tensors: List of lists of tensors. all_device_tensors[t][i] is
-      a tensor, where t is the tower the tensor is on and i is the index of
-      the tensor.
-    variable_mgr: The VariableMgr to perform the all-reduce.
-  Returns:
-    List of list of tensors in the same form as `all_device_tensors`, except the
-    tensors are aggregated across towers.
-  """
-  tower_grads = [[(g, None) for g in device_tensors] for
-                 device_tensors in all_device_tensors]
-  _, aggregated_tower_grads = variable_mgr.preprocess_device_grads(tower_grads)
-  return [
-      [g for g, _ in agg_device_tensors]
-      for agg_device_tensors in aggregated_tower_grads]
-def build_all_reduce_iterations(all_device_tensors, tower_devices, variable_mgr,
-                                num_iters):
-  """Builds the all-reduce ops for multiple iterations to aggregate tensors.
-  The tensors in `all_device_tensors` are aggregated `num_iters` times. Each
-  iteration aggregates the results from the previous iteration. The iterations
-  are run sequentially, so the aggregations for an iteration do not start
-  running until the previous iteration has completed. Each iteration after the
-  first is aggregating already-aggregated values, but it does not matter because
-  we are only aggregating for benchmarking purposes.
-  Args:
-    all_device_tensors: List of lists of tensors. all_device_tensors[t][i] is
-      a tensor, where t is the tower the tensor is on and i is the index of
-      the tensor.
-    tower_devices: A list of device strings. tower_devices[t] is the device
-      of the tensors in all_device_tensors[t].
-    variable_mgr: The VariableMgr to perform the all-reduce.
-    num_iters: Number of iterations to aggregate tensors for.
-  Returns:
-    An op that when run, causes the all-reduce ops to run.
-  """
-  for i in range(num_iters):
-    with tf.name_scope('iteration_%d' % i):
-      # Step 1: Do the aggregation.
-      with tf.name_scope('tensor_aggregation'):
-        all_device_tensors = all_reduce(all_device_tensors, variable_mgr)
-      # Step 2. Create identity ops, to bring the aggregated results back to
-      # each device.
-      new_all_device_tensors = []
-      for device, device_tensors in zip(tower_devices, all_device_tensors):
-        with tf.device(device):
-          new_all_device_tensors.append([
-              tf.identity(t, name='identity_after_allreduce')
-              for t in device_tensors
-          ])
-      all_device_tensors = new_all_device_tensors
-      # Step 3. Add control dependencies to delay the next iteration until this
-      # iteration is complete. To avoid extra overhead, we do not have any
-      # cross-device control dependencies, which means it's possible for two
-      # iterations to slightly overlap.
-      new_all_device_tensors = []
-      for device_tensors in all_device_tensors:
-        new_all_device_tensors.append([
-            control_flow_ops.with_dependencies(
-                device_tensors, t, name='identity_after_dependencies')
-            for t in device_tensors
-        ])
-      all_device_tensors = new_all_device_tensors
-  # To prevent the dependency optimizer from removing every op we created,
-  # we store the results in variables.
-  ops_to_run = []
-  for device, device_tensors in zip(tower_devices, all_device_tensors):
-    with tf.device(device):
-      for t in device_tensors:
-        # The placeholder initial value is never run.
-        var = tf.Variable(tf.placeholder(tf.float32, t.shape), collections=[])
-        ops_to_run.append(var.assign(t))
-  return tf.group(*ops_to_run)
-def build_graph(tower_devices, tensor_shapes, variable_mgr, num_iters):
-  """Builds the graph for the benchmark.
-  Args:
-    tower_devices: A list of device strings of the devices to run the all-reduce
-      benchmark on.
-    tensor_shapes: A list of shapes of the tensors that will be aggregated for
-      the all-reduce.
-    variable_mgr: The VariableMgr to perform the all-reduce.
-    num_iters: Number of iterations to aggregate tensors for.
-  Returns:
-    An op that runs the benchmark.
-  """
-  all_device_tensors = []
-  for i, tower_device in enumerate(tower_devices):
-    with tf.device(tower_device):
-      device_tensors = []
-      for j, shape in enumerate(tensor_shapes):
-        tensor = tf.Variable(tf.random_normal(shape, dtype=tf.float32),
-                             name='tensor_%d_on_device_%d' % (j, i))
-        device_tensors.append(tensor)
-    all_device_tensors.append(device_tensors)
-  log_fn('Building all-reduce ops')
-  benchmark_op = build_all_reduce_iterations(all_device_tensors, tower_devices,
-                                             variable_mgr, num_iters)
-  log_fn('Done building all-reduce ops')
-  return benchmark_op
-def run_graph(benchmark_op, bench_cnn, init_ops, dummy_loss_op):
-  """Runs the graph for the benchmark.
-  Args:
-    benchmark_op: An op that runs the benchmark.
-    bench_cnn: The BenchmarkCNN where params and other attributes are obtained.
-    init_ops: A list of ops that are run before `benchmark_op` for
-      initialization.
-    dummy_loss_op: Any op. We must pass a loss op to
-      `benchmark_cnn.benchmark_one_step`, but the result of the op is never
-      actually used.
-  """
-  config = benchmark_cnn.create_config_proto(bench_cnn.params)
-  with tf.Session(config=config) as sess:
-    for op in init_ops:
-      sess.run(op)
-    step_train_times = []
-    fetches = {'average_loss': dummy_loss_op, 'benchmark_op': benchmark_op}
-    log_fn('Running warmup')
-    for i in range(-bench_cnn.num_warmup_batches, bench_cnn.num_batches):
-      if i == 0:
-        log_fn('Running all-reduce ops')
-        start = time.time()
-      if i > 0 and i % bench_cnn.params.display_every == 0:
-        log_fn('Iteration: %d. Average time per step so far: %s' %
-               (i, (time.time() - start) / i))
-      # Call benchmark_one_step instead of directly calling sess.run(...), to
-      # potentially get a trace file, partitioned graphs, etc.
-      benchmark_cnn.benchmark_one_step(
-          sess=sess,
-          fetches=fetches,
-          step=i,
-          # The batch size is only used for the images/sec calculation, which is
-          # not actually calculated because we pass show_images_per_sec=False.
-          batch_size=None,
-          step_train_times=step_train_times,
-          trace_filename=bench_cnn.trace_filename,
-          partitioned_graph_file_prefix=(
-              bench_cnn.params.partitioned_graph_file_prefix),
-          profiler=None,
-          image_producer=None,
-          params=bench_cnn.params,
-          show_images_per_sec=False)
-    log_fn('Average time per step: %s' %
-           ((time.time() - start) / bench_cnn.num_batches))
-def run_benchmark(bench_cnn, num_iters):
-  """Runs the all-reduce benchmark.
-  Args:
-    bench_cnn: The BenchmarkCNN where params, the variable manager, and other
-      attributes are obtained.
-    num_iters: Number of iterations to do all-reduce for for.
-  Raises:
-    ValueError: Invalid params of bench_cnn.
-  """
-  if bench_cnn.params.variable_update != 'replicated':
-    raise ValueError('--variable_update=replicated must be specified to use'
-                     'the all-reduce benchmark')
-  if bench_cnn.params.variable_consistency == 'relaxed':
-    raise ValueError('--variable_consistency=relaxed is not supported')
-  benchmark_op = build_graph(bench_cnn.raw_devices,
-                             get_var_shapes(bench_cnn.model),
-                             bench_cnn.variable_mgr, num_iters)
-  init_ops = [
-      tf.global_variables_initializer(),
-      bench_cnn.variable_mgr.get_post_init_ops()
-  ]
-  loss_op = tf.no_op()
-  if bench_cnn.graph_file:
-    path, filename = os.path.split(bench_cnn.graph_file)
-    as_text = filename.endswith('txt')
-    log_fn('Writing GraphDef as %s to %s' % (
-        'text' if as_text else 'binary', bench_cnn.graph_file))
-    tf.train.write_graph(tf.get_default_graph().as_graph_def(add_shapes=True),
-                         path, filename, as_text)
-  run_graph(benchmark_op, bench_cnn, init_ops, loss_op)
-# TODO(reedwm): Reduce redundancy with tf_cnn_benchmarks
-def main(positional_arguments):
-  # Command-line arguments like '--distortions False' are equivalent to
-  # '--distortions=True False', where False is a positional argument. To prevent
-  # this from silently running with distortions, we do not allow positional
-  # arguments.
-  assert len(positional_arguments) >= 1
-  if len(positional_arguments) > 1:
-    raise ValueError('Received unknown positional arguments: %s'
-                     % positional_arguments[1:])
-  params = benchmark_cnn.make_params_from_flags()
-  params = benchmark_cnn.setup(params)
-  bench = benchmark_cnn.BenchmarkCNN(params)
-  tfversion = cnn_util.tensorflow_version_tuple()
-  log_fn('TensorFlow:  %i.%i' % (tfversion[0], tfversion[1]))
-  run_benchmark(bench, absl_flags.FLAGS.iters_per_step)
-if __name__ == '__main__':
-  tf.disable_v2_behavior()
-  app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/all_reduce_benchmark_test.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/all_reduce_benchmark_test.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for all_reduce_benchmark.py."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import tensorflow.compat.v1 as tf
-import all_reduce_benchmark
-import benchmark_cnn
-import test_util
-class AllReduceBenchmarkTest(tf.test.TestCase):
-  """Tests the all-reduce benchmark."""
-  def _test_run_benchmark(self, params):
-    """Tests that run_benchmark() runs successfully with the params."""
-    logs = []
-    with test_util.monkey_patch(all_reduce_benchmark,
-                                log_fn=test_util.print_and_add_to_list(logs)):
-      bench_cnn = benchmark_cnn.BenchmarkCNN(params)
-      all_reduce_benchmark.run_benchmark(bench_cnn, num_iters=5)
-      self.assertRegex(logs[-1], '^Average time per step: [0-9.]+$')
-  def test_run_benchmark(self):
-    """Tests that run_benchmark() runs successfully."""
-    params = benchmark_cnn.make_params(num_batches=10,
-                                       variable_update='replicated',
-                                       num_gpus=2)
-    self._test_run_benchmark(params)
-    params = params._replace(hierarchical_copy=True, gradient_repacking=8,
-                             num_gpus=8)
-    self._test_run_benchmark(params)
-if __name__ == '__main__':
-  tf.disable_v2_behavior()
-  tf.test.main()
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/allreduce.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/allreduce.py
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/allreduce_test.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/allreduce_test.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for tf_cnn_benchmark.allreduce."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import collections as pycoll
-import numpy as np
-import tensorflow.compat.v1 as tf
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import test_util
-from tensorflow.python.ops import variables
-import allreduce
-class AllReduceTest(tf.test.TestCase):
-  def testGroupKey(self):
-    d0 = ['/job:worker/replica:0/task:0/device:GPU:1',
-          '/job:worker/replica:0/task:0/device:GPU:0',
-          '/job:worker/replica:0/task:0/device:GPU:3',]
-    d1 = ['/job:worker/replica:0/task:1/device:GPU:1',
-          '/job:worker/replica:0/task:1/device:GPU:0',
-          '/job:worker/replica:0/task:1/device:GPU:3',]
-    d2 = ['/job:worker/replica:0/task:1/device:GPU:1',
-          '/job:worker/replica:0/task:1/device:GPU:3',
-          '/job:worker/replica:0/task:1/device:GPU:0',]
-    d3 = ['/job:worker/replica:0/task:1/device:GPU:1',
-          '/job:worker/replica:0/task:1/device:GPU:3',
-          '/job:worker/replica:0/task:1/device:GPU:2',]
-    d4 = ['/job:worker/task:0/device:GPU:1',
-          '/job:worker/task:0/device:GPU:2',
-          '/job:worker/task:0/device:GPU:3',]
-    d5 = ['/job:worker/task:0/device:CPU:1',
-          '/job:worker/task:0/device:CPU:2']
-    d6 = ['/job:worker/task:0/device:CPU:2',
-          '/job:worker/task:0/device:CPU:1']
-    g0 = allreduce.collective_group_key(d0)
-    g1 = allreduce.collective_group_key(d1)
-    g2 = allreduce.collective_group_key(d2)
-    g3 = allreduce.collective_group_key(d3)
-    g4 = allreduce.collective_group_key(d4)
-    g5 = allreduce.collective_group_key(d5)
-    g6 = allreduce.collective_group_key(d6)
-    self.assertEqual(g0, g1)
-    self.assertEqual(g0, g2)
-    self.assertTrue(g0 != g3)
-    self.assertEqual(g3, g4)
-    self.assertEqual(g5, g6)
-    self.assertTrue(g4 != g5)
-  def testExtractRanges(self):
-    x = []
-    expected_ranges = []
-    expected_singles = []
-    ranges, singles = allreduce.extract_ranges(x)
-    self.assertEqual(expected_ranges, ranges)
-    self.assertEqual(expected_singles, singles)
-    x = [1, 3, 4, 6, 7, 8, 9]
-    expected_ranges = [[3, 4], [6, 9]]
-    expected_singles = [1]
-    ranges, singles = allreduce.extract_ranges(x)
-    self.assertEqual(expected_ranges, ranges)
-    self.assertEqual(expected_singles, singles)
-    x = [1, 2, 3, 4, 6, 7, 8, 9]
-    expected_ranges = [[1, 4], [6, 9]]
-    expected_singles = []
-    ranges, singles = allreduce.extract_ranges(x)
-    self.assertEqual(expected_ranges, ranges)
-    self.assertEqual(expected_singles, singles)
-    x = [1, 3, 4, 6, 7, 9]
-    expected_ranges = [[3, 4], [6, 7]]
-    expected_singles = [1, 9]
-    ranges, singles = allreduce.extract_ranges(x)
-    self.assertEqual(expected_ranges, ranges)
-    self.assertEqual(expected_singles, singles)
-    x = [1, 3, 6, 9]
-    expected_ranges = []
-    expected_singles = [1, 3, 6, 9]
-    ranges, singles = allreduce.extract_ranges(x)
-    self.assertEqual(expected_ranges, ranges)
-    self.assertEqual(expected_singles, singles)
-  def testPackRange(self):
-    packing = {}
-    t0 = tf.constant([0, 1, 2, 3], dtype=tf.float32)
-    t1 = tf.constant([4, 5, 6, 7], dtype=tf.float32)
-    gv = [(t0, 'v0'), (t1, 'v1')]
-    new_t = allreduce.pack_range('0:0', packing, gv, [0, 1])
-    self.assertEqual(1, new_t.shape.ndims)
-    self.assertEqual(8, new_t.shape.dims[0])
-    self.assertEqual(
-        packing, {
-            '0:0':
-                allreduce.GradPackTuple(
-                    indices=range(2),
-                    vars=['v0', 'v1'],
-                    shapes=[tf.TensorShape([4]),
-                            tf.TensorShape([4])])
-        })
-    t2 = tf.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=tf.float32)
-    t3 = tf.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=tf.float32)
-    gv = [(t0, 'v0'), (t1, 'v1'), (t2, 'v2'), (t3, 'v3')]
-    packing = {}
-    new_t = allreduce.pack_range('1:0', packing, gv, [0, 3])
-    self.assertEqual(1, new_t.shape.ndims)
-    self.assertEqual(26, new_t.shape.dims[0])
-    self.assertEqual(
-        packing, {
-            '1:0':
-                allreduce.GradPackTuple(
-                    indices=range(4),
-                    vars=['v0', 'v1', 'v2', 'v3'],
-                    shapes=[
-                        tf.TensorShape([4]),
-                        tf.TensorShape([4]),
-                        tf.TensorShape([3, 3]),
-                        tf.TensorShape([3, 3])
-                    ])
-        })
-  def testUnpackGradTuple(self):
-    packing = {
-        '0:0':
-            allreduce.GradPackTuple(
-                indices=range(4),
-                vars=['v0', 'v1', 'v2', 'v3'],
-                shapes=[
-                    tf.TensorShape([4]),
-                    tf.TensorShape([4]),
-                    tf.TensorShape([3, 3]),
-                    tf.TensorShape([3, 3])
-                ])
-    }
-    tc = tf.constant([0, 1, 2, 3, 4, 5, 6, 7,
-                      0, 1, 2, 3, 4, 5, 6, 7, 8,
-                      0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=tf.float32)
-    packed_gv = [tc, 'packing_var_placeholder']
-    gv = allreduce.unpack_grad_tuple(packed_gv, packing['0:0'])
-    self.assertEqual(4, len(gv))
-    self.assertEqual('v0', gv[0][1])
-    self.assertEqual('v1', gv[1][1])
-    self.assertEqual('v2', gv[2][1])
-    self.assertEqual('v3', gv[3][1])
-    self.assertEqual(1, gv[0][0].shape.ndims)
-    self.assertEqual(4, gv[0][0].shape.dims[0])
-    self.assertEqual(1, gv[1][0].shape.ndims)
-    self.assertEqual(4, gv[1][0].shape.dims[0])
-    self.assertEqual(2, gv[2][0].shape.ndims)
-    self.assertEqual(3, gv[2][0].shape.dims[0])
-    self.assertEqual(3, gv[2][0].shape.dims[1])
-  def testPackSmallTensors(self):
-    t0 = tf.constant([0, 1, 2, 3], dtype=tf.float32)
-    t1 = tf.constant([4, 5, 6, 7], dtype=tf.float32)
-    t2 = tf.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=tf.float32)
-    t3 = tf.constant([[0, 1, 2], [3, 4, 5], [6, 7, 8]], dtype=tf.float32)
-    tower_grads = []
-    for d in range(0, 3):
-      gv = [(t0, 'v_%d_0' % d), (t1, 'v_%d_1' %d), (t2, 'v_%d_2' %d),
-            (t3, 'v_%d_3' % d)]
-      tower_grads.append(gv)
-    # 1) Set the size limit so small that nothing gets concatenated.
-    new_tower_grads, packing = allreduce.pack_small_tensors(
-        tower_grads, max_bytes=12,
-        max_group=10)
-    self.assertEqual(tower_grads, new_tower_grads)
-    self.assertTrue(packing is None)
-    # 2) Set the size limit so only the first two tensors get concatenated
-    new_tower_grads, packing = allreduce.pack_small_tensors(
-        tower_grads, max_bytes=16,  # 16 bytes == 4 elements
-        max_group=10)
-    self.assertEqual(3, len(new_tower_grads))
-    self.assertEqual(4, len(tower_grads[0]))
-    first_tower = new_tower_grads[0]
-    self.assertEqual(3, len(first_tower))
-    self.assertEqual(1, first_tower[0][0].shape.ndims)
-    self.assertEqual(8, first_tower[0][0].shape.dims[0])
-    self.assertEqual(packing,
-                     {'0:0': allreduce.GradPackTuple(
-                         indices=range(2),
-                         vars=['v_0_0', 'v_0_1'],
-                         shapes=[tf.TensorShape([4]),
-                                 tf.TensorShape([4])]),
-                      '1:0': allreduce.GradPackTuple(
-                          indices=range(2),
-                          vars=['v_1_0', 'v_1_1'],
-                          shapes=[tf.TensorShape([4]),
-                                  tf.TensorShape([4])]),
-                      '2:0': allreduce.GradPackTuple(
-                          indices=range(2),
-                          vars=['v_2_0', 'v_2_1'],
-                          shapes=[tf.TensorShape([4]),
-                                  tf.TensorShape([4])])})
-    # 3) Set the size limit so all tensors get concatenated
-    new_tower_grads, packing = allreduce.pack_small_tensors(
-        tower_grads, max_bytes=256,   # bytes = 64 elements
-        max_group=10)
-    self.assertEqual(3, len(new_tower_grads))
-    self.assertEqual(4, len(tower_grads[0]))
-    self.assertEqual(1, len(new_tower_grads[0]))
-    first_tower = new_tower_grads[0]
-    self.assertEqual(1, first_tower[0][0].shape.ndims)
-    self.assertEqual(26, first_tower[0][0].shape.dims[0])
-    self.assertEqual(packing,
-                     {'0:0': allreduce.GradPackTuple(
-                         indices=range(4),
-                         vars=['v_0_0', 'v_0_1', 'v_0_2', 'v_0_3'],
-                         shapes=[tf.TensorShape([4]),
-                                 tf.TensorShape([4]),
-                                 tf.TensorShape([3, 3,]),
-                                 tf.TensorShape([3, 3,])]),
-                      '1:0': allreduce.GradPackTuple(
-                          indices=range(4),
-                          vars=['v_1_0', 'v_1_1', 'v_1_2', 'v_1_3'],
-                          shapes=[tf.TensorShape([4]),
-                                  tf.TensorShape([4]),
-                                  tf.TensorShape([3, 3,]),
-                                  tf.TensorShape([3, 3,])]),
-                      '2:0': allreduce.GradPackTuple(
-                          indices=range(4),
-                          vars=['v_2_0', 'v_2_1', 'v_2_2', 'v_2_3'],
-                          shapes=[tf.TensorShape([4]),
-                                  tf.TensorShape([4]),
-                                  tf.TensorShape([3, 3,]),
-                                  tf.TensorShape([3, 3,])])})
-  def testUnpackSmallTensors(self):
-    packing = {'0:0': allreduce.GradPackTuple(indices=range(2),
-                                              vars=['v_0_0', 'v_0_1'],
-                                              shapes=[tf.TensorShape([4]),
-                                                      tf.TensorShape([4])]),
-               '0:1': allreduce.GradPackTuple(indices=range(3, 5),
-                                              vars=['v_0_3', 'v_0_4'],
-                                              shapes=[tf.TensorShape([3, 3,]),
-                                                      tf.TensorShape([3, 3,])]),
-               '1:0': allreduce.GradPackTuple(indices=range(2),
-                                              vars=['v_1_0', 'v_1_1'],
-                                              shapes=[tf.TensorShape([4]),
-                                                      tf.TensorShape([4])]),
-               '1:1': allreduce.GradPackTuple(indices=range(3, 5),
-                                              vars=['v_1_3', 'v_1_4'],
-                                              shapes=[tf.TensorShape([3, 3,]),
-                                                      tf.TensorShape([3, 3,])])}
-    t0 = tf.constant([0, 1, 2, 3, 4, 5, 6, 7], dtype=tf.float32)
-    t1 = tf.constant([17, 17], dtype=tf.float32)
-    t2 = tf.constant([0, 1, 2, 3, 4, 5, 6, 7, 8,
-                      0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=tf.float32)
-    t3 = tf.constant([0], dtype=tf.float32)
-    tower_grads = []
-    for d in range(0, 2):
-      one_tower = [(t0, 'packing_var_placeholder'),
-                   (t2, 'packing_var_placeholder'),
-                   (t1, 'v_%d_2' % d), (t3, 'v_%d_5' %d)]
-      tower_grads.append(one_tower)
-    new_tower_grads = allreduce.unpack_small_tensors(tower_grads, packing)
-    self.assertEqual(2, len(new_tower_grads))
-    for d, tg in enumerate(new_tower_grads):
-      self.assertEqual(6, len(tg))
-      self.assertEqual('v_%d_0' % d, tg[0][1])
-      self.assertEqual('v_%d_1' % d, tg[1][1])
-      self.assertEqual('v_%d_2' % d, tg[2][1])
-      self.assertEqual('v_%d_3' % d, tg[3][1])
-      self.assertEqual('v_%d_4' % d, tg[4][1])
-      self.assertEqual('v_%d_5' % d, tg[5][1])
-      self.assertEqual(1, tg[0][0].shape.ndims)
-      self.assertEqual(4, tg[0][0].shape.dims[0])
-      self.assertEqual(1, tg[1][0].shape.ndims)
-      self.assertEqual(4, tg[1][0].shape.dims[0])
-      self.assertEqual(1, tg[2][0].shape.ndims)
-      self.assertEqual(2, tg[2][0].shape.dims[0])
-      self.assertEqual(2, tg[3][0].shape.ndims)
-      self.assertEqual(3, tg[3][0].shape.dims[0])
-      self.assertEqual(3, tg[3][0].shape.dims[1])
-      self.assertEqual(2, tg[4][0].shape.ndims)
-      self.assertEqual(3, tg[4][0].shape.dims[0])
-      self.assertEqual(3, tg[4][0].shape.dims[1])
-      self.assertEqual(1, tg[5][0].shape.ndims)
-      self.assertEqual(1, tg[5][0].shape.dims[0])
-class DynamicPackingTest(test_util.TensorFlowTestCase):
-  """Packing/Unpacking tests that require executing a TensorFlow session."""
-  def _init_tensors(self, num_towers, tensor_shapes):
-    """Construct a collection of tensors across multiple devices."""
-    num_tensors = len(tensor_shapes)
-    consts = []
-    tensors = []
-    vrbls = []
-    tower_grads = []
-    tf.Variable([-1], dtype=tf.int32, name='packing_var_placeholder')
-    for dev_idx in range(0, num_towers):
-      devname = '/job:localhost/device:GPU:%d' % dev_idx
-      consts.append([])
-      tensors.append([])
-      vrbls.append([])
-      with tf.device(devname):
-        base_value = 0
-        gv_tuples = []
-        for t_idx in range(0, num_tensors):
-          shape = tensor_shapes[t_idx]
-          num_elts = 0
-          for d in shape:
-            num_elts = (num_elts or 1) * d
-          c = np.fromiter(range(base_value, base_value + num_elts),
-                          dtype=np.float32).reshape(shape)
-          base_value += num_elts
-          consts[dev_idx].append(c)
-          tensors[dev_idx].append(tf.constant(c))
-          vrbls[dev_idx].append(
-              tf.Variable(c, name='v_d%d_t%d' % (dev_idx, t_idx)))
-          gv_tuples.append((tensors[dev_idx][-1], vrbls[dev_idx][-1]))
-        tower_grads.append(gv_tuples)
-    return tower_grads, consts, tensors, vrbls
-  _test_tuple = pycoll.namedtuple('_test_tuple',
-                                  'num_devices, in_shapes out_shapes out_i')
-  def _do_pack_unpack_test(self, tt):
-    """Do a single pack-unpack test.
-    Args:
-      tt: A _test_tuple defining the parameters of the test to do.
-    This test executes a graph that performs a pack of tower_grads
-    followed by an unpack and verifies that the shapes and values
-    of gradient tensors are unchanged, along with paired variables.
-    """
-    with ops.Graph().as_default():
-      tower_grads, consts, _, vrbls = self._init_tensors(
-          tt.num_devices, tt.in_shapes)
-      packed_tg, packing = allreduce.pack_small_tensors(
-          tower_grads, max_bytes=40, max_group=10)
-      unpacked_tg = allreduce.unpack_small_tensors(packed_tg, packing)
-      with self.test_session() as sess:
-        sess.run(variables.global_variables_initializer())
-        packed = sess.run(packed_tg)
-        for d in range(0, tt.num_devices):
-          for t in range(0, len(tt.out_shapes)):
-            num_elts = 0
-            for dim in tt.out_shapes[t]:
-              num_elts = (num_elts or 1) * dim
-            self.assertTrue(np.array_equal(
-                np.array(range(tt.out_i[t], tt.out_i[t] + num_elts),
-                         dtype=np.float32).reshape(tt.out_shapes[t]),
-                packed[d][t][0]))
-        unpacked = sess.run(unpacked_tg)
-        for d in range(0, tt.num_devices):
-          for t in range(0, len(tt.in_shapes)):
-            self.assertTrue(np.array_equal(consts[d][t], unpacked[d][t][0]))
-            self.assertEqual(vrbls[d][t], unpacked_tg[d][t][1])
-  def testPackUnpack0(self):
-    self._do_pack_unpack_test(
-        self._test_tuple(num_devices=3,
-                         in_shapes=[[8], [3, 3], [12], [5, 5, 5]],
-                         out_shapes=[[17], [12], [5, 5, 5]],
-                         out_i=[0, 17, 29]))
-  def testPackUnpack1(self):
-    self._do_pack_unpack_test(
-        self._test_tuple(num_devices=4,
-                         in_shapes=[[5, 5, 5], [2, 3], [5]],
-                         out_shapes=[[11], [5, 5, 5]],
-                         out_i=[125, 0]))
-  def testPackUnpack2(self):
-    self._do_pack_unpack_test(
-        self._test_tuple(num_devices=2,
-                         in_shapes=[[5, 5, 5], [2, 3], [1, 5], [7], [100]],
-                         out_shapes=[[18], [5, 5, 5], [100]],
-                         out_i=[125, 0, 143]))
-  def _do_all_reduce_pack_test(self, tt):
-    """Test that all-reduce results are the same with or without packing."""
-    with ops.Graph().as_default():
-      tower_grads, consts, _, _ = self._init_tensors(
-          tt.num_devices, tt.in_shapes)
-      dev_prefixes = ['/job:localhost']
-      num_workers = 1
-      alg = 'xring'
-      shards = 1
-      single_session = True
-      gpu_indices = range(0, tt.num_devices)
-      assert len(gpu_indices) == len(tower_grads)
-      no_pack_all_reduce = allreduce.sum_gradients_all_reduce(
-          single_session,
-          dev_prefixes, tower_grads, num_workers, alg, shards,
-          gpu_indices,
-          agg_small_grads_max_bytes=0, agg_small_grads_max_group=1)
-      packed_tg, packing = allreduce.pack_small_tensors(tower_grads, 100, 100)
-      packed_all_reduce = allreduce.sum_gradients_all_reduce(
-          single_session,
-          dev_prefixes, packed_tg, num_workers, alg, shards,
-          gpu_indices,
-          agg_small_grads_max_bytes=0, agg_small_grads_max_group=1)
-      unpacked_tg = allreduce.unpack_small_tensors(packed_all_reduce, packing)
-      with self.test_session() as sess:
-        sess.run(variables.global_variables_initializer())
-        no_pack_values = sess.run(no_pack_all_reduce)
-        pack_unpack_values = sess.run(unpacked_tg)
-        for d in range(1, tt.num_devices):
-          for t in range(0, len(tt.in_shapes)):
-            self.assertTrue(np.allclose(no_pack_values[d][t][0],
-                                        tt.num_devices * consts[0][t]))
-            self.assertTrue(np.array_equal(no_pack_values[d][t][0],
-                                           pack_unpack_values[d][t][0]))
-  def testAllReducePacked0(self):
-    self._do_all_reduce_pack_test(
-        self._test_tuple(num_devices=3,
-                         in_shapes=[[8], [3, 3], [12], [5, 5, 5]],
-                         out_shapes=[[17], [12], [5, 5, 5]],
-                         out_i=[0, 17, 29]))
-  def testAllReducePacked1(self):
-    self._do_all_reduce_pack_test(
-        self._test_tuple(num_devices=2,
-                         in_shapes=[[8], [3, 3], [12], [5, 5, 5], [3], [4]],
-                         out_shapes=[[17], [7], [12], [5, 5, 5]],
-                         out_i=[0, 17, 29, 154, 157]))
-if __name__ == '__main__':
-  tf.disable_v2_behavior()
-  tf.test.main()
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/batch_allreduce.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/batch_allreduce.py
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/benchmark_cnn.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/benchmark_cnn.py
--- a/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/benchmark_cnn_distributed_test.py
+++ b/TensorFlow2x/ComputeVision/Classification/tf_cnn_benchmarks/benchmark_cnn_distributed_test.py