Commit 2d342592 authored by Dan Holtmann-Rice's avatar Dan Holtmann-Rice Committed by A. Unique TensorFlower
Browse files

Internal change

PiperOrigin-RevId: 335446217
parent 3a9ed6bd
![TensorFlow Requirement: 2.x](https://img.shields.io/badge/TensorFlow%20Requirement-2.x-brightgreen)
# Orbit # Orbit
Orbit is a customized training loop library built on top of Tensorflow 2. It Orbit is a flexible, lightweight library designed to make it easy to write
provides a flexible lightweight library that users can easily use or fork when [custom training loops][custom_training] in TensorFlow 2. Orbit handles common
writing [customized training loop code](https://www.tensorflow.org/tutorials/distribute/custom_training) model training tasks such as saving checkpoints, running model evaluations, and
in TF2. It intergates with `tf.distribute` seamlessly and supports running on setting up summary writing, while giving users full control over implementing
different device types (CPU, GPU, and TPU). the inner training loop. It integrates with `tf.distribute` seamlessly and
supports running on different device types (CPU, GPU, and TPU). The core code is
intended to be easy to read and fork.
See our [g3doc](g3doc) at go/orbit-trainer for additional documentation.
[custom_training]: https://www.tensorflow.org/tutorials/distribute/custom_training
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""Defines exported symbols for `orbit` package.""" """Defines exported symbols for the `orbit` package."""
from orbit import utils from orbit import utils
......
...@@ -12,14 +12,14 @@ ...@@ -12,14 +12,14 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""A light weight utilities to train TF2 models.""" """Provides a `Controller` class for managing the outer training loop."""
import pprint
import time import time
from typing import Callable, Dict, Optional, Text, Union from typing import Callable, Optional, Union
from absl import logging from absl import logging
import numpy as np
from orbit import runner from orbit import runner
from orbit import utils from orbit import utils
...@@ -27,14 +27,50 @@ from orbit import utils ...@@ -27,14 +27,50 @@ from orbit import utils
import tensorflow as tf import tensorflow as tf
def _log_info(message: Text): def _log(message: str):
"""Logs `message` to the `info` log, and also prints to stdout.""" """Logs `message` to the `info` log, and also prints to stdout."""
logging.info(message) logging.info(message)
print(message) print(message)
logging.ABSLLogger.register_frame_to_skip(__file__, _log.__name__)
def _format_output(output, indent=4):
"""Formats `output`, either on one line, or indented across multiple lines."""
formatted = pprint.pformat(output)
lines = formatted.splitlines()
if len(lines) == 1:
return formatted
lines = [" " * indent + line for line in lines]
return "\n" + "\n".join(lines)
class Controller: class Controller:
"""Class that facilitates training and evaluation of models.""" """Class that controls the outer loop of model training and evaluation.
Orbit divides training and evaluation into "inner" and "outer" loops. Inner
loops are implemented by users in the form of `AbstractTrainer` and
`AbstractEvaluator` subclasses, and define how to run a given number of
training or evaluation steps. The outer loop is provided by this `Controller`,
and interleaves calls to the user provided inner loops with additional actions
such as saving checkpoints, running evaluations, and writing summaries
(depending on the arguments passed to `Controller.__init__` and the method
being called).
There are four top-level "outer loops" provided:
- `train`, which trains until a specified number of global steps is reached;
- `evaluate`, for one-off model evaluation;
- `train_and_evaluate`, for interleaved training and evaluation;
- `evaluate_continuously`, for monitoring a given directory and running
evaluations on new model checkpoints.
While this class attempts to provide out-of-the-box solutions for common
training and evaluation use cases, the internal details and method
implementations are also intended to be simple enough to make subclassing or
other custom outer loop implementations easy to achieve.
"""
def __init__( def __init__(
self, self,
...@@ -47,63 +83,82 @@ class Controller: ...@@ -47,63 +83,82 @@ class Controller:
checkpoint_manager: Optional[tf.train.CheckpointManager] = None, checkpoint_manager: Optional[tf.train.CheckpointManager] = None,
# Summary related # Summary related
summary_interval: Optional[int] = None, summary_interval: Optional[int] = None,
summary_dir: Optional[Text] = None, summary_dir: Optional[str] = None,
# Evaluation related # Evaluation related
eval_summary_dir: Optional[Text] = None): eval_summary_dir: Optional[str] = None):
"""Constructs a `Controller` instance. """Initializes a `Controller` instance.
Note that if `checkpoint_manager` is provided and there are checkpoints in
the associated model directory, the model will be restored from the most
recent checkpoint during this `__init__` method.
Args: Args:
strategy: An instance of `tf.distribute.Strategy`. strategy: An instance of `tf.distribute.Strategy`. If not provided, the
trainer: An instance of `orbit.AbstractTrainer`, which represents model strategy will be initialized from the current in-scope strategy using
training details. `tf.distribute.get_strategy()`.
evaluator: An instance of `orbit.AbstractEvaluator`, which represents trainer: An instance of `orbit.AbstractTrainer`, which implements the
model evaluation details. inner training loop.
global_step: An integer `tf.Variable` indicating the global training step evaluator: An instance of `orbit.AbstractEvaluator`, which implements
number. Usually this can be obtained from `iterations` property of the evaluation.
model's optimizer (e.g. `self.optimizer.iterations`), or users can global_step: An integer `tf.Variable` storing the global training step
create their own global step variable as well. If the users create their number. Usually this can be obtained from the `iterations` property of
own global step variable, it is recommended to create the `tf.Variable` the model's optimizer (e.g. `trainer.optimizer.iterations`). In cases
inside strategy scope, and with where multiple optimizers are used, or if one model "step" corresponds
`aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA`. to more than one update to model parameters, users can create and
steps_per_loop: The number of steps to run in each "inner loop" of increment their own global step variable as well. In this case it is
training (passed to the `num_steps` parameter of `trainer.train`). recommended to create the `tf.Variable` inside the distribution strategy
checkpoint_manager: An instance of `tf.train.CheckpointManager`. scope, with `aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA` (see
also `orbit.utils.create_global_step()`).
steps_per_loop: The number of steps to run in each inner loop of training
(passed as the `num_steps` parameter of `trainer.train`).
checkpoint_manager: An instance of `tf.train.CheckpointManager`. If
provided and there are checkpoints in the associated model directory,
the model will be restored from the most recent checkpoint inside this
`__init__` method. If not provided, the `Controller` will not
automatically save to or restore from checkpoints.
summary_interval: Step interval for training summaries. Note that this summary_interval: Step interval for training summaries. Note that this
argument only applies to the summaries inside `trainer.train` function. argument only applies to `tf.summary` calls inside the `trainer.train`
Summaries outside like "steps_per_second" and outputs from function. Summaries written by the `Controller` (specifically
`trainer.train` function will always be enabled. If set, the value "steps_per_second" and output from the `trainer.train` method) will
should be divisible by steps_per_loop. always be enabled unless the `summary_dir` parameter is `None`. If set,
summary_dir: The directory to restore and write checkpoints and summaries. the value must be divisible by `steps_per_loop`.
For example, You can set it to `checkpoint_manager.directory`. summary_dir: The directory to write summaries to. To use the same
If None, it will not write training summarizes. directory as for checkpointing, pass `checkpoint_manager.directory`. If
eval_summary_dir: The directory to write eval summaries. If None, it will `None`, no training summaries will be written.
be set to `summary_dir`. If both `summary_dir` and `eval_summary_dir` eval_summary_dir: The directory to write eval summaries to. If `None`, it
are None, it will not write evaluation summarizes. will be set to `summary_dir`. If both `summary_dir` and
`eval_summary_dir` are `None`, no eval summaries will be written.
Raises: Raises:
ValueError: If both `trainer` and `evaluator` are None. ValueError: If both `trainer` and `evaluator` are `None`.
ValueError: If `steps_per_loop` is not a positive integer. ValueError: If `steps_per_loop` is not a positive integer.
ValueError: If `summary_interval` is not a positive integer or it cannot ValueError: If `summary_interval` is not a positive integer or is not
be divisible by `steps_per_loop`. divisible by `steps_per_loop`.
""" """
if trainer is None and evaluator is None: if trainer is None and evaluator is None:
raise ValueError("`trainer` and `evaluator` should not both be None") raise ValueError("`trainer` and `evaluator` should not both be `None`.")
if trainer is not None: if trainer is not None:
if steps_per_loop is None: if steps_per_loop is None:
raise ValueError("`steps_per_loop` is required when `trainer` is " raise ValueError(
"provided.") "`steps_per_loop` is required when `trainer` is provided.")
elif not isinstance(steps_per_loop, int) or steps_per_loop < 1:
if not isinstance(steps_per_loop, int) or steps_per_loop < 1: raise ValueError(
raise ValueError("`steps_per_loop` should be a positive integer") f"`steps_per_loop` ({steps_per_loop}) must be a positive integer.")
if summary_interval is not None: if summary_interval is not None:
if summary_interval <= 0: if summary_interval <= 0:
raise ValueError("`summary_interval` should be larger than 0") raise ValueError(
if summary_interval % steps_per_loop != 0: f"`summary_interval` ({summary_interval}) must be larger than 0.")
raise ValueError("The summary interval ({}) must be a multiple " elif summary_interval % steps_per_loop != 0:
"of the steps_per_loop ({})".format( raise ValueError(
summary_interval, steps_per_loop)) f"`summary interval` ({summary_interval}) must be a multiple "
f"of `steps_per_loop` ({steps_per_loop}).")
if global_step is None:
raise ValueError("`global_step` is required.")
elif not isinstance(global_step, tf.Variable):
raise ValueError("`global_step` must be a `tf.Variable`.")
self.trainer = trainer self.trainer = trainer
self.evaluator = evaluator self.evaluator = evaluator
...@@ -136,157 +191,129 @@ class Controller: ...@@ -136,157 +191,129 @@ class Controller:
# Restores the model if needed. # Restores the model if needed.
# TODO(momernick): We probably only want to do this on certain occasions? # TODO(momernick): We probably only want to do this on certain occasions?
if self.checkpoint_manager is not None: if self.checkpoint_manager is not None:
checkpoint_interval = self.checkpoint_manager.checkpoint_interval
restored_path = self.restore_checkpoint() restored_path = self.restore_checkpoint()
if restored_path: if restored_path:
logging.info("Restored from checkpoint: %s", restored_path) _log(f"restored from checkpoint: {restored_path}")
def train(self, steps: int, checkpoint_at_completion: bool = True): def train(self, steps: int, checkpoint_at_completion: bool = True):
"""Runs training. """Runs training until the specified global step count has been reached.
This method calls the `train` method on the Trainable object until the This method makes calls to `self.trainer.train()` until the global step
global step count is equal to `steps`. It will optionally save checkpoints, count is equal to `steps`. It will additionally save checkpoints (if a
if a CheckpointManager was passed to the Controller instance's `__init__`. `CheckpointManager` was passed to `Controller.__init__`) and summarize
training output (if `summary_dir` is set).
Args: Args:
steps: The global step count to train up to. steps: The global step count to train up to.
checkpoint_at_completion: Whether to save a checkpoint when this method checkpoint_at_completion: Whether to save a checkpoint when this method
returns. Defaults to True (write the checkpoint). This is always returns (regardless of the checkpointing interval). Defaults to `True`.
triggered, regardless of the checkpointing interval.
""" """
if self.trainer is None: self._require("trainer", for_method="train")
raise ValueError("`self.trainer` is required when calling `train` "
"method.")
if self.global_step is None:
raise ValueError("`self.global_step` is required when calling `train` "
"method.")
# TODO(momernick): Support steps=None or -1 (training to exhaustion). # TODO(momernick): Support steps=None or -1 (training to exhaustion).
current_step = self.global_step.numpy() # This is an expensive access. current_step = self.global_step.numpy() # Cache, since this is expensive.
_log(f"train | step: {current_step: 6d} | training until step {steps}...")
while current_step < steps: while current_step < steps:
logging.info("Train at step %s of %s", current_step, steps)
# Calculates steps to run for the next train loop. # Calculates steps to run for the next train loop.
num_steps = min(steps - current_step, self.steps_per_loop) num_steps = min(steps - current_step, self.steps_per_loop)
self._train_n_steps(num_steps) self._train_n_steps(num_steps)
self._maybe_save_checkpoint() self._maybe_save_checkpoint()
current_step = self.global_step.numpy() # This is an expensive access. current_step = self.global_step.numpy()
if checkpoint_at_completion: if checkpoint_at_completion:
self.save_checkpoint() self._maybe_save_checkpoint(check_interval=False)
def evaluate(self, steps: int = None) -> Optional[Dict[Text, np.number]]: def evaluate(self, steps: int = -1) -> Optional[runner.Output]:
"""Runs evaluation. """Runs evaluation for the given number of steps.
This method calls the `evaluate` method on the Evaluator object for `steps` This method calls `self.evaluator.evaluate(steps)`, then writes the returned
steps, then writes the returned summaries (if any). summaries (if any).
Args: Args:
steps: The number of steps to evaluate for. steps: The number of evaluation steps to run. The value `-1` is reserved
as a special sentinel to indicate a "complete" evaluation that runs
until the underlying dataset is exhausted. Support for this is dependent
on the specific `evaluator` being used.
Returns: Returns:
The evaluation results as a dictionary of numpy values. The evaluation results as a dictionary mapping names to NumPy values.
Raises: Raises:
ValueError: If no checkpoint found in `self.checkpoint_manager.directory`. ValueError: If `evaluator` was not provided to `Controller.__init__`.
ValueError: If `evaluator` is not provided. ValueError: If no checkpoint is present in `checkpoint_manager.directory`.
ValueError: If `steps` is not a positive value or -1.
""" """
if self.evaluator is None: self._require("evaluator", for_method="evaluate")
raise ValueError("`evaluator` must be provided to call `evaluate()` "
"method.")
steps = steps or -1
current_step = self.global_step.numpy()
if steps > 0: if steps > 0:
logging.info("Running %s steps of evaluation at train step: %s", steps, steps_msg = f"running {steps} steps of evaluation..."
current_step) elif steps == -1:
steps = tf.convert_to_tensor(steps, dtype=tf.int32) steps_msg = "running complete evaluation..."
else: else:
logging.info("Evaluating at train step: %s", current_step) raise ValueError(f"`steps` ({steps}) should be > 0, or == -1.")
with self.eval_summary_manager.summary_writer().as_default(): current_step = self.global_step.numpy()
eval_outputs = self.evaluator.evaluate(steps) _log(f" eval | step: {current_step: 6d} | {steps_msg}")
if eval_outputs: start = time.time()
eval_outputs = tf.nest.map_structure(utils.get_value, eval_outputs) with self.eval_summary_manager.summary_writer().as_default():
steps_tensor = tf.convert_to_tensor(steps, dtype=tf.int32)
eval_output = self.evaluator.evaluate(steps_tensor)
eval_output = tf.nest.map_structure(utils.get_value, eval_output or {})
elapsed = time.time() - start
info = "step: {} evaluation metric: {}".format( _log(f" eval | step: {current_step: 6d} | "
current_step, eval_outputs) f"eval time: {elapsed: 6.1f} | "
_log_info(info) f"output: {_format_output(eval_output)}")
self.eval_summary_manager.write_summaries(eval_outputs) self.eval_summary_manager.write_summaries(eval_output)
self.eval_summary_manager.flush() self.eval_summary_manager.flush()
return eval_outputs return eval_output
def restore_checkpoint(self, checkpoint_path: Text = None):
"""Restore or initialize the model.
Args:
checkpoint_path: An optional string indicates the checkpoint path to
restore. If None, will restore from `self.checkpoint_manager`.
Returns:
The path to the restored checkpoint if a restore happened, or None
if no restore occurred.
"""
with self.strategy.scope():
# Checkpoint restoring should be inside scope. b/139450638
if checkpoint_path is not None:
self.checkpoint_manager.checkpoint.restore(checkpoint_path)
return checkpoint_path
return self.checkpoint_manager.restore_or_initialize()
def save_checkpoint(self):
"""Checkpoint the model.
This method will write a checkpoint containing the current state of the
model.
Raises:
ValueError: if no CheckpointManager was provided to this Controller's
init args.
"""
self._maybe_save_checkpoint(force_trigger=True)
def train_and_evaluate(self, def train_and_evaluate(self,
train_steps: int = None, train_steps: int = None,
eval_steps: int = None, eval_steps: int = None,
eval_interval: int = None): eval_interval: int = None):
"""Train and evaluate in an interleaved manner. """Runs interleaved training and evaluation.
This method will train the model until the global step count equals This method interleaves calls to `self.train()` and `self.evaluate()`,
`train_steps`, running an evaluation for `eval_steps` every `eval_interval` training the model until the global step count equals `train_steps`, and
training steps. In addition, this method will run a final evaluation at the running an evaluation for `eval_steps` every `eval_interval` training steps.
end of the training sequence. In addition, this method will run a final evaluation at the end of the
training sequence.
Args: Args:
train_steps: The global step count to train up to. train_steps: The global step count to train up to.
eval_steps: The number of steps to run during an evaluation. If None, eval_steps: The number of steps to run during an evaluation. If None, this
this method will evaluate over the entire evaluation dataset. method will evaluate over the entire evaluation dataset.
eval_interval: The number of training steps to run between evaluations. eval_interval: The number of training steps to run between evaluations. If
If set, training will always stop every `eval_interval` steps, even if set, training will always stop every `eval_interval` steps, even if this
this results in a shorter inner loop than specified by `steps_per_loop` results in a shorter inner loop than specified by `steps_per_loop`
setting. If None, evaluation will only be performed after training is setting. If None, evaluation will only be performed after training is
complete. complete.
Raises: Raises:
ValueError: If eval_interval is not a multiple of self.steps_per_loop. ValueError: If eval_interval is not a multiple of self.steps_per_loop.
""" """
current_step = self.global_step.numpy() # This is an expensive access. self._require("trainer", for_method="train_and_evaluate")
self._require("evaluator", for_method="train_and_evaluate")
current_step = self.global_step.numpy() # Cache, since this is expensive.
eval_interval = eval_interval or (train_steps - current_step) eval_interval = eval_interval or (train_steps - current_step)
while current_step < train_steps: while current_step < train_steps:
interval = min(train_steps - current_step, eval_interval) interval = min(train_steps - current_step, eval_interval)
num_steps = current_step + interval num_steps = current_step + interval
self.train(steps=num_steps, checkpoint_at_completion=False) self.train(steps=num_steps, checkpoint_at_completion=False)
self.evaluate(steps=eval_steps) self.evaluate(steps=eval_steps)
current_step = self.global_step.numpy() # This is an expensive access. current_step = self.global_step.numpy()
self.save_checkpoint() self._maybe_save_checkpoint(check_interval=False)
def evaluate_continuously(self, def evaluate_continuously(self,
steps: int = None, steps: int = None,
timeout: Optional[Union[int, float]] = None, timeout: Optional[Union[int, float]] = None,
timeout_fn: Optional[Callable[[], bool]] = None): timeout_fn: Optional[Callable[[], bool]] = None):
"""Monitor a directory and evaluate on checkpoints in it. """Continuously monitors a directory and evaluates new checkpoints in it.
This method continuously monitors a directory as specified by this This method continuously monitors a directory as specified by this
Controller's CheckpointManager init arg and runs evaluation on the Controller's CheckpointManager init arg and runs evaluation on the
...@@ -303,8 +330,10 @@ class Controller: ...@@ -303,8 +330,10 @@ class Controller:
Raises: Raises:
ValueError: If no checkpoint found in `self.checkpoint_manager.directory`. ValueError: If no checkpoint found in `self.checkpoint_manager.directory`.
ValueError: If `evaluator` was not provided as a controller init arg. ValueError: If `evaluator` was not provided as a controller init arg.
""" """
self._require("evaluator", for_method="evaluate_continuously")
self._require("checkpoint_manager", for_method="evaluate_continuously")
for checkpoint_path in tf.train.checkpoints_iterator( for checkpoint_path in tf.train.checkpoints_iterator(
self.checkpoint_manager.directory, self.checkpoint_manager.directory,
timeout=timeout, timeout=timeout,
...@@ -312,63 +341,108 @@ class Controller: ...@@ -312,63 +341,108 @@ class Controller:
self.restore_checkpoint(checkpoint_path) self.restore_checkpoint(checkpoint_path)
self.evaluate(steps) self.evaluate(steps)
def restore_checkpoint(self, checkpoint_path: str = None):
"""Restores the model from a checkpoint.
Args:
checkpoint_path: An optional string specifying the checkpoint path to
restore from. If `None`, will restore from the most recent checkpoint
(or initialize the model using a custom `init_fn` if no checkpoints can
be found) using `self.checkpoint_manager.restore_or_initialize()`.
Returns:
The path to the restored checkpoint if a restore happened, or `None` if no
restore occurred.
"""
self._require("checkpoint_manager", for_method="restore_checkpoint")
with self.strategy.scope():
# Checkpoint restoring should be inside scope (b/139450638).
if checkpoint_path is not None:
_log(f"restoring model from {checkpoint_path}...")
self.checkpoint_manager.checkpoint.restore(checkpoint_path)
else:
_log("restoring or initializing model...")
checkpoint_path = self.checkpoint_manager.restore_or_initialize()
if checkpoint_path is not None:
_log(f"restored model from {checkpoint_path}.")
else:
_log("initialized model.")
return checkpoint_path
def save_checkpoint(self):
"""Saves the model to a checkpoint.
This method will save a checkpoint containing the current state of the
model.
Raises:
ValueError: If no `checkpoint_manager` was provided to
`Controller.__init__`.
"""
self._require("checkpoint_manager", for_method="save_checkpoint")
self._maybe_save_checkpoint(check_interval=False)
def _train_n_steps(self, num_steps: int): def _train_n_steps(self, num_steps: int):
"""Run training for `num_steps`. """Runs training for `num_steps` steps.
It will also write training outputs to summaries if there is any. Also prints/logs updates about training progress, and summarizes training
output (if output is returned from `self.trainer.train()`, and if
`self.summary_dir` is set).
Args: Args:
num_steps: An integer indicates how many steps to run for this training num_steps: An integer specifying how many steps of training to run.
loop.
Raises: Raises:
RuntimeError: If `global_step` is not updated correctly in RuntimeError: If `global_step` is not properly incremented by `num_steps`
`trainer.train`. after calling `self.trainer.train(num_steps)`.
""" """
if not self.step_timer: if not self.step_timer:
self.step_timer = StepTimer(self.global_step) self.step_timer = StepTimer(self.global_step)
# Calculates steps to run for the next train loop.
current_step = self.global_step.numpy() current_step = self.global_step.numpy()
logging.info("Entering training loop at step %s to run %s steps",
current_step, num_steps)
current_step += num_steps
num_steps = tf.convert_to_tensor(num_steps, dtype=tf.int32)
with self.summary_manager.summary_writer().as_default(): with self.summary_manager.summary_writer().as_default():
# Create a lambda that returns true when summaries should be written.
should_record = False # Allows static optimization in no-summary cases. should_record = False # Allows static optimization in no-summary cases.
if self.summary_interval: if self.summary_interval:
# Create a predicate to determine when summaries should be written.
should_record = lambda: (self.global_step % self.summary_interval == 0) should_record = lambda: (self.global_step % self.summary_interval == 0)
with tf.summary.record_if(should_record): with tf.summary.record_if(should_record):
train_outputs = self.trainer.train(num_steps) num_steps_tensor = tf.convert_to_tensor(num_steps, dtype=tf.int32)
train_output = self.trainer.train(num_steps_tensor)
# Updates and verifies the current step after a training loop finishes. train_output = tf.nest.map_structure(utils.get_value, train_output or {})
if current_step != self.global_step.numpy():
raise RuntimeError("`trainer.train` function is not updating " # Verify that global_step was updated properly, then update current_step.
"`global_step` correctly, expected: %s, actual: %s" % expected_step = current_step + num_steps
(current_step, self.global_step.numpy())) if self.global_step.numpy() != expected_step:
raise RuntimeError(
f"`trainer.train({num_steps})` did not update `global_step` by "
f"{num_steps}. Old value was {current_step}, expected updated value "
f"to be {expected_step}, but it was {self.global_step.numpy()}.")
current_step = expected_step
# Print information like metrics and steps_per_second after a training
# loop.
if train_outputs:
train_outputs = tf.nest.map_structure(utils.get_value, train_outputs)
train_outputs = train_outputs or {}
steps_per_second = self.step_timer.steps_per_second() steps_per_second = self.step_timer.steps_per_second()
info = "step: {} steps_per_second: {:.2f} {}".format( _log(f"train | step: {current_step: 6d} | "
current_step, steps_per_second, train_outputs) f"steps/sec: {steps_per_second: 6.1f} | "
_log_info(info) f"output: {_format_output(train_output)}")
train_output["steps_per_second"] = steps_per_second
self.summary_manager.write_summaries(train_output)
self.summary_manager.flush()
train_outputs["steps_per_second"] = steps_per_second def _maybe_save_checkpoint(self, check_interval: bool = True):
self.summary_manager.write_summaries(train_outputs) """Conditionally saves a checkpoint.
def _maybe_save_checkpoint(self, force_trigger: bool = False): A checkpoint is saved if a `CheckpointManager` is available, and if the
"""Save checkpoints if necessary. required number of steps has elapsed since the last checkpoint was saved
(although this condition can be disabled by setting `check_interval=False`).
Args: Args:
force_trigger: A boolean indicates whether to force saving checkpoints check_interval: Whether to check if the checkpoint interval has fully
regardless of the checkpoint interval. elapsed. If `False`, a checkpoint is saved regardless of the elapsed
steps since the most recent checkpoint, unless no `checkpoint_manager`
was provided to `Controller.__init__`.
Returns: Returns:
A boolean indicating whether a checkpoint was saved. A boolean indicating whether a checkpoint was saved.
...@@ -376,12 +450,19 @@ class Controller: ...@@ -376,12 +450,19 @@ class Controller:
if self.checkpoint_manager and self.checkpoint_manager.checkpoint_interval: if self.checkpoint_manager and self.checkpoint_manager.checkpoint_interval:
ckpt_path = self.checkpoint_manager.save( ckpt_path = self.checkpoint_manager.save(
checkpoint_number=self.global_step.numpy(), checkpoint_number=self.global_step.numpy(),
check_interval=not force_trigger) check_interval=check_interval)
if ckpt_path is not None: if ckpt_path is not None:
logging.info("Saved checkpoints in %s", ckpt_path) _log(f"saved checkpoint to {ckpt_path}.")
return True return True
return False return False
def _require(self, attribute, for_method):
"""Utility method to raise an error if the given `attribute` is not set."""
if getattr(self, attribute, None) is None:
raise ValueError(
f"`{attribute}` is not set. Pass `{attribute}` to "
f"`Controller.__init__` before calling `{for_method}()`.")
class StepTimer: class StepTimer:
"""Utility class for measuring steps/second.""" """Utility class for measuring steps/second."""
......
...@@ -15,10 +15,14 @@ ...@@ -15,10 +15,14 @@
"""Tests for orbit.controller.""" """Tests for orbit.controller."""
import os import os
from absl import logging from absl import logging
from absl.testing import parameterized from absl.testing import parameterized
import numpy as np import numpy as np
from orbit import controller from orbit import controller
from orbit import runner
from orbit import standard_runner from orbit import standard_runner
import tensorflow as tf import tensorflow as tf
...@@ -65,12 +69,8 @@ class TestRunner(standard_runner.StandardTrainer, ...@@ -65,12 +69,8 @@ class TestRunner(standard_runner.StandardTrainer,
self.train_loss = tf.keras.metrics.Mean("train_loss", dtype=tf.float32) self.train_loss = tf.keras.metrics.Mean("train_loss", dtype=tf.float32)
self.eval_loss = tf.keras.metrics.Mean("eval_loss", dtype=tf.float32) self.eval_loss = tf.keras.metrics.Mean("eval_loss", dtype=tf.float32)
self.return_numpy = return_numpy self.return_numpy = return_numpy
train_dataset = ( train_dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
self.strategy.experimental_distribute_datasets_from_function(dataset_fn) eval_dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
)
eval_dataset = (
self.strategy.experimental_distribute_datasets_from_function(dataset_fn)
)
standard_runner.StandardTrainer.__init__(self, train_dataset) standard_runner.StandardTrainer.__init__(self, train_dataset)
standard_runner.StandardEvaluator.__init__(self, eval_dataset) standard_runner.StandardEvaluator.__init__(self, eval_dataset)
...@@ -95,8 +95,7 @@ class TestRunner(standard_runner.StandardTrainer, ...@@ -95,8 +95,7 @@ class TestRunner(standard_runner.StandardTrainer,
} }
def build_eval_dataset(self): def build_eval_dataset(self):
return self.strategy.experimental_distribute_datasets_from_function( return self.strategy.distribute_datasets_from_function(dataset_fn)
dataset_fn)
def eval_begin(self): def eval_begin(self):
self.eval_loss.reset_states() self.eval_loss.reset_states()
...@@ -125,8 +124,7 @@ class TestEvaluator(standard_runner.StandardEvaluator): ...@@ -125,8 +124,7 @@ class TestEvaluator(standard_runner.StandardEvaluator):
def __init__(self): def __init__(self):
self.strategy = tf.distribute.get_strategy() self.strategy = tf.distribute.get_strategy()
self.model = create_model() self.model = create_model()
eval_dataset = self.strategy.experimental_distribute_datasets_from_function( eval_dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
dataset_fn)
standard_runner.StandardEvaluator.__init__(self, eval_dataset) standard_runner.StandardEvaluator.__init__(self, eval_dataset)
def eval_reduce(self, state, output): def eval_reduce(self, state, output):
...@@ -157,16 +155,20 @@ class TestEvaluator(standard_runner.StandardEvaluator): ...@@ -157,16 +155,20 @@ class TestEvaluator(standard_runner.StandardEvaluator):
} }
class TestEvaluatorNoOutput(runner.AbstractEvaluator):
def evaluate(self, num_steps):
pass
class TestEvaluatorWithNestedSummary(standard_runner.StandardEvaluator): class TestEvaluatorWithNestedSummary(standard_runner.StandardEvaluator):
"""Implements the training and evaluation APIs for the test model.""" """Implements the training and evaluation APIs for the test model."""
def __init__(self): def __init__(self):
self.strategy = tf.distribute.get_strategy() self.strategy = tf.distribute.get_strategy()
self.model = create_model() self.model = create_model()
dataset = self.strategy.experimental_distribute_datasets_from_function( dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
dataset_fn) dataset2 = self.strategy.distribute_datasets_from_function(dataset_fn)
dataset2 = self.strategy.experimental_distribute_datasets_from_function(
dataset_fn)
self.loss = tf.keras.metrics.Mean("loss", dtype=tf.float32) self.loss = tf.keras.metrics.Mean("loss", dtype=tf.float32)
self.accuracy = tf.keras.metrics.CategoricalAccuracy( self.accuracy = tf.keras.metrics.CategoricalAccuracy(
"accuracy", dtype=tf.float32) "accuracy", dtype=tf.float32)
...@@ -217,9 +219,7 @@ class TestTrainerWithSummaries(standard_runner.StandardTrainer): ...@@ -217,9 +219,7 @@ class TestTrainerWithSummaries(standard_runner.StandardTrainer):
self.optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.1) self.optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.1)
self.global_step = self.optimizer.iterations self.global_step = self.optimizer.iterations
self.train_loss = tf.keras.metrics.Mean("train_loss", dtype=tf.float32) self.train_loss = tf.keras.metrics.Mean("train_loss", dtype=tf.float32)
train_dataset = ( train_dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
self.strategy.experimental_distribute_datasets_from_function(dataset_fn)
)
standard_runner.StandardTrainer.__init__( standard_runner.StandardTrainer.__init__(
self, self,
train_dataset, train_dataset,
...@@ -227,8 +227,7 @@ class TestTrainerWithSummaries(standard_runner.StandardTrainer): ...@@ -227,8 +227,7 @@ class TestTrainerWithSummaries(standard_runner.StandardTrainer):
use_tpu_summary_optimization=True)) use_tpu_summary_optimization=True))
def build_train_dataset(self): def build_train_dataset(self):
return self.strategy.experimental_distribute_datasets_from_function( return self.strategy.distribute_datasets_from_function(dataset_fn)
dataset_fn)
def train_step(self, iterator): def train_step(self, iterator):
...@@ -344,6 +343,26 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase): ...@@ -344,6 +343,26 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase):
self.assertNotEmpty(tf.io.gfile.glob( self.assertNotEmpty(tf.io.gfile.glob(
os.path.join(self.model_dir, "summaries/eval/events.*"))) os.path.join(self.model_dir, "summaries/eval/events.*")))
def test_restore_from_most_recent_checkpoint(self):
test_runner = TestRunner()
checkpoint = tf.train.Checkpoint(model=test_runner.model)
checkpoint_manager = tf.train.CheckpointManager(
checkpoint,
self.model_dir,
max_to_keep=None,
step_counter=test_runner.global_step,
checkpoint_interval=5)
test_controller = controller.Controller(
trainer=test_runner,
global_step=test_runner.global_step,
checkpoint_manager=checkpoint_manager,
eval_summary_dir=os.path.join(self.model_dir, "summaries/eval"),
steps_per_loop=5)
test_controller.train(20)
self.assertLen(checkpoint_manager.checkpoints, 4)
restored_path = test_controller.restore_checkpoint()
self.assertEqual(restored_path, checkpoint_manager.checkpoints[-1])
@parameterized.named_parameters(("return_numpy", True), @parameterized.named_parameters(("return_numpy", True),
("return_tensor", False)) ("return_tensor", False))
def test_train_and_evaluate(self, return_numpy): def test_train_and_evaluate(self, return_numpy):
...@@ -601,7 +620,7 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase): ...@@ -601,7 +620,7 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase):
self.assertLess(test_runner.global_step, 10) self.assertLess(test_runner.global_step, 10)
def test_evaluate_with_loss_outputs(self): def test_evaluate_with_loss_output(self):
test_evaluator = TestEvaluator() test_evaluator = TestEvaluator()
checkpoint = tf.train.Checkpoint(model=test_evaluator.model) checkpoint = tf.train.Checkpoint(model=test_evaluator.model)
...@@ -622,6 +641,13 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase): ...@@ -622,6 +641,13 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase):
summaries_with_matching_keyword( summaries_with_matching_keyword(
"eval_loss", os.path.join(self.model_dir, "summaries/eval"))) "eval_loss", os.path.join(self.model_dir, "summaries/eval")))
def test_evaluate_with_no_output(self):
test_controller = controller.Controller(
evaluator=TestEvaluatorNoOutput(),
global_step=tf.Variable(0, dtype=tf.int64),
eval_summary_dir=os.path.join(self.model_dir, "summaries/eval"))
self.assertEqual(test_controller.evaluate(steps=5), {})
def test_train_and_evaluate_reset_datasets(self): def test_train_and_evaluate_reset_datasets(self):
test_runner = TestRunner() test_runner = TestRunner()
...@@ -635,11 +661,9 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase): ...@@ -635,11 +661,9 @@ class ControllerTest(tf.test.TestCase, parameterized.TestCase):
train_steps=10, eval_steps=2, eval_interval=6) train_steps=10, eval_steps=2, eval_interval=6)
train_dataset = ( train_dataset = (
test_runner.strategy.experimental_distribute_datasets_from_function( test_runner.strategy.distribute_datasets_from_function(dataset_fn))
dataset_fn))
eval_dataset = ( eval_dataset = (
test_runner.strategy.experimental_distribute_datasets_from_function( test_runner.strategy.distribute_datasets_from_function(dataset_fn))
dataset_fn))
test_runner.train_dataset = train_dataset test_runner.train_dataset = train_dataset
test_runner.eval_dataset = eval_dataset test_runner.eval_dataset = eval_dataset
......
...@@ -12,62 +12,72 @@ ...@@ -12,62 +12,72 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""An abstraction that users can easily handle their custom training loops.""" """Provides AbstractTrainer/Evaluator base classes, defining train/eval APIs."""
import abc import abc
from typing import Dict, Optional, Text
from typing import Dict, Optional, Union
import numpy as np
import tensorflow as tf import tensorflow as tf
Output = Dict[str, Union[tf.Tensor, float, np.number, np.ndarray, 'Output']] # pytype: disable=not-supported-yet
class AbstractTrainer(tf.Module, metaclass=abc.ABCMeta): class AbstractTrainer(tf.Module, metaclass=abc.ABCMeta):
"""An abstract class defining the APIs required for training.""" """An abstract class defining the API required for training."""
@abc.abstractmethod @abc.abstractmethod
def train(self, def train(self, num_steps: tf.Tensor) -> Optional[Output]:
num_steps: Optional[tf.Tensor]) -> Optional[Dict[Text, tf.Tensor]]: """Implements `num_steps` steps of training.
"""Implements model training with multiple steps.
This method will by called the `Controller` to perform the "inner loop" of
In training, it is common to break the total training steps into several training. This inner loop amortizes the cost of bookkeeping associated with
training loops, so users can do checkpointing, write summaries and run some checkpointing, evaluation, and writing summaries. Additionally, the inner
python callbacks. This is necessary for getting good performance in TPU loop can be implemented (if desired) using TensorFlow's looping constructs
training, as the overhead for launching a multi worker tf.function may be (e.g. a `for` loop over a `tf.range` inside a `tf.function`), which can be
large in Eager mode. It is usually encouraged to create a host training loop necessary for getting optimal performance when running on TPU. For cases
(e.g. using a `tf.range` wrapping `strategy.run` inside a that don't require peak performance, a simple Python loop can be used
`tf.function`) in the TPU case. For the cases that don't require host instead for simplicity.
training loop to achieve peak performance, users can just implement a simple
python loop to drive each step.
Args: Args:
num_steps: A guideline for how many training steps to run. Note that it is num_steps: The number of training steps to run. Note that it is up to the
up to the model what constitutes a "step" (this may involve more than model what constitutes a "step", which may involve more than one update
one update to model parameters, e.g. if training a GAN). to model parameters (e.g., if training a GAN).
Returns: Returns:
The function may return a dictionary of `Tensors` or numpy arrays, which Either `None`, or a dictionary mapping names to `Tensor`s or NumPy values.
will be written to logs and as TensorBoard summaries. It can also be a If a dictionary is returned, it will be written to logs and as TensorBoard
nested dictionary, yielding a hierarchy of summary directories. summaries. The dictionary may also be nested, which will generate a
hierarchy of summary directories.
""" """
pass pass
class AbstractEvaluator(tf.Module, metaclass=abc.ABCMeta): class AbstractEvaluator(tf.Module, metaclass=abc.ABCMeta):
"""An abstract class defining the APIs required for evaluation.""" """An abstract class defining the API required for evaluation."""
@abc.abstractmethod @abc.abstractmethod
def evaluate( def evaluate(self, num_steps: tf.Tensor) -> Optional[Output]:
self, num_steps: Optional[tf.Tensor]) -> Optional[Dict[Text, tf.Tensor]]: """Implements `num_steps` steps of evaluation.
"""Implements model evaluation.
This method will by called the `Controller` to perform an evaluation. The
`num_steps` parameter specifies the number of steps of evaluation to run,
which is specified by the user when calling one of the `Controller`'s
evaluation methods. A special sentinel value of `-1` is reserved to indicate
evaluation should run until the underlying data source is exhausted.
Args: Args:
num_steps: A guideline for how many evaluation steps to run. Note that it num_steps: The number of evaluation steps to run. Note that it is up to
is up to the model what constitutes a "step". Generally, it may be the model what constitutes a "step". Evaluations may also want to
desirable to support both a limited number of eval steps and iterating support "complete" evaluations when `num_steps == -1`, running until a
over a full dataset (however many steps are required) when `num_steps` given data source is exhausted.
is `None`.
Returns: Returns:
The function may return a dictionary of `Tensors` or numpy arrays, which Either `None`, or a dictionary mapping names to `Tensor`s or NumPy values.
will be written to logs and as TensorBoard summaries. It can also be a If a dictionary is returned, it will be written to logs and as TensorBoard
nested dictionary, yielding a hierarchy of summary directories. summaries. The dictionary may also be nested, which will generate a
hierarchy of summary directories.
""" """
pass pass
...@@ -12,11 +12,30 @@ ...@@ -12,11 +12,30 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""AbstractTrainer/Evaluator implementations for standard settings.""" """AbstractTrainer/Evaluator subclasses with added functionality.
The classes in this module provide some additional structure to the bare
`AbstractTrainer`/`AbstractEvaluator` APIs.
Both `StandardTrainer` and `StandardEvaluator` split the train/eval loops into
"begin", "step", and "end" methods, and provide an implementation of the loop
itself that makes calls to the relevant step method.
`StandardTrainer` supports running the loop using the TF while loop construct
for added performance (particularly on TPUs). It additionally provides some
functionality to make writing summaries from inside a model more performant when
running on TPUs.
These classes are intended to work well in common settings, however there may
be use cases these classes don't support (for instance, `StandardEvaluator` in
particular doesn't support running full evaluations over multiple different eval
datasets). Users are encouraged to simply fall back to custom `AbstractTrainer`
and `AbstractEvaluator` subclasses in these cases.
"""
import abc import abc
from typing import Any, Dict, Optional, Text from typing import Any, Optional
import dataclasses import dataclasses
...@@ -65,14 +84,26 @@ def _create_train_loop_fn(train_step_fn, options: StandardTrainerOptions): ...@@ -65,14 +84,26 @@ def _create_train_loop_fn(train_step_fn, options: StandardTrainerOptions):
class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta): class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta):
"""Implements the standard functionality of AbstractTrainer APIs.""" """Implements standard functionality on top of the AbstractTrainer API.
This class structures the training "inner loop" roughly as follows:
train_loop_begin()
for _ in range(num_steps):
train_step(train_iterator)
return train_loop_end()
Calls to `train_loop_begin` and `train_loop_end` are always done in eager
mode, while the loop/`train_step` may be implemented using `tf.while` and/or
`tf.function`, as determined by the `options` passed to `__init__`.
"""
def __init__(self, train_dataset, options: StandardTrainerOptions = None): def __init__(self, train_dataset, options: StandardTrainerOptions = None):
"""Construct a `StandardTrainer` object. """Initializes the `StandardTrainer` instance.
Args: Args:
train_dataset: A tf.nest-compatible structure of tf.data.Dataset or train_dataset: A `tf.nest`-compatible structure of `tf.data.Dataset` or
DistributedDataset. `DistributedDataset`.
options: An `orbit.StandardTrainerOptions` instance. options: An `orbit.StandardTrainerOptions` instance.
""" """
options = options or StandardTrainerOptions() options = options or StandardTrainerOptions()
...@@ -88,11 +119,16 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta): ...@@ -88,11 +119,16 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta):
self._train_iter = None self._train_iter = None
self._train_loop_fn = None self._train_loop_fn = None
def train( def train(self, num_steps: tf.Tensor) -> Optional[runner.Output]:
self, """Implements `num_steps` steps of training.
num_steps: Optional[tf.Tensor],
) -> Optional[Dict[Text, tf.Tensor]]: Args:
"""See base class.""" num_steps: The number of training steps to run. This corresponds directly
to the number of calls made to `train_step`.
Returns:
The output of `train_loop_end`.
"""
self.train_loop_begin() self.train_loop_begin()
if self._train_loop_fn is None: if self._train_loop_fn is None:
...@@ -108,9 +144,10 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta): ...@@ -108,9 +144,10 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta):
def train_loop_begin(self): def train_loop_begin(self):
"""Called once at the beginning of the training loop. """Called once at the beginning of the training loop.
This method is called before dataset iterators creation. This method is always called in eager mode, and is a good place to reset
This is a good place to reset metrics that accumulate values over multiple metrics that accumulate values over multiple steps of training.
steps of training.
Note that this method is called before dataset iterator creation.
""" """
pass pass
...@@ -118,28 +155,30 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta): ...@@ -118,28 +155,30 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta):
def train_step(self, iterator): def train_step(self, iterator):
"""Implements one step of training. """Implements one step of training.
What a "step" consists of is up to the implementer. If using distribution What a "step" consists of is up to the implementer. When using distribution
strategies, the call to this method should take place in the "cross-replica strategies, the call to this method takes place in the "cross-replica
context" for generality, to allow e.g. multiple iterator dequeues and calls context" for generality, to allow e.g. multiple iterator dequeues and calls
to `strategy.run`. to `strategy.run`.
Note that if `use_tf_function=True`, all the code inside `train_step` should Note that if `use_tf_function=True`, all the code inside `train_step` should
be tf.function compatible, as they will be traced with tf.function. This be compatible with `tf.function` tracing (and in particular, any state
means you cannot put arbitrary python code in this function. If users have modifications involving `self` should be avoided). In some cases, non-
any numpy operations, they should be put in `train_loop_begin` or `tf.function` compatible code can be moved to `train_loop_begin` or
`train_loop_end` functions. `train_loop_end`, which always execute eagerly.
Args: Args:
iterator: A tf.nest-compatible structure of tf.data Iterator or iterator: A `tf.nest`-compatible structure of `tf.data.Iterator` or
DistributedIterator. `DistributedIterator`. The structure of this input matches the structure
of `train_dataset` as passed to `__init__`.
""" """
pass pass
def train_loop_end(self) -> Optional[Dict[Text, tf.Tensor]]: def train_loop_end(self) -> Optional[runner.Output]:
"""Called at the end of the training loop. """Called once at the end of the training loop.
This is a good place to get metric results. The value returned from this This method is always called in eager mode, and is a good place to get
function will be returned as-is from the train() method. metric results. The value returned from this function will be returned as-is
from the `train` method implementation provided by `StandardTrainer`.
Returns: Returns:
The function may return a dictionary of `Tensors`, which will be The function may return a dictionary of `Tensors`, which will be
...@@ -150,18 +189,18 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta): ...@@ -150,18 +189,18 @@ class StandardTrainer(runner.AbstractTrainer, metaclass=abc.ABCMeta):
@property @property
def train_dataset(self): def train_dataset(self):
"""Returns the train_dataset instance.""" """The current training dataset."""
return self._train_dataset return self._train_dataset
@train_dataset.setter @train_dataset.setter
def train_dataset(self, train_dataset): def train_dataset(self, train_dataset):
"""Set a new train dataset and replace with the existing one. """Sets a new training dataset, replacing the current one.
Any unfinished work in the previous dataset will be discarded. Any unprocessed examples in the current dataset are discarded.
Args: Args:
train_dataset: A tf.nest-compatible structure of tf.data.Dataset or train_dataset: A `tf.nest`-compatible structure of `tf.data.Dataset` or
DistributedDataset. `DistributedDataset`.
""" """
self._train_dataset = train_dataset self._train_dataset = train_dataset
self._train_iter = None self._train_iter = None
...@@ -187,25 +226,49 @@ def _create_eval_loop_fn(eval_step_fn, options: StandardEvaluatorOptions): ...@@ -187,25 +226,49 @@ def _create_eval_loop_fn(eval_step_fn, options: StandardEvaluatorOptions):
class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta): class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta):
"""Implements the standard functionality of AbstractEvaluator APIs.""" """Implements the standard functionality of AbstractEvaluator APIs.
This class structures evaluation roughly as follows:
state = eval_begin()
for _ in range(num_steps):
step_outputs = eval_step(eval_iterator)
state = eval_reduce(state, step_outputs)
return eval_end(state)
Calls to `eval_begin`, `eval_reduce`, and `eval_end` are always done in eager
mode, while `eval_step` may be compiled with `tf.function` as determined by
the `options` passed to `__init__`.
This class does not support completely evaluating multiple different datasets
(i.e., where every example of each dataset should be processed, as opposed to
running for a fixed number of evaluation steps). A custom `AbstractEvaluator`
is recommended in this case.
"""
def __init__(self, eval_dataset, options: StandardEvaluatorOptions = None): def __init__(self, eval_dataset, options: StandardEvaluatorOptions = None):
"""Construct a `StandardEvaluator` object. """Initializes the `StandardEvaluator` instance.
Args: Args:
eval_dataset: A tf.nest-compatible structure of tf.data.Dataset or eval_dataset: A `tf.nest`-compatible structure of `tf.data.Dataset` or
DistributedDataset. `DistributedDataset`.
options: An `orbit.StandardEvaluatorOptions` instance. options: An `orbit.StandardEvaluatorOptions` instance.
""" """
self._eval_options = options or StandardEvaluatorOptions() self._eval_options = options or StandardEvaluatorOptions()
self._eval_dataset = eval_dataset self._eval_dataset = eval_dataset
self._eval_loop_fn = None self._eval_loop_fn = None
def evaluate( def evaluate(self, num_steps: tf.Tensor) -> Optional[runner.Output]:
self, """Implements `num_steps` steps of evaluation.
num_steps: Optional[tf.Tensor],
) -> Optional[Dict[Text, tf.Tensor]]: Args:
"""See base class.""" num_steps: The number of evaluation steps to run. When this is -1,
evaluation proceeds until a call to `eval_step` raises a `StopIteration`
or `tf.errors.OutOfRangeError`.
Returns:
The output of `self.eval_end()`.
"""
outputs = self.eval_begin() # pylint: disable=assignment-from-no-return outputs = self.eval_begin() # pylint: disable=assignment-from-no-return
if self._eval_loop_fn is None: if self._eval_loop_fn is None:
...@@ -224,12 +287,13 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta): ...@@ -224,12 +287,13 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta):
def eval_begin(self) -> Any: def eval_begin(self) -> Any:
"""Called once at the beginning of the evaluation. """Called once at the beginning of the evaluation.
This method is called before dataset iterators creation. This method is always called in eager mode, and is a good place to reset
This is a good place to reset metrics that accumulate values over the entire metrics that accumulate values over the course of evaluation.
evaluation.
Note that this method is called before dataset iterator creation.
Returns: Returns:
An output which is passed as `state` argument into `eval_reduce` function. An value to pass as the `state` argument to `eval_reduce`.
""" """
pass pass
...@@ -237,20 +301,20 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta): ...@@ -237,20 +301,20 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta):
def eval_step(self, iterator) -> Any: def eval_step(self, iterator) -> Any:
"""Implements one step of evaluation. """Implements one step of evaluation.
What a "step" consists of is up to the implementer. If using distribution What a "step" consists of is up to the implementer. When using distribution
strategies, the call to this method should take place in the "cross-replica strategies, the call to this method takes place in the "cross-replica
context" for generality, to allow e.g. multiple iterator dequeues and calls context" for generality, to allow e.g. multiple iterator dequeues and calls
to `strategy.run`. to `strategy.run`.
Note that if `use_tf_function=True`, all the code inside `eval_step` should Note that if `use_tf_function=True`, all the code inside `eval_step` should
be tf.function compatible, as they will be traced with tf.function. This be compatible with `tf.function` tracing (and in particular, any state
means you cannot put arbitrary python code in this function. If users have modifications involving `self` should be avoided). In some cases, non-
any numpy operations, they should be put in `eval_begin`, `eval_end` or `tf.function` compatible code can be moved to `eval_loop_begin`,
`eval_reduce` functions. `eval_reduce`, or `eval_loop_end`, which always execute eagerly.
Args: Args:
iterator: A tf.nest-compatible structure of tf.data Iterator or iterator: A `tf.nest`-compatible structure of `tf.data.Iterator` or
DistributedIterator. `DistributedIterator`.
Returns: Returns:
An output which is passed as `step_outputs` argument into `eval_reduce` An output which is passed as `step_outputs` argument into `eval_reduce`
...@@ -258,14 +322,18 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta): ...@@ -258,14 +322,18 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta):
""" """
pass pass
def eval_end(self, *args) -> Optional[Dict[Text, tf.Tensor]]: def eval_end(self, *args) -> Optional[runner.Output]:
"""Called at the end of the evaluation. """Called at the end of the evaluation.
This is a good place to get metric results. The value returned from this Called once at the end of evaluation.
function will be returned as-is from the evaluate() method.
This method is always called in eager mode, and is a good place to get
metric results. The value returned from this function will be returned as-is
from the `evaluate` method implementation provided by `StandardEvaluator`.
Args: Args:
*args: the outputs from `eval_reduce` for the last eval step. *args: The outputs from `eval_reduce` for the last eval step, if they are
non-`None` (if they are `None`, nothing is passed).
Returns: Returns:
The function may return a dictionary of `Tensors`, which will be The function may return a dictionary of `Tensors`, which will be
...@@ -274,35 +342,41 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta): ...@@ -274,35 +342,41 @@ class StandardEvaluator(runner.AbstractEvaluator, metaclass=abc.ABCMeta):
""" """
pass pass
def eval_reduce(self, state=None, step_outputs=None) -> Any: def eval_reduce(self,
"""A function to do the reduction on the evaluation outputs per step. state: Any = None,
step_outputs: Optional[runner.Output] = None) -> Any:
"""A function to perform per-step reduction on the evaluation outputs.
This is useful for passing states throughout evaluation. E.g. it can be used This is useful for passing state throughout evaluation, especially in cases
to maintain the output losses from all the evaluation steps, and compute the where maintaining or accumulating state is hard to accomplish using
mean loss in `eval_end` function. `tf.metrics.Metric` or other `tf.Variable`-based approaches. For instance,
it can be used to easily accumulate all per-example losses from the full
evaluation for subsequent processing in `eval_end()`.
Args: Args:
state: A maintained state throughout the evaluation. state: A state being mainted throughout the evaluation.
step_outputs: Outputs from the current evaluation step. step_outputs: Outputs from the current evaluation step.
Returns: Returns:
An output which is passed as `state` argument into `eval_reduce` function An output which is passed as the `state` argument to this function for the
for the next step. After evaluation is finished, the output from last step next step. After evaluation is finished, the output from last step will be
will be passed into `eval_end` function. passed to `eval_end`.
""" """
pass pass
@property @property
def eval_dataset(self): def eval_dataset(self):
"""Returns the train_datase instance.""" """The current evaluation dataset."""
return self._eval_dataset return self._eval_dataset
@eval_dataset.setter @eval_dataset.setter
def eval_dataset(self, eval_dataset): def eval_dataset(self, eval_dataset):
"""Set a new eval dataset and replace with the existing one. """Sets a new eval dataset, replacing the current one.
Any unprocessed examples in the current dataset are discarded.
Args: Args:
eval_dataset: A tf.nest-compatible structure of tf.data.Dataset or eval_dataset: A `tf.nest`-compatible structure of `tf.data.Dataset` or
DistributedDataset. `DistributedDataset`.
""" """
self._eval_dataset = eval_dataset self._eval_dataset = eval_dataset
...@@ -39,8 +39,7 @@ class TestTrainer(standard_runner.StandardTrainer): ...@@ -39,8 +39,7 @@ class TestTrainer(standard_runner.StandardTrainer):
def __init__(self, options=None): def __init__(self, options=None):
self.strategy = tf.distribute.get_strategy() self.strategy = tf.distribute.get_strategy()
self.global_step = utils.create_global_step() self.global_step = utils.create_global_step()
distribute = self.strategy.experimental_distribute_datasets_from_function dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
dataset = distribute(dataset_fn)
super().__init__(train_dataset=dataset, options=options) super().__init__(train_dataset=dataset, options=options)
def train_loop_begin(self): def train_loop_begin(self):
...@@ -63,8 +62,7 @@ class TestEvaluator(standard_runner.StandardEvaluator): ...@@ -63,8 +62,7 @@ class TestEvaluator(standard_runner.StandardEvaluator):
def __init__(self, options=None): def __init__(self, options=None):
self.strategy = tf.distribute.get_strategy() self.strategy = tf.distribute.get_strategy()
self.global_step = utils.create_global_step() self.global_step = utils.create_global_step()
distribute = self.strategy.experimental_distribute_datasets_from_function dataset = self.strategy.distribute_datasets_from_function(dataset_fn)
dataset = distribute(dataset_fn)
super().__init__(eval_dataset=dataset, options=options) super().__init__(eval_dataset=dataset, options=options)
def eval_begin(self): def eval_begin(self):
......
...@@ -12,7 +12,7 @@ ...@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""Defines exported symbols for `orbit.utils` package.""" """Defines exported symbols for the `orbit.utils` package."""
from orbit.utils.common import create_global_step from orbit.utils.common import create_global_step
from orbit.utils.common import get_value from orbit.utils.common import get_value
......
...@@ -16,7 +16,6 @@ ...@@ -16,7 +16,6 @@
import inspect import inspect
import numpy as np
import tensorflow as tf import tensorflow as tf
...@@ -46,16 +45,16 @@ def create_global_step() -> tf.Variable: ...@@ -46,16 +45,16 @@ def create_global_step() -> tf.Variable:
def make_distributed_dataset(strategy, dataset_or_fn, *args, **kwargs): def make_distributed_dataset(strategy, dataset_or_fn, *args, **kwargs):
"""A helper function to create distributed dataset. """A utility function to help create a `tf.distribute.DistributedDataset`.
Args: Args:
strategy: An instance of `tf.distribute.Strategy`. strategy: An instance of `tf.distribute.Strategy`.
dataset_or_fn: A instance of `tf.data.Dataset` or a function which takes an dataset_or_fn: A instance of `tf.data.Dataset`, or a "dataset function"
`tf.distribute.InputContext` as input and returns a `tf.data.Dataset`. If returning a `tf.data.Dataset`. If it is a function, it may optionally have
it is a function, it could optionally have an argument named an argument named `input_context` which will be passed a
`input_context` which is `tf.distribute.InputContext` argument type. `tf.distribute.InputContext` instance.
*args: The list of arguments to be passed to dataset_or_fn. *args: Any positional arguments to pass through to `dataset_or_fn`.
**kwargs: Any keyword arguments to be passed. **kwargs: Any keyword arguments to pass through to `dataset_or_fn`.
Returns: Returns:
A distributed Dataset. A distributed Dataset.
...@@ -64,38 +63,37 @@ def make_distributed_dataset(strategy, dataset_or_fn, *args, **kwargs): ...@@ -64,38 +63,37 @@ def make_distributed_dataset(strategy, dataset_or_fn, *args, **kwargs):
strategy = tf.distribute.get_strategy() strategy = tf.distribute.get_strategy()
if isinstance(dataset_or_fn, tf.data.Dataset): if isinstance(dataset_or_fn, tf.data.Dataset):
return strategy.experimental_distribute_dataset(dataset_or_fn) return strategy.distribute_dataset(dataset_or_fn)
if not callable(dataset_or_fn): if not callable(dataset_or_fn):
raise ValueError("`dataset_or_fn` should be either callable or an instance " raise ValueError("`dataset_or_fn` should be either callable or an instance "
"of `tf.data.Dataset`") "of `tf.data.Dataset`.")
def dataset_fn(ctx): def dataset_fn(input_context):
"""Wrapped dataset function for creating distributed dataset..""" """Wraps `dataset_or_fn` for strategy.distribute_datasets_from_function."""
# If `dataset_or_fn` is a function and has `input_context` as argument # If `dataset_or_fn` is a function and has an argument named
# names, pass `ctx` as the value of `input_context` when calling # `input_context`, pass through the given `input_context`. Otherwise
# `dataset_or_fn`. Otherwise `ctx` will not be used when calling # `input_context` will be ignored.
# `dataset_or_fn`.
argspec = inspect.getfullargspec(dataset_or_fn) argspec = inspect.getfullargspec(dataset_or_fn)
args_names = argspec.args arg_names = argspec.args
if "input_context" in args_names: if "input_context" in arg_names:
kwargs["input_context"] = ctx kwargs["input_context"] = input_context
ds = dataset_or_fn(*args, **kwargs) return dataset_or_fn(*args, **kwargs)
return ds
return strategy.experimental_distribute_datasets_from_function(dataset_fn) return strategy.distribute_datasets_from_function(dataset_fn)
def get_value(x) -> np.number: def get_value(x):
"""Returns the value of a variable/tensor. """Returns input values, converting any TensorFlow values to NumPy values.
Args: Args:
x: input variable. x: The input. May be a `tf.Tensor` or `tf.Variable`.
Returns: Returns:
A Numpy array or number. If the input is a TensorFlow `Tensor`, returns the `Tensor`'s equivalent
NumPy value. Otherwise, just returns the input.
""" """
if not tf.is_tensor(x): if not tf.is_tensor(x):
return x return x
......
...@@ -18,14 +18,14 @@ import tensorflow as tf ...@@ -18,14 +18,14 @@ import tensorflow as tf
class EpochHelper: class EpochHelper:
"""A Helper class to handle epochs in Customized Training Loop.""" """A helper class handle bookkeeping of epochs in custom training loops."""
def __init__(self, epoch_steps: int, global_step: tf.Variable): def __init__(self, epoch_steps: int, global_step: tf.Variable):
"""Constructs the EpochHelper. """Initializes the `EpochHelper` instance.
Args: Args:
epoch_steps: An integer indicates how many steps in an epoch. epoch_steps: An integer indicating how many steps are in an epoch.
global_step: A `tf.Variable` instance indicates the current global step. global_step: A `tf.Variable` providing the current global step.
""" """
self._epoch_steps = epoch_steps self._epoch_steps = epoch_steps
self._global_step = global_step self._global_step = global_step
...@@ -46,7 +46,7 @@ class EpochHelper: ...@@ -46,7 +46,7 @@ class EpochHelper:
def epoch_end(self): def epoch_end(self):
"""Returns whether the current epoch should end.""" """Returns whether the current epoch should end."""
if not self._in_epoch: if not self._in_epoch:
raise ValueError("`epoch_end` can only be called inside an epoch") raise ValueError("`epoch_end` can only be called inside an epoch.")
current_step = self._global_step.numpy() current_step = self._global_step.numpy()
epoch = current_step // self._epoch_steps epoch = current_step // self._epoch_steps
......
...@@ -20,36 +20,57 @@ import tensorflow as tf ...@@ -20,36 +20,57 @@ import tensorflow as tf
def create_loop_fn(step_fn): def create_loop_fn(step_fn):
"""Creates a multiple steps function driven by the python while loop. """Creates a loop function driven by a Python `while` loop.
Args: Args:
step_fn: A function which takes `iterator` as input. step_fn: A function taking a nested structure of `tf.data.Iterator` or
`DistributedIterator`. There are no constraints on the return value of the
function (except that it must be compatible with any `reduce_fn` provided
to the returned `loop_fn`).
Returns: Returns:
A callable defined as the `loop_fn` defination below. A loop function taking required `iterator` and `num_steps` parameters, as
well as optional `state` and `reduce_fn` parameters for accumulating state
over multiple iterations of the loop. See the `loop_fn` definition below for
additional details.
""" """
def loop_fn(iterator, num_steps, state=None, reduce_fn=None): def loop_fn(iterator, num_steps, state=None, reduce_fn=None):
"""A loop function with multiple steps. """Makes `num_steps` calls to `step_fn(iterator)`.
Additionally, state may be accumulated across iterations of the loop.
Conceptually, state accumulation is handled roughly as follows:
for _ in range(num_steps):
step_outputs = step_fn(iterator)
state = reduce_fn(state, step_outputs)
return state
However, the implementation is slightly more complicated in order to support
looping until the iterator is exhausted (when `num_steps == -1`) and to
properly catch exceptions when running under async remote eager (as is the
case in TPU training setups involving separate coordinator/worker machines).
Args: Args:
iterator: A nested structure of tf.data `Iterator` or iterator: A nested structure of `tf.data.Iterator` or
`DistributedIterator`. `DistributedIterator`.
num_steps: The number of steps in the loop. If `num_steps==-1`, will num_steps: The number of steps in the loop. If `num_steps == -1`, will
iterate until exausting the iterator. iterate until exausting the iterator.
state: An optional initial state before running the loop. state: An optional initial state before running the loop.
reduce_fn: a callable defined as `def reduce_fn(state, value)`, where reduce_fn: A callable taking two inputs, `state` and `value`, where
`value` is the outputs from `step_fn`. `state` is the previous output from `reduce_fn`, and `value` is the
output from `step_fn`.
Returns: Returns:
The updated state. The final state returned by `reduce_fn`, or `None` if `state` and
`reduce_fn` are not provided.
""" """
try: try:
step = 0 step = 0
# To make sure the OutOfRangeError exception can be handled well with # To make sure the OutOfRangeError exception can be handled well under
# async remote eager, we need to wrap the loop body in a `async_scope`. # async remote eager, we need to wrap the loop body in `async_scope`.
with tf.experimental.async_scope(): with tf.experimental.async_scope():
while (num_steps == -1 or step < num_steps): while num_steps == -1 or step < num_steps:
outputs = step_fn(iterator) outputs = step_fn(iterator)
if reduce_fn is not None: if reduce_fn is not None:
state = reduce_fn(state, outputs) state = reduce_fn(state, outputs)
...@@ -63,26 +84,32 @@ def create_loop_fn(step_fn): ...@@ -63,26 +84,32 @@ def create_loop_fn(step_fn):
def create_tf_while_loop_fn(step_fn): def create_tf_while_loop_fn(step_fn):
"""Create a multiple steps function driven by tf.while_loop on the host. """Creates a loop function compatible with TF's AutoGraph loop conversion.
Args: Args:
step_fn: A function which takes `iterator` as input. step_fn: A function taking a nested structure of `tf.data.Iterator` or
`DistributedIterator`. Currently, any return values are ignored.
Returns: Returns:
A callable defined as the `loop_fn` defination below. A loop function taking required `iterator` and `num_steps` parameters. If
called inside a `tf.function`, the loop will be converted by AutoGraph into
a `tf.while_loop` construct. See the `loop_fn` definition below for
additional details.
""" """
def loop_fn(iterator, num_steps): def loop_fn(iterator, num_steps):
"""A loop function with multiple steps. """Makes `num_steps` calls to `step_fn(iterator)`.
Args: Args:
iterator: A nested structure of tf.data `Iterator` or iterator: A nested structure of `tf.data.Iterator` or
`DistributedIterator`. `DistributedIterator`.
num_steps: The number of steps in the loop. Must be a tf.Tensor. num_steps: The number of steps in the loop. Should be passed as a
`tf.Tensor`. Iterating until iterator exhaustion is not supported.
""" """
if not isinstance(num_steps, tf.Tensor): if not isinstance(num_steps, tf.Tensor):
raise ValueError("`num_steps` should be an `tf.Tensor`. Python object " raise ValueError(
"may cause retracing.") "`num_steps` should be a `tf.Tensor`. Passing a Python value can "
"cause unnecessary retracing when wrapped by `tf.function`.")
for _ in tf.range(num_steps): for _ in tf.range(num_steps):
step_fn(iterator) step_fn(iterator)
......
...@@ -20,18 +20,19 @@ import tensorflow as tf ...@@ -20,18 +20,19 @@ import tensorflow as tf
class SummaryManager: class SummaryManager:
"""A class manages writing summaries.""" """A utility class for managing summary writing."""
def __init__(self, summary_dir, summary_fn, global_step=None): def __init__(self, summary_dir, summary_fn, global_step=None):
"""Construct a summary manager object. """Initializes the `SummaryManager` instance.
Args: Args:
summary_dir: the directory to write summaries. summary_dir: The directory in which to write summaries. If `None`, all
summary_fn: A callable defined as `def summary_fn(name, tensor, summary writing operations provided by this class are no-ops.
step=None)`, which describes the summary operation. summary_fn: A callable defined accepting `name`, `value`, and `step`
global_step: A `tf.Variable` instance for the global step. parameters, making calls to `tf.summary` functions to write summaries.
global_step: A `tf.Variable` containing the global step value.
""" """
self._enabled = (summary_dir is not None) self._enabled = summary_dir is not None
self._summary_dir = summary_dir self._summary_dir = summary_dir
self._summary_fn = summary_fn self._summary_fn = summary_fn
self._summary_writers = {} self._summary_writers = {}
...@@ -42,12 +43,12 @@ class SummaryManager: ...@@ -42,12 +43,12 @@ class SummaryManager:
self._global_step = global_step self._global_step = global_step
def summary_writer(self, relative_path=""): def summary_writer(self, relative_path=""):
"""Returns the underlying summary writer. """Returns the underlying summary writer for a specific subdirectory.
Args: Args:
relative_path: The current path in which to write summaries, relative to relative_path: The current path in which to write summaries, relative to
the summary directory. By default it is empty, which specifies the root the summary directory. By default it is empty, which corresponds to the
directory. root directory.
""" """
if self._summary_writers and relative_path in self._summary_writers: if self._summary_writers and relative_path in self._summary_writers:
return self._summary_writers[relative_path] return self._summary_writers[relative_path]
...@@ -59,43 +60,41 @@ class SummaryManager: ...@@ -59,43 +60,41 @@ class SummaryManager:
return self._summary_writers[relative_path] return self._summary_writers[relative_path]
def flush(self): def flush(self):
"""Flush the underlying summary writers.""" """Flushes the underlying summary writers."""
if self._enabled: if self._enabled:
tf.nest.map_structure(tf.summary.flush, self._summary_writers) tf.nest.map_structure(tf.summary.flush, self._summary_writers)
def write_summaries(self, summary_dict): def write_summaries(self, summary_dict):
"""Write summaries for the given values. """Writes summaries for the given dictionary of values.
This recursively creates subdirectories for any nested dictionaries This recursively creates subdirectories for any nested dictionaries
provided in `summary_dict`, yielding a hierarchy of directories which will provided in `summary_dict`, yielding a hierarchy of directories which will
then be reflected in the TensorBoard UI as different colored curves. then be reflected in the TensorBoard UI as different colored curves.
E.g. users may evaluate on muliple datasets and return `summary_dict` as a For example, users may evaluate on muliple datasets and return
nested dictionary. `summary_dict` as a nested dictionary:
```
{ {
"dataset": { "dataset1": {
"loss": loss, "loss": loss1,
"accuracy": accuracy "accuracy": accuracy1
}, },
"dataset2": { "dataset2": {
"loss": loss2, "loss": loss2,
"accuracy": accuracy2 "accuracy": accuracy2
}, },
} }
```
This will create two subdirectories "dataset" and "dataset2" inside the This will create two subdirectories, "dataset1" and "dataset2", inside the
summary root directory. Each directory will contain event files including summary root directory. Each directory will contain event files including
both "loss" and "accuracy" summaries. both "loss" and "accuracy" summaries.
Args: Args:
summary_dict: A dictionary of values. If any value in `summary_dict` is summary_dict: A dictionary of values. If any value in `summary_dict` is
itself a dictionary, then the function will recursively create itself a dictionary, then the function will create a subdirectory with
subdirectories with names given by the keys in the dictionary. The name given by the corresponding key. This is performed recursively. Leaf
Tensor values are summarized using the summary writer instance specific values are then summarized using the summary writer instance specific to
to the parent relative path. the parent relative path.
""" """
if not self._enabled: if not self._enabled:
return return
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment