Commit c4f34e58 authored by Tian Lin's avatar Tian Lin Committed by Toby Boyd
Browse files

Merge Transformer V2 to Github (#6846)

* Merged commit includes the following changes:
249218656  by tianlin<tianlin@google.com>:

    Deal with imports, fix a typo and make unit tests fast.

--
249198645  by tianlin<tianlin@google.com>:

    Trivial: Remove one empty line before "import tensorflow"

--
249195490  by tianlin<tianlin@google.com>:

    Initialize Transformer TF V2 Model with Keras subclassing implementation. (Compatible with TF V1)

--
249195008  by tianlin<tianlin@google.com>:

    Internal change

249173564  by hongkuny<hongkuny@google.com>:

    Internal change

249079258  by hongkuny<hongkuny@google.com>:

    Internal change

247691534  by haoyuzhang<haoyuzhang@google.com>:

    Internal change

247533725  by haoyuzhang<haoyuzhang@google.com>:

    Internal change

247509295  by haoyuzhang<haoyuzhang@google.com>:

    Internal change

247311355  by wangtz<wangtz@google.com>:

    Internal change

247303127  by wangtz<wangtz@google.com>:

  ...
parent 4726c5b9
Copyright 2015 The TensorFlow Authors. All rights reserved.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2015, The TensorFlow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
......@@ -26,7 +26,10 @@ from __future__ import print_function
import os
import sys
import tensorflow as tf # pylint: disable=g-bad-import-order
# pylint: disable=g-bad-import-order
from absl import app as absl_app
import tensorflow as tf
# pylint: enable=g-bad-import-order
# For open source environment, add grandparent directory for import
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(sys.path[0]))))
......@@ -195,4 +198,4 @@ def main(argv):
if __name__ == "__main__":
tf.app.run()
absl_app.run()
......@@ -190,13 +190,15 @@ class SequenceBeamSearch(object):
best_alive_scores = alive_log_probs[:, 0] / max_length_norm
# Compute worst score in finished sequences for each batch element
finished_scores *= tf.to_float(finished_flags) # set filler scores to zero
finished_scores *= tf.cast(finished_flags,
tf.float32) # set filler scores to zero
lowest_finished_scores = tf.reduce_min(finished_scores, axis=1)
# If there are no finished sequences in a batch element, then set the lowest
# finished score to -INF for that element.
finished_batches = tf.reduce_any(finished_flags, 1)
lowest_finished_scores += (1. - tf.to_float(finished_batches)) * -INF
lowest_finished_scores += (1.0 -
tf.cast(finished_batches, tf.float32)) * -INF
worst_finished_score_better_than_best_alive_score = tf.reduce_all(
tf.greater(lowest_finished_scores, best_alive_scores)
......@@ -319,7 +321,7 @@ class SequenceBeamSearch(object):
"""
# To prevent finished sequences from being considered, set log probs to -INF
new_finished_flags = tf.equal(new_seq[:, :, -1], self.eos_id)
new_log_probs += tf.to_float(new_finished_flags) * -INF
new_log_probs += tf.cast(new_finished_flags, tf.float32) * -INF
top_alive_seq, top_alive_log_probs, top_alive_cache = _gather_topk_beams(
[new_seq, new_log_probs, new_cache], new_log_probs, self.batch_size,
......@@ -364,7 +366,7 @@ class SequenceBeamSearch(object):
# Set the scores of the still-alive seq in new_seq to large negative values.
new_finished_flags = tf.equal(new_seq[:, :, -1], self.eos_id)
new_scores += (1. - tf.to_float(new_finished_flags)) * -INF
new_scores += (1. - tf.cast(new_finished_flags, tf.float32)) * -INF
# Combine sequences, scores, and flags.
finished_seq = tf.concat([finished_seq, new_seq], axis=1)
......@@ -417,12 +419,12 @@ def sequence_beam_search(
def _log_prob_from_logits(logits):
return logits - tf.reduce_logsumexp(logits, axis=2, keep_dims=True)
return logits - tf.reduce_logsumexp(logits, axis=2, keepdims=True)
def _length_normalization(alpha, length):
"""Return length normalization factor."""
return tf.pow(((5. + tf.to_float(length)) / 6.), alpha)
return tf.pow(((5. + tf.cast(length, tf.float32)) / 6.), alpha)
def _expand_to_beam_size(tensor, beam_size):
......
......@@ -42,13 +42,13 @@ def get_position_encoding(
Returns:
Tensor with shape [length, hidden_size]
"""
position = tf.to_float(tf.range(length))
position = tf.cast(tf.range(length), tf.float32)
num_timescales = hidden_size // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(tf.to_float(num_timescales) - 1))
(tf.cast(num_timescales, tf.float32) - 1))
inv_timescales = min_timescale * tf.exp(
tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
tf.cast(tf.range(num_timescales), tf.float32) * -log_timescale_increment)
scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
return signal
......@@ -68,7 +68,7 @@ def get_decoder_self_attention_bias(length):
float tensor of shape [1, 1, length, length]
"""
with tf.name_scope("decoder_self_attention_bias"):
valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
valid_locs = tf.linalg.band_part(tf.ones([length, length]), -1, 0)
valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
decoder_bias = _NEG_INF * (1.0 - valid_locs)
return decoder_bias
......@@ -86,7 +86,7 @@ def get_padding(x, padding_value=0):
0 -> non-padding, 1 -> padding
"""
with tf.name_scope("padding"):
return tf.to_float(tf.equal(x, padding_value))
return tf.cast(tf.equal(x, padding_value), tf.float32)
def get_padding_bias(x):
......
......@@ -63,7 +63,8 @@ class Subtokenizer(object):
def __init__(self, vocab_file, reserved_tokens=None):
"""Initializes class, creating a vocab file if data_files is provided."""
tf.logging.info("Initializing Subtokenizer from file %s." % vocab_file)
tf.compat.v1.logging.info("Initializing Subtokenizer from file %s." %
vocab_file)
if reserved_tokens is None:
reserved_tokens = RESERVED_TOKENS
......@@ -106,17 +107,17 @@ class Subtokenizer(object):
if reserved_tokens is None:
reserved_tokens = RESERVED_TOKENS
if tf.gfile.Exists(vocab_file):
tf.logging.info("Vocab file already exists (%s)" % vocab_file)
if tf.io.gfile.exists(vocab_file):
tf.compat.v1.logging.info("Vocab file already exists (%s)" % vocab_file)
else:
tf.logging.info("Begin steps to create subtoken vocabulary...")
tf.compat.v1.logging.info("Begin steps to create subtoken vocabulary...")
token_counts = _count_tokens(files, file_byte_limit)
alphabet = _generate_alphabet_dict(token_counts)
subtoken_list = _generate_subtokens_with_target_vocab_size(
token_counts, alphabet, target_vocab_size, threshold, min_count,
reserved_tokens)
tf.logging.info("Generated vocabulary with %d subtokens." %
len(subtoken_list))
tf.compat.v1.logging.info("Generated vocabulary with %d subtokens." %
len(subtoken_list))
_save_vocab_file(vocab_file, subtoken_list)
return Subtokenizer(vocab_file)
......@@ -394,22 +395,23 @@ def _generate_subtokens_with_target_vocab_size(
reserved_tokens = RESERVED_TOKENS
if min_count is not None:
tf.logging.info("Using min_count=%d to generate vocab with target size %d" %
(min_count, target_size))
tf.compat.v1.logging.info(
"Using min_count=%d to generate vocab with target size %d" %
(min_count, target_size))
return _generate_subtokens(
token_counts, alphabet, min_count, reserved_tokens=reserved_tokens)
def bisect(min_val, max_val):
"""Recursive function to binary search for subtoken vocabulary."""
cur_count = (min_val + max_val) // 2
tf.logging.info("Binary search: trying min_count=%d (%d %d)" %
(cur_count, min_val, max_val))
tf.compat.v1.logging.info("Binary search: trying min_count=%d (%d %d)" %
(cur_count, min_val, max_val))
subtoken_list = _generate_subtokens(
token_counts, alphabet, cur_count, reserved_tokens=reserved_tokens)
val = len(subtoken_list)
tf.logging.info("Binary search: min_count=%d resulted in %d tokens" %
(cur_count, val))
tf.compat.v1.logging.info(
"Binary search: min_count=%d resulted in %d tokens" % (cur_count, val))
within_threshold = abs(val - target_size) < threshold
if within_threshold or min_val >= max_val or cur_count < 2:
......@@ -425,8 +427,8 @@ def _generate_subtokens_with_target_vocab_size(
return other_subtoken_list
return subtoken_list
tf.logging.info("Finding best min_count to get target size of %d" %
target_size)
tf.compat.v1.logging.info("Finding best min_count to get target size of %d" %
target_size)
return bisect(_MIN_MIN_COUNT, _MAX_MIN_COUNT)
......@@ -594,7 +596,7 @@ def _generate_subtokens(
# subtoken_dict, count how often the resulting subtokens appear, and update
# the dictionary with subtokens w/ high enough counts.
for i in xrange(num_iterations):
tf.logging.info("\tGenerating subtokens: iteration %d" % i)
tf.compat.v1.logging.info("\tGenerating subtokens: iteration %d" % i)
# Generate new subtoken->id dictionary using the new subtoken list.
subtoken_dict = _list_to_index_dict(subtoken_list)
......@@ -607,5 +609,5 @@ def _generate_subtokens(
subtoken_list, max_subtoken_length = _gen_new_subtoken_list(
subtoken_counts, min_count, alphabet, reserved_tokens)
tf.logging.info("\tVocab size: %d" % len(subtoken_list))
tf.compat.v1.logging.info("\tVocab size: %d" % len(subtoken_list))
return subtoken_list
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Implementation of multiheaded attention and self-attention layers."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
class Attention(tf.keras.layers.Layer):
"""Multi-headed attention layer."""
def __init__(self, hidden_size, num_heads, attention_dropout):
"""Initialize Attention.
Args:
hidden_size: int, output dim of hidden layer.
num_heads: int, number of heads to repeat the same attention structure.
attention_dropout: float, dropout rate inside attention for training.
"""
if hidden_size % num_heads:
raise ValueError(
"Hidden size ({}) must be divisible by the number of heads ({})."
.format(hidden_size, num_heads))
super(Attention, self).__init__()
self.hidden_size = hidden_size
self.num_heads = num_heads
self.attention_dropout = attention_dropout
def build(self, input_shape):
"""Builds the layer."""
# Layers for linearly projecting the queries, keys, and values.
self.q_dense_layer = tf.keras.layers.Dense(
self.hidden_size, use_bias=False, name="q")
self.k_dense_layer = tf.keras.layers.Dense(
self.hidden_size, use_bias=False, name="k")
self.v_dense_layer = tf.keras.layers.Dense(
self.hidden_size, use_bias=False, name="v")
self.output_dense_layer = tf.keras.layers.Dense(
self.hidden_size, use_bias=False, name="output_transform")
super(Attention, self).build(input_shape)
def get_config(self):
return {
"hidden_size": self.hidden_size,
"num_heads": self.num_heads,
"attention_dropout": self.attention_dropout,
}
def split_heads(self, x):
"""Split x into different heads, and transpose the resulting value.
The tensor is transposed to insure the inner dimensions hold the correct
values during the matrix multiplication.
Args:
x: A tensor with shape [batch_size, length, hidden_size]
Returns:
A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]
"""
with tf.name_scope("split_heads"):
batch_size = tf.shape(x)[0]
length = tf.shape(x)[1]
# Calculate depth of last dimension after it has been split.
depth = (self.hidden_size // self.num_heads)
# Split the last dimension
x = tf.reshape(x, [batch_size, length, self.num_heads, depth])
# Transpose the result
return tf.transpose(x, [0, 2, 1, 3])
def combine_heads(self, x):
"""Combine tensor that has been split.
Args:
x: A tensor [batch_size, num_heads, length, hidden_size/num_heads]
Returns:
A tensor with shape [batch_size, length, hidden_size]
"""
with tf.name_scope("combine_heads"):
batch_size = tf.shape(x)[0]
length = tf.shape(x)[2]
x = tf.transpose(x, [0, 2, 1, 3]) # --> [batch, length, num_heads, depth]
return tf.reshape(x, [batch_size, length, self.hidden_size])
def call(self, x, y, bias, training, cache=None):
"""Apply attention mechanism to x and y.
Args:
x: a tensor with shape [batch_size, length_x, hidden_size]
y: a tensor with shape [batch_size, length_y, hidden_size]
bias: attention bias that will be added to the result of the dot product.
training: boolean, whether in training mode or not.
cache: (Used during prediction) dictionary with tensors containing results
of previous attentions. The dictionary must have the items:
{"k": tensor with shape [batch_size, i, key_channels],
"v": tensor with shape [batch_size, i, value_channels]}
where i is the current decoded length.
Returns:
Attention layer output with shape [batch_size, length_x, hidden_size]
"""
# Linearly project the query (q), key (k) and value (v) using different
# learned projections. This is in preparation of splitting them into
# multiple heads. Multi-head attention uses multiple queries, keys, and
# values rather than regular attention (which uses a single q, k, v).
q = self.q_dense_layer(x)
k = self.k_dense_layer(y)
v = self.v_dense_layer(y)
if cache is not None:
# Combine cached keys and values with new keys and values.
k = tf.concat([cache["k"], k], axis=1)
v = tf.concat([cache["v"], v], axis=1)
# Update cache
cache["k"] = k
cache["v"] = v
# Split q, k, v into heads.
q = self.split_heads(q)
k = self.split_heads(k)
v = self.split_heads(v)
# Scale q to prevent the dot product between q and k from growing too large.
depth = (self.hidden_size // self.num_heads)
q *= depth ** -0.5
# Calculate dot product attention
logits = tf.matmul(q, k, transpose_b=True)
logits += bias
weights = tf.nn.softmax(logits, name="attention_weights")
if training:
weights = tf.nn.dropout(weights, rate=self.attention_dropout)
attention_output = tf.matmul(weights, v)
# Recombine heads --> [batch_size, length, hidden_size]
attention_output = self.combine_heads(attention_output)
# Run the combined outputs through another linear projection layer.
attention_output = self.output_dense_layer(attention_output)
return attention_output
class SelfAttention(Attention):
"""Multiheaded self-attention layer."""
def call(self, x, bias, training, cache=None):
return super(SelfAttention, self).call(x, x, bias, training, cache)
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Input pipeline for the transformer model to read, filter, and batch examples.
Two things to note in the pipeline:
1. Batching scheme
The examples encoded in the TFRecord files contain data in the format:
{"inputs": [variable length array of integers],
"targets": [variable length array of integers]}
Where integers in the arrays refer to tokens in the English and German vocab
file (named `vocab.ende.32768`).
Prior to batching, elements in the dataset are grouped by length (max between
"inputs" and "targets" length). Each group is then batched such that:
group_batch_size * length <= batch_size.
Another way to view batch_size is the maximum number of tokens in each batch.
Once batched, each element in the dataset will have the shape:
{"inputs": [group_batch_size, padded_input_length],
"targets": [group_batch_size, padded_target_length]}
Lengths are padded to the longest "inputs" or "targets" sequence in the batch
(padded_input_length and padded_target_length can be different).
This batching scheme decreases the fraction of padding tokens per training
batch, thus improving the training speed significantly.
2. Shuffling
While training, the dataset is shuffled in two places in the code. The first
is the list of training files. Second, while reading records using
`parallel_interleave`, the `sloppy` argument is used to generate randomness
in the order of the examples.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import os
import tensorflow as tf
# TODO(tianlin) Import internal library. Remove this when different behaviors
# of keras_model.fit(dataset, ...) for different TF versions are fixed.
from tensorflow.python import tf2 as tf2_internal
from official.utils.misc import model_helpers
# Buffer size for reading records from a TFRecord file. Each training file is
# 7.2 MB, so 8 MB allows an entire file to be kept in memory.
_READ_RECORD_BUFFER = 8 * 1000 * 1000
# Example grouping constants. Defines length boundaries for each group.
# These values are the defaults used in Tensor2Tensor.
_MIN_BOUNDARY = 8
_BOUNDARY_SCALE = 1.1
def _load_records(filename):
"""Read file and return a dataset of tf.Examples."""
return tf.data.TFRecordDataset(filename, buffer_size=_READ_RECORD_BUFFER)
def _parse_example(serialized_example):
"""Return inputs and targets Tensors from a serialized tf.Example."""
data_fields = {
"inputs": tf.io.VarLenFeature(tf.int64),
"targets": tf.io.VarLenFeature(tf.int64)
}
parsed = tf.io.parse_single_example(serialized_example, data_fields)
inputs = tf.sparse.to_dense(parsed["inputs"])
targets = tf.sparse.to_dense(parsed["targets"])
return inputs, targets
def _filter_max_length(example, max_length=256):
"""Indicates whether the example's length is lower than the maximum length."""
return tf.logical_and(tf.size(example[0]) <= max_length,
tf.size(example[1]) <= max_length)
def _get_example_length(example):
"""Returns the maximum length between the example inputs and targets."""
length = tf.maximum(tf.shape(example[0])[0], tf.shape(example[1])[0])
return length
def _create_min_max_boundaries(
max_length, min_boundary=_MIN_BOUNDARY, boundary_scale=_BOUNDARY_SCALE):
"""Create min and max boundary lists up to max_length.
For example, when max_length=24, min_boundary=4 and boundary_scale=2, the
returned values will be:
buckets_min = [0, 4, 8, 16, 24]
buckets_max = [4, 8, 16, 24, 25]
Args:
max_length: The maximum length of example in dataset.
min_boundary: Minimum length in boundary.
boundary_scale: Amount to scale consecutive boundaries in the list.
Returns:
min and max boundary lists
"""
# Create bucket boundaries list by scaling the previous boundary or adding 1
# (to ensure increasing boundary sizes).
bucket_boundaries = []
x = min_boundary
while x < max_length:
bucket_boundaries.append(x)
x = max(x + 1, int(x * boundary_scale))
# Create min and max boundary lists from the initial list.
buckets_min = [0] + bucket_boundaries
buckets_max = bucket_boundaries + [max_length + 1]
return buckets_min, buckets_max
def _batch_examples(dataset, batch_size, max_length):
"""Group examples by similar lengths, and return batched dataset.
Each batch of similar-length examples are padded to the same length, and may
have different number of elements in each batch, such that:
group_batch_size * padded_length <= batch_size.
This decreases the number of padding tokens per batch, which improves the
training speed.
Args:
dataset: Dataset of unbatched examples.
batch_size: Max number of tokens per batch of examples.
max_length: Max number of tokens in an example input or target sequence.
Returns:
Dataset of batched examples with similar lengths.
"""
# Get min and max boundary lists for each example. These are used to calculate
# the `bucket_id`, which is the index at which:
# buckets_min[bucket_id] <= len(example) < buckets_max[bucket_id]
# Note that using both min and max lists improves the performance.
buckets_min, buckets_max = _create_min_max_boundaries(max_length)
# Create list of batch sizes for each bucket_id, so that
# bucket_batch_size[bucket_id] * buckets_max[bucket_id] <= batch_size
bucket_batch_sizes = [batch_size // x for x in buckets_max]
# bucket_id will be a tensor, so convert this list to a tensor as well.
bucket_batch_sizes = tf.constant(bucket_batch_sizes, dtype=tf.int64)
def example_to_bucket_id(example_input, example_target):
"""Return int64 bucket id for this example, calculated based on length."""
seq_length = _get_example_length((example_input, example_target))
# TODO: investigate whether removing code branching improves performance.
conditions_c = tf.logical_and(
tf.less_equal(buckets_min, seq_length),
tf.less(seq_length, buckets_max))
bucket_id = tf.reduce_min(tf.where(conditions_c))
return bucket_id
def window_size_fn(bucket_id):
"""Return number of examples to be grouped when given a bucket id."""
return bucket_batch_sizes[bucket_id]
def batching_fn(bucket_id, grouped_dataset):
"""Batch and add padding to a dataset of elements with similar lengths."""
bucket_batch_size = window_size_fn(bucket_id)
# Batch the dataset and add padding so that all input sequences in the
# examples have the same length, and all target sequences have the same
# lengths as well. Resulting lengths of inputs and targets can differ.
return grouped_dataset.padded_batch(bucket_batch_size, ([None], [None]))
return dataset.apply(tf.data.experimental.group_by_window(
key_func=example_to_bucket_id,
reduce_func=batching_fn,
window_size=None,
window_size_func=window_size_fn))
def _read_and_batch_from_files(
file_pattern, batch_size, max_length, num_parallel_calls, shuffle, repeat,
static_batch=False):
"""Create dataset where each item is a dict of "inputs" and "targets".
Args:
file_pattern: String used to match the input TFRecord files.
batch_size: Maximum number of tokens per batch of examples
max_length: Maximum number of tokens per example
num_parallel_calls: Number of cpu cores for parallel input processing.
shuffle: If true, randomizes order of elements.
repeat: Number of times to repeat the dataset. If None, the dataset is
repeated forever.
static_batch: Whether the batches in the dataset should have static shapes.
If True, the input is batched so that every batch has the
shape [batch_size // max_length, max_length]. If False, the input is
grouped by length, and batched so that batches may have different
shapes [N, M], where:
N * M <= batch_size
M <= max_length
In general, this setting should be False. Dynamic shapes allow the inputs
to be grouped so that the number of padding tokens is minimized, and helps
model training. In cases where the input shape must be static
(e.g. running on TPU), this setting should be set to True.
Returns:
tf.data.Dataset object containing examples loaded from the files.
"""
dataset = tf.data.Dataset.list_files(file_pattern, shuffle=shuffle)
# Read files and interleave results. When training, the order of the examples
# will be non-deterministic.
dataset = dataset.interleave(
_load_records,
cycle_length=num_parallel_calls,
num_parallel_calls=num_parallel_calls)
# Parse each tf.Example into a dictionary
# TODO: Look into prefetch_input_elements for performance optimization.
dataset = dataset.map(_parse_example,
num_parallel_calls=num_parallel_calls)
# Remove examples where the input or target length exceeds the maximum length,
dataset = dataset.filter(lambda x, y: _filter_max_length((x, y), max_length))
if static_batch:
dataset = dataset.padded_batch(
batch_size // max_length, ([max_length], [max_length]),
drop_remainder=True)
else:
# Group and batch such that each batch has examples of similar length.
dataset = _batch_examples(dataset, batch_size, max_length)
dataset = dataset.repeat(repeat)
# Prefetch the next element to improve speed of input pipeline.
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset
def _generate_synthetic_data(params):
"""Create synthetic data based on the parameter batch size."""
batch = length = int(math.sqrt(params["batch_size"]))
return model_helpers.generate_synthetic_data(
input_shape=tf.TensorShape([batch, length]),
input_value=1,
input_dtype=tf.int32,
label_shape=tf.TensorShape([batch, length]),
label_value=1,
label_dtype=tf.int32,
)
def train_input_fn(params):
"""Load and return dataset of batched examples for use during training."""
file_pattern = os.path.join(params["data_dir"] or "", "*train*")
if params["use_synthetic_data"]:
return _generate_synthetic_data(params)
return _read_and_batch_from_files(
file_pattern, params["batch_size"], params["max_length"],
params["num_parallel_calls"], shuffle=True,
repeat=params["repeat_dataset"], static_batch=params["static_batch"])
def eval_input_fn(params):
"""Load and return dataset of batched examples for use during evaluation."""
file_pattern = os.path.join(params["data_dir"] or "", "*dev*")
if params["use_synthetic_data"]:
return _generate_synthetic_data(params)
return _read_and_batch_from_files(
file_pattern, params["batch_size"], params["max_length"],
params["num_parallel_calls"], shuffle=False, repeat=1,
static_batch=params["static_batch"])
def map_data_for_transformer_fn(x, y):
"""Maps data for training, and handles weried behaviors for different vers."""
# Will transform input x and targets y into tuple(x, y) as new model inputs.
if tf2_internal.enabled():
# For TF v2, the 2nd parameter is omitted to make Keras training work.
return ((x, y),)
else:
# For TF v1, Keras requires a dummy placeholder as the 2nd parameter.
return ((x, y), tf.constant(0.0))
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Implementation of embedding layer with shared weights."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
class EmbeddingSharedWeights(tf.keras.layers.Layer):
"""Calculates input embeddings and pre-softmax linear with shared weights."""
def __init__(self, vocab_size, hidden_size):
"""Specify characteristic parameters of embedding layer.
Args:
vocab_size: Number of tokens in the embedding. (Typically ~32,000)
hidden_size: Dimensionality of the embedding. (Typically 512 or 1024)
"""
super(EmbeddingSharedWeights, self).__init__()
self.vocab_size = vocab_size
self.hidden_size = hidden_size
def build(self, input_shape):
with tf.name_scope("embedding_and_softmax"):
# Create and initialize weights. The random normal initializer was chosen
# arbitrarily, and works well.
self.shared_weights = self.add_weight(
"weights",
shape=[self.vocab_size, self.hidden_size],
dtype="float32",
initializer=tf.random_normal_initializer(
mean=0., stddev=self.hidden_size**-0.5))
super(EmbeddingSharedWeights, self).build(input_shape)
def get_config(self):
return {
"vocab_size": self.vocab_size,
"hidden_size": self.hidden_size,
}
def call(self, inputs, mode="embedding"):
"""Get token embeddings of inputs.
Args:
inputs: An int64 tensor with shape [batch_size, length]
mode: string, a valid value is one of "embedding" and "linear".
Returns:
outputs: (1) If mode == "embedding", output embedding tensor, float32 with
shape [batch_size, length, embedding_size]; (2) mode == "linear", output
linear tensor, float32 with shape [batch_size, length, vocab_size].
Raises:
ValueError: if mode is not valid.
"""
if mode == "embedding":
return self._embedding(inputs)
elif mode == "linear":
return self._linear(inputs)
else:
raise ValueError("mode {} is not valid.".format(mode))
def _embedding(self, inputs):
"""Applies embedding based on inputs tensor."""
with tf.name_scope("embedding"):
# Create binary mask of size [batch_size, length]
mask = tf.cast(tf.not_equal(inputs, 0), tf.float32)
embeddings = tf.gather(self.shared_weights, inputs)
embeddings *= tf.expand_dims(mask, -1)
# Scale embedding by the sqrt of the hidden size
embeddings *= self.hidden_size ** 0.5
return embeddings
def _linear(self, inputs):
"""Computes logits by running inputs through a linear layer.
Args:
inputs: A float32 tensor with shape [batch_size, length, hidden_size]
Returns:
float32 tensor with shape [batch_size, length, vocab_size].
"""
with tf.name_scope("presoftmax_linear"):
batch_size = tf.shape(inputs)[0]
length = tf.shape(inputs)[1]
x = tf.reshape(inputs, [-1, self.hidden_size])
logits = tf.matmul(x, self.shared_weights, transpose_b=True)
return tf.reshape(logits, [batch_size, length, self.vocab_size])
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Implementation of fully connected network."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
class FeedForwardNetwork(tf.keras.layers.Layer):
"""Fully connected feedforward network."""
def __init__(self, hidden_size, filter_size, relu_dropout):
"""Initialize FeedForwardNetwork.
Args:
hidden_size: int, output dim of hidden layer.
filter_size: int, filter size for the inner (first) dense layer.
relu_dropout: float, dropout rate for training.
"""
super(FeedForwardNetwork, self).__init__()
self.hidden_size = hidden_size
self.filter_size = filter_size
self.relu_dropout = relu_dropout
def build(self, input_shape):
self.filter_dense_layer = tf.keras.layers.Dense(
self.filter_size,
use_bias=True,
activation=tf.nn.relu,
name="filter_layer")
self.output_dense_layer = tf.keras.layers.Dense(
self.hidden_size, use_bias=True, name="output_layer")
super(FeedForwardNetwork, self).build(input_shape)
def get_config(self):
return {
"hidden_size": self.hidden_size,
"filter_size": self.filter_size,
"relu_dropout": self.relu_dropout,
}
def call(self, x, training):
"""Return outputs of the feedforward network.
Args:
x: tensor with shape [batch_size, length, hidden_size]
training: boolean, whether in training mode or not.
Returns:
Output of the feedforward network.
tensor with shape [batch_size, length, hidden_size]
"""
# Retrieve dynamically known shapes
batch_size = tf.shape(x)[0]
length = tf.shape(x)[1]
output = self.filter_dense_layer(x)
if training:
output = tf.nn.dropout(output, rate=self.relu_dropout)
output = self.output_dense_layer(output)
return output
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Functions for calculating loss, accuracy, and other model metrics.
Metrics:
- Padded loss, accuracy, and negative log perplexity. Source:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/metrics.py
- BLEU approximation. Source:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py
- ROUGE score. Source:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/rouge.py
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import functools
import tensorflow as tf
def _pad_tensors_to_same_length(x, y):
"""Pad x and y so that the results have the same length (second dimension)."""
with tf.name_scope("pad_to_same_length"):
x_length = tf.shape(x)[1]
y_length = tf.shape(y)[1]
max_length = tf.maximum(x_length, y_length)
x = tf.pad(x, [[0, 0], [0, max_length - x_length], [0, 0]])
y = tf.pad(y, [[0, 0], [0, max_length - y_length]])
return x, y
def padded_cross_entropy_loss(logits, labels, smoothing, vocab_size):
"""Calculate cross entropy loss while ignoring padding.
Args:
logits: Tensor of size [batch_size, length_logits, vocab_size]
labels: Tensor of size [batch_size, length_labels]
smoothing: Label smoothing constant, used to determine the on and off values
vocab_size: int size of the vocabulary
Returns:
Returns the cross entropy loss and weight tensors: float32 tensors with
shape [batch_size, max(length_logits, length_labels)]
"""
with tf.name_scope("loss"):
logits, labels = _pad_tensors_to_same_length(logits, labels)
# Calculate smoothing cross entropy
with tf.name_scope("smoothing_cross_entropy"):
confidence = 1.0 - smoothing
low_confidence = (1.0 - confidence) / tf.cast(vocab_size - 1, tf.float32)
soft_targets = tf.one_hot(
tf.cast(labels, tf.int32),
depth=vocab_size,
on_value=confidence,
off_value=low_confidence)
xentropy = tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=soft_targets)
# Calculate the best (lowest) possible value of cross entropy, and
# subtract from the cross entropy loss.
normalizing_constant = -(
confidence * tf.math.log(confidence) +
tf.cast(vocab_size - 1, tf.float32) * low_confidence *
tf.math.log(low_confidence + 1e-20))
xentropy -= normalizing_constant
weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
return xentropy * weights, weights
def padded_accuracy(logits, labels):
"""Percentage of times that predictions matches labels on non-0s."""
with tf.name_scope("padded_accuracy"):
logits, labels = _pad_tensors_to_same_length(logits, labels)
weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
outputs = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
padded_labels = tf.cast(labels, tf.int32)
return tf.cast(tf.equal(outputs, padded_labels), tf.float32), weights
def padded_accuracy_topk(logits, labels, k):
"""Percentage of times that top-k predictions matches labels on non-0s."""
with tf.name_scope("padded_accuracy_topk"):
logits, labels = _pad_tensors_to_same_length(logits, labels)
weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
effective_k = tf.minimum(k, tf.shape(logits)[-1])
_, outputs = tf.nn.top_k(logits, k=effective_k)
outputs = tf.cast(outputs, tf.int32)
padded_labels = tf.cast(labels, tf.int32)
padded_labels = tf.expand_dims(padded_labels, axis=-1)
padded_labels += tf.zeros_like(outputs) # Pad to same shape.
same = tf.cast(tf.equal(outputs, padded_labels), tf.float32)
same_topk = tf.reduce_sum(same, axis=-1)
return same_topk, weights
def padded_accuracy_top5(logits, labels):
return padded_accuracy_topk(logits, labels, 5)
def padded_sequence_accuracy(logits, labels):
"""Percentage of times that predictions matches labels everywhere (non-0)."""
with tf.name_scope("padded_sequence_accuracy"):
logits, labels = _pad_tensors_to_same_length(logits, labels)
weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
outputs = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
padded_labels = tf.cast(labels, tf.int32)
not_correct = tf.cast(tf.not_equal(outputs, padded_labels),
tf.float32) * weights
axis = list(range(1, len(outputs.get_shape())))
correct_seq = 1.0 - tf.minimum(1.0, tf.reduce_sum(not_correct, axis=axis))
return correct_seq, tf.constant(1.0)
def padded_neg_log_perplexity(logits, labels, vocab_size):
"""Average log-perplexity excluding padding 0s. No smoothing."""
num, den = padded_cross_entropy_loss(logits, labels, 0, vocab_size)
return -num, den
class MetricLayer(tf.keras.layers.Layer):
"""Custom a layer of metrics for Transformer model."""
def __init__(self, vocab_size):
super(MetricLayer, self).__init__()
self.vocab_size = vocab_size
self.metric_mean_fns = []
def build(self, input_shape):
neg_log_perplexity = functools.partial(
padded_neg_log_perplexity, vocab_size=self.vocab_size)
self.metric_mean_fns = [
(tf.keras.metrics.Mean("accuracy"), padded_accuracy),
(tf.keras.metrics.Mean("accuracy_top5"), padded_accuracy_top5),
(tf.keras.metrics.Mean("accuracy_per_sequence"),
padded_sequence_accuracy),
(tf.keras.metrics.Mean("neg_log_perplexity"), neg_log_perplexity),
]
super(MetricLayer, self).build(input_shape)
def get_config(self):
return {"vocab_size": self.vocab_size}
def call(self, inputs):
logits, targets = inputs[0], inputs[1]
for mean, fn in self.metric_mean_fns:
m = mean(*fn(logits, targets))
self.add_metric(m)
return logits
def transformer_loss(logits, labels, smoothing, vocab_size):
"""Calculates total loss containing cross entropy with padding ignored.
Args:
logits: Tensor of size [batch_size, length_logits, vocab_size]
labels: Tensor of size [batch_size, length_labels]
smoothing: Label smoothing constant, used to determine the on and off values
vocab_size: int size of the vocabulary
Returns:
A scalar float tensor for loss.
"""
xentropy, weights = padded_cross_entropy_loss(logits, labels, smoothing,
vocab_size)
return tf.reduce_sum(xentropy) / tf.reduce_sum(weights)
class LossLayer(tf.keras.layers.Layer):
"""Custom a layer of transformer loss for Transformer model."""
def __init__(self, vocab_size, label_smoothing):
super(LossLayer, self).__init__()
self.vocab_size = vocab_size
self.label_smoothing = label_smoothing
def get_config(self):
return {
"vocab_size": self.vocab_size,
"label_smoothing": self.label_smoothing,
}
def call(self, inputs):
logits, targets = inputs[0], inputs[1]
loss = transformer_loss(logits, targets, self.label_smoothing,
self.vocab_size)
self.add_loss(loss)
return logits
# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an 'AS IS' BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Misc for Transformer."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from absl import flags
from official.transformer.model import model_params
from official.utils.flags import core as flags_core
PARAMS_MAP = {
"tiny": model_params.TINY_PARAMS,
"base": model_params.BASE_PARAMS,
"big": model_params.BIG_PARAMS,
}
def get_model_params(param_set, num_gpus):
"""Gets predefined model params."""
if num_gpus > 1:
if param_set == "big":
return model_params.BIG_MULTI_GPU_PARAMS.copy()
elif param_set == "base":
return model_params.BASE_MULTI_GPU_PARAMS.copy()
else:
raise ValueError("Not valid params: param_set={} num_gpus={}".format(
param_set, num_gpus))
return PARAMS_MAP[param_set].copy()
def define_transformer_flags():
"""Add flags and flag validators for running transformer_main."""
# Add common flags (data_dir, model_dir, train_epochs, etc.).
flags_core.define_base()
flags_core.define_performance(
num_parallel_calls=True,
inter_op=False,
intra_op=False,
synthetic_data=True,
max_train_steps=False,
dtype=False,
all_reduce_alg=True
)
flags_core.define_benchmark()
flags_core.define_device(tpu=True)
# Set flags from the flags_core module as "key flags" so they're listed when
# the '-h' flag is used. Without this line, the flags defined above are
# only shown in the full `--helpful` help text.
flags.adopt_module_key_flags(flags_core)
# Add transformer-specific flags
flags.DEFINE_enum(
name="param_set", short_name="mp", default="big",
enum_values=PARAMS_MAP.keys(),
help=flags_core.help_wrap(
"Parameter set to use when creating and training the model. The "
"parameters define the input shape (batch size and max length), "
"model configuration (size of embedding, # of hidden layers, etc.), "
"and various other settings. The big parameter set increases the "
"default batch size, embedding/hidden size, and filter size. For a "
"complete list of parameters, please see model/model_params.py."))
flags.DEFINE_bool(
name="static_batch", default=False,
help=flags_core.help_wrap(
"Whether the batches in the dataset should have static shapes. In "
"general, this setting should be False. Dynamic shapes allow the "
"inputs to be grouped so that the number of padding tokens is "
"minimized, and helps model training. In cases where the input shape "
"must be static (e.g. running on TPU), this setting will be ignored "
"and static batching will always be used."))
# Flags for training with steps (may be used for debugging)
flags.DEFINE_integer(
name="steps_per_epoch", short_name="sbe", default=1000,
help=flags_core.help_wrap(
"The number of training steps for each epoch."))
flags.DEFINE_integer(
name="init_epoch", short_name="is", default=0,
help=flags_core.help_wrap("The number of initial epoch for training."))
flags.DEFINE_string(
name="init_weight_path", short_name="iwp", default=None,
help=flags_core.help_wrap("The initial model weights to load."))
flags.DEFINE_string(
name="init_logdir_timestamp", short_name="ilt", default=None,
help=flags_core.help_wrap("The initial timestamp for logdir."))
flags.DEFINE_integer(
name="validation_steps", short_name="vs", default=64,
help=flags_core.help_wrap("The number of steps used in validation."))
# BLEU score computation
flags.DEFINE_string(
name="bleu_source", short_name="bls", default=None,
help=flags_core.help_wrap(
"Path to source file containing text translate when calculating the "
"official BLEU score. Both --bleu_source and --bleu_ref must be set. "
"Use the flag --stop_threshold to stop the script based on the "
"uncased BLEU score."))
flags.DEFINE_string(
name="bleu_ref", short_name="blr", default=None,
help=flags_core.help_wrap(
"Path to source file containing text translate when calculating the "
"official BLEU score. Both --bleu_source and --bleu_ref must be set. "
"Use the flag --stop_threshold to stop the script based on the "
"uncased BLEU score."))
flags.DEFINE_string(
name="vocab_file", short_name="vf", default=None,
help=flags_core.help_wrap(
"Path to subtoken vocabulary file. If data_download.py was used to "
"download and encode the training data, look in the data_dir to find "
"the vocab file."))
flags.DEFINE_string(
name="mode", default="train",
help=flags_core.help_wrap("mode: train, eval, or predict"))
flags_core.set_defaults(data_dir="/tmp/translate_ende",
model_dir="/tmp/transformer_model",
batch_size=None,
train_epochs=10)
# pylint: disable=unused-variable
@flags.multi_flags_validator(
["mode", "train_epochs"],
message="--train_epochs must be defined in train mode")
def _check_train_limits(flag_dict):
if flag_dict["mode"] == "train":
return flag_dict["train_epochs"] is not None
return True
@flags.multi_flags_validator(
["bleu_source", "bleu_ref"],
message="Both or neither --bleu_source and --bleu_ref must be defined.")
def _check_bleu_files(flags_dict):
return (flags_dict["bleu_source"] is None) == (
flags_dict["bleu_ref"] is None)
@flags.multi_flags_validator(
["bleu_source", "bleu_ref", "vocab_file"],
message="--vocab_file must be defined if --bleu_source and --bleu_ref "
"are defined.")
def _check_bleu_vocab_file(flags_dict):
if flags_dict["bleu_source"] and flags_dict["bleu_ref"]:
return flags_dict["vocab_file"] is not None
return True
@flags.multi_flags_validator(
["export_dir", "vocab_file"],
message="--vocab_file must be defined if --export_dir is set.")
def _check_export_vocab_file(flags_dict):
if flags_dict["export_dir"]:
return flags_dict["vocab_file"] is not None
return True
# pylint: enable=unused-variable
flags_core.require_cloud_storage(["data_dir", "model_dir", "export_dir"])
# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Optimizer from addons and learning rate scheduler.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
K = tf.keras.backend
class LazyAdam(tf.keras.optimizers.Adam):
"""Variant of the Adam optimizer that handles sparse updates more efficiently.
The original Adam algorithm maintains two moving-average accumulators for
each trainable variable; the accumulators are updated at every step.
This class provides lazier handling of gradient updates for sparse
variables. It only updates moving-average accumulators for sparse variable
indices that appear in the current batch, rather than updating the
accumulators for all indices. Compared with the original Adam optimizer,
it can provide large improvements in model training throughput for some
applications. However, it provides slightly different semantics than the
original Adam algorithm, and may lead to different empirical results.
Note, amsgrad is currently not supported and the argument can only be
False.
This class is borrowed from:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lazy_adam.py
"""
def _resource_apply_sparse(self, grad, var, indices):
"""Applies grad for one step."""
var_dtype = var.dtype.base_dtype
lr_t = self._decayed_lr(var_dtype)
beta_1_t = self._get_hyper('beta_1', var_dtype)
beta_2_t = self._get_hyper('beta_2', var_dtype)
local_step = tf.cast(self.iterations + 1, var_dtype)
beta_1_power = tf.math.pow(beta_1_t, local_step)
beta_2_power = tf.math.pow(beta_2_t, local_step)
epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
lr = (lr_t * tf.math.sqrt(1 - beta_2_power) / (1 - beta_1_power))
# \\(m := beta1 * m + (1 - beta1) * g_t\\)
m = self.get_slot(var, 'm')
m_t_slice = beta_1_t * tf.gather(m, indices) + (1 - beta_1_t) * grad
m_update_kwargs = {
'resource': m.handle,
'indices': indices,
'updates': m_t_slice
}
m_update_op = tf.raw_ops.ResourceScatterUpdate(**m_update_kwargs)
# \\(v := beta2 * v + (1 - beta2) * (g_t * g_t)\\)
v = self.get_slot(var, 'v')
v_t_slice = (beta_2_t * tf.gather(v, indices) +
(1 - beta_2_t) * tf.math.square(grad))
v_update_kwargs = {
'resource': v.handle,
'indices': indices,
'updates': v_t_slice
}
v_update_op = tf.raw_ops.ResourceScatterUpdate(**v_update_kwargs)
# \\(variable -= learning_rate * m_t / (epsilon_t + sqrt(v_t))\\)
var_slice = lr * m_t_slice / (tf.math.sqrt(v_t_slice) + epsilon_t)
var_update_kwargs = {
'resource': var.handle,
'indices': indices,
'updates': var_slice
}
var_update_op = tf.raw_ops.ResourceScatterSub(**var_update_kwargs)
return tf.group(*[var_update_op, m_update_op, v_update_op])
class LearningRateFn(object):
"""Creates learning rate function."""
def __init__(self, learning_rate, hidden_size, warmup_steps):
self.learning_rate = learning_rate
self.hidden_size = hidden_size
self.warmup_steps = float(warmup_steps)
def __call__(self, global_step):
"""Calculate learning rate with linear warmup and rsqrt decay."""
step = float(global_step)
learning_rate = self.learning_rate
learning_rate *= (self.hidden_size ** -0.5)
# Apply linear warmup
learning_rate *= np.minimum(1.0, step / self.warmup_steps)
# Apply rsqrt decay
learning_rate /= np.sqrt(np.maximum(step, self.warmup_steps))
return learning_rate
class LearningRateScheduler(tf.keras.callbacks.Callback):
"""Keras callback to schedule learning rate.
TODO(tianlin): Refactor this scheduler and LearningRateBatchScheduler in
official/resnet/keras/keras_common.py.
"""
def __init__(self, schedule, init_steps=None, verbose=False):
super(LearningRateScheduler, self).__init__()
self.schedule = schedule
self.verbose = verbose
if init_steps is None:
init_steps = 0.0
self.steps = float(init_steps) # Total steps during training.
def on_epoch_begin(self, epoch, logs=None):
if not hasattr(self.model.optimizer, 'lr'):
raise ValueError('Optimizer must have a "lr" attribute.')
if not hasattr(self.model.optimizer, 'iterations'):
raise ValueError('Optimizer must have a "iterations" attribute.')
def on_train_batch_begin(self, batch, logs=None):
if self.verbose > 0:
iterations = K.get_value(self.model.optimizer.iterations)
print('Original iteration %d' % iterations)
self.steps += 1.0
try: # new API
lr = float(K.get_value(self.model.optimizer.lr))
lr = self.schedule(self.steps, lr)
except TypeError: # Support for old API for backward compatibility
lr = self.schedule(self.steps)
if not isinstance(lr, (float, np.float32, np.float64)):
raise ValueError('The output of the "schedule" function '
'should be float.')
K.set_value(self.model.optimizer.lr, lr)
K.set_value(self.model.optimizer.iterations, self.steps)
if self.verbose > 0:
print('Batch %05d Step %05d: LearningRateScheduler setting learning '
'rate to %s.' % (batch + 1, self.steps, lr))
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
logs['lr'] = K.get_value(self.model.optimizer.lr)
logs['steps'] = self.steps
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Defines the Transformer model in TF 2.0.
Model paper: https://arxiv.org/pdf/1706.03762.pdf
Transformer model code source: https://github.com/tensorflow/tensor2tensor
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from official.transformer.model import beam_search
from official.transformer.model import model_utils
from official.transformer.utils.tokenizer import EOS_ID
from official.transformer.v2 import attention_layer
from official.transformer.v2 import embedding_layer
from official.transformer.v2 import ffn_layer
from official.transformer.v2 import metrics
def create_model(params, is_train):
"""Creates transformer model."""
with tf.name_scope("model"):
if is_train:
inputs = tf.keras.layers.Input((None,), dtype="int64", name="inputs")
targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
internal_model = Transformer(params, name="transformer_v2")
logits = internal_model([inputs, targets], training=is_train)
vocab_size = params["vocab_size"]
label_smoothing = params["label_smoothing"]
logits = metrics.MetricLayer(vocab_size)([logits, targets])
logits = metrics.LossLayer(vocab_size, label_smoothing)([logits, targets])
logits = tf.keras.layers.Lambda(lambda x: x, name="logits")(logits)
return tf.keras.Model([inputs, targets], logits)
else:
inputs = tf.keras.layers.Input((None,), dtype="int64", name="inputs")
internal_model = Transformer(params, name="transformer_v2")
ret = internal_model([inputs], training=is_train)
outputs, scores = ret["outputs"], ret["scores"]
return tf.keras.Model(inputs, [outputs, scores])
class Transformer(tf.keras.Model):
"""Transformer model with Keras.
Implemented as described in: https://arxiv.org/pdf/1706.03762.pdf
The Transformer model consists of an encoder and decoder. The input is an int
sequence (or a batch of sequences). The encoder produces a continuous
representation, and the decoder uses the encoder output to generate
probabilities for the output sequence.
"""
def __init__(self, params, name=None):
"""Initialize layers to build Transformer model.
Args:
params: hyperparameter object defining layer sizes, dropout values, etc.
name: name of the model.
"""
super(Transformer, self).__init__(name=name)
self.params = params
self.embedding_softmax_layer = embedding_layer.EmbeddingSharedWeights(
params["vocab_size"], params["hidden_size"])
self.encoder_stack = EncoderStack(params)
self.decoder_stack = DecoderStack(params)
def get_config(self):
return {
"params": self.params,
}
def call(self, x, training):
"""Calculate target logits or inferred target sequences.
Args:
x: input tensor list of size 1 or 2.
First item, inputs: int tensor with shape [batch_size, input_length].
Second item (optional), targets: None or int tensor with shape
[batch_size, target_length].
training: boolean, whether in training mode or not.
Returns:
If targets is defined, then return logits for each word in the target
sequence. float tensor with shape [batch_size, target_length, vocab_size]
If target is none, then generate output sequence one token at a time.
returns a dictionary {
outputs: [batch_size, decoded length]
scores: [batch_size, float]}
"""
if len(x) == 2:
inputs, targets = x[0], x[1]
else:
inputs, targets = x[0], None
# Variance scaling is used here because it seems to work in many problems.
# Other reasonable initializers may also work just as well.
with tf.name_scope("Transformer"):
# Calculate attention bias for encoder self-attention and decoder
# multi-headed attention layers.
attention_bias = model_utils.get_padding_bias(inputs)
# Run the inputs through the encoder layer to map the symbol
# representations to continuous representations.
encoder_outputs = self.encode(inputs, attention_bias, training)
# Generate output sequence if targets is None, or return logits if target
# sequence is known.
if targets is None:
return self.predict(encoder_outputs, attention_bias, training)
else:
logits = self.decode(targets, encoder_outputs, attention_bias, training)
return logits
def encode(self, inputs, attention_bias, training):
"""Generate continuous representation for inputs.
Args:
inputs: int tensor with shape [batch_size, input_length].
attention_bias: float tensor with shape [batch_size, 1, 1, input_length].
training: boolean, whether in training mode or not.
Returns:
float tensor with shape [batch_size, input_length, hidden_size]
"""
with tf.name_scope("encode"):
# Prepare inputs to the layer stack by adding positional encodings and
# applying dropout.
embedded_inputs = self.embedding_softmax_layer(inputs)
inputs_padding = model_utils.get_padding(inputs)
with tf.name_scope("add_pos_encoding"):
length = tf.shape(embedded_inputs)[1]
pos_encoding = model_utils.get_position_encoding(
length, self.params["hidden_size"])
encoder_inputs = embedded_inputs + pos_encoding
if training:
encoder_inputs = tf.nn.dropout(
encoder_inputs, rate=self.params["layer_postprocess_dropout"])
return self.encoder_stack(
encoder_inputs, attention_bias, inputs_padding, training=training)
def decode(self, targets, encoder_outputs, attention_bias, training):
"""Generate logits for each value in the target sequence.
Args:
targets: target values for the output sequence. int tensor with shape
[batch_size, target_length]
encoder_outputs: continuous representation of input sequence. float tensor
with shape [batch_size, input_length, hidden_size]
attention_bias: float tensor with shape [batch_size, 1, 1, input_length]
training: boolean, whether in training mode or not.
Returns:
float32 tensor with shape [batch_size, target_length, vocab_size]
"""
with tf.name_scope("decode"):
# Prepare inputs to decoder layers by shifting targets, adding positional
# encoding and applying dropout.
decoder_inputs = self.embedding_softmax_layer(targets)
with tf.name_scope("shift_targets"):
# Shift targets to the right, and remove the last element
decoder_inputs = tf.pad(decoder_inputs,
[[0, 0], [1, 0], [0, 0]])[:, :-1, :]
with tf.name_scope("add_pos_encoding"):
length = tf.shape(decoder_inputs)[1]
decoder_inputs += model_utils.get_position_encoding(
length, self.params["hidden_size"])
if training:
decoder_inputs = tf.nn.dropout(
decoder_inputs, rate=self.params["layer_postprocess_dropout"])
# Run values
decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
length)
outputs = self.decoder_stack(
decoder_inputs,
encoder_outputs,
decoder_self_attention_bias,
attention_bias,
training=training)
logits = self.embedding_softmax_layer(outputs, mode="linear")
return logits
def _get_symbols_to_logits_fn(self, max_decode_length, training):
"""Returns a decoding function that calculates logits of the next tokens."""
timing_signal = model_utils.get_position_encoding(
max_decode_length + 1, self.params["hidden_size"])
decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
max_decode_length)
def symbols_to_logits_fn(ids, i, cache):
"""Generate logits for next potential IDs.
Args:
ids: Current decoded sequences. int tensor with shape [batch_size *
beam_size, i + 1]
i: Loop index
cache: dictionary of values storing the encoder output, encoder-decoder
attention bias, and previous decoder attention values.
Returns:
Tuple of
(logits with shape [batch_size * beam_size, vocab_size],
updated cache values)
"""
# Set decoder input to the last generated IDs
decoder_input = ids[:, -1:]
# Preprocess decoder input by getting embeddings and adding timing signal.
decoder_input = self.embedding_softmax_layer(decoder_input)
decoder_input += timing_signal[i:i + 1]
self_attention_bias = decoder_self_attention_bias[:, :, i:i + 1, :i + 1]
decoder_outputs = self.decoder_stack(
decoder_input,
cache.get("encoder_outputs"),
self_attention_bias,
cache.get("encoder_decoder_attention_bias"),
training=training,
cache=cache)
logits = self.embedding_softmax_layer(decoder_outputs, mode="linear")
logits = tf.squeeze(logits, axis=[1])
return logits, cache
return symbols_to_logits_fn
def predict(self, encoder_outputs, encoder_decoder_attention_bias, training):
"""Return predicted sequence."""
batch_size = tf.shape(encoder_outputs)[0]
input_length = tf.shape(encoder_outputs)[1]
max_decode_length = input_length + self.params["extra_decode_length"]
symbols_to_logits_fn = self._get_symbols_to_logits_fn(
max_decode_length, training)
# Create initial set of IDs that will be passed into symbols_to_logits_fn.
initial_ids = tf.zeros([batch_size], dtype=tf.int32)
# Create cache storing decoder attention values for each layer.
# pylint: disable=g-complex-comprehension
cache = {
"layer_%d" % layer: {
"k": tf.zeros([batch_size, 0, self.params["hidden_size"]]),
"v": tf.zeros([batch_size, 0, self.params["hidden_size"]])
} for layer in range(self.params["num_hidden_layers"])
}
# pylint: enable=g-complex-comprehension
# Add encoder output and attention bias to the cache.
cache["encoder_outputs"] = encoder_outputs
cache["encoder_decoder_attention_bias"] = encoder_decoder_attention_bias
# Use beam search to find the top beam_size sequences and scores.
decoded_ids, scores = beam_search.sequence_beam_search(
symbols_to_logits_fn=symbols_to_logits_fn,
initial_ids=initial_ids,
initial_cache=cache,
vocab_size=self.params["vocab_size"],
beam_size=self.params["beam_size"],
alpha=self.params["alpha"],
max_decode_length=max_decode_length,
eos_id=EOS_ID)
# Get the top sequence for each batch element
top_decoded_ids = decoded_ids[:, 0, 1:]
top_scores = scores[:, 0]
return {"outputs": top_decoded_ids, "scores": top_scores}
class LayerNormalization(tf.keras.layers.Layer):
"""Applies layer normalization."""
def __init__(self, hidden_size):
super(LayerNormalization, self).__init__()
self.hidden_size = hidden_size
def build(self, input_shape):
"""Builds the layer."""
self.scale = self.add_weight(
"layer_norm_scale",
shape=[self.hidden_size],
dtype="float32",
initializer=tf.ones_initializer())
self.bias = self.add_weight(
"layer_norm_bias",
shape=[self.hidden_size],
dtype="float32",
initializer=tf.zeros_initializer())
super(LayerNormalization, self).build(input_shape)
def get_config(self):
return {
"hidden_size": self.hidden_size,
}
def call(self, x, epsilon=1e-6):
mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
norm_x = (x - mean) * tf.math.rsqrt(variance + epsilon)
return norm_x * self.scale + self.bias
class PrePostProcessingWrapper(tf.keras.layers.Layer):
"""Wrapper class that applies layer pre-processing and post-processing."""
def __init__(self, layer, params):
super(PrePostProcessingWrapper, self).__init__()
self.layer = layer
self.params = params
self.postprocess_dropout = params["layer_postprocess_dropout"]
def build(self, input_shape):
# Create normalization layer
self.layer_norm = LayerNormalization(self.params["hidden_size"])
super(PrePostProcessingWrapper, self).build(input_shape)
def get_config(self):
return {
"params": self.params,
}
def call(self, x, *args, **kwargs):
"""Calls wrapped layer with same parameters."""
# Preprocessing: apply layer normalization
training = kwargs["training"]
y = self.layer_norm(x)
# Get layer output
y = self.layer(y, *args, **kwargs)
# Postprocessing: apply dropout and residual connection
if training:
y = tf.nn.dropout(y, rate=self.postprocess_dropout)
return x + y
class EncoderStack(tf.keras.layers.Layer):
"""Transformer encoder stack.
The encoder stack is made up of N identical layers. Each layer is composed
of the sublayers:
1. Self-attention layer
2. Feedforward network (which is 2 fully-connected layers)
"""
def __init__(self, params):
super(EncoderStack, self).__init__()
self.params = params
self.layers = []
def build(self, input_shape):
"""Builds the encoder stack."""
params = self.params
for _ in range(params["num_hidden_layers"]):
# Create sublayers for each layer.
self_attention_layer = attention_layer.SelfAttention(
params["hidden_size"], params["num_heads"],
params["attention_dropout"])
feed_forward_network = ffn_layer.FeedForwardNetwork(
params["hidden_size"], params["filter_size"], params["relu_dropout"])
self.layers.append([
PrePostProcessingWrapper(self_attention_layer, params),
PrePostProcessingWrapper(feed_forward_network, params)
])
# Create final layer normalization layer.
self.output_normalization = LayerNormalization(params["hidden_size"])
super(EncoderStack, self).build(input_shape)
def get_config(self):
return {
"params": self.params,
}
def call(self, encoder_inputs, attention_bias, inputs_padding, training):
"""Return the output of the encoder layer stacks.
Args:
encoder_inputs: tensor with shape [batch_size, input_length, hidden_size]
attention_bias: bias for the encoder self-attention layer. [batch_size, 1,
1, input_length]
inputs_padding: tensor with shape [batch_size, input_length], inputs with
zero paddings.
training: boolean, whether in training mode or not.
Returns:
Output of encoder layer stack.
float32 tensor with shape [batch_size, input_length, hidden_size]
"""
for n, layer in enumerate(self.layers):
# Run inputs through the sublayers.
self_attention_layer = layer[0]
feed_forward_network = layer[1]
with tf.name_scope("layer_%d" % n):
with tf.name_scope("self_attention"):
encoder_inputs = self_attention_layer(
encoder_inputs, attention_bias, training=training)
with tf.name_scope("ffn"):
encoder_inputs = feed_forward_network(
encoder_inputs, training=training)
return self.output_normalization(encoder_inputs)
class DecoderStack(tf.keras.layers.Layer):
"""Transformer decoder stack.
Like the encoder stack, the decoder stack is made up of N identical layers.
Each layer is composed of the sublayers:
1. Self-attention layer
2. Multi-headed attention layer combining encoder outputs with results from
the previous self-attention layer.
3. Feedforward network (2 fully-connected layers)
"""
def __init__(self, params):
super(DecoderStack, self).__init__()
self.params = params
self.layers = []
def build(self, input_shape):
"""Builds the decoder stack."""
params = self.params
for _ in range(params["num_hidden_layers"]):
self_attention_layer = attention_layer.SelfAttention(
params["hidden_size"], params["num_heads"],
params["attention_dropout"])
enc_dec_attention_layer = attention_layer.Attention(
params["hidden_size"], params["num_heads"],
params["attention_dropout"])
feed_forward_network = ffn_layer.FeedForwardNetwork(
params["hidden_size"], params["filter_size"], params["relu_dropout"])
self.layers.append([
PrePostProcessingWrapper(self_attention_layer, params),
PrePostProcessingWrapper(enc_dec_attention_layer, params),
PrePostProcessingWrapper(feed_forward_network, params)
])
self.output_normalization = LayerNormalization(params["hidden_size"])
super(DecoderStack, self).build(input_shape)
def get_config(self):
return {
"params": self.params,
}
def call(self,
decoder_inputs,
encoder_outputs,
decoder_self_attention_bias,
attention_bias,
training,
cache=None):
"""Return the output of the decoder layer stacks.
Args:
decoder_inputs: tensor with shape [batch_size, target_length, hidden_size]
encoder_outputs: tensor with shape [batch_size, input_length, hidden_size]
decoder_self_attention_bias: bias for decoder self-attention layer. [1, 1,
target_len, target_length]
attention_bias: bias for encoder-decoder attention layer. [batch_size, 1,
1, input_length]
training: boolean, whether in training mode or not.
cache: (Used for fast decoding) A nested dictionary storing previous
decoder self-attention values. The items are:
{layer_n: {"k": tensor with shape [batch_size, i, key_channels],
"v": tensor with shape [batch_size, i, value_channels]},
...}
Returns:
Output of decoder layer stack.
float32 tensor with shape [batch_size, target_length, hidden_size]
"""
for n, layer in enumerate(self.layers):
self_attention_layer = layer[0]
enc_dec_attention_layer = layer[1]
feed_forward_network = layer[2]
# Run inputs through the sublayers.
layer_name = "layer_%d" % n
layer_cache = cache[layer_name] if cache is not None else None
with tf.name_scope(layer_name):
with tf.name_scope("self_attention"):
decoder_inputs = self_attention_layer(
decoder_inputs,
decoder_self_attention_bias,
training=training,
cache=layer_cache)
with tf.name_scope("encdec_attention"):
decoder_inputs = enc_dec_attention_layer(
decoder_inputs,
encoder_outputs,
attention_bias,
training=training)
with tf.name_scope("ffn"):
decoder_inputs = feed_forward_network(
decoder_inputs, training=training)
return self.output_normalization(decoder_inputs)
"""Tests for layers in Transformer."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from official.transformer.v2 import attention_layer
from official.transformer.v2 import embedding_layer
from official.transformer.v2 import ffn_layer
from official.transformer.v2 import metrics
class TransformerLayersTest(tf.test.TestCase):
def test_attention_layer(self):
hidden_size = 64
num_heads = 4
dropout = 0.5
layer = attention_layer.SelfAttention(hidden_size, num_heads, dropout)
self.assertDictEqual(layer.get_config(), {
"hidden_size": hidden_size,
"num_heads": num_heads,
"attention_dropout": dropout,
})
length = 2
x = tf.ones([1, length, hidden_size])
bias = tf.ones([1])
cache = {
"k": tf.zeros([1, 0, hidden_size]),
"v": tf.zeros([1, 0, hidden_size]),
}
y = layer(x, bias, training=True, cache=cache)
self.assertEqual(y.shape, (1, length, 64,))
self.assertEqual(cache["k"].shape, (1, length, 64,))
self.assertEqual(cache["v"].shape, (1, length, 64,))
def test_embedding_shared_weights(self):
vocab_size = 50
hidden_size = 64
length = 2
layer = embedding_layer.EmbeddingSharedWeights(vocab_size, hidden_size)
self.assertDictEqual(layer.get_config(), {
"vocab_size": 50,
"hidden_size": 64,
})
idx = tf.ones([1, length], dtype="int32")
y = layer(idx)
self.assertEqual(y.shape, (1, length, hidden_size,))
x = tf.ones([1, length, hidden_size])
output = layer(x, "linear")
self.assertEqual(output.shape, (1, length, vocab_size,))
def test_feed_forward_network(self):
hidden_size = 64
filter_size = 32
relu_dropout = 0.5
layer = ffn_layer.FeedForwardNetwork(hidden_size, filter_size, relu_dropout)
self.assertDictEqual(layer.get_config(), {
"hidden_size": hidden_size,
"filter_size": filter_size,
"relu_dropout": relu_dropout,
})
length = 2
x = tf.ones([1, length, hidden_size])
y = layer(x, training=True)
self.assertEqual(y.shape, (1, length, hidden_size,))
def test_metric_layer(self):
vocab_size = 50
logits = tf.keras.layers.Input((None, vocab_size),
dtype="float32",
name="logits")
targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
output_logits = metrics.MetricLayer(vocab_size)([logits, targets])
self.assertEqual(output_logits.shape.as_list(), [None, None, vocab_size,])
def test_loss_layer(self):
vocab_size, label_smoothing = 50, 0.1
logits = tf.keras.layers.Input((None, vocab_size),
dtype="float32",
name="logits")
targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
output_logits = metrics.LossLayer(vocab_size,
label_smoothing)([logits, targets])
self.assertEqual(output_logits.shape.as_list(), [None, None, vocab_size,])
if __name__ == "__main__":
tf.test.main()
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Train and evaluate the Transformer model.
See README for description of setting the training schedule and evaluating the
BLEU score.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import datetime
import os
import tempfile
# pylint: disable=g-bad-import-order
from absl import flags
import tensorflow as tf
# pylint: enable=g-bad-import-order
from official.transformer import compute_bleu
from official.transformer.utils import tokenizer
from official.transformer.v2 import data_pipeline
from official.transformer.v2 import misc
from official.transformer.v2 import optimizer
from official.transformer.v2 import transformer
from official.transformer.v2 import translate
from official.utils.flags import core as flags_core
from official.utils.logs import logger
INF = int(1e9)
BLEU_DIR = "bleu"
_SINGLE_SAMPLE = 1
def translate_and_compute_bleu(model, subtokenizer, bleu_source, bleu_ref):
"""Translate file and report the cased and uncased bleu scores."""
# Create temporary file to store translation.
tmp = tempfile.NamedTemporaryFile(delete=False)
tmp_filename = tmp.name
translate.translate_file(
model,
subtokenizer,
bleu_source,
output_file=tmp_filename,
print_all_translations=False)
# Compute uncased and cased bleu scores.
uncased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, False)
cased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, True)
os.remove(tmp_filename)
return uncased_score, cased_score
def evaluate_and_log_bleu(model, bleu_source, bleu_ref, vocab_file):
"""Calculate and record the BLEU score."""
subtokenizer = tokenizer.Subtokenizer(vocab_file)
uncased_score, cased_score = translate_and_compute_bleu(
model, subtokenizer, bleu_source, bleu_ref)
tf.compat.v1.logging.info("Bleu score (uncased): %s", uncased_score)
tf.compat.v1.logging.info("Bleu score (cased): %s", cased_score)
return uncased_score, cased_score
class TransformerTask(object):
"""Main entry of Transformer model."""
def __init__(self, flags_obj):
"""Init function of TransformerMain.
Args:
flags_obj: Object containing parsed flag values, i.e., FLAGS.
"""
self.flags_obj = flags_obj
# Add flag-defined parameters to params object
num_gpus = flags_core.get_num_gpus(flags_obj)
self.params = params = misc.get_model_params(flags_obj.param_set, num_gpus)
params["data_dir"] = flags_obj.data_dir
params["model_dir"] = flags_obj.model_dir
params["num_parallel_calls"] = (
flags_obj.num_parallel_calls or tf.data.experimental.AUTOTUNE)
params["use_synthetic_data"] = flags_obj.use_synthetic_data
params["batch_size"] = flags_obj.batch_size or params["default_batch_size"]
params["repeat_dataset"] = None
def train(self):
"""Trains the model."""
params, flags_obj, is_train = self.params, self.flags_obj, True
model = transformer.create_model(params, is_train)
opt = self._create_optimizer()
model.compile(opt, target_tensors=[])
model.summary()
self._load_weights_if_possible(model, flags_obj.init_weight_path)
cur_log_dir = _get_log_dir_or_default(flags_obj)
_ensure_dir(cur_log_dir)
map_data_fn = data_pipeline.map_data_for_transformer_fn
train_ds = data_pipeline.train_input_fn(params)
train_ds = train_ds.map(
map_data_fn, num_parallel_calls=params["num_parallel_calls"])
valid_ds = data_pipeline.eval_input_fn(params)
valid_ds = valid_ds.map(
map_data_fn, num_parallel_calls=params["num_parallel_calls"])
init_epoch = flags_obj.init_epoch or 0
init_steps = init_epoch * flags_obj.steps_per_epoch
callbacks = self._create_callbacks(cur_log_dir, init_steps, params)
history = model.fit(
train_ds,
initial_epoch=init_epoch,
epochs=flags_obj.train_epochs,
steps_per_epoch=flags_obj.steps_per_epoch,
validation_data=valid_ds,
validation_steps=flags_obj.validation_steps,
callbacks=callbacks)
tf.compat.v1.logging.info("\nTrain history: {}".format(history.history))
save_weight_path = os.path.join(cur_log_dir, "saves-model-weights.hdf5")
save_model_path = os.path.join(cur_log_dir, "saves-model.hdf5")
model.save_weights(save_weight_path)
model.save(save_model_path)
def eval(self):
"""Evaluates the model."""
params, flags_obj, is_train = self.params, self.flags_obj, False
with tf.name_scope("model"):
model = transformer.create_model(params, is_train)
self._load_weights_if_possible(model, flags_obj.init_weight_path)
model.summary()
evaluate_and_log_bleu(model, flags_obj.bleu_source, flags_obj.bleu_ref,
flags_obj.vocab_file)
def predict(self):
"""Predicts result from the model."""
params, flags_obj, is_train = self.params, self.flags_obj, False
with tf.name_scope("model"):
model = transformer.create_model(params, is_train)
self._load_weights_if_possible(model, flags_obj.init_weight_path)
model.summary()
subtokenizer = tokenizer.Subtokenizer(flags_obj.vocab_file)
ds = data_pipeline.eval_input_fn(params)
ds = ds.map(lambda x, y: x).take(_SINGLE_SAMPLE)
ret = model.predict(ds)
val_outputs, _ = ret
length = len(val_outputs)
for i in range(length):
translate.translate_from_input(val_outputs[i], subtokenizer)
def _create_callbacks(self, cur_log_dir, init_steps, params):
"""Creates a list of callbacks."""
sfunc = optimizer.LearningRateFn(params["learning_rate"],
params["hidden_size"],
params["learning_rate_warmup_steps"])
scheduler_callback = optimizer.LearningRateScheduler(sfunc, init_steps)
tb_logdir = os.path.join(cur_log_dir, "logs")
save_path = os.path.join(cur_log_dir,
"weights-epoch-{epoch:02d}-loss-{loss:.4f}.hdf5")
csv_path = os.path.join(cur_log_dir, "result.csv")
return [
scheduler_callback,
tf.keras.callbacks.TensorBoard(tb_logdir),
tf.keras.callbacks.ModelCheckpoint(save_path, save_weights_only=True),
tf.keras.callbacks.CSVLogger(csv_path, append=True),
]
def _load_weights_if_possible(self, model, init_weight_path=None):
"""Loads model weights when it is provided."""
if init_weight_path:
tf.compat.v1.logging.info("Load weights: {}".format(init_weight_path))
model.load_weights(init_weight_path, by_name=True)
def _create_optimizer(self):
"""Creates optimizer."""
params = self.params
opt = optimizer.LazyAdam(
params["learning_rate"],
params["optimizer_adam_beta1"],
params["optimizer_adam_beta2"],
epsilon=params["optimizer_adam_epsilon"])
return opt
def _get_log_dir_or_default(flags_obj):
"""Gets init_logdir_timestamp if it is given, otherwise use current time."""
if flags_obj.init_logdir_timestamp is not None:
timestamp = flags_obj.init_logdir_timestamp
else:
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M")
return os.path.join(flags_obj.model_dir, timestamp)
def _ensure_dir(log_dir):
"""Makes log dir if not existed."""
if not os.path.exists(log_dir):
os.makedirs(log_dir)
def main(_):
flags_obj = flags.FLAGS
with logger.benchmark_context(flags_obj):
task = TransformerTask(flags_obj)
if flags_obj.mode == "train":
task.train()
elif flags_obj.mode == "predict":
task.predict()
elif flags_obj.mode == "eval":
task.eval()
else:
raise ValueError("Invalid mode {}".format(flags_obj.mode))
if __name__ == "__main__":
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
misc.define_transformer_flags()
tf.compat.v1.app.run(main)
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Test Transformer model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import re
from absl import flags
import tensorflow as tf
from tensorflow.python.framework import test_util
from official.transformer.v2 import misc
from official.transformer.v2 import transformer_main as tm
FLAGS = flags.FLAGS
FIXED_TIMESTAMP = "my_time_stamp"
WEIGHT_PATTERN = re.compile(r"weights-epoch-.+\.hdf5")
def _generate_file(filepath, lines):
with open(filepath, "w") as f:
for l in lines:
f.write("{}\n".format(l))
class TransformerTaskTest(tf.test.TestCase):
def setUp(self):
temp_dir = self.get_temp_dir()
FLAGS.model_dir = temp_dir
FLAGS.init_logdir_timestamp = FIXED_TIMESTAMP
FLAGS.param_set = param_set = "tiny"
FLAGS.use_synthetic_data = True
FLAGS.steps_per_epoch = 1
FLAGS.validation_steps = 1
FLAGS.train_epochs = 1
FLAGS.batch_size = 8
FLAGS.init_weight_path = None
self.cur_log_dir = os.path.join(temp_dir, FIXED_TIMESTAMP)
self.vocab_file = os.path.join(self.cur_log_dir, "vocab")
self.vocab_size = misc.get_model_params(param_set, 0)["vocab_size"]
self.bleu_source = os.path.join(self.cur_log_dir, "bleu_source")
self.bleu_ref = os.path.join(self.cur_log_dir, "bleu_ref")
self.flags_file = os.path.join(self.cur_log_dir, "flags")
def _assert_exists(self, filepath):
self.assertTrue(os.path.exists(filepath))
def test_train(self):
t = tm.TransformerTask(FLAGS)
t.train()
# Test model dir.
self._assert_exists(self.cur_log_dir)
# Test saving models.
self._assert_exists(
os.path.join(self.cur_log_dir, "saves-model-weights.hdf5"))
self._assert_exists(os.path.join(self.cur_log_dir, "saves-model.hdf5"))
# Test callbacks:
# TensorBoard file.
self._assert_exists(os.path.join(self.cur_log_dir, "logs"))
# CSVLogger file.
self._assert_exists(os.path.join(self.cur_log_dir, "result.csv"))
# Checkpoint file.
filenames = os.listdir(self.cur_log_dir)
matched_weight_file = any([WEIGHT_PATTERN.match(f) for f in filenames])
self.assertTrue(matched_weight_file)
def _prepare_files_and_flags(self, *extra_flags):
# Make log dir.
if not os.path.exists(self.cur_log_dir):
os.makedirs(self.cur_log_dir)
# Fake vocab, bleu_source and bleu_ref.
tokens = [
"'<pad>'", "'<EOS>'", "'_'", "'a'", "'b'", "'c'", "'d'", "'a_'", "'b_'",
"'c_'", "'d_'"
]
tokens += ["'{}'".format(i) for i in range(self.vocab_size - len(tokens))]
_generate_file(self.vocab_file, tokens)
_generate_file(self.bleu_source, ["a b", "c d"])
_generate_file(self.bleu_ref, ["a b", "d c"])
# Update flags.
update_flags = [
"ignored_program_name",
"--vocab_file={}".format(self.vocab_file),
"--bleu_source={}".format(self.bleu_source),
"--bleu_ref={}".format(self.bleu_ref),
]
if extra_flags:
update_flags.extend(extra_flags)
FLAGS(update_flags)
@test_util.run_v1_only("V1 should work. Issue: V2 w/ graph transformed.")
def test_predict(self):
self._prepare_files_and_flags()
t = tm.TransformerTask(FLAGS)
t.predict()
@test_util.run_v1_only("V1 should work. Issue: V2 w/ graph transformed.")
def test_eval(self):
self._prepare_files_and_flags()
t = tm.TransformerTask(FLAGS)
t.eval()
if __name__ == "__main__":
misc.define_transformer_flags()
tf.test.main()
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Test Transformer model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from official.transformer.model import model_params
from official.transformer.v2 import transformer
class TransformerV2Test(tf.test.TestCase):
def setUp(self):
self.params = params = model_params.TINY_PARAMS
params["batch_size"] = params["default_batch_size"] = 16
params["use_synthetic_data"] = True
params["hidden_size"] = 12
params["num_hidden_layers"] = 2
params["filter_size"] = 14
params["num_heads"] = 2
params["vocab_size"] = 41
params["extra_decode_length"] = 2
params["beam_size"] = 3
def test_create_model_train(self):
model = transformer.create_model(self.params, True)
inputs, outputs = model.inputs, model.outputs
self.assertEqual(len(inputs), 2)
self.assertEqual(len(outputs), 1)
self.assertEqual(inputs[0].shape.as_list(), [None, None])
self.assertEqual(inputs[0].dtype, tf.int64)
self.assertEqual(inputs[1].shape.as_list(), [None, None])
self.assertEqual(inputs[1].dtype, tf.int64)
self.assertEqual(outputs[0].shape.as_list(), [None, None, 41])
self.assertEqual(outputs[0].dtype, tf.float32)
def test_create_model_not_train(self):
model = transformer.create_model(self.params, False)
inputs, outputs = model.inputs, model.outputs
self.assertEqual(len(inputs), 1)
self.assertEqual(len(outputs), 2)
self.assertEqual(inputs[0].shape.as_list(), [None, None])
self.assertEqual(inputs[0].dtype, tf.int64)
self.assertEqual(outputs[0].shape.as_list(), [None, None])
self.assertEqual(outputs[0].dtype, tf.int32)
self.assertEqual(outputs[1].shape.as_list(), [None])
self.assertEqual(outputs[1].dtype, tf.float32)
if __name__ == "__main__":
tf.test.main()
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Translate text or files using trained transformer model."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from official.transformer.utils import tokenizer
_DECODE_BATCH_SIZE = 32
_EXTRA_DECODE_LENGTH = 100
_BEAM_SIZE = 4
_ALPHA = 0.6
def _get_sorted_inputs(filename):
"""Read and sort lines from the file sorted by decreasing length.
Args:
filename: String name of file to read inputs from.
Returns:
Sorted list of inputs, and dictionary mapping original index->sorted index
of each element.
"""
with tf.io.gfile.GFile(filename) as f:
records = f.read().split("\n")
inputs = [record.strip() for record in records]
if not inputs[-1]:
inputs.pop()
input_lens = [(i, len(line.split())) for i, line in enumerate(inputs)]
sorted_input_lens = sorted(input_lens, key=lambda x: x[1], reverse=True)
sorted_inputs = [None] * len(sorted_input_lens)
sorted_keys = [0] * len(sorted_input_lens)
for i, (index, _) in enumerate(sorted_input_lens):
sorted_inputs[i] = inputs[index]
sorted_keys[index] = i
return sorted_inputs, sorted_keys
def _encode_and_add_eos(line, subtokenizer):
"""Encode line with subtokenizer, and add EOS id to the end."""
return subtokenizer.encode(line) + [tokenizer.EOS_ID]
def _trim_and_decode(ids, subtokenizer):
"""Trim EOS and PAD tokens from ids, and decode to return a string."""
try:
index = list(ids).index(tokenizer.EOS_ID)
return subtokenizer.decode(ids[:index])
except ValueError: # No EOS found in sequence
return subtokenizer.decode(ids)
def translate_file(
model, subtokenizer, input_file, output_file=None,
print_all_translations=True):
"""Translate lines in file, and save to output file if specified.
Args:
model: Keras model used to generate the translations.
subtokenizer: Subtokenizer object for encoding and decoding source and
translated lines.
input_file: file containing lines to translate
output_file: file that stores the generated translations.
print_all_translations: If true, all translations are printed to stdout.
Raises:
ValueError: if output file is invalid.
"""
batch_size = _DECODE_BATCH_SIZE
# Read and sort inputs by length. Keep dictionary (original index-->new index
# in sorted list) to write translations in the original order.
sorted_inputs, sorted_keys = _get_sorted_inputs(input_file)
total_samples = len(sorted_inputs)
num_decode_batches = (total_samples - 1) // batch_size + 1
def input_generator():
"""Yield encoded strings from sorted_inputs."""
for i in range(num_decode_batches):
lines = [
sorted_inputs[j + i * batch_size]
for j in range(batch_size)
if j + i * batch_size < total_samples
]
lines = [_encode_and_add_eos(l, subtokenizer) for l in lines]
batch = tf.keras.preprocessing.sequence.pad_sequences(
lines, dtype="int64", padding="post")
tf.compat.v1.logging.info("Decoding batch %d out of %d.", i,
num_decode_batches)
yield batch
translations = []
for i, text in enumerate(input_generator()):
val_outputs, _ = model.predict(text)
length = len(val_outputs)
for j in range(length):
translation = _trim_and_decode(val_outputs[j], subtokenizer)
translations.append(translation)
if print_all_translations:
tf.compat.v1.logging.info(
"Translating:\n\tInput: %s\n\tOutput: %s" %
(sorted_inputs[j + i * batch_size], translation))
# Write translations in the order they appeared in the original file.
if output_file is not None:
if tf.io.gfile.isdir(output_file):
raise ValueError("File output is a directory, will not save outputs to "
"file.")
tf.compat.v1.logging.info("Writing to file %s" % output_file)
with tf.compat.v1.gfile.Open(output_file, "w") as f:
for i in sorted_keys:
f.write("%s\n" % translations[i])
def translate_from_text(model, subtokenizer, txt):
encoded_txt = _encode_and_add_eos(txt, subtokenizer)
result = model.predict(encoded_txt)
outputs = result["outputs"]
tf.compat.v1.logging.info("Original: \"%s\"" % txt)
translate_from_input(outputs, subtokenizer)
def translate_from_input(outputs, subtokenizer):
translation = _trim_and_decode(outputs, subtokenizer)
tf.compat.v1.logging.info("Translation: \"%s\"" % translation)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment