README.md 10.1 KB
Newer Older
Hongkun Yu's avatar
Hongkun Yu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Katherine Wu's avatar
Katherine Wu committed
15
# Transformer Translation Model
16
17
This is an implementation of the Transformer translation model as described in
the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The
18
19
implementation leverages tf.keras and makes sure it is compatible with TF 2.x.

Hongkun Yu's avatar
Hongkun Yu committed
20
21
**Warning: the features in the `transformer/` folder have been fully intergrated
into nlp/modeling.
22
23
24
Due to its dependencies, we will remove this folder after the model
garden 2.5 release. The model in `nlp/modeling/models/seq2seq_transformer.py` is
identical to the model in this folder.**
Katherine Wu's avatar
Katherine Wu committed
25
26
27
28
29

## Contents
  * [Contents](#contents)
  * [Walkthrough](#walkthrough)
  * [Detailed instructions](#detailed-instructions)
30
    * [Environment preparation](#environment-preparation)
Katherine Wu's avatar
Katherine Wu committed
31
32
33
34
    * [Download and preprocess datasets](#download-and-preprocess-datasets)
    * [Model training and evaluation](#model-training-and-evaluation)
  * [Implementation overview](#implementation-overview)
    * [Model Definition](#model-definition)
35
    * [Model Trainer](#model-trainer)
Katherine Wu's avatar
Katherine Wu committed
36
37
38
39
    * [Test dataset](#test-dataset)

## Walkthrough

40
41
42
Below are the commands for running the Transformer model. See the
[Detailed instructions](#detailed-instructions) for more details on running the
model.
Katherine Wu's avatar
Katherine Wu committed
43
44

```
45
# Ensure that PYTHONPATH is correctly defined as described in
46
# https://github.com/tensorflow/models/tree/master/official#requirements
47
48
export PYTHONPATH="$PYTHONPATH:/path/to/models"

49
cd /path/to/models/official/nlp/transformer
50
51
52

# Export variables
PARAM_SET=big
Katherine Wu's avatar
Katherine Wu committed
53
DATA_DIR=$HOME/transformer/data
54
MODEL_DIR=$HOME/transformer/model_$PARAM_SET
55
VOCAB_FILE=$DATA_DIR/vocab.ende.32768
Katherine Wu's avatar
Katherine Wu committed
56

57
# Download training/evaluation/test datasets
58
python3 data_download.py --data_dir=$DATA_DIR
Katherine Wu's avatar
Katherine Wu committed
59

60
61
62
63
# Train the model for 100000 steps and evaluate every 5000 steps on a single GPU.
# Each train step, takes 4096 tokens as a batch budget with 64 as sequence
# maximal length.
python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
64
    --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET \
65
66
67
68
69
70
    --train_steps=100000 --steps_between_evals=5000 \
    --batch_size=4096 --max_length=64 \
    --bleu_source=$DATA_DIR/newstest2014.en \
    --bleu_ref=$DATA_DIR/newstest2014.de \
    --num_gpus=1 \
    --enable_time_history=false
Katherine Wu's avatar
Katherine Wu committed
71
72
73
74
75
76
77
78
79

# Run during training in a separate process to get continuous updates,
# or after training is complete.
tensorboard --logdir=$MODEL_DIR
```

## Detailed instructions


80
81
82
0. ### Environment preparation

   #### Add models repo to PYTHONPATH
83
   Follow the instructions described in the [Requirements](https://github.com/tensorflow/models/tree/master/official#requirements) section to add the models folder to the python path.
84
85

   #### Export variables (optional)
Katherine Wu's avatar
Katherine Wu committed
86
87

   Export the following variables, or modify the values in each of the snippets below:
88
89

   ```shell
90
   PARAM_SET=big
Katherine Wu's avatar
Katherine Wu committed
91
   DATA_DIR=$HOME/transformer/data
92
   MODEL_DIR=$HOME/transformer/model_$PARAM_SET
93
   VOCAB_FILE=$DATA_DIR/vocab.ende.32768
Katherine Wu's avatar
Katherine Wu committed
94
95
96
97
98
99
100
101
102
103
   ```

1. ### Download and preprocess datasets

   [data_download.py](data_download.py) downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.

   1.75GB of compressed data will be downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.

   Command to run:
   ```
104
   python3 data_download.py --data_dir=$DATA_DIR
Katherine Wu's avatar
Katherine Wu committed
105
106
107
108
109
110
111
112
   ```

   Arguments:
   * `--data_dir`: Path where the preprocessed TFRecord data, and vocab file will be saved.
   * Use the `--help` or `-h` flag to get a full list of possible arguments.

2. ### Model training and evaluation

113
   [transformer_main.py](transformer_main.py) creates a Transformer keras model,
114
115
116
117
118
119
120
121
122
   and trains it uses keras model.fit().

   Users need to adjust `batch_size` and `num_gpus` to get good performance
   running multiple GPUs.

   **Note that:**
   when using multiple GPUs or TPUs, this is the global batch size for all
   devices. For example, if the batch size is `4096*4` and there are 4 devices,
   each device will take 4096 tokens as a batch budget.
Katherine Wu's avatar
Katherine Wu committed
123
124
125

   Command to run:
   ```
126
   python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
127
       --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET
Katherine Wu's avatar
Katherine Wu committed
128
129
130
131
132
   ```

   Arguments:
   * `--data_dir`: This should be set to the same directory given to the `data_download`'s `data_dir` argument.
   * `--model_dir`: Directory to save Transformer model training checkpoints.
133
   * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
134
   * `--param_set`: Parameter set to use when creating and training the model. Options are `base` and `big` (default).
135
136
137
   * `--enable_time_history`: Whether add TimeHistory call. If so, --log_steps must be specified.
   * `--batch_size`: The number of tokens to consider in a batch. Combining with
     `--max_length`, they decide how many sequences are used per batch.
Katherine Wu's avatar
Katherine Wu committed
138
139
   * Use the `--help` or `-h` flag to get a full list of possible arguments.

140
141
142
143
144
145
146
147
148
149
150
151
152
153
    #### Using multiple GPUs
    You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
    You can read more about them in this
    [guide](https://www.tensorflow.org/guide/distribute_strategy).

    In this example, we have made it easier to use is with just a command line flag
    `--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
    and 0 otherwise.

    - --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
    - --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
    - --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
    distributed training across the GPUs.

Hongkun Yu's avatar
Hongkun Yu committed
154
   #### Using Cloud TPUs
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

   You can train the Transformer model on Cloud TPUs using
   `tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
   strongly recommended that you go through the
   [quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
   create a TPU and GCE VM.

   To run the Transformer model on a TPU, you must set
   `--distribution_strategy=tpu`, `--tpu=$TPU_NAME`, and `--use_ctl=True` where
   `$TPU_NAME` the name of your TPU in the Cloud Console.

   An example command to run Transformer on a v2-8 or v3-8 TPU would be:

   ```bash
   python transformer_main.py \
     --tpu=$TPU_NAME \
     --model_dir=$MODEL_DIR \
     --data_dir=$DATA_DIR \
     --vocab_file=$DATA_DIR/vocab.ende.32768 \
     --bleu_source=$DATA_DIR/newstest2014.en \
     --bleu_ref=$DATA_DIR/newstest2014.end \
     --batch_size=6144 \
     --train_steps=2000 \
     --static_batch=true \
     --use_ctl=true \
     --param_set=big \
     --max_length=64 \
     --decode_batch_size=32 \
     --decode_max_length=97 \
     --padded_decode=true \
     --distribution_strategy=tpu
   ```
   Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.

Katherine Wu's avatar
Katherine Wu committed
189
190
191
   #### Customizing training schedule

   By default, the model will train for 10 epochs, and evaluate after every epoch. The training schedule may be defined through the flags:
192

Katherine Wu's avatar
Katherine Wu committed
193
194
   * Training with steps:
     * `--train_steps`: sets the total number of training steps to run.
195
     * `--steps_between_evals`: Number of training steps to run between evaluations.
Katherine Wu's avatar
Katherine Wu committed
196
197
198
199

   #### Compute BLEU score during model evaluation

   Use these flags to compute the BLEU when the model evaluates:
200

Katherine Wu's avatar
Katherine Wu committed
201
202
203
   * `--bleu_source`: Path to file containing text to translate.
   * `--bleu_ref`: Path to file containing the reference translation.

204
   When running `transformer_main.py`, use the flags: `--bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de`
Katherine Wu's avatar
Katherine Wu committed
205
206
207
208
209
210
211
212
213
214
215
216
217

   #### Tensorboard
   Training and evaluation metrics (loss, accuracy, approximate BLEU score, etc.) are logged, and can be displayed in the browser using Tensorboard.
   ```
   tensorboard --logdir=$MODEL_DIR
   ```
   The values are displayed at [localhost:6006](localhost:6006).

## Implementation overview

A brief look at each component in the code:

### Model Definition
218
219
220
221
* [transformer.py](transformer.py): Defines a tf.keras.Model: `Transformer`.
* [embedding_layer.py](embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
* [attention_layer.py](attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
* [ffn_layer.py](ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
Katherine Wu's avatar
Katherine Wu committed
222
223

Other files:
224
* [beam_search.py](beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
Katherine Wu's avatar
Katherine Wu committed
225

226
### Model Trainer
227
[transformer_main.py](transformer_main.py) creates an `TransformerTask` to train and evaluate the model using tf.keras.
Katherine Wu's avatar
Katherine Wu committed
228
229

### Test dataset
230
231
232
233
The [newstest2014 files](https://storage.googleapis.com/tf-perf-public/official_transformer/test_data/newstest2014.tgz)
are extracted from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data).
The raw text files are converted from the SGM format of the
[WMT 2016](http://www.statmt.org/wmt16/translation-task.html) test sets. The
234
newstest2014 files are put into the `$DATA_DIR` when executing `data_download.py`