README.md 6.05 KB
Newer Older
huchen's avatar
huchen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# 1. 简介

该脚本是基于NLP领域gnmt模型的功能测试用例,参考mlperf工程,当target-bleu指标达到24.0时,视为模型达到收敛标准并成功结束作业运行。
# 2. 运行

## 安装依赖项
    pip install sacrebleu==1.2.10
    pip3 install --no-cache-dir https://github.com/mlperf/logging/archive/9ea0afa.zip
    apex
    seq2seq中gpu相关依赖 CC=hipcc CXX=hipcc python3 setup.py install

## 数据集下载
    bash scripts/wmt16_en_de.sh
* 关于数据集的更详细介绍可以参考README_orgin.md中第3部分

## 预处理
    python3 preprocess_data.py --dataset-dir /path/to/download/wmt16_de_en/ --preproc-data-dir =/path/to/save/preprocess/data --max-length-train "75" --math fp32
 
## 单机单卡
    HIP_VISIBLE_DEVICES=0 python3 train.py \
        --save ${RESULTS_DIR} \
        --dataset-dir ${DATASET_DIR} \
        --preproc-data-dir ${PREPROC_DATADIR}/${MAX_SEQ_LEN} \
        --target-bleu $TARGET \
        --epochs "${NUMEPOCHS}" \
        --math ${MATH} \
        --max-length-train ${MAX_SEQ_LEN} \
        --print-freq 10 \
        --train-batch-size $TRAIN_BATCH_SIZE \
        --test-batch-size $TEST_BATCH_SIZE \
        --optimizer Adam \
        --lr $LR \
        --warmup-steps $WARMUP_STEPS \
        --remain-steps $REMAIN_STEPS \
        --decay-interval $DECAY_INTERVAL \
        --no-log-all-ranks

* 可参考run_fp32_singleCard.sh

## 单机多卡
    bash run_fp32_node.sh
        
* 可参考run_fp32_node.sh

## 多机多卡
    bash run_fp32_multi.sh
     
# 3. 模型
### Publication/Attribution

Implemented model is similar to the one from [Google's Neural Machine
Translation System: Bridging the Gap between Human and Machine
Translation](https://arxiv.org/abs/1609.08144) paper.

Most important difference is in the attention mechanism. This repository
implements `gnmt_v2` attention: output from first LSTM layer of decoder goes
into attention, then re-weighted context is concatenated with inputs to all
subsequent LSTM layers in decoder at current timestep.

The same attention mechanism is also implemented in default
GNMT-like models from [tensorflow/nmt](https://github.com/tensorflow/nmt) and
[NVIDIA/OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq).

### Structure

* general:
  * encoder and decoder are using shared embeddings
  * data-parallel multi-gpu training
  * trained with label smoothing loss (smoothing factor 0.1)
* encoder:
  * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest of
    layers are unidirectional
  * with residual connections starting from 3rd LSTM layer
  * uses standard pytorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
    distribution
* decoder:
  * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
    classifier
  * with residual connections starting from 3rd LSTM layer
  * uses standard pytorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
    distribution
  * weights and bias of fully-connected classifier is initialized with
    uniform(-0.1, 0.1) distribution
* attention:
  * normalized Bahdanau attention
  * model uses `gnmt_v2` attention mechanism
  * output from first LSTM layer of decoder goes into attention,
  then re-weighted context is concatenated with the input to all subsequent
  LSTM layers in decoder at the current timestep
  * linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
  normalization scalar is initialized with 1.0 / sqrt(1024),
    normalization bias is initialized with zero
* inference:
  * beam search with beam size of 5
  * with coverage penalty and length normalization, coverage penalty factor is
    set to 0.1, length normalization factor is set to 0.6 and length
    normalization constant is set to 5.0
  * BLEU computed by [sacrebleu](https://pypi.org/project/sacrebleu/)


Implementation:
* base Seq2Seq model: `pytorch/seq2seq/models/seq2seq_base.py`, class `Seq2Seq`
* GNMT model: `pytorch/seq2seq/models/gnmt.py`, class `GNMT`
* encoder: `pytorch/seq2seq/models/encoder.py`, class `ResidualRecurrentEncoder`
* decoder: `pytorch/seq2seq/models/decoder.py`, class `ResidualRecurrentDecoder`
* attention: `pytorch/seq2seq/models/attention.py`, class `BahdanauAttention`
* inference (including BLEU evaluation and detokenization): `pytorch/seq2seq/inference/inference.py`, class `Translator`
* beam search: `pytorch/seq2seq/inference/beam_search.py`, class `SequenceGenerator`

### Loss function
Cross entropy loss with label smoothing (smoothing factor = 0.1), padding is not
considered part of the loss.

Loss function is implemented in `pytorch/seq2seq/train/smoothing.py`, class
`LabelSmoothing`.

### Optimizer
Adam optimizer with learning rate 1e-3, beta1 = 0.9, beta2 = 0.999, epsilon =
1e-8 and no weight decay.
Network is trained with gradient clipping, max L2 norm of gradients is set to 5.0.

Optimizer is implemented in `pytorch/seq2seq/train/fp_optimizers.py`, class
`Fp32Optimizer`.

### Learning rate schedule
Model is trained with exponential learning rate warmup for 200 steps and with
step learning rate decay. Decay is started after 2/3 of training steps, decays
for a total of 4 times, at regularly spaced intervals, decay factor is 0.5.

Learning rate scheduler is implemented in
`pytorch/seq2seq/train/lr_scheduler.py`, class `WarmupMultiStepLR`.

# 4. 评估

### Quality metric
Uncased BLEU score on newstest2014 en-de dataset.
BLEU scores reported by [sacrebleu](https://pypi.org/project/sacrebleu/)
package (version 1.2.10). Sacrebleu is executed with the following flags:
`--score-only -lc --tokenize intl`.

### Quality target
Uncased BLEU score of 24.00.

### Evaluation frequency
Evaluation of BLEU score is done after every epoch.

### Evaluation thoroughness
Evaluation uses all of `newstest2014.en` (3003 sentences).