README.md

# 1. 简介

该脚本是基于NLP领域gnmt模型的功能测试用例，参考mlperf工程，当target-bleu指标达到24.0时，视为模型达到收敛标准并成功结束作业运行。
# 2. 运行

## 安装依赖项
    pip install sacrebleu==1.2.10
    pip3 install --no-cache-dir https://github.com/mlperf/logging/archive/9ea0afa.zip
    apex
    seq2seq中gpu相关依赖 CC=hipcc CXX=hipcc python3 setup.py install

## 数据集下载
    bash scripts/wmt16_en_de.sh
* 关于数据集的更详细介绍可以参考README_orgin.md中第3部分

## 预处理
    python3 preprocess_data.py --dataset-dir /path/to/download/wmt16_de_en/ --preproc-data-dir =/path/to/save/preprocess/data --max-length-train "75" --math fp32
 
## 单机单卡
    HIP_VISIBLE_DEVICES=0 python3 train.py \
        --save ${RESULTS_DIR} \
        --dataset-dir ${DATASET_DIR} \
        --preproc-data-dir ${PREPROC_DATADIR}/${MAX_SEQ_LEN} \
        --target-bleu $TARGET \
        --epochs "${NUMEPOCHS}" \
        --math ${MATH} \
        --max-length-train ${MAX_SEQ_LEN} \
        --print-freq 10 \
        --train-batch-size $TRAIN_BATCH_SIZE \
        --test-batch-size $TEST_BATCH_SIZE \
        --optimizer Adam \
        --lr $LR \
        --warmup-steps $WARMUP_STEPS \
        --remain-steps $REMAIN_STEPS \
        --decay-interval $DECAY_INTERVAL \
        --no-log-all-ranks

* 可参考run_fp32_singleCard.sh

## 单机多卡
    bash run_fp32_node.sh
        
* 可参考run_fp32_node.sh

## 多机多卡
    bash run_fp32_multi.sh
     
# 3. 模型
### Publication/Attribution

Implemented model is similar to the one from [Google's Neural Machine
Translation System: Bridging the Gap between Human and Machine
Translation](https://arxiv.org/abs/1609.08144) paper.

Most important difference is in the attention mechanism. This repository
implements `gnmt_v2` attention: output from first LSTM layer of decoder goes
into attention, then re-weighted context is concatenated with inputs to all
subsequent LSTM layers in decoder at current timestep.

The same attention mechanism is also implemented in default
GNMT-like models from [tensorflow/nmt](https://github.com/tensorflow/nmt) and
[NVIDIA/OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq).

### Structure

* general:
  * encoder and decoder are using shared embeddings
  * data-parallel multi-gpu training
  * trained with label smoothing loss (smoothing factor 0.1)
* encoder:
  * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest of
    layers are unidirectional
  * with residual connections starting from 3rd LSTM layer
  * uses standard pytorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
    distribution
* decoder:
  * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
    classifier
  * with residual connections starting from 3rd LSTM layer
  * uses standard pytorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
    distribution
  * weights and bias of fully-connected classifier is initialized with
    uniform(-0.1, 0.1) distribution
* attention:
  * normalized Bahdanau attention
  * model uses `gnmt_v2` attention mechanism
  * output from first LSTM layer of decoder goes into attention,
  then re-weighted context is concatenated with the input to all subsequent
  LSTM layers in decoder at the current timestep
  * linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
  normalization scalar is initialized with 1.0 / sqrt(1024),
    normalization bias is initialized with zero
* inference:
  * beam search with beam size of 5
  * with coverage penalty and length normalization, coverage penalty factor is
    set to 0.1, length normalization factor is set to 0.6 and length
    normalization constant is set to 5.0
  * BLEU computed by [sacrebleu](https://pypi.org/project/sacrebleu/)


Implementation:
* base Seq2Seq model: `pytorch/seq2seq/models/seq2seq_base.py`, class `Seq2Seq`
* GNMT model: `pytorch/seq2seq/models/gnmt.py`, class `GNMT`
* encoder: `pytorch/seq2seq/models/encoder.py`, class `ResidualRecurrentEncoder`
* decoder: `pytorch/seq2seq/models/decoder.py`, class `ResidualRecurrentDecoder`
* attention: `pytorch/seq2seq/models/attention.py`, class `BahdanauAttention`
* inference (including BLEU evaluation and detokenization): `pytorch/seq2seq/inference/inference.py`, class `Translator`
* beam search: `pytorch/seq2seq/inference/beam_search.py`, class `SequenceGenerator`

### Loss function
Cross entropy loss with label smoothing (smoothing factor = 0.1), padding is not
considered part of the loss.

Loss function is implemented in `pytorch/seq2seq/train/smoothing.py`, class
`LabelSmoothing`.

### Optimizer
Adam optimizer with learning rate 1e-3, beta1 = 0.9, beta2 = 0.999, epsilon =
1e-8 and no weight decay.
Network is trained with gradient clipping, max L2 norm of gradients is set to 5.0.

Optimizer is implemented in `pytorch/seq2seq/train/fp_optimizers.py`, class
`Fp32Optimizer`.

### Learning rate schedule
Model is trained with exponential learning rate warmup for 200 steps and with
step learning rate decay. Decay is started after 2/3 of training steps, decays
for a total of 4 times, at regularly spaced intervals, decay factor is 0.5.

Learning rate scheduler is implemented in
`pytorch/seq2seq/train/lr_scheduler.py`, class `WarmupMultiStepLR`.

# 4. 评估

### Quality metric
Uncased BLEU score on newstest2014 en-de dataset.
BLEU scores reported by [sacrebleu](https://pypi.org/project/sacrebleu/)
package (version 1.2.10). Sacrebleu is executed with the following flags:
`--score-only -lc --tokenize intl`.

### Quality target
Uncased BLEU score of 24.00.

### Evaluation frequency
Evaluation of BLEU score is done after every epoch.

### Evaluation thoroughness
Evaluation uses all of `newstest2014.en` (3003 sentences).