# 1. 简介 该脚本是基于NLP领域gnmt模型的功能测试用例,参考mlperf工程,当target-bleu指标达到24.0时,视为模型达到收敛标准并成功结束作业运行。 # 2. 运行 ## 安装依赖项 pip install sacrebleu==1.2.10 pip3 install --no-cache-dir https://github.com/mlperf/logging/archive/9ea0afa.zip apex seq2seq中gpu相关依赖 CC=hipcc CXX=hipcc python3 setup.py install ## 数据集下载 bash scripts/wmt16_en_de.sh * 关于数据集的更详细介绍可以参考README_orgin.md中第3部分 ## 预处理 python3 preprocess_data.py --dataset-dir /path/to/download/wmt16_de_en/ --preproc-data-dir =/path/to/save/preprocess/data --max-length-train "75" --math fp32 ## 单机单卡 HIP_VISIBLE_DEVICES=0 python3 train.py \ --save ${RESULTS_DIR} \ --dataset-dir ${DATASET_DIR} \ --preproc-data-dir ${PREPROC_DATADIR}/${MAX_SEQ_LEN} \ --target-bleu $TARGET \ --epochs "${NUMEPOCHS}" \ --math ${MATH} \ --max-length-train ${MAX_SEQ_LEN} \ --print-freq 10 \ --train-batch-size $TRAIN_BATCH_SIZE \ --test-batch-size $TEST_BATCH_SIZE \ --optimizer Adam \ --lr $LR \ --warmup-steps $WARMUP_STEPS \ --remain-steps $REMAIN_STEPS \ --decay-interval $DECAY_INTERVAL \ --no-log-all-ranks * 可参考run_fp32_singleCard.sh ## 单机多卡 bash run_fp32_node.sh * 可参考run_fp32_node.sh ## 多机多卡 bash run_fp32_multi.sh # 3. 模型 ### Publication/Attribution Implemented model is similar to the one from [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) paper. Most important difference is in the attention mechanism. This repository implements `gnmt_v2` attention: output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with inputs to all subsequent LSTM layers in decoder at current timestep. The same attention mechanism is also implemented in default GNMT-like models from [tensorflow/nmt](https://github.com/tensorflow/nmt) and [NVIDIA/OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq). ### Structure * general: * encoder and decoder are using shared embeddings * data-parallel multi-gpu training * trained with label smoothing loss (smoothing factor 0.1) * encoder: * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest of layers are unidirectional * with residual connections starting from 3rd LSTM layer * uses standard pytorch nn.LSTM layer * dropout is applied on input to all LSTM layers, probability of dropout is set to 0.2 * hidden state of LSTM layers is initialized with zeros * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1) distribution * decoder: * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected classifier * with residual connections starting from 3rd LSTM layer * uses standard pytorch nn.LSTM layer * dropout is applied on input to all LSTM layers, probability of dropout is set to 0.2 * hidden state of LSTM layers is initialized with zeros * weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1) distribution * weights and bias of fully-connected classifier is initialized with uniform(-0.1, 0.1) distribution * attention: * normalized Bahdanau attention * model uses `gnmt_v2` attention mechanism * output from first LSTM layer of decoder goes into attention, then re-weighted context is concatenated with the input to all subsequent LSTM layers in decoder at the current timestep * linear transform of keys and queries is initialized with uniform(-0.1, 0.1), normalization scalar is initialized with 1.0 / sqrt(1024), normalization bias is initialized with zero * inference: * beam search with beam size of 5 * with coverage penalty and length normalization, coverage penalty factor is set to 0.1, length normalization factor is set to 0.6 and length normalization constant is set to 5.0 * BLEU computed by [sacrebleu](https://pypi.org/project/sacrebleu/) Implementation: * base Seq2Seq model: `pytorch/seq2seq/models/seq2seq_base.py`, class `Seq2Seq` * GNMT model: `pytorch/seq2seq/models/gnmt.py`, class `GNMT` * encoder: `pytorch/seq2seq/models/encoder.py`, class `ResidualRecurrentEncoder` * decoder: `pytorch/seq2seq/models/decoder.py`, class `ResidualRecurrentDecoder` * attention: `pytorch/seq2seq/models/attention.py`, class `BahdanauAttention` * inference (including BLEU evaluation and detokenization): `pytorch/seq2seq/inference/inference.py`, class `Translator` * beam search: `pytorch/seq2seq/inference/beam_search.py`, class `SequenceGenerator` ### Loss function Cross entropy loss with label smoothing (smoothing factor = 0.1), padding is not considered part of the loss. Loss function is implemented in `pytorch/seq2seq/train/smoothing.py`, class `LabelSmoothing`. ### Optimizer Adam optimizer with learning rate 1e-3, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8 and no weight decay. Network is trained with gradient clipping, max L2 norm of gradients is set to 5.0. Optimizer is implemented in `pytorch/seq2seq/train/fp_optimizers.py`, class `Fp32Optimizer`. ### Learning rate schedule Model is trained with exponential learning rate warmup for 200 steps and with step learning rate decay. Decay is started after 2/3 of training steps, decays for a total of 4 times, at regularly spaced intervals, decay factor is 0.5. Learning rate scheduler is implemented in `pytorch/seq2seq/train/lr_scheduler.py`, class `WarmupMultiStepLR`. # 4. 评估 ### Quality metric Uncased BLEU score on newstest2014 en-de dataset. BLEU scores reported by [sacrebleu](https://pypi.org/project/sacrebleu/) package (version 1.2.10). Sacrebleu is executed with the following flags: `--score-only -lc --tokenize intl`. ### Quality target Uncased BLEU score of 24.00. ### Evaluation frequency Evaluation of BLEU score is done after every epoch. ### Evaluation thoroughness Evaluation uses all of `newstest2014.en` (3003 sentences).