更新transformer代码

c0f05c10 · hepj · c056df78 · c056df78 · c056df78 · c056df78
Commit c0f05c10 authored Nov 29, 2022 by hepj
20 changed files
--- a/PyTorch/NLP/Transformer/.dockerignore
+++ b/PyTorch/NLP/Transformer/.dockerignore
-results
-data
--- a/PyTorch/NLP/Transformer/CONTRIBUTING.md
+++ b/PyTorch/NLP/Transformer/CONTRIBUTING.md
-# Contributing to FAIR Sequence-to-Sequence Toolkit (PyTorch)
-We want to make contributing to this project as easy and transparent as
-possible.
-
-## Pull Requests
-We actively welcome your pull requests.
-
-1. Fork the repo and create your branch from `master`.
-2. If you've added code that should be tested, add tests.
-3. If you've changed APIs, update the documentation.
-4. Ensure the test suite passes.
-5. Make sure your code lints.
-6. If you haven't already, complete the Contributor License Agreement ("CLA").
-
-## Contributor License Agreement ("CLA")
-In order to accept your pull request, we need you to submit a CLA. You only need
-to do this once to work on any of Facebook's open source projects.
-
-Complete your CLA here: <https://code.facebook.com/cla>
-
-## Issues
-We use GitHub issues to track public bugs. Please ensure your description is
-clear and has sufficient instructions to be able to reproduce the issue.
-
-## Coding Style
-We try to follow the PEP style guidelines and encourage you to as well.
-
-## License
-By contributing to FAIR Sequence-to-Sequence Toolkit, you agree that your contributions will be licensed
-under the LICENSE file in the root directory of this source tree.
\ No newline at end of file
--- a/PyTorch/NLP/Transformer/Dockerfile
+++ b/PyTorch/NLP/Transformer/Dockerfile
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.05-py3
-FROM ${FROM_IMAGE_NAME}
-
-WORKDIR /workspace
-#RUN git clone https://github.com/NVIDIA/apex \
-# && cd apex \
-# && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
-# Install Python dependencies
-RUN pip install --no-cache-dir \
-      sacrebleu \
-      sentencepiece
-RUN pip install jupyter
-
-ARG DEBIAN_FRONTEND=noninteractive
-RUN apt-get update
-RUN apt-get install -y -q cmake pkg-config protobuf-compiler libprotobuf-dev libgoogle-perftools-dev
-RUN git clone https://github.com/google/sentencepiece.git /workspace/sentencepiece
-RUN cd /workspace/sentencepiece \
-  && git checkout d4dd947 \
-  && mkdir build \
-  && cd build \
-  && cmake .. \
-  && make -j 8 \
-  && make install \
-  && ldconfig -v
-
-ENV PYTHONPATH=/workspace/translation/examples/translation/subword-nmt/
-WORKDIR /workspace/translation
-RUN git clone https://github.com/rsennrich/subword-nmt.git /workspace/translation/examples/translation/subword-nmt/
-RUN git clone https://github.com/NVIDIA/cutlass.git && cd cutlass && git checkout ed2ed4d6 && cd ..
-COPY . .
-RUN pip install -e .
-RUN pip install git+https://github.com/NVIDIA/dllogger@v0.1.0#egg=dllogger
--- a/PyTorch/NLP/Transformer/LICENSE
+++ b/PyTorch/NLP/Transformer/LICENSE
-BSD License
-
-For fairseq software
-
-Copyright (c) 2017-present, Facebook, Inc. All rights reserved.
-Copyright (c) 2019-present, NVIDIA CORPORATION. All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
-
- * Redistributions of source code must retain the above copyright notice, this
-    list of conditions and the following disclaimer.
-
- * Redistributions in binary form must reproduce the above copyright notice,
-    this list of conditions and the following disclaimer in the documentation
-       and/or other materials provided with the distribution.
-
- * Neither the name Facebook nor the names of its contributors may be used to
-    endorse or promote products derived from this software without specific
-       prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
-ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/PyTorch/NLP/Transformer/NOTICE
+++ b/PyTorch/NLP/Transformer/NOTICE
-Transformer PyTorch
-
-This repository includes software from https://github.com/facebookresearch/fairseq
-licensed under the BSD License.
-
-
--- a/PyTorch/NLP/Transformer/PATENTS
+++ b/PyTorch/NLP/Transformer/PATENTS
-Additional Grant of Patent Rights Version 2
-
-"Software" means the fairseq software distributed by Facebook, Inc.
-
-Facebook, Inc. ("Facebook") hereby grants to each recipient of the Software
-("you") a perpetual, worldwide, royalty-free, non-exclusive, irrevocable
-(subject to the termination provision below) license under any Necessary
-Claims, to make, have made, use, sell, offer to sell, import, and otherwise
-transfer the Software. For avoidance of doubt, no license is granted under
-Facebook’s rights in any patent claims that are infringed by (i) modifications
-to the Software made by you or any third party or (ii) the Software in
-combination with any software or other technology.
-
-The license granted hereunder will terminate, automatically and without notice,
-if you (or any of your subsidiaries, corporate affiliates or agents) initiate
-directly or indirectly, or take a direct financial interest in, any Patent
-Assertion: (i) against Facebook or any of its subsidiaries or corporate
-affiliates, (ii) against any party if such Patent Assertion arises in whole or
-in part from any software, technology, product or service of Facebook or any of
-its subsidiaries or corporate affiliates, or (iii) against any party relating
-to the Software. Notwithstanding the foregoing, if Facebook or any of its
-subsidiaries or corporate affiliates files a lawsuit alleging patent
-infringement against you in the first instance, and you respond by filing a
-patent infringement counterclaim in that lawsuit against that party that is
-unrelated to the Software, the license granted hereunder will not terminate
-under section (i) of this paragraph due to such counterclaim.
-
-A "Necessary Claim" is a claim of a patent owned by Facebook that is
-necessarily infringed by the Software standing alone.
-
-A "Patent Assertion" is any lawsuit or other action alleging direct, indirect,
-or contributory infringement or inducement to infringe any patent, including a
-cross-claim or counterclaim.
--- a/PyTorch/NLP/Transformer/README.md.old
+++ b/PyTorch/NLP/Transformer/README.md.old
-# Transformer For PyTorch
-
-This repository provides a script and recipe to train the Transformer model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
-
-## Table Of Contents
-* [Model overview](#model-overview)
-    * [Model architecture](#model-architecture)
-    * [Default configuration](#default-configuration)
-    * [Feature support matrix](#feature-support-matrix)
-	    * [Features](#features)
-    * [Mixed precision training](#mixed-precision-training)
-	    * [Enabling mixed precision](#enabling-mixed-precision)
-        * [Enabling TF32](#enabling-tf32)
-    * [Glossary](#glossary)
-* [Setup](#setup)
-    * [Requirements](#requirements)
-* [Quick Start Guide](#quick-start-guide)
-* [Advanced](#advanced)
-    * [Scripts and sample code](#scripts-and-sample-code)
-    * [Parameters](#parameters)
-    * [Command-line options](#command-line-options)
-    * [Getting the data](#getting-the-data)
-        * [Dataset guidelines](#dataset-guidelines)
-        * [Multi-dataset](#multi-dataset)
-    * [Training process](#training-process)
-    * [Inference process](#inference-process)
-* [Performance](#performance)
-    * [Benchmarking](#benchmarking)
-        * [Training performance benchmark](#training-performance-benchmark)
-        * [Inference performance benchmark](#inference-performance-benchmark)
-    * [Results](#results)
-        * [Training accuracy results](#training-accuracy-results)
-            * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
-            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
-            * [Training stability test](#training-stability-test)
-        * [Training performance results](#training-performance-results)
-            * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
-            * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
-            * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
-        * [Inference performance results](#inference-performance-results)
-            * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
-            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
-* [Release notes](#release-notes)
-    * [Changelog](#changelog)
-    * [Known issues](#known-issues)
-
-
-## Model overview
-
-The Transformer is a Neural Machine Translation (NMT) model which uses attention mechanism to boost training speed and overall accuracy. The Transformer model was introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) and improved in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
-This implementation is based on the optimized implementation in [Facebook's Fairseq NLP toolkit](https://github.com/pytorch/fairseq), built on top of PyTorch.
-
-This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, Turing and Ampere GPU architectures. Therefore, researchers can get results 6.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
-
-### Model architecture
-
-The Transformer model uses standard NMT encoder-decoder architecture. This model unlike other NMT models, uses no recurrent connections and operates on fixed size context window.
-The encoder stack is made up of N identical layers. Each layer is composed of the following sublayers:
-    1. Self-attention layer
-    2. Feedforward network (which is 2 fully-connected layers)
-Like the encoder stack, the decoder stack is made up of N identical layers. Each layer is composed of the sublayers:
-    1. Self-attention layer
-    2. Multi-headed attention layer combining encoder outputs with results from
-       the previous self-attention layer.
-    3. Feedforward network (2 fully-connected layers)
-
-The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
-The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
-
-<p align="center">
-    <img width="50%" src="./transformer.png" />
-    <br>
-    Figure 1. The architecture of a Transformer model.
-</p>
-
-The complete description of the Transformer architecture can be found in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
-### Default configuration
-
-The Transformer uses Byte Pair Encoding tokenization scheme using [Moses decoder](https://github.com/moses-smt/mosesdecoder). This is a lossy compression method (we drop information about white spaces). Tokenization is applied over whole [WMT14](http://statmt.org/wmt14/translation-task.html#Download) en-de dataset including test set. Default vocabulary size is 33708, excluding all special tokens. Encoder and decoder are using shared embeddings.
-We use 6 blocks in each encoder and decoder stacks. Self attention layer computes it's outputs according to the following formula $`Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V`$. At each attention step, the model computes 16 different attention representations (which we will call attention heads) and concatenates them.
-We trained the Transformer model using the Adam optimizer with betas `(0.9, 0.997)`, epsilon `1e-9` and learning rate `6e-4`. We used the inverse square root training schedule preceded with linear warmup of 4000 steps.
-The implementation allows to perform training in mixed precision. We use dynamic loss scaling and custom mixed precision optimizer. Distributed multi-GPU and multi-Node is implemented with `torch.distirbuted` module with NCCL backend.
-For inference, we use beam search with default beam size of 5. Model performance is evaluated with BLEU4 metrics. For clarity, we report internal (legacy) BLEU implementation as well as external [SacreBleu](https://github.com/mjpost/sacreBLEU)  score.
-
-### Feature support matrix
-
-The following features are supported by this model.<br>
-
-| Feature                  | Yes column                
-|--------------------------|--------------------------
-| Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html)  | Yes          
-| Nvidia APEX              | Yes         
-| AMP                      | Yes
-| TorchScript              | Yes
-
-#### Features
-
-* Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html): Our model uses torch.distributed package to implement efficient multi-GPU training with NCCL.
-To enable multi-GPU training with torch.distributed, you have to initialize your model identically in every process spawned by torch.distributed.launch. Distributed strategy is implemented with APEX's DistributedDataParallel.
-For details, see example sources in this repo or see the [pytorch tutorial](https://pytorch.org/docs/stable/distributed.html)
-
-* Nvidia APEX: The purpose of the APEX is to provide easy and intuitive framework for distributed training and mixed precision training. For details, see official [APEX repository](https://github.com/NVIDIA/apex).
-
-* AMP: This implementation uses Apex's AMP to perform mixed precision training.
-
-* TorchScript: Transformer can be converted to TorchScript format offering ease of deployment on platforms without Python dependencies. For more information see official [TorchScript](https://pytorch.org/docs/stable/jit.html) documentation.
-
-
-### Mixed precision training
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
-1.  Porting the model to use the FP16 data type where appropriate.    
-2.  Adding loss scaling to preserve small gradient values.
-
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
-
-For information about:
-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
-#### Enabling mixed precision
-
-Mixed precision is enabled using the `--amp` option in the `train.py` script. The default is optimization level `O2` but can be overriden with `--amp-level $LVL` option (for details see [amp documentation](https://nvidia.github.io/apex/amp.html)). Forward and backward pass are computed with FP16 precision with exclusion of a loss function which is computed in FP32 precision. Default optimization level keeps a copy of a model in higher precision in order to perform accurate weight update. After the update FP32 weights are again copied to FP16 model. We use dynamic loss scaling with initial scale of 2^7 increasing it by a factor of 2 every 2000 successful iterations. Overflow is being checked after reducing gradients from all of the workers. If we encounter infs or nans the whole batch is dropped.
-
-#### Enabling TF32
-TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
-
-TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
-
-For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
-
-TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
-
-
-### Glossary
-
-Attention layer - Layer that computes which elements of input sequence or it's hidden representation contribute the most to the currently considered output element.  
-Beam search - A heuristic search algorithm which at each step of predictions keeps N most possible outputs as a base to perform further prediction.  
-BPE - Binary Pair Encoding, compression algorithm that find most common pair of symbols in a data and replaces them with new symbol absent in the data.  
-EOS - End of a sentence.  
-Self attention layer - Attention layer that computes hidden representation of input using the same tensor as query, key and value.  
-Token - A  string that is representable within the model. We also refer to the token's position in the dictionary as a token. There are special non-string tokens: alphabet tokens (all characters in a dataset), EOS token, PAD token.  
-Tokenizer - Object that converts raw strings to sequences of tokens.  
-Vocabulary embedding - Layer that projects one-hot token representations to a high dimensional space which preserves some information about correlations between tokens.  
-
-## Setup
-
-The following section lists the requirements in order to start training the Transformer model.
-
-### Requirements
-
-This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
-
-   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-   [PyTorch 20.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
-   GPU-based architecture:
-	- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
-	- [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
-	- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
-
-For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
-   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-   Running [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
-  
-For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
-
-## Quick Start Guide
-To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Transformer model on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
-
-1. Clone the repository 
-```
-git clone https://github.com/NVIDIA/DeepLearningExamples.git 
-cd DeepLearningExamples/PyTorch/Translation/Transformer
-```
-
-2. Build and launch the Transformer PyTorch NGC  container
-```bash
-docker build . -t your.repository:transformer
-nvidia-docker run -it --rm --ipc=host your.repository:transformer bash
-```
-If you already have preprocessed data, use:
-```bash
-nvidia-docker run -it --rm --ipc=host -v <path to your preprocessed data>:/data/wmt14_en_de_joined_dict your.repository:transformer bash
-```
-If you already have data downloaded, but it has not yet been preprocessed, use:
-```bash
-nvidia-docker run -it --rm --ipc=host -v <path to your unprocessed data>:/workspace/translation/examples/translation/orig your.repository:transformer bash
-```
-3. Download and preprocess dataset: Download and preprocess the WMT14 English-German dataset.
-
-```bash 
-scripts/run_preprocessing.sh
-```
-After running this command, data will be downloaded to `/workspace/translation/examples/translation/orig` directory and this data will be processed and put into `/data/wmt14_en_de_joined_dict` directory.
-
-4. Start training
-```bash
-python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /data/wmt14_en_de_joined_dict \
-  --arch transformer_wmt_en_de_big_t2t \
-  --share-all-embeddings \
-  --optimizer adam \
-  --adam-betas '(0.9, 0.997)' \
-  --adam-eps "1e-9" \
-  --clip-norm 0.0 \
-  --lr-scheduler inverse_sqrt \
-  --warmup-init-lr 0.0 \
-  --warmup-updates 4000 \
-  --lr 0.0006 \
-  --min-lr 0.0 \
-  --dropout 0.1 \
-  --weight-decay 0.0 \
-  --criterion label_smoothed_cross_entropy \
-  --label-smoothing 0.1 \
-  --max-tokens 5120 \
-  --seed 1 \
-  --fuse-layer-norm \
-  --amp \
-  --amp-level O2 \
-  --save-dir /workspace/checkpoints \
-  --distributed-init-method env:// 
-```
-
-The script saves checkpoints every epoch to the directory specified in the `--save-dir` option. In addition, the best performing checkpoint (in terms of loss) and the latest checkpoints are saved separately.
-**WARNING**: If you don't have access to sufficient disk space, use the `--save-interval $N` option. The checkpoints are ~3.4GB large. For example, it takes the Transformer model 30 epochs for the validation loss to plateau. The default option is to save last checkpoint, the best checkpoint and a checkpoint for every epoch, which means (30+1+1)*3.4GB = 108.8GB of a disk space used. Specifying `--save-interval 10` reduces this to (30/10+1+1)*3.4GB = 17GB. 
-
-5. Start interactive inference
-```bash
-python inference.py \ 
-  --buffer-size 5000 \
-  --path /path/to/your/checkpoint.pt \
-  --max-tokens 10240 \
-  --fuse-dropout-add \
-  --remove-bpe \
-  --bpe-codes /path/to/bpe_code_file \
-  --fp16
-```
-where, 
-* `--path` option is the location of the checkpoint file.  
-* `--bpe-codes` option is the location of the `code` file. If the default training command mentioned above is used, this file can be found in the preprocessed data ( i.e., `/data/wmt14_en_de_joined_dict` ) directory.
-
-## Advanced
-The following sections provide greater details of the dataset, running training and inference, and the training results.
-
-### Scripts and sample code
-
-The `preprocess.py` script performs binarization of the dataset obtained and tokenized by the `examples/translation/prepare-wmt14en2de.sh` script. The `train.py` script contains training loop as well as statistics gathering code. Steps performed in single training step can be found in `fairseq/ddp_trainer.py`. Model definition is placed in the file `fairseq/models/transformer.py`. Model specific modules including multiheaded attention and sinusoidal positional embedding are inside the `fairseq/modules/` directory. Finally, the data wrappers are placed inside the `fairseq/data/` directory.
-
-### Parameters
-
-In this section we give a user friendly description of the most common options used in the `train.py` script.
-### Command-line options
-`--arch` - select the specific configuration for the model. You can select between various predefined hyper parameters values like number of encoder/decoder blocks, dropout value or size of hidden state representation.<br/>
-`--share-all-embeddings` - use the same set of weights for encoder and decoder words embedding.<br/>
-`--optimizer` - choose optimization algorithm.<br/>
-`--clip-norm` - set a value that gradients will be clipped to.<br/>
-`--lr-scheduler` - choose learning rate change strategy.<br/>
-`--warmup-init-lr` - start linear warmup with a learning rate at this value.<br/>
-`--warmup-updates` - set number of optimization steps after which linear warmup will end.<br/>
-`--lr` - set learning rate.<br/>
-`--min-lr` - prevent learning rate to fall below this value using arbitrary learning rate schedule.<br/>
-`--dropout` - set dropout value.<br/>
-`--weight-decay` - set weight decay value.<br/>
-`--criterion` - select loss function.<br/>
-`--label-smoothing` - distribute value of one-hot labels between all entries of a dictionary. Value set by this option will be a value subtracted from one-hot label.<br/>
-`--max-tokens` - set batch size in terms of tokens.<br/>
-`--max-sentences` - set batch size in terms of sentences. Note that then the actual batchsize will vary a lot more than when using `--max-tokens` option.<br/>
-`--seed` - set random seed for NumPy and PyTorch RNGs.<br/>
-`--max-epochs` - set the maximum number of epochs.<br/>
-`--online-eval` - perform inference on test set and then compute BLEU score after every epoch.<br/>
-`--target-bleu` - works like `--online-eval` and sets a BLEU score threshold which after being attained will cause training to stop.<br/>
-`--amp` - use mixed precision.<br/>
-`--save-dir` - set directory for saving checkpoints.<br/>
-`--distributed-init-method` - method for initializing torch.distributed package. You can either provide addresses with the `tcp` method or use the envionment variables initialization with `env` method<br/>
-`--update-freq` - use gradient accumulation. Set number of training steps across which gradient will be accumulated.<br/>
-
-To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
-```
-python train.py --help
-```
-
-The following (partial) output is printed when running the sample:
-```
-usage: train.py [-h] [--no-progress-bar] [--log-interval N]
-                [--log-format {json,none,simple,tqdm}] [--seed N] [--fp16]
-                [--task TASK] [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
-                [--max-sentences N] [--sentencepiece] [--train-subset SPLIT]
-                [--valid-subset SPLIT] [--max-sentences-valid N]
-                [--gen-subset SPLIT] [--num-shards N] [--shard-id ID]
-                [--distributed-world-size N]
-                [--distributed-rank DISTRIBUTED_RANK]
-                [--local_rank LOCAL_RANK]
-                [--distributed-backend DISTRIBUTED_BACKEND]
-                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
-                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
-                --arch ARCH [--criterion CRIT] [--max-epoch N]
-                [--max-update N] [--target-bleu TARGET] [--clip-norm NORM]
-                [--sentence-avg] [--update-freq N] [--optimizer OPT]
-                [--lr LR_1,LR_2,...,LR_N] [--momentum M] [--weight-decay WD]
-                [--lr-scheduler LR_SCHEDULER] [--lr-shrink LS] [--min-lr LR]
-                [--min-loss-scale D] [--enable-parallel-backward-allred-opt]
-                [--parallel-backward-allred-opt-threshold N]
-                [--enable-parallel-backward-allred-opt-correctness-check]
-                [--save-dir DIR] [--restore-file RESTORE_FILE]
-                [--save-interval N] [--save-interval-updates N]
-                [--keep-interval-updates N] [--no-save]
-                [--no-epoch-checkpoints] [--validate-interval N] [--path FILE]
-                [--remove-bpe [REMOVE_BPE]] [--cpu] [--quiet] [--beam N]
-                [--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N]
-                [--no-early-stop] [--unnormalized] [--no-beamable-mm]
-                [--lenpen LENPEN] [--unkpen UNKPEN]
-                [--replace-unk [REPLACE_UNK]] [--score-reference]
-                [--prefix-size PS] [--sampling] [--sampling-topk PS]
-                [--sampling-temperature N] [--print-alignment]
-                [--model-overrides DICT] [--online-eval] 
-                [--bpe-codes CODES] [--fuse-dropout-add] [--fuse-relu-dropout]
-```
-
-### Getting the data
-
-The Transformer model was trained on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. Concatenation of the *commoncrawl*, *europarl* and *news-commentary* is used as train and validation dataset and *newstest2014* is used as test dataset.<br/>
-This repository contains the `run_preprocessing.sh` script which will automatically downloads and preprocesses the training and test datasets. By default, data will be stored in the `/data/wmt14_en_de_joined_dict` directory.<br/>
-Our download script utilizes [Moses decoder](https://github.com/moses-smt/mosesdecoder) to perform tokenization of the dataset and [subword-nmt](https://github.com/rsennrich/subword-nmt) to segment text into subword units (BPE). By default, the script builds a shared vocabulary of 33708 tokens, which is consistent with [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
-
-#### Dataset guidelines
-
-The Transformer model works with a fixed sized vocabulary. Prior to the training, we need to learn a data representation that allows us to store the entire dataset as a sequence of tokens. To achieve this we use Binary Pair Encoding. This algorithm builds a vocabulary by iterating over a dataset, looking for the most frequent pair of symbols and replacing them with a new symbol, yet absent in the dataset. After identifying the desired number of encodings (new symbols can also be merged together) it outputs a code file that is used as an input for the `Dictionary` class.
-This approach does not minimize the length of the encoded dataset, however this is allowed using [SentencePiece](https://github.com/google/sentencepiece/) to tokenize the dataset with the unigram model. This approach tries to find encoding that is close to the theoretical entropy limit.
-Data is then sorted by length (in terms of tokens) and examples with similar length are batched together, padded if necessary.
-
-#### Multi-dataset
-
-The model has been tested oni the [wmt14 en-fr](http://www.statmt.org/wmt14/translation-task.html) dataset. Achieving state of the art accuracy of 41.4 BLEU.
-
-### Training process
-
-The default training configuration can be launched by running the `train.py` training script. By default, the script saves one checkpoint every epoch in addition to the latest and the best ones. The best checkpoint is considered the one with the lowest value of loss, not the one with the highest BLEU score. To override this behavior use the `--save-interval $N` option to save epoch checkpoints every N epoch or `--no-epoch-checkpoints` to disable them entirely (with this option the latest and the best checkpoints still will be saved). Specify save the directory with `--save-dir` option.<br/>
-In order to run multi-GPU training, launch the training script with `python -m torch.distributed.launch --nproc_per_node $N` prepended, where N is the number of GPUs.
-We have tested reliance on up to 16 GPUs on a single node.<br/>
-After each training epoch, the script runs a loss validation on the validation split of the dataset and outputs the validation loss. By default the evaluation after each epoch is disabled. To enable it, use the `--online-eval` option or to use the BLEU score value as the training stopping condition use the `--target-bleu $TGT` option. The BLEU scores computed are case insensitive. The BLEU is computed by the internal fairseq algorithm which implementation can be found in the `fairseq/bleu.py` script.<br/>
-By default, the `train.py` script will launch FP32 training without Tensor Cores. To use mixed precision with Tensor Cores use the `--fp16` option.<br/>
-
-To reach the BLEU score reported in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) research paper, we used mixed precision training with a batch size of 5120 per GPU and learning rate of 6e-4 on a DGX-1V system with 8 Tesla V100s 16G. If you use a different setup, we recommend you scale your hyperparameters by applying the following rules:
-1. To use FP32, reduce the batch size to 2560 and set the `--update-freq 2` option.
-2. To train on a fewer GPUs, multiply `--update-freq` by the reciprocal of the scaling factor.
-
-For example, when training in FP32 mode on 4 GPUs, use the `--update-freq=4` option.
-
-### Inference process
-
-Inference on a raw input can be performed by piping file to be translated into the `inference.py` script. It requires a pre-trained model checkpoint, BPE codes file and dictionary file (both are produced by the `run_preprocessing.sh` script and can be found in the dataset directory).<br/>
-In order to run interactive inference, run command:
-```
-python inference.py --path /path/to/your/checkpoint.pt --fuse-dropout-add --remove-bpe --bpe-codes /path/to/code/file
-```
-The `--buffer-size` option allows the batching of input sentences up to `--max_token` length.
-
-To test model checkpoint accuracy on wmt14 test set run following command:
-
-```bash
-sacrebleu -t wmt14/full -l en-de --echo src | python inference.py --buffer-size 5000 --path /path/to/your/checkpoint.pt --max-tokens 10240 --fuse-dropout-add --remove-bpe --bpe-codes /data/code --fp16 | sacrebleu -t wmt14/full -l en-de -lc
-```
-
-## Performance
-
-The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
-
-### Benchmarking
-
-The following section shows how to run benchmarks measuring the model performance in training and inference modes.
-
-#### Training performance benchmark
-
-To benchmark the training performance on a specific batch size, run `train.py` training script. Performance in tokens/s will be printed to standard output every N iterations, specified by the `--log-interval` option. Additionally performance and loss values will be logged by [dllogger](https://github.com/NVIDIA/dllogger) to the file specified in `--stat-file` option. Every line in the output file will be a valid JSON file prepended with `DLLL` prefix.
-
-#### Inference performance benchmark
-
-To benchmark the inference performance on a specific batch size, run following command to start the benchmark
-```bash
-for i in {1..10}; do sacrebleu -t wmt14/full -l en-de --echo src; done | python inference.py --buffer-size 5000 --path /path/to/your/checkpoint.pt --max-tokens 10240 --fuse-dropout-add --remove-bpe --bpe-codes /data/code --fp16 > /dev/null
-```
-Results will be printed to stderr.
-
-### Results
-
-The following sections provide details on how we achieved our performance and accuracy in training and inference.
-
-#### Training accuracy results
-
-Following the spirit of the paper [A Call for Clarity in Reporting BLEU Scores](https://arxiv.org/pdf/1804.08771.pdf) we decided to change evaluation metric implemented in fairseq to [SacreBleu](https://github.com/mjpost/sacreBLEU) score. We have calculated that the new metric has almost linear relationship with the old one. We run linear regression on nearly 2000 checkpoints to discover that the SacreBleu score almost perfectly follows the formula: newScore = 0.978 * oldScore - 0.05.
-<p align="center">
-    <img src="./bleu_relationship.png" />
-    <br>
-    Figure 2. Linear relationship between old and new BLEU metric.
-</p>
-To take into account the varibaility of the results we computed basic statistics that help us verify whether a model trains correctly. Evaluating nearly 2000 checkpoints from 20 runs, the best score we achieved is 28.09 BLEU (which corresponds to 28.77 old score). Variance of the score of the best performing model between those 20 runs is 0.011. Knowing that max statistic is skewed toward higher values we have also run studies which calculate threshold beyond which validation loss is no longer correlated with BLEU score.
-Of course our hope is that dev's set distribution is similar to test's set distribution and when validation loss drops, BLEU score rises. But due to the finiteness of the validation and test sets we expect that there is such a loss value that makes performance on both sets decoupled from each other. To find this point we used Pearson correlation coefficient as a metric. The results indicate that optimizing beyond 4.02 validation loss value is no longer beneficial for the BLEU score. Further optimization does not cause overfitting but results become stochastic.
-Mean BLEU score after reaching 4.02 validation loss is 27.38. We observe variance of 0.08, which translate to nearly 0.3 BLEU average difference between mean score and obtained score.
-<p align="center">
-    <img src="./decorrelation_threshold.png" />
-    <br>
-    Figure 3. Validation loss vs BLEU score. Plots are trimmed to certain validation loss threshold.
-</p>
-
-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
-Our results were obtained by running the `run_DGXA100_AMP_8GPU.sh` and `run_DGXA100_TF32_8GPU.sh` training scripts in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. We report average accuracy over 6 runs. We consider a model trained when it reaches minimal validation loss. Time to train contains only training time without validation. Depending on a configuration and frequency of validation it can take up to additional minute per epoch. 
-
-| GPUs    | Batch size / GPU    | Accuracy - TF32  | Accuracy - mixed precision  |   Time to train - TF32  |  Time to train - mixed precision | Time to train speedup (TF32 to mixed precision)        
-|---------|---------------------|------------------|-----------------------------|-------------------------|----------------------------------|------------------------------------
-| 8       | 10240               | 27.92            | 27.76                       | 2.87 hours              | 2.79 hours                       | x1.03
-
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-
-Our results were obtained by running the `run_DGX1_AMP_8GPU.sh` and `run_DGX1_FP32_8GPU.sh` training scripts in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs. We report average accuracy over 6 runs. We consider a model trained when it reaches minimal validation loss. Time to train contains only training time without validation. Depending on a configuration and frequency of validation it can take up to additional minute per epoch. Using mixed precision we could fit a larger batch size in the memory, further speeding up the training.
-
-| GPUs    | Batch size / GPU    | Accuracy - FP32  | Accuracy - mixed precision  |   Time to train - FP32  |  Time to train - mixed precision | Time to train speedup (FP32 to mixed precision)        
-|---------|---------------------|------------------|-----------------------------|-------------------------|----------------------------------|------------------------------------
-| 8       | 5120/2560           | 27.66            | 27.82                       | 12 hours                | 4.6  hours                       | x2.64
-
-#### Training performance results
-
-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
-
-Our results were obtained by running the `run_DGXA100_AMP_8GPU.sh` and `run_DGXA100_TF32_8GPU.sh` training scripts in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch.
-
-| GPUs   | Batch size / GPU   | Throughput - TF32    | Throughput - mixed precision    | Throughput speedup (TF32 - mixed precision)   | Weak scaling - TF32    | Weak scaling - mixed precision        
-|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|-----
-| 8      | 10240              | 316913               | 582721                          | x1.84                                         | 6.93                   | 7.05 
-| 4      | 10240              | 161980               | 298741                          | x1.84                                         | 3.54                   | 3.62
-| 1      | 10240              | 45755                | 82618                           | x1.81                                         | 1                      | 1
-
-
-To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
-
-##### Training stability test
-
-The following plot shows average validation loss curves for different configs. We can see that training with AMP O2 converges slightly slower that FP32 and TF32 training. In order to mitigate this, you can use option `--amp-level O1` at the cost of 20% performance drop compared to the default AMP setting.
-
-<p align="center">
-    <img width="75%" hight="75%" src="./average_valid_loss.png" />
-    <br>
-    Figure 4. Validation loss curves
-</p>
-
-##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
-
-Our results were obtained by running the `run_DGX1_AMP_8GPU.sh` and `run_DGX1_FP32_8GPU.sh` training scripts in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch. Using mixed precision we could fit a larger batch size in the memory, further speeding up the training.
-
-| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Weak scaling - FP32    | Weak scaling - mixed precision        
-|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|-----
-| 8      | 5120/2560          | 58742                | 223245                          | x3.80                                         | 6.91                   | 6.67
-| 4      | 5120/2560          | 29674                | 115269                          | x3.88                                         | 3.49                   | 3.44
-| 1      | 5120/2560          | 8498                 | 33468                           | x3.94                                         | 1                      | 1
-
-
-To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
-
-##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
-
-Our results were obtained by running the `run_DGX1_AMP_8GPU.sh` and `run_DGX1_FP32_8GPU.sh` training scripts setting number of GPUs to 16 in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch. Using mixed precision we could fit a larger batch size in the memory, further speeding up the training.
-
-| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Weak scaling - FP32    | Weak scaling - mixed precision        
-|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|-----
-| 16     | 10240/5120         | 130867               | 510267                          | x3.9                                          | 13.38                  | 12.7
-| 8      | 10240/5120         | 68829                | 269464                          | x3.91                                         | 7.04                   | 6.71
-| 4      | 10240/5120         | 35168                | 141143                          | x4.01                                         | 3.6                    | 3.51
-| 1      | 10240/5120         | 9779                 | 40163                           | x4.11                                         | 1                      | 1   
-
-
-To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
-
-#### Inference performance results
-
-Our implementation of the Transformer has dynamic batching algorithm, which batches sentences together in such a way that there are no more than `N` tokens in each batch or no more than `M` sentences in each batch. In this benchmark we use the first option in order to get the most stable results.
-
-##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
-
-Our results were obtained by running the `inference.py` inferencing benchmarking script in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
-
-FP16
-
-| Batch size |  Throughput Avg | Latency Avg | Latency 90% |Latency 95% |Latency 99% |
-|------------|-----------------|-------------|-------------|------------|------------|
-| 10240      | 9653            | 0.986s      | 1.291s      | 2.157s     | 2.167s     |
-| 2560       | 5092            | 0.504s      | 0.721s      | 0.830s     | 1.752s     |
-| 1024       | 2590            | 0.402s      | 0.587s      | 0.666s     | 0.918s     |
-| 512        | 1357            | 0.380s      | 0.561s      | 0.633s     | 0.788s     |
-| 256        | 721             | 0.347s      | 0.513s      | 0.576s     | 0.698s     | 
-
-TF32
-
-| Batch size | Throughput Avg | Latency Avg | Latency 90% |Latency 95% |Latency 99% |
-|------------|----------------|-------------|-------------|------------|------------|
-|  10240     | 7755           | 1.227s      | 1.592s      | 2.512s     | 2.525s     |
-|  2560      | 4624           | 0.555s      | 0.786s      | 0.872s     | 1.886s     |
-|  1024      | 2394           | 0.435s      | 0.627s      | 0.702s     | 0.881s     |
-|  512       | 1275           | 0.405s      | 0.586s      | 0.663s     | 0.821s     |
-|  256       | 677            | 0.370s      | 0.546s      | 0.613s     | 0.733s     |    
-
-To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
-
-##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
-
-Our results were obtained by running the `inference.py` inferencing benchmarking script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.
-
-FP16
-
-| Batch size | Throughput Avg | Latency Avg | Latency 90% |Latency 95% |Latency 99% |
-|------------|----------------|-------------|-------------|------------|------------|
-| 10240      | 7464           | 1.283s      | 1.704s      | 1.792s     | 1.801s     |
-| 2560       | 3596           | 0.719s      | 1.066s      | 1.247s     | 1.423s     |
-| 1024       | 1862           | 0.563s      | 0.857s      | 0.936s     | 1.156s     |
-| 512        | 1003           | 0.518s      | 0.782s      | 0.873s     | 1.103s     |
-| 256        | 520            | 0.484s      | 0.723s      | 0.813s     | 0.992s     |
-
-FP32
-
-| Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
-|------------|----------------|-------------|-------------|-------------|-------------|
-| 10240      | 3782           | 2.531s      | 3.091s      | 3.121s      | 3.136s      |
-| 2560       | 2910           | 0.888s      | 1.221s      | 1.252s      | 1.432s      |
-| 1024       | 1516           | 0.692s      | 1.001s      | 1.126s      | 1.297s      |
-| 512        | 941            | 0.551s      | 0.812s      | 0.893s      | 1.133s      |
-| 256        | 502            | 0.501s      | 0.734s      | 0.822s      | 0.978s      |
-
-To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
-
-
-## Release notes
-
-### Changelog
-
-June 2020
- add TorchScript support
- Ampere support
-
-March 2020
- remove language modeling from the repository
- one inference script for large chunks of data as well as for interactive demo
- change custom distributed strategy to APEX's DDP
- replace custom fp16 training with AMP
- major refactoring of the codebase
-
-December 2019
- Change evaluation metric
-
-August 2019
- add basic AMP support
-
-July 2019
- Replace custom fused operators with jit functions
-
-June 2019
- New README
-
-March 2019
- Add mid-training [SacreBLEU](https://pypi.org/project/sacrebleu/1.2.10/) evaluation. Better handling of OOMs.
-
-Initial commit, forked from [fairseq](https://github.com/pytorch/fairseq/commit/ac5fddfc691267285a84c81d39475411da5ed1c6)
-
-## Known issues
-
- Using batch size greater than 16k causes indexing error in strided_batched_gemm module
--- a/PyTorch/NLP/Transformer/average_valid_loss.png
+++ b/PyTorch/NLP/Transformer/average_valid_loss.png
--- a/PyTorch/NLP/Transformer/bleu_relationship.png
+++ b/PyTorch/NLP/Transformer/bleu_relationship.png
--- a/PyTorch/NLP/Transformer/decorrelation_threshold.png
+++ b/PyTorch/NLP/Transformer/decorrelation_threshold.png
--- a/PyTorch/NLP/Transformer/distributed_train.py
+++ b/PyTorch/NLP/Transformer/distributed_train.py
-#!/usr/bin/env python3 -u
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-#
-#-------------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-
-import os
-import socket
-import subprocess
-
-from train import main as single_process_main
-from fairseq import distributed_utils, options
-
-
-def main(args):
-    if args.distributed_init_method is None and args.distributed_port > 0:
-        # We can determine the init method automatically for Slurm.
-        node_list = os.environ.get('SLURM_JOB_NODELIST')
-        if node_list is not None:
-            try:
-                hostnames = subprocess.check_output(['scontrol', 'show', 'hostnames', node_list])
-                args.distributed_init_method = 'tcp://{host}:{port}'.format(
-                    host=hostnames.split()[0].decode('utf-8'),
-                    port=args.distributed_port)
-                args.distributed_rank = int(os.environ.get('SLURM_PROCID'))
-                args.device_id = int(os.environ.get('SLURM_LOCALID'))
-            except subprocess.CalledProcessError as e:  # scontrol failed
-                raise e
-            except FileNotFoundError as e:  # Slurm is not installed
-                pass
-    if args.distributed_init_method is None:
-        raise ValueError('--distributed-init-method or --distributed-port '
-                         'must be specified for distributed training')
-
-    args.distributed_rank = distributed_utils.distributed_init(args)
-    args.device_id = int(os.environ.get('LOCAL_RANK', args.local_rank))
-    print('| initialized host {} as rank {} and device id {}'.format(socket.gethostname(), args.distributed_rank, args.device_id))
-    single_process_main(args)
-
-
-if __name__ == '__main__':
-    parser = options.get_training_parser()
-    args = options.parse_args_and_arch(parser)
-    main(args)
--- a/PyTorch/NLP/Transformer/fairseq/__init__.py
+++ b/PyTorch/NLP/Transformer/fairseq/__init__.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-
-from .multiprocessing_pdb import pdb
-
-__all__ = ['pdb']
--- a/PyTorch/NLP/Transformer/fairseq/criterions.py
+++ b/PyTorch/NLP/Transformer/fairseq/criterions.py
-import torch.nn.functional as F
-from torch.nn.modules.loss import _Loss
-
-
-class CrossEntropyCriterion(_Loss):
-
-    def __init__(self, args):
-        super().__init__()
-        self.padding_idx = args.padding_idx
-
-    def forward(self, norm_probs, target, reduce=True):
-        """Compute the loss for the given sample.
-        """
-        lprobs = norm_probs.view(-1, norm_probs.size(-1))
-        target = target.view(-1)
-        loss = F.nll_loss(lprobs, target, size_average=False, ignore_index=self.padding_idx,
-                          reduce=reduce)
-        return loss
-
-
-class LabelSmoothedCrossEntropyCriterion(_Loss):
-
-    def __init__(self, args):
-        super().__init__()
-        self.eps = args.label_smoothing
-        self.padding_idx = args.padding_idx
-
-    def forward(self, norm_probs, target, reduce=True):
-        """Compute the loss for the given sample.
-        """
-        target = target.view(-1, 1)
-        lprobs = norm_probs.view(-1, norm_probs.size(-1))
-        non_pad_mask = target.ne(self.padding_idx)
-        nll_loss = -lprobs.gather(dim=-1, index=target)[non_pad_mask]
-        smooth_loss = -lprobs.sum(dim=-1, keepdim=True)[non_pad_mask]
-        if reduce:
-            nll_loss = nll_loss.sum()
-            smooth_loss = smooth_loss.sum()
-        eps_i = self.eps / lprobs.size(-1)
-        loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss
-
-        return loss
-
-
-CRITERION_REGISTRY = {
-        'label_smoothed_cross_entropy' : LabelSmoothedCrossEntropyCriterion,
-        'cross_entropy' : CrossEntropyCriterion,
-        }
--- a/PyTorch/NLP/Transformer/fairseq/data/__init__.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/__init__.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-#
-#-------------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from .dictionary import Dictionary
-from .indexed_dataset import IndexedDataset, IndexedInMemoryDataset, IndexedRawTextDataset  # noqa: F401
-from .language_pair_dataset import LanguagePairDataset, load_dataset_splits
-
-from .data_utils import EpochBatchIterator
--- a/PyTorch/NLP/Transformer/fairseq/data/csrc/make_batches.cpp
+++ b/PyTorch/NLP/Transformer/fairseq/data/csrc/make_batches.cpp
-// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include <pybind11/pybind11.h>
-#include <pybind11/numpy.h>
-#include <pybind11/stl.h>
-#include <torch/extension.h>
-
-namespace at { namespace native {
-
-namespace {
-   bool is_batch_full(int64_t num_tokens, int64_t max_tokens, int64_t max_sentences, int64_t batch_length){
-      if (batch_length == 0){
-         return false;
-      } else if (batch_length == max_sentences || num_tokens > max_tokens){
-         return true;
-      } else {
-         return false;
-      }
-         
-   }
-}
-
-
-std::vector<std::vector<int64_t> > make_batches(py::array_t<int64_t> src_lengths, py::array_t<int64_t> tgt_lengths, py::array_t<int64_t> idx_list, int64_t max_tokens, int64_t max_sentences, uint64_t bsz_mult, int64_t max_len){
-   std::vector<std::vector<int64_t> > batches;   
-   auto src_l = src_lengths.unchecked<1>();
-   auto tgt_l = tgt_lengths.unchecked<1>();
-   auto idx_l = idx_list.unchecked<1>();
-   AT_ASSERTM(src_l.shape(0) == tgt_l.shape(0), "tgt_list and src_list should have the same shape");
-   AT_ASSERTM(idx_l.shape(0) == tgt_l.shape(0), "idx_list and tgt_list should have the same shape");
-   ssize_t nelem = src_l.shape(0);
-   int64_t sample_len =0;
-   std::vector<int64_t> sample_lens;
-   std::vector<int64_t> batch; 
-   for (ssize_t i=0; i < nelem; i++){
-       int64_t idx = idx_l(i);
-       int64_t sample_num_tokens = std::max(src_l(idx), tgt_l(idx));
-       if (sample_num_tokens > max_len) continue;
-       sample_len = std::max(sample_len, sample_num_tokens);
-       sample_lens.push_back(sample_num_tokens);
-       int64_t num_tokens = (batch.size() + 1) * sample_len;
-       if (is_batch_full(num_tokens, max_tokens, max_sentences, batch.size())){
-          int64_t mode_len = std::max(batch.size() / bsz_mult * bsz_mult, batch.size() % bsz_mult);
-          std::vector<int64_t> new_batch;
-          new_batch.reserve(mode_len);
-          std::copy(batch.begin()+mode_len, batch.end(), std::back_inserter(new_batch)); 
-          batch.erase(batch.begin()+mode_len, batch.end());
-          sample_lens.erase(sample_lens.begin(), sample_lens.begin()+mode_len);
-//sample_len always contains at least one element
-          sample_len = *std::max_element(sample_lens.begin(), sample_lens.end());
-          batches.push_back(batch);
-          batch = new_batch;
-       }
-       batch.push_back(idx);
-   }
-   if (batch.size() > 0) batches.push_back(batch);
-   return batches;
-}   
-
-
-}}
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
-  m.def("make_batches", &at::native::make_batches);
-}
-  
--- a/PyTorch/NLP/Transformer/fairseq/data/data_utils.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/data_utils.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-#
-#-------------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import contextlib
-import itertools
-import os
-
-import numpy as np
-import torch
-
-import fairseq.data.batch_C
-import sys
-from .dictionary import Dictionary
-
-
-def infer_language_pair(path):
-    """Infer language pair from filename: <split>.<lang1>-<lang2>.(...).idx"""
-    src, dst = None, None
-    for filename in os.listdir(path):
-        parts = filename.split('.')
-        if len(parts) >= 3 and len(parts[1].split('-')) == 2:
-            return parts[1].split('-')
-    return src, dst
-
-
-def load_dictionaries(args):
-    if args.source_lang is None or args.target_lang is None:
-        args.source_lang, args.target_lang = infer_language_pair(args.data)
-    if args.source_lang is None or args.target_lang is None:
-        raise Exception('Could not infer language pair, please provide it explicitly')
-
-    # load dictionaries
-    src_dict = Dictionary.load(os.path.join(args.data, 'dict.{}.txt'.format(args.source_lang)))
-    tgt_dict = Dictionary.load(os.path.join(args.data, 'dict.{}.txt'.format(args.target_lang)))
-    assert src_dict.pad() == tgt_dict.pad()
-    assert src_dict.eos() == tgt_dict.eos()
-    assert src_dict.unk() == tgt_dict.unk()
-    args.src_vocab_size = len(src_dict)
-    args.tgt_vocab_size = len(tgt_dict)
-    args.padding_idx = src_dict.pad()
-    print('| [{}] dictionary: {} types'.format(args.source_lang, len(src_dict)))
-    print('| [{}] dictionary: {} types'.format(args.target_lang, len(tgt_dict)))
-    return src_dict, tgt_dict
-
-
-class ShardedIterator(object):
-    """A sharded wrapper around an iterable (padded to length)."""
-
-    def __init__(self, iterable, num_shards, shard_id, fill_value=None):
-        if shard_id < 0 or shard_id >= num_shards:
-            raise ValueError('shard_id must be between 0 and num_shards')
-
-        self._sharded_len = len(iterable) // num_shards
-        if len(iterable) % num_shards > 0:
-            self._sharded_len += 1
-
-        self.itr = itertools.zip_longest(
-            range(self._sharded_len),
-            itertools.islice(iterable, shard_id, len(iterable), num_shards),
-            fillvalue=fill_value,
-        )
-
-    def __len__(self):
-        return self._sharded_len
-
-    def __iter__(self):
-        return self
-
-    def __next__(self):
-        return next(self.itr)[1]
-
-
-class CountingIterator(object):
-    """Wrapper around an iterable that maintains the iteration count."""
-
-    def __init__(self, iterable):
-        self.iterable = iterable
-        self.count = 0
-        self.itr = iter(self)
-
-    def __len__(self):
-        return len(self.iterable)
-
-    def __iter__(self):
-        for x in self.iterable:
-            self.count += 1
-            yield x
-
-    def __next__(self):
-        return next(self.itr)
-
-    def has_next(self):
-        return self.count < len(self)
-
-    def skip(self, num_to_skip):
-        next(itertools.islice(self.itr, num_to_skip, num_to_skip), None)
-        return self
-
-
-def collate_tokens(values, pad_idx, eos_idx, left_pad, move_eos_to_beginning=False, pad_sequence=1):
-    """Convert a list of 1d tensors into a padded 2d tensor."""
-    #size = max(v.size(0) for v in values)
-    orig_size = max(v.size(0) for v in values)
-    size = 0
-    if pad_sequence > 1:
-        size = orig_size // pad_sequence * pad_sequence
-        if orig_size % pad_sequence > 0:
-            size += pad_sequence
-    else:
-        size = orig_size
-    res = values[0].new(len(values), size).fill_(pad_idx)
-
-    def copy_tensor(src, dst):
-        assert dst.numel() == src.numel()
-        if move_eos_to_beginning:
-            assert src[-1] == eos_idx
-            dst[0] = eos_idx
-            dst[1:] = src[:-1]
-        else:
-            dst.copy_(src)
-
-    for i, v in enumerate(values):
-        copy_tensor(v, res[i][size - len(v):] if left_pad else res[i][:len(v)])
-    return res
-
-
-def collate(samples, pad_idx, eos_idx, left_pad_source=True, left_pad_target=False, pad_sequence=1):
-    if len(samples) == 0:
-        return {}
-
-    def merge(key, left_pad, move_eos_to_beginning=False):
-        return collate_tokens(
-            [s[key] for s in samples],
-            pad_idx, eos_idx, left_pad, move_eos_to_beginning,
-            pad_sequence,
-        )
-
-    id = torch.LongTensor([s['id'] for s in samples])
-    src_tokens = merge('source', left_pad=left_pad_source)
-    # sort by descending source length
-    src_lengths = torch.LongTensor([s['source'].numel() for s in samples])
-    src_lengths, sort_order = src_lengths.sort(descending=True)
-    id = id.index_select(0, sort_order)
-    src_tokens = src_tokens.index_select(0, sort_order)
-
-    prev_output_tokens = None
-    target = None
-    if samples[0].get('target', None) is not None:
-        target = merge('target', left_pad=left_pad_target)
-        # we create a shifted version of targets for feeding the
-        # previous output token(s) into the next decoder step
-        prev_output_tokens = merge(
-            'target',
-            left_pad=left_pad_target,
-            move_eos_to_beginning=True,
-        )
-        prev_output_tokens = prev_output_tokens.index_select(0, sort_order)
-        target = target.index_select(0, sort_order)
-        ntokens = sum(len(s['target']) for s in samples)
-    else:
-        ntokens = sum(len(s['source']) for s in samples)
-
-    return {
-        'id': id,
-        'ntokens': ntokens,
-        'net_input': {
-            'src_tokens': src_tokens,
-            'src_lengths': src_lengths,
-            'prev_output_tokens': prev_output_tokens,
-        },
-        'target': target,
-    }
-
-
-def get_dummy_batch(num_tokens, src_dict, tgt_dict, src_len=128, tgt_len=128,
-                    left_pad_source=True, left_pad_target=False, pad_sequence=1):
-    bsz = num_tokens // max(src_len, tgt_len)
-    dummy_samples = [
-        {
-            'id': i,
-            'source': src_dict.dummy_sentence(src_len),
-            'target': tgt_dict.dummy_sentence(tgt_len) if tgt_dict is not None else None,
-        }
-        for i in range(bsz)
-    ]
-    return collate(
-        dummy_samples, pad_idx=src_dict.pad(), eos_idx=src_dict.eos(),
-        left_pad_source=left_pad_source, left_pad_target=left_pad_target,
-        pad_sequence=pad_sequence,
-    )
-
-
-class EpochBatchIterator(object):
-    """Iterate over a FairseqDataset and yield batches bucketed by size.
-
-    Batches may contain sequences of different lengths. This iterator can be
-    reused across multiple epochs with the next_epoch_itr() method.
-
-    Args:
-        dataset: a FairseqDataset
-        max_tokens: max number of tokens in each batch
-        max_sentences: max number of sentences in each batch
-        max_positions: max sentence length supported by the model
-        required_batch_size_multiple: require batch size to be a multiple of N
-        seed: seed for random number generator for reproducibility
-        num_shards: shard the data iterator into N shards
-        shard_id: which shard of the data iterator to return
-    """
-
-    def __init__(
-        self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
-        required_batch_size_multiple=1, seed=1,
-        num_shards=1, shard_id=0, epoch=0
-    ):
-        self.dataset = dataset
-        self.max_tokens = max_tokens if max_tokens is not None else float('Inf')
-        self.max_sentences = max_sentences if max_sentences is not None else float('Inf')
-        self.max_positions = max_positions
-        self.bsz_mult = required_batch_size_multiple
-        self.seed = seed
-        self.num_shards = num_shards
-        self.shard_id = shard_id
-        self.epoch = epoch
-        self._cur_epoch_itr = None
-        self._next_epoch_itr = None
-
-        with numpy_seed(self.seed):
-            indices = self.dataset.ordered_indices(self.seed, self.epoch)
-#need integer, rather than float('Inf') values
-            max_sentences = max_sentences if max_sentences is not None else sys.maxsize
-            max_positions_num = 1024
-            max_tokens = max_tokens if max_tokens is not None else sys.maxsize
-            #Following line is workaround due to the fact we cannot pass None object as argument
-            tgt_sizes = self.dataset.tgt_sizes if self.dataset.tgt_sizes is not None else self.dataset.src_sizes
-            batches = fairseq.data.batch_C.make_batches(
-                self.dataset.src_sizes, tgt_sizes, indices, max_tokens,
-                max_sentences, self.bsz_mult, max_positions_num)
-            self.frozen_batches = tuple(batches)
-
-    def __len__(self):
-        return len(self.frozen_batches)
-
-    def next_epoch_itr(self, shuffle=True):
-        """Shuffle batches and return a new iterator over the dataset."""
-        if self._next_epoch_itr is not None:
-            self._cur_epoch_itr = self._next_epoch_itr
-            self._next_epoch_itr = None
-        else:
-            self.epoch += 1
-            self._cur_epoch_itr = self._get_iterator_for_epoch(self.epoch, shuffle)
-        return self._cur_epoch_itr
-
-    def end_of_epoch(self):
-        return not self._cur_epoch_itr.has_next()
-
-    @property
-    def iterations_in_epoch(self):
-        if self._cur_epoch_itr is not None:
-            return self._cur_epoch_itr.count
-        elif self._next_epoch_itr is not None:
-            return self._next_epoch_itr.count
-        return 0
-
-    def state_dict(self):
-        return {
-            'epoch': self.epoch,
-            'iterations_in_epoch': self.iterations_in_epoch,
-        }
-
-    def load_state_dict(self, state_dict):
-        self.epoch = state_dict['epoch']
-        itr_pos = state_dict.get('iterations_in_epoch', 0)
-        if itr_pos > 0:
-            # fast-forward epoch iterator
-            itr = self._get_iterator_for_epoch(self.epoch, state_dict.get('shuffle', True))
-            if itr_pos < len(itr):
-                self._next_epoch_itr = itr.skip(itr_pos)
-
-    def _get_iterator_for_epoch(self, epoch, shuffle):
-        if shuffle:
-            # set seed based on the seed and epoch number so that we get
-            # reproducible results when resuming from checkpoints
-            with numpy_seed(self.seed + epoch):
-                batches = list(self.frozen_batches)  # copy
-                np.random.shuffle(batches)
-        else:
-            batches = self.frozen_batches
-        return CountingIterator(torch.utils.data.DataLoader(
-            self.dataset,
-            collate_fn=self.dataset.collater,
-            num_workers=1,
-            batch_sampler=ShardedIterator(batches, self.num_shards, self.shard_id, fill_value=[]),
-        ))
-
-
-@contextlib.contextmanager
-def numpy_seed(seed):
-    """Context manager which seeds the NumPy PRNG with the specified seed and
-    restores the state afterward"""
-    if seed is None:
-        yield
-        return
-    state = np.random.get_state()
-    np.random.seed(seed)
-    try:
-        yield
-    finally:
-        np.random.set_state(state)
--- a/PyTorch/NLP/Transformer/fairseq/data/fairseq_dataset.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/fairseq_dataset.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-
-import torch.utils.data
-
-
-class FairseqDataset(torch.utils.data.Dataset):
-    """A dataset that provides helpers for batching."""
-
-    def __getitem__(self, index):
-        raise NotImplementedError
-
-    def __len__(self):
-        raise NotImplementedError
-
-    def collater(self, samples):
-        """Merge a list of samples to form a mini-batch."""
-        raise NotImplementedError
-
-
-    def num_tokens(self, index):
-        """Return an example's length (number of tokens), used for batching."""
-        raise NotImplementedError
-
-    def ordered_indices(self, seed=None, epoch=0):
-        """Ordered indices for batching."""
-        raise NotImplementedError
-
-    def valid_size(self, index, max_positions):
-        """Check if an example's size is valid according to max_positions."""
-        raise NotImplementedError
--- a/PyTorch/NLP/Transformer/fairseq/data/indexed_dataset.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/indexed_dataset.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-
-import os
-import struct
-
-import numpy as np
-import torch
-
-from fairseq.tokenizer import Tokenizer
-
-
-def read_longs(f, n):
-    a = np.empty(n, dtype=np.int64)
-    f.readinto(a)
-    return a
-
-
-def write_longs(f, a):
-    f.write(np.array(a, dtype=np.int64))
-
-
-dtypes = {
-    1: np.uint8,
-    2: np.int8,
-    3: np.int16,
-    4: np.int32,
-    5: np.int64,
-    6: np.float,
-    7: np.double,
-}
-
-
-def code(dtype):
-    for k in dtypes.keys():
-        if dtypes[k] == dtype:
-            return k
-
-
-def index_file_path(prefix_path):
-    return prefix_path + '.idx'
-
-
-def data_file_path(prefix_path):
-    return prefix_path + '.bin'
-
-
-class IndexedDataset(torch.utils.data.Dataset):
-    """Loader for TorchNet IndexedDataset"""
-
-    def __init__(self, path, fix_lua_indexing=False):
-        super().__init__()
-        self.fix_lua_indexing = fix_lua_indexing
-        with open(index_file_path(path), 'rb') as f:
-            magic = f.read(8)
-            assert magic == b'TNTIDX\x00\x00'
-            version = f.read(8)
-            assert struct.unpack('<Q', version) == (1,)
-            code, self.element_size = struct.unpack('<QQ', f.read(16))
-            self.dtype = dtypes[code]
-            self.size, self.s = struct.unpack('<QQ', f.read(16))
-            self.dim_offsets = read_longs(f, self.size + 1)
-            self.data_offsets = read_longs(f, self.size + 1)
-            self.sizes = read_longs(f, self.s)
-        self.read_data(path)
-
-    def read_data(self, path):
-        self.data_file = open(data_file_path(path), 'rb', buffering=0)
-
-    def check_index(self, i):
-        if i < 0 or i >= self.size:
-            raise IndexError('index out of range')
-
-    def __del__(self):
-        self.data_file.close()
-
-    def __getitem__(self, i):
-        self.check_index(i)
-        tensor_size = self.sizes[self.dim_offsets[i]:self.dim_offsets[i + 1]]
-        a = np.empty(tensor_size, dtype=self.dtype)
-        self.data_file.seek(self.data_offsets[i] * self.element_size)
-        self.data_file.readinto(a)
-        item = torch.from_numpy(a).long()
-        if self.fix_lua_indexing:
-            item -= 1  # subtract 1 for 0-based indexing
-        return item
-
-    def __len__(self):
-        return self.size
-
-    @staticmethod
-    def exists(path):
-        return (
-            os.path.exists(index_file_path(path)) and
-            os.path.exists(data_file_path(path))
-        )
-
-
-class IndexedInMemoryDataset(IndexedDataset):
-    """Loader for TorchNet IndexedDataset, keeps all the data in memory"""
-
-    def read_data(self, path):
-        self.data_file = open(data_file_path(path), 'rb')
-        self.buffer = np.empty(self.data_offsets[-1], dtype=self.dtype)
-        self.data_file.readinto(self.buffer)
-        self.data_file.close()
-        if self.fix_lua_indexing:
-            self.buffer -= 1  # subtract 1 for 0-based indexing
-
-    def __del__(self):
-        pass
-
-    def __getitem__(self, i):
-        self.check_index(i)
-        tensor_size = self.sizes[self.dim_offsets[i]:self.dim_offsets[i + 1]]
-        a = np.empty(tensor_size, dtype=self.dtype)
-        np.copyto(a, self.buffer[self.data_offsets[i]:self.data_offsets[i + 1]])
-        return torch.from_numpy(a).long()
-
-
-class IndexedRawTextDataset(IndexedDataset):
-    """Takes a text file as input and binarizes it in memory at instantiation.
-    Original lines are also kept in memory"""
-
-    def __init__(self, path, dictionary, append_eos=True, reverse_order=False):
-        self.tokens_list = []
-        self.lines = []
-        self.sizes = []
-        self.append_eos = append_eos
-        self.reverse_order = reverse_order
-        self.read_data(path, dictionary)
-        self.size = len(self.tokens_list)
-
-    def read_data(self, path, dictionary):
-        with open(path, 'r') as f:
-            for line in f:
-                self.lines.append(line.strip('\n'))
-                tokens = Tokenizer.tokenize(
-                    line, dictionary, add_if_not_exist=False,
-                    append_eos=self.append_eos, reverse_order=self.reverse_order,
-                ).long()
-                self.tokens_list.append(tokens)
-                self.sizes.append(len(tokens))
-        self.sizes = np.array(self.sizes)
-
-    def __getitem__(self, i):
-        self.check_index(i)
-        return self.tokens_list[i]
-
-    def get_original_text(self, i):
-        self.check_index(i)
-        return self.lines[i]
-
-    def __del__(self):
-        pass
-
-    def __len__(self):
-        return self.size
-
-    @staticmethod
-    def exists(path):
-        return os.path.exists(path)
-
-
-class IndexedDatasetBuilder(object):
-    element_sizes = {
-        np.uint8: 1,
-        np.int8: 1,
-        np.int16: 2,
-        np.int32: 4,
-        np.int64: 8,
-        np.float: 4,
-        np.double: 8
-    }
-
-    def __init__(self, out_file, dtype=np.int32):
-        self.out_file = open(out_file, 'wb')
-        self.dtype = dtype
-        self.data_offsets = [0]
-        self.dim_offsets = [0]
-        self.sizes = []
-        self.element_size = self.element_sizes[self.dtype]
-
-    def add_item(self, tensor):
-        # +1 for Lua compatibility
-        bytes = self.out_file.write(np.array(tensor.numpy() + 1, dtype=self.dtype))
-        self.data_offsets.append(self.data_offsets[-1] + bytes / self.element_size)
-        for s in tensor.size():
-            self.sizes.append(s)
-        self.dim_offsets.append(self.dim_offsets[-1] + len(tensor.size()))
-
-    def finalize(self, index_file):
-        self.out_file.close()
-        index = open(index_file, 'wb')
-        index.write(b'TNTIDX\x00\x00')
-        index.write(struct.pack('<Q', 1))
-        index.write(struct.pack('<QQ', code(self.dtype), self.element_size))
-        index.write(struct.pack('<QQ', len(self.data_offsets) - 1, len(self.sizes)))
-        write_longs(index, self.dim_offsets)
-        write_longs(index, self.data_offsets)
-        write_longs(index, self.sizes)
-        index.close()
--- a/PyTorch/NLP/Transformer/fairseq/data/language_pair_dataset.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/language_pair_dataset.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-#
-#-------------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import numpy as np
-from torch.utils.data import Dataset, ConcatDataset
-
-from . import data_utils
-import itertools
-import os
-import sys
-from fairseq.data import IndexedInMemoryDataset, IndexedRawTextDataset
-
-
-class LanguagePairDataset(Dataset):
-    """A pair of torch.utils.data.Datasets."""
-
-    def __init__(
-        self, src, src_sizes, src_dict,
-        tgt=None, tgt_sizes=None, tgt_dict=None,
-        left_pad_source=True, left_pad_target=False,
-        max_source_positions=1024, max_target_positions=1024,
-        pad_sequence=1, shuffle=True,
-    ):
-        if tgt_dict is not None:
-            assert src_dict.pad() == tgt_dict.pad()
-            assert src_dict.eos() == tgt_dict.eos()
-            assert src_dict.unk() == tgt_dict.unk()
-        self.src = src
-        self.tgt = tgt
-        self.src_sizes = np.array(src_sizes)
-        self.tgt_sizes = np.array(tgt_sizes) if tgt_sizes is not None else None
-        self.src_dict = src_dict
-        self.tgt_dict = tgt_dict
-        self.left_pad_source = left_pad_source
-        self.left_pad_target = left_pad_target
-        self.max_source_positions = max_source_positions
-        self.max_target_positions = max_target_positions
-        self.pad_sequence = pad_sequence
-        self.shuffle = shuffle
-        print("| Sentences are being padded to multiples of: {}".format(self.pad_sequence), file=sys.stderr)
-
-    def __getitem__(self, index):
-        return {
-            'id': index,
-            'source': self.src[index],
-            'target': self.tgt[index] if self.tgt is not None else None,
-        }
-
-    def __len__(self):
-        return len(self.src)
-
-    def collater(self, samples):
-        """Merge a list of samples to form a mini-batch."""
-        return data_utils.collate(
-            samples, pad_idx=self.src_dict.pad(), eos_idx=self.src_dict.eos(),
-            left_pad_source=self.left_pad_source, left_pad_target=self.left_pad_target,
-            pad_sequence=self.pad_sequence,
-        )
-
-    def num_tokens(self, index):
-        """Return an example's length (number of tokens), used for batching."""
-        orig_size = max(self.src_sizes[index], self.tgt_sizes[index] if self.tgt_sizes is not None else 0)
-        assert self.pad_sequence > 0, "Padding multiple has to be greater than 0"
-        size = 0
-        if self.pad_sequence > 1:
-            size = orig_size // self.pad_sequence * self.pad_sequence
-            if orig_size % self.pad_sequence > 0:
-                size += self.pad_sequence
-        else:
-            size = orig_size
-        return size
-        #return max(self.src_sizes[index], self.tgt_sizes[index] if self.tgt_sizes is not None else 0)
-
-    def ordered_indices(self, seed=None, epoch=1):
-        """Ordered indices for batching."""
-        if self.shuffle:
-            indices = np.random.RandomState(seed + epoch).permutation(len(self))
-        else:
-            indices = np.arange(len(self))
-        if self.tgt_sizes is not None:
-            indices = indices[np.argsort(self.tgt_sizes[indices], kind='mergesort')]
-        return indices[np.argsort(self.src_sizes[indices], kind='mergesort')]
-
-    def valid_size(self, index, max_positions):
-        """Check if an example's size is valid according to max_positions."""
-        max_source_positions, max_target_positions = self._get_max_positions(max_positions)
-        return (
-            self.src_sizes[index] <= max_source_positions and
-            (self.tgt_sizes is None or self.tgt_sizes[index] <= max_target_positions)
-        )
-
-    def _get_max_positions(self, max_positions):
-        if max_positions is None:
-            return self.max_source_positions, self.max_target_positions
-        assert len(max_positions) == 2
-        max_src_pos, max_tgt_pos = max_positions
-        return min(self.max_source_positions, max_src_pos), min(self.max_target_positions, max_tgt_pos)
-
-
-def load_dataset(args, datasets, split, src_dict, tgt_dict, combine=False):
-    """Load a dataset split."""
-
-    def split_exists(split, src, tgt, lang):
-        filename = os.path.join(args.data, '{}.{}-{}.{}'.format(split, src, tgt, lang))
-        if args.raw_text and IndexedRawTextDataset.exists(filename):
-            return True
-        elif not args.raw_text and IndexedInMemoryDataset.exists(filename):
-            return True
-        return False
-
-    def indexed_dataset(path, dictionary):
-        if args.raw_text:
-            return IndexedRawTextDataset(path, dictionary)
-        elif IndexedInMemoryDataset.exists(path):
-            return IndexedInMemoryDataset(path, fix_lua_indexing=True)
-        return None
-
-    src_datasets = []
-    tgt_datasets = []
-
-    for k in itertools.count():
-        split_k = split + (str(k) if k > 0 else '')
-
-        # infer langcode
-        src, tgt = args.source_lang, args.target_lang
-        if split_exists(split_k, src, tgt, src):
-            prefix = os.path.join(args.data, '{}.{}-{}.'.format(split_k, src, tgt))
-        elif split_exists(split_k, tgt, src, src):
-            prefix = os.path.join(args.data, '{}.{}-{}.'.format(split_k, tgt, src))
-        else:
-            if k > 0:
-                break
-            else:
-                raise FileNotFoundError('Dataset not found: {} ({})'.format(split, args.data))
-
-        src_datasets.append(indexed_dataset(prefix + src, src_dict))
-        tgt_datasets.append(indexed_dataset(prefix + tgt, tgt_dict))
-
-        print('| {} {} {} examples'.format(args.data, split_k, len(src_datasets[-1])))
-
-        if not combine:
-            break
-
-    assert len(src_datasets) == len(tgt_datasets)
-
-    if len(src_datasets) == 1:
-        src_dataset, tgt_dataset = src_datasets[0], tgt_datasets[0]
-        src_sizes = src_dataset.sizes
-        tgt_sizes = tgt_dataset.sizes
-    else:
-        src_dataset = ConcatDataset(src_datasets)
-        tgt_dataset = ConcatDataset(tgt_datasets)
-        src_sizes = np.concatenate([ds.sizes for ds in src_datasets])
-        tgt_sizes = np.concatenate([ds.sizes for ds in tgt_datasets])
-
-    datasets[split] = LanguagePairDataset(
-        src_dataset, src_sizes, src_dict,
-        tgt_dataset, tgt_sizes, tgt_dict,
-        left_pad_source=args.left_pad_source,
-        left_pad_target=args.left_pad_target,
-        max_source_positions=args.max_source_positions,
-        max_target_positions=args.max_target_positions,
-        pad_sequence=args.pad_sequence,
-    )
-
-
-def load_dataset_splits(args, splits, src_dict, tgt_dict):
-    datasets = {}
-    for split in splits:
-        if split == 'train':
-            load_dataset(args, datasets, split, src_dict, tgt_dict, combine=True)
-        else:
-            for k in itertools.count():
-                split_k = split + (str(k) if k > 0 else '')
-                try:
-                    load_dataset(args, datasets, split_k, src_dict, tgt_dict, combine=False)
-                except FileNotFoundError as e:
-                    if k > 0:
-                        break
-                    raise e
-    return datasets
--- a/PyTorch/NLP/Transformer/fairseq/data/token_block_dataset.py
+++ b/PyTorch/NLP/Transformer/fairseq/data/token_block_dataset.py
-# Copyright (c) 2017-present, Facebook, Inc.
-# All rights reserved.
-#
-# This source code is licensed under the license found in the LICENSE file in
-# the root directory of this source tree. An additional grant of patent rights
-# can be found in the PATENTS file in the same directory.
-#
-#-------------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-
-import numpy as np
-import torch
-
-
-class TokenBlockDataset(torch.utils.data.Dataset):
-    """Break a 1d tensor of tokens into blocks.
-
-    The blocks are fetched from the original tensor so no additional memory is allocated.
-
-    Args:
-        tokens: 1d tensor of tokens to break into blocks
-        sizes: sentence lengths (required for 'complete' and 'eos')
-        block_size: maximum block size (ignored in 'eos' break mode)
-        break_mode: Mode used for breaking tokens. Values can be one of:
-            - 'none': break tokens into equally sized blocks (up to block_size)
-            - 'complete': break tokens into blocks (up to block_size) such that
-                blocks contains complete sentences, although block_size may be
-                exceeded if some sentences exceed block_size
-            - 'eos': each block contains one sentence (block_size is ignored)
-        include_targets: return next tokens as targets
-    """
-
-    def __init__(self, tokens, sizes, block_size, break_mode=None, include_targets=False):
-        super().__init__()
-
-        self.tokens = tokens
-        self.total_size = len(tokens)
-        self.include_targets = include_targets
-        self.slice_indices = []
-
-        if break_mode is None or break_mode == 'none':
-            length = math.ceil(len(tokens) / block_size)
-
-            def block_at(i):
-                start = i * block_size
-                end = min(start + block_size, len(tokens))
-                return (start, end)
-
-            self.slice_indices = [block_at(i) for i in range(length)]
-        elif break_mode == 'complete':
-            assert sizes is not None and sum(sizes) == len(tokens), '{} != {}'.format(sum(sizes), len(tokens))
-            tok_idx = 0
-            sz_idx = 0
-            curr_size = 0
-            while sz_idx < len(sizes):
-                if curr_size + sizes[sz_idx] <= block_size or curr_size == 0:
-                    curr_size += sizes[sz_idx]
-                    sz_idx += 1
-                else:
-                    self.slice_indices.append((tok_idx, tok_idx + curr_size))
-                    tok_idx += curr_size
-                    curr_size = 0
-            if curr_size > 0:
-                self.slice_indices.append((tok_idx, tok_idx + curr_size))
-        elif break_mode == 'eos':
-            assert sizes is not None and sum(sizes) == len(tokens), '{} != {}'.format(sum(sizes), len(tokens))
-            curr = 0
-            for sz in sizes:
-                # skip samples with just 1 example (which would be just the eos token)
-                if sz > 1:
-                    self.slice_indices.append((curr, curr + sz))
-                curr += sz
-        else:
-            raise ValueError('Invalid break_mode: ' + break_mode)
-
-        self.sizes = np.array([e - s for s, e in self.slice_indices])
-
-    def __getitem__(self, index):
-        s, e = self.slice_indices[index]
-
-        item = torch.LongTensor(self.tokens[s:e])
-
-        if self.include_targets:
-            # target is the sentence, for source, rotate item one token to the left (would start with eos)
-            if s == 0:
-                source = np.concatenate([self.tokens[-1:], self.tokens[0:e - 1]])
-            else:
-                source = self.tokens[s - 1:e - 1]
-
-            return torch.LongTensor(source), item
-        return item
-
-    def __len__(self):
-        return len(self.slice_indices)