metavar='LR',help='Initial learning rate. Will be scaled by <global batch size>/256: args.lr = args.lr*float(args.batch_size*args.world_size)/256. A warmup schedule will also be applied over the first 5 epochs.')
This example is based on [https://github.com/pytorch/examples/tree/master/word_language_model](https://github.com/pytorch/examples/tree/master/word_language_model).
It trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
By default, the training script uses the Wikitext-2 dataset, provided.
The trained model can then be used by the generate script to generate new text.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
`main_fp16_optimizer.py` with `--fp16` demonstrates use of `apex.fp16_utils.FP16_Optimizer` to automatically manage master parameters and loss scaling.
These examples are intended as an illustration of the mixed precision recipe, not necessarily as a performance showcase. However, they do demonstrate certain best practices.
First, a default loss scale of 128.0 is used. In our testing, this improves converged test perplexity modestly with mixed precision, from around 93 with loss scale 1.0 to around 90 with loss scale 128.0.
Second, to enable Tensor Core use with `--fp16` and improve performance, dimensions that participate in GEMMs in the model are made multiples of 8. Specifically, these are
* dictionary length (ntokens in `main.py`),
* embedding size (`--emsize`),
* hidden size (`--nhid`), and
* batch size (`--batch_size`).
The dictionary length is a property of the dataset, and is not controlled by a command line argument. In `main.py`, `corpus = data.Corpus(args.data, pad_to_multiple_of=8)` and the `Corpus` constructor in
`data.py` ensure that the dictionary length is a multiple of 8.
Also, for mixed precision performance, a good general rule is: the more work you give the GPU, the better. Bigger models and larger batch sizes supply the cores with more work and do a better job saturating the device. A (very rough) way to check if you're saturating the device is to run nvidia-smi from another terminal, and see what fraction of device memory you're using. This will tell you how much leeway you have to increase model or batch size.
```bash
python main.py --cuda--epochs 6 # Train a LSTM on Wikitext-2 with CUDA
python main.py --cuda--epochs 6 --fp16# Train a LSTM on Wikitext-2 with CUDA and mixed precision
python main.py --cuda--epochs 6 --tied# Train a tied LSTM on Wikitext-2 with CUDA
python main.py --cuda--tied# Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs
python generate.py # Generate samples from the trained LSTM model.
```
The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`)
which will automatically use the cuDNN backend if run on CUDA with cuDNN installed.
During training, if a keyboard interrupt (Ctrl-C) is received,
training is stopped and the current model is evaluated against the test dataset.
## Usage for `main.py` and `main_fp16_optimizer.py`
**Dockerfile** installs the latest Apex on top of an existing image. Run
```
docker build -t image_with_apex .
docker build -t new_image_with_apex .
```
By default, **Dockerfile** uses NVIDIA's Pytorch container as the base image,
which requires an NVIDIA GPU Cloud (NGC) account. If you don't have an NGC account, you can sign up for free by following the instructions [here](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html#generating-api-key).
Alternatively, you can supply your own base image via the `BASE_IMAGE` build-arg.
Any `BASE_IMAGE` you supply must have Pytorch and Cuda installed, for example:
`BASE_IMAGE` must have Pytorch and Cuda installed. For example, any
`-devel` image for Pytorch 1.0 and later from the
[official Pytorch Dockerhub](https://hub.docker.com/r/pytorch/pytorch) may be used:
If you want to rebuild your image, and force the latest Apex to be cloned and installed, make any small change to the `SHA` variable in **Dockerfile**.
**Warning:**
Currently, Pytorch's default non-devel image on Dockerhub
[pytorch/pytorch:0.4_cuda9_cudnn7](https://hub.docker.com/r/pytorch/pytorch/tags/) contains Pytorch installed with prebuilt binaries. It does not contain NVCC, which means it is not an eligible candidate for `<base image>`.
Currently, the non-`-devel` images on Pytorch Dockerhub do not contain the Cuda compiler `nvcc`. Therefore,
images whose name does not contain `-devel` are not eligible candidates for `BASE_IMAGE`.
## Option 2: Install Apex in a running container
...
...
@@ -25,4 +27,7 @@ Instead of building a new container, it is also a viable option to `git clone ht
```
docker run --runtime=nvidia -it --rm --ipc=host -v /bare/metal/apex:/apex/in/container <base image>
```
then go to /apex/in/container within the running container and `python setup.py install [--cuda_ext] [--cpp_ext]`.
then go to /apex/in/container within the running container and