Commit c41d9f2b authored by Michael Carilli's avatar Michael Carilli
Browse files

README wiring in a reasonable state, Sphinx docstrings updated

parent 5f8c3183
...@@ -2,15 +2,15 @@ fp16_optimizer.py contains `FP16_Optimizer`, a Python class designed to wrap an ...@@ -2,15 +2,15 @@ fp16_optimizer.py contains `FP16_Optimizer`, a Python class designed to wrap an
### [FP16_Optimizer API documentation](https://nvidia.github.io/apex/fp16_utils.html#automatic-management-of-master-params-loss-scaling) ### [FP16_Optimizer API documentation](https://nvidia.github.io/apex/fp16_utils.html#automatic-management-of-master-params-loss-scaling)
[Simple examples with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple) ### [Simple examples with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple)
[Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) ### [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
[word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model) ### [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model)
fp16_util.py contains a number of utilities to manually manage master parameters and loss scaling, if the user chooses. fp16_util.py contains a number of utilities to manually manage master parameters and loss scaling, if the user chooses.
### [Manual management documentation](https://nvidia.github.io/apex/fp16_utils.html#manual-master-parameter-management) ### [Manual management documentation](https://nvidia.github.io/apex/fp16_utils.html#manual-master-parameter-management)
In addition to `FP16_Optimizer` examples, the Imagenet and word_language_model directories contain examples that demonstrate manual management of master parameters and static loss scaling. The [Imagenet with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) and [word_language_model with FP16_Optimizer](https://github.com/NVIDIA/apex/tree/master/examples/word_language_model) directories also contain `main.py` files that demonstrate manual management of master parameters and static loss scaling. These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
These examples illustrate what sort of operations `FP16_Optimizer` is performing automatically.
...@@ -53,7 +53,7 @@ class FP16_Module(nn.Module): ...@@ -53,7 +53,7 @@ class FP16_Module(nn.Module):
class FP16_Optimizer(object): class FP16_Optimizer(object):
""" """
:class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer, :class:`FP16_Optimizer` is designed to wrap an existing PyTorch optimizer,
and manage (dynamic) loss scaling and master weights in a manner transparent to the user. and manage static or dynamic loss scaling and master weights in a manner transparent to the user.
For standard use, only two lines must be changed: creating the :class:`FP16_Optimizer` instance, For standard use, only two lines must be changed: creating the :class:`FP16_Optimizer` instance,
and changing the call to ``backward``. and changing the call to ``backward``.
......
...@@ -43,13 +43,13 @@ class DistributedDataParallel(Module): ...@@ -43,13 +43,13 @@ class DistributedDataParallel(Module):
When used with ``multiproc.py``, :class:`DistributedDataParallel` When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node. assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()`. allreduced and averaged over processes during ``backward()``.
:class:``DistributedDataParallel`` is optimized for use with NCCL. It achieves high performance by :class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required. transfers to reduce the total number of transfers required.
:class:``DistributedDataParallel`` assumes that your script accepts the command line :class:`DistributedDataParallel` assumes that your script accepts the command line
arguments "rank" and "world-size." It also assumes that your script calls arguments "rank" and "world-size." It also assumes that your script calls
``torch.cuda.set_device(args.rank)`` before creating the model. ``torch.cuda.set_device(args.rank)`` before creating the model.
...@@ -60,8 +60,7 @@ class DistributedDataParallel(Module): ...@@ -60,8 +60,7 @@ class DistributedDataParallel(Module):
Args: Args:
module: Network definition to be run in multi-gpu/distributed mode. module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket. message_size (Default = 1e7): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be True. shared_param (Default = False): If your model uses shared parameters this must be True. It will disable bucketing of parameters to avoid race conditions.
It will disable bucketing of parameters to avoid race conditions.
""" """
......
...@@ -13,7 +13,8 @@ help: ...@@ -13,7 +13,8 @@ help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
docset: html docset: html
doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/nv-pytorch2.png --enable-js --online-redirect-url http://pytorch.org/docs/ --force $(BUILDDIR)/html/ doc2dash --name $(SPHINXPROJ) --enable-js --online-redirect-url http://pytorch.org/docs/ --force $(BUILDDIR)/html/
# doc2dash --name $(SPHINXPROJ) --icon $(SOURCEDIR)/_static/img/nv-pytorch2.png --enable-js --online-redirect-url http://pytorch.org/docs/ --force $(BUILDDIR)/html/
# Manually fix because Zeal doesn't deal well with `icon.png`-only at 2x resolution. # Manually fix because Zeal doesn't deal well with `icon.png`-only at 2x resolution.
cp $(SPHINXPROJ).docset/icon.png $(SPHINXPROJ).docset/icon@2x.png cp $(SPHINXPROJ).docset/icon.png $(SPHINXPROJ).docset/icon@2x.png
......
...@@ -62,7 +62,7 @@ source_suffix = '.rst' ...@@ -62,7 +62,7 @@ source_suffix = '.rst'
master_doc = 'index' master_doc = 'index'
# General information about the project. # General information about the project.
project = 'APEx' project = 'Apex'
copyright = '2018' copyright = '2018'
author = 'Christian Sarofeen, Natalia Gimelshein, Michael Carilli, Raul Puri' author = 'Christian Sarofeen, Natalia Gimelshein, Michael Carilli, Raul Puri'
...@@ -115,7 +115,7 @@ html_theme_options = { ...@@ -115,7 +115,7 @@ html_theme_options = {
'logo_only': True, 'logo_only': True,
} }
html_logo = '_static/img/nv-pytorch2.png' # html_logo = '_static/img/nv-pytorch2.png'
# Add any paths that contain custom static files (such as style sheets) here, # Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files, # relative to this directory. They are copied after the builtin static files,
...@@ -161,7 +161,7 @@ latex_elements = { ...@@ -161,7 +161,7 @@ latex_elements = {
# (source start file, target name, title, # (source start file, target name, title,
# author, documentclass [howto, manual, or own class]). # author, documentclass [howto, manual, or own class]).
latex_documents = [ latex_documents = [
(master_doc, 'apex.tex', 'APEx Documentation', (master_doc, 'apex.tex', 'Apex Documentation',
'Torch Contributors', 'manual'), 'Torch Contributors', 'manual'),
] ]
...@@ -171,7 +171,7 @@ latex_documents = [ ...@@ -171,7 +171,7 @@ latex_documents = [
# One entry per manual page. List of tuples # One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section). # (source start file, name, description, authors, manual section).
man_pages = [ man_pages = [
(master_doc, 'APEx', 'APEx Documentation', (master_doc, 'Apex', 'Apex Documentation',
[author], 1) [author], 1)
] ]
...@@ -182,8 +182,8 @@ man_pages = [ ...@@ -182,8 +182,8 @@ man_pages = [
# (source start file, target name, title, author, # (source start file, target name, title, author,
# dir menu entry, description, category) # dir menu entry, description, category)
texinfo_documents = [ texinfo_documents = [
(master_doc, 'APEx', 'APEx Documentation', (master_doc, 'Apex', 'Apex Documentation',
author, 'APEx', 'One line description of project.', author, 'Apex', 'One line description of project.',
'Miscellaneous'), 'Miscellaneous'),
] ]
......
...@@ -10,12 +10,29 @@ presented by NVIDIA `on Parallel Forall`_ and in GTC 2018 Sessions ...@@ -10,12 +10,29 @@ presented by NVIDIA `on Parallel Forall`_ and in GTC 2018 Sessions
`Training Neural Networks with Mixed Precision: Real Examples`_. `Training Neural Networks with Mixed Precision: Real Examples`_.
For Pytorch users, Real Examples in particular is recommended. For Pytorch users, Real Examples in particular is recommended.
Full runnable Python scripts demonstrating ``apex.fp16_utils``
can be found on the Github page:
| `Simple FP16_Optimizer demos`_
|
| `Distributed Mixed Precision Training with imagenet`_
|
| `Mixed Precision Training with word_language_model`_
|
|
.. _`on Parallel Forall`: .. _`on Parallel Forall`:
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/ https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
.. _`Training Neural Networks with Mixed Precision: Theory and Practice`: .. _`Training Neural Networks with Mixed Precision: Theory and Practice`:
http://on-demand.gputechconf.com/gtc/2018/video/S8923/ http://on-demand.gputechconf.com/gtc/2018/video/S8923/
.. _`Training Neural Networks with Mixed Precision: Real Examples`: .. _`Training Neural Networks with Mixed Precision: Real Examples`:
http://on-demand.gputechconf.com/gtc/2018/video/S81012/ http://on-demand.gputechconf.com/gtc/2018/video/S81012/
.. _`Simple FP16_Optimizer demos`:
https://github.com/NVIDIA/apex/tree/master/examples/FP16_Optimizer_simple
.. _`Distributed Mixed Precision Training with imagenet`:
https://github.com/NVIDIA/apex/tree/master/examples/imagenet
.. _`Mixed Precision Training with word_language_model`:
https://github.com/NVIDIA/apex/tree/master/examples/word_language_model
.. automodule:: apex.fp16_utils .. automodule:: apex.fp16_utils
.. currentmodule:: apex.fp16_utils .. currentmodule:: apex.fp16_utils
......
...@@ -11,17 +11,24 @@ Apex (A PyTorch Extension) ...@@ -11,17 +11,24 @@ Apex (A PyTorch Extension)
This site contains the API documentation for Apex (https://github.com/nvidia/apex), This site contains the API documentation for Apex (https://github.com/nvidia/apex),
a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible. a Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Installation can be done by running Installation requires CUDA 9 or later, PyTorch 0.4 or later, and Python 3. Install by running
:: ::
git clone https://www.github.com/nvidia/apex
cd apex git clone https://www.github.com/nvidia/apex
python setup.py install cd apex
python setup.py install
.. toctree::
:maxdepth: 1
:caption: AMP: Automatic Mixed Precision
amp
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: FP16/Mixed Precision Training :caption: FP16/Mixed Precision Utilities
fp16_utils fp16_utils
......
...@@ -9,25 +9,28 @@ ...@@ -9,25 +9,28 @@
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details. See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
<!---
TODO: add checkpointing example showing deserialization on the correct device
#### Checkpointing #### Checkpointing
`FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers. `FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers.
`save_load.py` shows an example. Test via `python save_load.py`. `save_load.py` shows an example. Test via `python save_load.py`.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details. See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
-->
#### Distributed #### Distributed
**distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel. **distributed_apex** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process
usage. Test via usage. Test via
```bash ```bash
cd distributed_pytorch cd distributed_apex
bash run.sh bash run.sh
``` ```
**distributed_pytorch** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel. **distributed_pytorch** shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel.
Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary
single-process usage. Test via single-process usage. Test via
```bash ```bash
cd distributed_apex cd distributed_pytorch
bash run.sh bash run.sh
``` ```
...@@ -17,6 +17,8 @@ transfers to reduce the total number of transfers required. ...@@ -17,6 +17,8 @@ transfers to reduce the total number of transfers required.
[Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel) [Source Code](https://github.com/NVIDIA/apex/tree/master/apex/parallel)
[Another Example: Imagenet with mixed precision](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Getting started ## Getting started
Prior to running please run Prior to running please run
```pip install -r requirements.txt``` ```pip install -r requirements.txt```
...@@ -26,8 +28,10 @@ To download the dataset, run ...@@ -26,8 +28,10 @@ To download the dataset, run
without any arguments. Once you have downloaded the dataset, you should not need to do this again. without any arguments. Once you have downloaded the dataset, you should not need to do this again.
You can now launch multi-process distributed data parallel jobs via You can now launch multi-process distributed data parallel jobs via
```python -m apex.parallel.multiproc main.py args...``` ```bash
adding any args... you'd like. The launch script `apex.parallel.multiproc` will python -m apex.parallel.multiproc main.py args...
```
adding any `args...` you like. The launch script `apex.parallel.multiproc` will
spawn one process for each of your system's available (visible) GPUs. spawn one process for each of your system's available (visible) GPUs.
Each process will run `python main.py args... --world-size <worldsize> --rank <rank>` Each process will run `python main.py args... --world-size <worldsize> --rank <rank>`
(the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`). (the `--world-size` and `--rank` arguments are determined and appended by `apex.parallel.multiproc`).
...@@ -45,7 +49,5 @@ which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is uns ...@@ -45,7 +49,5 @@ which will run on devices 0 and 1. By default, if `CUDA_VISIBLE_DEVICES` is uns
To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags. To understand how to convert your own model, please see all sections of main.py within ```#=====START: ADDED FOR DISTRIBUTED======``` and ```#=====END: ADDED FOR DISTRIBUTED======``` flags.
[Example with Imagenet and mixed precision training](https://github.com/NVIDIA/apex/tree/master/examples/imagenet)
## Requirements ## Requirements
Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend. Pytorch master branch built from source. This requirement is to use NCCL as a distributed backend.
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). This example is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet).
It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset. It implements training of popular model architectures, such as ResNet, AlexNet, and VGG on the ImageNet dataset.
`main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in APEx instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo. `main.py` and `main_fp16_optimizer.py` have been modified to use the `DistributedDataParallel` module in Apex instead of the one in upstream PyTorch. For description of how this works please see the distributed example included in this repo.
`main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling. `main.py` with the `--fp16` argument demonstrates mixed precision training with manual management of master parameters and loss scaling.
...@@ -15,8 +15,8 @@ adding any normal arguments. ...@@ -15,8 +15,8 @@ adding any normal arguments.
## Requirements ## Requirements
- APEx which can be installed from https://www.github.com/nvidia/apex - Apex which can be installed from https://www.github.com/nvidia/apex
- Install PyTorch from source, master branch of ([pytorch on github](https://www.github.com/pytorch/pytorch) - Install PyTorch from source, master branch of [pytorch on github](https://www.github.com/pytorch/pytorch).
- `pip install -r requirements.txt` - `pip install -r requirements.txt`
- Download the ImageNet dataset and move validation images to labeled subfolders - Download the ImageNet dataset and move validation images to labeled subfolders
- To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh - To do this, you can use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment