To use `FP16_Optimizer` on a half-precision model, or a model with a mixture of
half and float parameters, only two lines of your training script need to change:
1. Construct an `FP16_Optimizer` instance from an existing optimizer.
2. Replace `loss.backward()` with `optimizer.backward(loss)`.
[Full API Documentation](https://nvidia.github.io/apex/fp16_utils.html#automatic-management-of-master-params-loss-scaling)
See "Other Options" at the bottom of this page for some cases that require special treatment.
#### Minimal Working Sample
#### Minimal Working Sample
`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via `python minimal.py`.
`minimal.py` shows the basic usage of `FP16_Optimizer` with either static or dynamic loss scaling. Test via `python minimal.py`.
...
@@ -9,14 +17,11 @@
...
@@ -9,14 +17,11 @@
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.step) for more details.
<!---
#### Serialization/Deserialization
TODO: add checkpointing example showing deserialization on the correct device
`FP16_Optimizer` supports saving and loading with the same control flow as ordinary Pytorch optimizers.
#### Checkpointing
`FP16_Optimizer` also supports checkpointing with the same control flow as ordinary Pytorch optimizers.
`save_load.py` shows an example. Test via `python save_load.py`.
`save_load.py` shows an example. Test via `python save_load.py`.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
See [the API documentation](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.load_state_dict) for more details.
-->
#### Distributed
#### Distributed
**distributed_apex** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
**distributed_apex** shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
...
@@ -34,3 +39,21 @@ single-process usage. Test via
...
@@ -34,3 +39,21 @@ single-process usage. Test via
cd distributed_pytorch
cd distributed_pytorch
bash run.sh
bash run.sh
```
```
#### Other Options
Gradient clipping requires that calls to `torch.nn.utils.clip_grad_norm"
be replaced with [fp16_optimizer_instance.clip_master_grads](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.FP16_Optimizer.clip_master_grads).
Multiple losses will work if you simply replace
```bash
loss1.backward()
loss2.backward()
```
with
```bash
optimizer.backward(loss1)
optimizer.backward(loss2)
```
but `FP16_Optimizer` can be told to handle this more efficiently using the