that provides better performance and numerical accuracy.
`--cpp_ext` enables
- C++-side flattening and unflattening utilities that reduce the CPU overhead of `apex.parallel.DistributedDataParallel`.
### Windows support
Windows support is experimental, and Linux is recommended. However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux. If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.
@@ -135,8 +131,11 @@ class DistributedDataParallel(Module):
Args:
module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
delay_allreduce (Default = False): Delay all communication to the end of the backward pass. This disables overlapping communication with computation.
message_size (int, default=1e7): Minimum number of elements in a communication bucket.
delay_allreduce (bool, default=False): Delay all communication to the end of the backward pass. This disables overlapping communication with computation.
allreduce_trigger_params (list, optional, default=None): If supplied, should contain a list of parameters drawn from the model. Allreduces will be kicked off whenever one of these parameters receives its gradient (as opposed to when a bucket of size message_size is full). At the end of backward(), a cleanup allreduce to catch any remaining gradients will also be performed automatically. If allreduce_trigger_params is supplied, the message_size argument will be ignored.
allreduce_always_fp32 (bool, default=False): Convert any FP16 gradients to FP32 before allreducing. This can improve stability for widely scaled-out runs.
gradient_average_split_factor (float, default=1.0): Perform the averaging of gradients over proceses partially before and partially after the allreduce. Before allreduce: grads.mul_(1.0/gradient_average_split_factor). After allreduce: grads.mul_(gradient_average_split_factor/world size). This can reduce the stress on the dynamic range of FP16 allreduces for widely scaled-out runs.