Unverified Commit 43a27cd4 authored by Min Xu's avatar Min Xu Committed by GitHub
Browse files

[docs] clarify per-GPU batch size for AdaScale (#301)

- clarify that per-GPU batch size is not increased with AdaScale.
parent 2d954203
...@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed ...@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger. schedule from a baseline batch size when effective batch size is bigger.
Note that AdaScale does _not_ help increase per-GPU batch size.
```python ```python
from torch.optim import SGD from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR # or your scheduler from torch.optim.lr_scheduler import LambdaLR # or your scheduler
...@@ -147,11 +149,12 @@ while not done: ...@@ -147,11 +149,12 @@ while not done:
``` ```
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy. Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
(However, training time might be longer comparing to without AdaScale.)
At a high level, we want ML researchers to: At a high level, we want ML researchers to:
* go parallel more easily (i.e. reuse the same LR schedule) * go parallel more easily (i.e. no need to find new learning rate schedules)
* not worrying about lossing accuracy * not worrying about lossing accuracy
* get same (or higher) GPU efficiency (fewer steps, less networking, etc.) * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)
# Testing # Testing
......
...@@ -47,11 +47,14 @@ class AdaScale(Optimizer): ...@@ -47,11 +47,14 @@ class AdaScale(Optimizer):
distributed and large batch size training. Can be used in combination with distributed and large batch size training. Can be used in combination with
``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``. ``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``.
Subclass `Optimizer` so that `torch.optim.lr_scheduler` can work. In other words,
AdaScale is intended to be a complete wrapper of an torch Optimizer.
.. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf .. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf
This class subclasses `Optimizer` so that `torch.optim.lr_scheduler` can
work with it. In other words, AdaScale is intended to be a complete wrapper of an
torch Optimizer.
Note that, AdaScale does _not_ help increase per-GPU batch size.
There are several ways to integrate AdaScale with your training loop. There are several ways to integrate AdaScale with your training loop.
We show two examples below. We show two examples below.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment