[docs] clarify per-GPU batch size for AdaScale (#301)

- clarify that per-GPU batch size is not increased with AdaScale.

[docs] clarify per-GPU batch size for AdaScale (#301)
- clarify that per-GPU batch size is not increased with AdaScale.
43a27cd4 · Min Xu · GitHub · 2d954203 · 43a27cd4 · 43a27cd4
Unverified Commit 43a27cd4 authored Jan 11, 2021 by Min Xu Committed by GitHub Jan 11, 2021
Show whitespace changes
Inline Side-by-side

Showing with 11 additions and 5 deletions

README.md README.md +5 -2

fairscale/optim/adascale.py fairscale/optim/adascale.py +6 -3

No files found.
--- a/README.md
+++ b/README.md
@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed
 training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
 schedule from a baseline batch size when effective batch size is bigger.

+Note that AdaScale does _not_ help increase per-GPU batch size.
+
 ```python
 from torch.optim import SGD
 from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
@@ -147,11 +149,12 @@ while not done:
 ```

 Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
+(However, training time might be longer comparing to without AdaScale.)

 At a high level, we want ML researchers to:
-  * go parallel more easily (i.e. reuse the same LR schedule)
+  * go parallel more easily (i.e. no need to find new learning rate schedules)
  * not worrying about lossing accuracy
-  * get same (or higher) GPU efficiency (fewer steps, less networking, etc.)
+  * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)

 # Testing


--- a/fairscale/optim/adascale.py
+++ b/fairscale/optim/adascale.py
@@ -47,11 +47,14 @@ class AdaScale(Optimizer):
    distributed and large batch size training. Can be used in combination with
    ``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``.

-    Subclass `Optimizer` so that `torch.optim.lr_scheduler` can work. In other words,
-    AdaScale is intended to be a complete wrapper of an torch Optimizer.
-
    .. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf

+    This class subclasses `Optimizer` so that `torch.optim.lr_scheduler` can
+    work with it. In other words, AdaScale is intended to be a complete wrapper of an
+    torch Optimizer.
+
+    Note that, AdaScale does _not_ help increase per-GPU batch size.
+
    There are several ways to integrate AdaScale with your training loop.
    We show two examples below.