Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
fairscale
Commits
43a27cd4
Unverified
Commit
43a27cd4
authored
Jan 11, 2021
by
Min Xu
Committed by
GitHub
Jan 11, 2021
Browse files
[docs] clarify per-GPU batch size for AdaScale (#301)
- clarify that per-GPU batch size is not increased with AdaScale.
parent
2d954203
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
11 additions
and
5 deletions
+11
-5
README.md
README.md
+5
-2
fairscale/optim/adascale.py
fairscale/optim/adascale.py
+6
-3
No files found.
README.md
View file @
43a27cd4
...
...
@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.
Note that AdaScale does _not_ help increase per-GPU batch size.
```
python
from
torch.optim
import
SGD
from
torch.optim.lr_scheduler
import
LambdaLR
# or your scheduler
...
...
@@ -147,11 +149,12 @@ while not done:
```
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
(However, training time might be longer comparing to without AdaScale.)
At a high level, we want ML researchers to:
*
go parallel more easily (i.e.
reuse the same LR
schedule)
*
go parallel more easily (i.e.
no need to find new learning rate
schedule
s
)
*
not worrying about lossing accuracy
*
get same (or
higher
)
GPU efficiency (fewer steps, less networking, etc.)
*
potentially
higher GPU efficiency (fewer steps, less networking
overhead
, etc.)
# Testing
...
...
fairscale/optim/adascale.py
View file @
43a27cd4
...
...
@@ -47,11 +47,14 @@ class AdaScale(Optimizer):
distributed and large batch size training. Can be used in combination with
``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``.
Subclass `Optimizer` so that `torch.optim.lr_scheduler` can work. In other words,
AdaScale is intended to be a complete wrapper of an torch Optimizer.
.. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf
This class subclasses `Optimizer` so that `torch.optim.lr_scheduler` can
work with it. In other words, AdaScale is intended to be a complete wrapper of an
torch Optimizer.
Note that, AdaScale does _not_ help increase per-GPU batch size.
There are several ways to integrate AdaScale with your training loop.
We show two examples below.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment