Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
fairscale
Commits
43a27cd4
Unverified
Commit
43a27cd4
authored
Jan 11, 2021
by
Min Xu
Committed by
GitHub
Jan 11, 2021
Browse files
[docs] clarify per-GPU batch size for AdaScale (#301)
- clarify that per-GPU batch size is not increased with AdaScale.
parent
2d954203
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
11 additions
and
5 deletions
+11
-5
README.md
README.md
+5
-2
fairscale/optim/adascale.py
fairscale/optim/adascale.py
+6
-3
No files found.
README.md
View file @
43a27cd4
...
@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed
...
@@ -120,6 +120,8 @@ AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.
schedule from a baseline batch size when effective batch size is bigger.
Note that AdaScale does _not_ help increase per-GPU batch size.
```
python
```
python
from
torch.optim
import
SGD
from
torch.optim
import
SGD
from
torch.optim.lr_scheduler
import
LambdaLR
# or your scheduler
from
torch.optim.lr_scheduler
import
LambdaLR
# or your scheduler
...
@@ -147,11 +149,12 @@ while not done:
...
@@ -147,11 +149,12 @@ while not done:
```
```
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
(However, training time might be longer comparing to without AdaScale.)
At a high level, we want ML researchers to:
At a high level, we want ML researchers to:
*
go parallel more easily (i.e.
reuse the same LR
schedule)
*
go parallel more easily (i.e.
no need to find new learning rate
schedule
s
)
*
not worrying about lossing accuracy
*
not worrying about lossing accuracy
*
get same (or
higher
)
GPU efficiency (fewer steps, less networking, etc.)
*
potentially
higher GPU efficiency (fewer steps, less networking
overhead
, etc.)
# Testing
# Testing
...
...
fairscale/optim/adascale.py
View file @
43a27cd4
...
@@ -47,11 +47,14 @@ class AdaScale(Optimizer):
...
@@ -47,11 +47,14 @@ class AdaScale(Optimizer):
distributed and large batch size training. Can be used in combination with
distributed and large batch size training. Can be used in combination with
``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``.
``torch.nn.parallel.DistributedDataParallel`` and ``torch.optim.SGD``.
Subclass `Optimizer` so that `torch.optim.lr_scheduler` can work. In other words,
AdaScale is intended to be a complete wrapper of an torch Optimizer.
.. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf
.. _AdaScale: https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Supplemental.pdf
This class subclasses `Optimizer` so that `torch.optim.lr_scheduler` can
work with it. In other words, AdaScale is intended to be a complete wrapper of an
torch Optimizer.
Note that, AdaScale does _not_ help increase per-GPU batch size.
There are several ways to integrate AdaScale with your training loop.
There are several ways to integrate AdaScale with your training loop.
We show two examples below.
We show two examples below.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment