Unverified Commit 1cc1c3c3 authored by Thomas Wolf's avatar Thomas Wolf Committed by GitHub
Browse files

Merge pull request #533 from lukovnikov/master

Docs for new learning rate code
parents dee8af4e 56a47ce2
...@@ -126,7 +126,7 @@ This package comprises the following classes that can be imported in Python and ...@@ -126,7 +126,7 @@ This package comprises the following classes that can be imported in Python and
- `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate. - `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_pretrained_bert/optimization_openai.py) file): - Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_pretrained_bert/optimization_openai.py) file):
- `OpenAIGPTAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate. - `OpenAIAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files): - Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
- `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files. - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
...@@ -984,19 +984,48 @@ The optimizer accepts the following arguments: ...@@ -984,19 +984,48 @@ The optimizer accepts the following arguments:
- `warmup` : portion of `t_total` for the warmup, `-1` means no warmup. Default : `-1` - `warmup` : portion of `t_total` for the warmup, `-1` means no warmup. Default : `-1`
- `t_total` : total number of training steps for the learning - `t_total` : total number of training steps for the learning
rate schedule, `-1` means constant learning rate. Default : `-1` rate schedule, `-1` means constant learning rate. Default : `-1`
- `schedule` : schedule to use for the warmup (see above). Default : `'warmup_linear'` - `schedule` : schedule to use for the warmup (see above).
Can be `'warmup_linear'`, `'warmup_constant'`, `'warmup_cosine'`, `'none'`, `None` or a `_LRSchedule` object (see below).
If `None` or `'none'`, learning rate is always kept constant.
Default : `'warmup_linear'`
- `b1` : Adams b1. Default : `0.9` - `b1` : Adams b1. Default : `0.9`
- `b2` : Adams b2. Default : `0.999` - `b2` : Adams b2. Default : `0.999`
- `e` : Adams epsilon. Default : `1e-6` - `e` : Adams epsilon. Default : `1e-6`
- `weight_decay:` Weight decay. Default : `0.01` - `weight_decay:` Weight decay. Default : `0.01`
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0` - `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
#### `OpenAIGPTAdam` #### `OpenAIAdam`
`OpenAIGPTAdam` is similar to `BertAdam`. `OpenAIAdam` is similar to `BertAdam`.
The differences with `BertAdam` is that `OpenAIGPTAdam` compensate for bias as in the regular Adam optimizer. The differences with `BertAdam` is that `OpenAIAdam` compensate for bias as in the regular Adam optimizer.
`OpenAIGPTAdam` accepts the same arguments as `BertAdam`. `OpenAIAdam` accepts the same arguments as `BertAdam`.
#### Learning Rate Schedules
The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`.
All `_LRSchedule` subclasses accept `warmup` and `t_total` arguments at construction.
When an `_LRSchedule` object is passed into `BertAdam` or `OpenAIAdam`,
the `warmup` and `t_total` arguments on the optimizer are ignored and the ones in the `_LRSchedule` object are used.
An overview of the implemented schedules:
- `ConstantLR`: always returns learning rate 1.
- `WarmupConstantSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Keeps learning rate equal to 1. after warmup.
![](docs/imgs/warmup_constant_schedule.png)
- `WarmupLinearSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Linearly decreases learning rate from 1. to 0. over remaining `1 - warmup` steps.
![](docs/imgs/warmup_linear_schedule.png)
- `WarmupCosineSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Decreases learning rate from 1. to 0. over remaining `1 - warmup` steps following a cosine curve.
If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
![](docs/imgs/warmup_cosine_schedule.png)
- `WarmupCosineWithHardRestartsSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
If `cycles` (default=1.) is different from default, learning rate follows `cycles` times a cosine decaying learning rate (with hard restarts).
![](docs/imgs/warmup_cosine_hard_restarts_schedule.png)
- `WarmupCosineWithWarmupRestartsSchedule`: All training progress is divided in `cycles` (default=1.) parts of equal length.
Every part follows a schedule with the first `warmup` fraction of the training steps linearly increasing from 0. to 1.,
followed by a learning rate decreasing from 1. to 0. following a cosine curve.
Note that the total number of all warmup steps over all cycles together is equal to `warmup` * `cycles`
![](docs/imgs/warmup_cosine_warm_restarts_schedule.png)
## Examples ## Examples
......
...@@ -85,7 +85,9 @@ class ConstantLR(_LRSchedule): ...@@ -85,7 +85,9 @@ class ConstantLR(_LRSchedule):
class WarmupCosineSchedule(_LRSchedule): class WarmupCosineSchedule(_LRSchedule):
""" """
Cosine learning rate schedule with linear warmup. Cosine after warmup is without restarts. Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Decreases learning rate from 1. to 0. over remaining `1 - warmup` steps following a cosine curve.
If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
""" """
warn_t_total = True warn_t_total = True
def __init__(self, warmup=0.002, t_total=-1, cycles=.5, **kw): def __init__(self, warmup=0.002, t_total=-1, cycles=.5, **kw):
...@@ -108,7 +110,9 @@ class WarmupCosineSchedule(_LRSchedule): ...@@ -108,7 +110,9 @@ class WarmupCosineSchedule(_LRSchedule):
class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule): class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule):
""" """
Cosine learning rate schedule with linear warmup and hard restarts (if cycles > 1). Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
If `cycles` (default=1.) is different from default, learning rate follows `cycles` times a cosine decaying
learning rate (with hard restarts).
""" """
def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw): def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
super(WarmupCosineWithHardRestartsSchedule, self).__init__(warmup=warmup, t_total=t_total, cycles=cycles, **kw) super(WarmupCosineWithHardRestartsSchedule, self).__init__(warmup=warmup, t_total=t_total, cycles=cycles, **kw)
...@@ -125,9 +129,9 @@ class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule): ...@@ -125,9 +129,9 @@ class WarmupCosineWithHardRestartsSchedule(WarmupCosineSchedule):
class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestartsSchedule): class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestartsSchedule):
""" """
Cosine learning rate schedule with linear warmups and linear warmup restarts. All training progress is divided in `cycles` (default=1.) parts of equal length.
The same warmup rate is used for warmup restarts as for initial warmup. Every part follows a schedule with the first `warmup` fraction of the training steps linearly increasing from 0. to 1.,
The total effective fraction of warmup steps over all cycles is warmup * cycles! followed by a learning rate decreasing from 1. to 0. following a cosine curve.
""" """
def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw): def __init__(self, warmup=0.002, t_total=-1, cycles=1., **kw):
assert(warmup * cycles < 1.) assert(warmup * cycles < 1.)
...@@ -146,7 +150,8 @@ class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestartsSchedul ...@@ -146,7 +150,8 @@ class WarmupCosineWithWarmupRestartsSchedule(WarmupCosineWithHardRestartsSchedul
class WarmupConstantSchedule(_LRSchedule): class WarmupConstantSchedule(_LRSchedule):
""" """
Applies linear warmup. After warmup always returns 1.. Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Keeps learning rate equal to 1. after warmup.
""" """
def get_lr_(self, progress): def get_lr_(self, progress):
if progress < self.warmup: if progress < self.warmup:
...@@ -156,7 +161,8 @@ class WarmupConstantSchedule(_LRSchedule): ...@@ -156,7 +161,8 @@ class WarmupConstantSchedule(_LRSchedule):
class WarmupLinearSchedule(_LRSchedule): class WarmupLinearSchedule(_LRSchedule):
""" """
Linear warmup. Linear decay after warmup. Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
Linearly decreases learning rate from 1. to 0. over remaining `1 - warmup` steps.
""" """
warn_t_total = True warn_t_total = True
def get_lr_(self, progress): def get_lr_(self, progress):
...@@ -182,8 +188,9 @@ class BertAdam(Optimizer): ...@@ -182,8 +188,9 @@ class BertAdam(Optimizer):
t_total: total number of training steps for the learning t_total: total number of training steps for the learning
rate schedule, -1 means constant learning rate of 1. (no warmup regardless of warmup setting). Default: -1 rate schedule, -1 means constant learning rate of 1. (no warmup regardless of warmup setting). Default: -1
schedule: schedule to use for the warmup (see above). schedule: schedule to use for the warmup (see above).
Can be 'warmup_linear', 'warmup_constant', 'warmup_cosine', or a LRSchedule object. Can be `'warmup_linear'`, `'warmup_constant'`, `'warmup_cosine'`, `'none'`, `None` or a `_LRSchedule` object (see below).
Default: 'warmup_linear' If `None` or `'none'`, learning rate is always kept constant.
Default : `'warmup_linear'`
b1: Adams b1. Default: 0.9 b1: Adams b1. Default: 0.9
b2: Adams b2. Default: 0.999 b2: Adams b2. Default: 0.999
e: Adams epsilon. Default: 1e-6 e: Adams epsilon. Default: 1e-6
...@@ -208,8 +215,8 @@ class BertAdam(Optimizer): ...@@ -208,8 +215,8 @@ class BertAdam(Optimizer):
schedule = schedule_type(warmup=warmup, t_total=t_total) schedule = schedule_type(warmup=warmup, t_total=t_total)
else: else:
if warmup != -1 or t_total != -1: if warmup != -1 or t_total != -1:
logger.warning("Non-default warmup and t_total are ineffective when LRSchedule object is provided. " logger.warning("warmup and t_total on the optimizer are ineffective when _LRSchedule object is provided as schedule. "
"Please specify custom warmup and t_total in LRSchedule object.") "Please specify custom warmup and t_total in _LRSchedule object.")
defaults = dict(lr=lr, schedule=schedule, defaults = dict(lr=lr, schedule=schedule,
b1=b1, b2=b2, e=e, weight_decay=weight_decay, b1=b1, b2=b2, e=e, weight_decay=weight_decay,
max_grad_norm=max_grad_norm) max_grad_norm=max_grad_norm)
......
...@@ -48,8 +48,8 @@ class OpenAIAdam(Optimizer): ...@@ -48,8 +48,8 @@ class OpenAIAdam(Optimizer):
schedule = schedule_type(warmup=warmup, t_total=t_total) schedule = schedule_type(warmup=warmup, t_total=t_total)
else: else:
if warmup != -1 or t_total != -1: if warmup != -1 or t_total != -1:
logger.warning("Non-default warmup and t_total are ineffective when LRSchedule object is provided. " logger.warning("warmup and t_total on the optimizer are ineffective when _LRSchedule object is provided as schedule. "
"Please specify custom warmup and t_total in LRSchedule object.") "Please specify custom warmup and t_total in _LRSchedule object.")
defaults = dict(lr=lr, schedule=schedule, defaults = dict(lr=lr, schedule=schedule,
b1=b1, b2=b2, e=e, weight_decay=weight_decay, vector_l2=vector_l2, b1=b1, b2=b2, e=e, weight_decay=weight_decay, vector_l2=vector_l2,
max_grad_norm=max_grad_norm) max_grad_norm=max_grad_norm)
......
...@@ -22,7 +22,8 @@ import torch ...@@ -22,7 +22,8 @@ import torch
from pytorch_pretrained_bert import BertAdam from pytorch_pretrained_bert import BertAdam
from pytorch_pretrained_bert import OpenAIAdam from pytorch_pretrained_bert import OpenAIAdam
from pytorch_pretrained_bert.optimization import ConstantLR, WarmupLinearSchedule, WarmupCosineWithWarmupRestartsSchedule from pytorch_pretrained_bert.optimization import ConstantLR, WarmupLinearSchedule, WarmupConstantSchedule, \
WarmupCosineWithWarmupRestartsSchedule, WarmupCosineWithHardRestartsSchedule, WarmupCosineSchedule
import numpy as np import numpy as np
...@@ -86,7 +87,5 @@ class WarmupCosineWithRestartsTest(unittest.TestCase): ...@@ -86,7 +87,5 @@ class WarmupCosineWithRestartsTest(unittest.TestCase):
self.assertTrue(np.allclose(expected_zeros, 0)) self.assertTrue(np.allclose(expected_zeros, 0))
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment