Unverified Commit c301c263 authored by Josh's avatar Josh Committed by GitHub
Browse files

Fix Adafactor documentation (recommend correct settings) (#10526)



* Update optimization.py

Fix documentation to reflect optimal settings for Adafactor

* update and expand on the recommendations

* style

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* flip scale_parameter to True for the 2nd recommendatoin
Co-authored-by: default avatarStas Bekman <stas@stason.org>
Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 838f83d8
......@@ -402,19 +402,24 @@ class Adafactor(Optimizer):
This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested.
Recommended T5 finetuning settings:
Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3):
- Scheduled LR warm-up to fixed LR
- disable relative updates
- use clip threshold: https://arxiv.org/abs/2004.14546
- Training without LR warmup or clip_threshold is not recommended.
* use scheduled LR warm-up to fixed LR
* use clip_threshold=1.0 (https://arxiv.org/abs/1804.04235)
- Disable relative updates
- Use scale_parameter=False
- Additional optimizer operations like gradient clipping should not be used alongside Adafactor
Example::
Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=True)
Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)
Others reported the following combination to work well::
Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
- Alternatively, relative_step with warmup_init can be used.
- Training without LR warmup or clip threshold is not recommended. Additional optimizer operations like
gradient clipping should not be used alongside Adafactor.
Usage::
......@@ -447,9 +452,9 @@ class Adafactor(Optimizer):
warmup_init=False,
):
if lr is not None and relative_step:
raise ValueError("Cannot combine manual lr and relative_step options")
raise ValueError("Cannot combine manual `lr` and `relative_step=True` options")
if warmup_init and not relative_step:
raise ValueError("warmup_init requires relative_step=True")
raise ValueError("`warmup_init=True` requires `relative_step=True`")
defaults = dict(
lr=lr,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment