Alternatively, relative_step with warmup_init can be used.
Training without LR warmup or clip threshold, is not recommended. Additional optimizer operations like gradient clipping, should not be used alongside Adafactor.
- Scheduled LR warm-up to fixed LR
- disable relative updates
- use clip threshold: https://arxiv.org/abs/2004.14546