Alternatively, relative_step with warmup_init can be used.
- use clip threshold: https://arxiv.org/abs/2004.14546
Training without LR warmup or clip threshold, is not recommended. Additional optimizer operations like gradient clipping, should not be used alongside Adafactor.