[test] AdaScale & SDP/FSDP (#468)
- cover them in terms of code path only - numerically, AdaScale is different on SDP/FSDP than DDP, mainly due to partial view of the gradients. - this doesn't mean it is definitely not useful but it is yet to be validated. - not going to spend too much time until we have a real use case.
Showing
Please register or sign in to comment