• Min Xu's avatar
    [test] AdaScale & SDP/FSDP (#468) · efed9cee
    Min Xu authored
    - cover them in terms of code path only
    - numerically, AdaScale is different on SDP/FSDP than DDP, mainly
      due to partial view of the gradients.
    - this doesn't mean it is definitely not useful but it is yet to
      be validated.
    - not going to spend too much time until we have a real use case.
    efed9cee
optimizer.pyi 738 Bytes