Reduce number of parameter groups to make optimizer more efficient
Summary: `torch.optim._multi_tensor` provides faster Optimizer implementations as it uses foreach APIs. We can enable it by modifying from `OPTIMIZER: "ADAMW"` to `OPTIMIZER: "ADAMW_MT"` in the config file. In order to profit from the speedup, we need to reduce the number of parameter groups as suggested in this post: https://fb.workplace.com/groups/1405155842844877/permalink/4971600462867046/ The current implementation uses one parameter group per parameter which is not optimal. The proposed change groups parameters by learning rate and weight decay combinations. Reviewed By: zhanghang1989 Differential Revision: D30272112 fbshipit-source-id: d8d24298a59b52c2fc2930f7d614a0c6380a432f
Showing
Please register or sign in to comment