apps/kg/models/pytorch/tensor_models.py · edf2d52666e60683c276d6cb8aa813be942e4cbb · OpenDAS / dgl

[Dist][Optim] Fixed race conditions in distributed SparseAdam and SparseAdagrad (#3971) · edf2d526

ndickson-nvidia authored May 09, 2022

* * Fixed race condition bug in distributed/optim/pytorch/sparse_optim.py's SparseAdam::update, corresponding with the bug fixed in the non-distributed version in https://github.com/dmlc/dgl/pull/3013 , though using the newer Event-based approach from that corresponding function.  The race condition would often result in NaNs, like the previously fixed bug. https://github.com/dmlc/dgl/issues/2760



* * Fixed race condition bug in SparseAdagrad::update corresponding with the one fixed in SparseAdam::update in the previous commit.  Same info applies.

* * Fixed typo in all copies of a repeatedly-copied comment near bug fixed 3 commits ago, checking all implementations nearby for a corresponding bug.  (All of them appear to have been fixed as of 2 commits ago.)

* * Removed trailing whitespace
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
Co-authored-by: Rhett Ying <85214957+Rhett-Ying@users.noreply.github.com>

edf2d526

tensor_models.py 10.6 KB

Replace tensor_models.py