replacing new_group with torch.distributed.group.WORLD, avoids creating new group in every iteration. This should resolve the issue in Training gets stuck when using SyncBN #105
Attach a file by drag & drop or click to upload