[feat] ShardedDataParallel with autoreduce (#157)
* rewrite using autograd and Variable execution queue to make the reduce automatic * share buckets with OSS to remove duplication * some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up
Showing
This diff is collapsed.
Please register or sign in to comment