• Zhaoheng Ni's avatar
    Introduce DistributedBatchSampler (#2299) · 6411c9ad
    Zhaoheng Ni authored
    Summary:
    When using customized `batch_sampler`, pytorch_lightning can't wrap the distributed sampler onto it. Hence we provide a `DistributedBatchSampler` that supports `BucketizeBatchSampler` in `ddp` mode.
    
    The `DistributedBatchSampler` assumes `BucketizeBatchSampler.iter_list` is a list of lists, where each sub-list contains a batch of indices. Setting `shuffle` to `True` will shuffle the lists based on `seed` and current `epoch`.
    
    The `shuffle` only happens in the initialization, and won't be changed if user don't reset it. The reason is shuffling `BucketizeBatchSampler` may have a different length than before, do shuffling in ``__iter__`` may result in mismatch between ``__len__`` and the real length value.
    Hence users need to set `reload_dataloaders_every_n_epochs=1` in pytorch_lightning's Trainer. Then the value of ``__len__``  and the real length is the same.
    
    Pull Request resolved: https://github.com/pytorch/audio/pull/2299
    
    Reviewed By: hwangjeff
    
    Differential Revision: D35781538
    
    Pulled By: nateanl
    
    fbshipit-source-id: 6e8396615497f1aeddab1ee5678830c0445c2b2a
    6411c9ad
hubert_dataset.py 16.8 KB