examples/hubert/dataset/hubert_dataset.py · 6411c9ad25acfd323b06a520cda7e8963af4afbf · OpenDAS / Torchaudio

Introduce DistributedBatchSampler (#2299) · 6411c9ad

Zhaoheng Ni authored Apr 22, 2022

Summary:
When using customized `batch_sampler`, pytorch_lightning can't wrap the distributed sampler onto it. Hence we provide a `DistributedBatchSampler` that supports `BucketizeBatchSampler` in `ddp` mode.

The `DistributedBatchSampler` assumes `BucketizeBatchSampler.iter_list` is a list of lists, where each sub-list contains a batch of indices. Setting `shuffle` to `True` will shuffle the lists based on `seed` and current `epoch`.

The `shuffle` only happens in the initialization, and won't be changed if user don't reset it. The reason is shuffling `BucketizeBatchSampler` may have a different length than before, do shuffling in ``__iter__`` may result in mismatch between ``__len__`` and the real length value.
Hence users need to set `reload_dataloaders_every_n_epochs=1` in pytorch_lightning's Trainer. Then the value of ``__len__`` and the real length is the same.

Pull Request resolved: https://github.com/pytorch/audio/pull/2299

Reviewed By: hwangjeff

Differential Revision: D35781538

Pulled By: nateanl

fbshipit-source-id: 6e8396615497f1aeddab1ee5678830c0445c2b2a

6411c9ad

hubert_dataset.py 16.8 KB

Replace hubert_dataset.py