Improve wav2vec2/hubert model for pre-training (#2716)
Summary: This PR improves the Wav2Vec2/HuBERT model regarding model pre-training. - The model initialization of positional embedding and transformer module is essential to model pre-training. The accuracy of unmasked frames should be higher than masked frames, as it is an easier task. but without the initialization, the accuracy of masked frames is higher than unmasked frames. Compared the performance after two epochs with 16 GPUs. - With model initialization, the accuracies of masked/unmasked frames are 0.08/0.11. - Without model initialization, the accuracies of masked/unmasked frames are 0.06/0.04. - After adding the model initialization, the gradient is easy to overflow (aka `nan` gradient). In paper [Self-Supervised Learning for speech recognition with Intermediate layer supervision](https://arxiv.org/abs/2112.08778) the authors propose a simple but effective method to mitigate the overflow issue, by scaling down the multiplication of query and key and subtracting the maximum value from it (subtracting a constant value won't change the output of softmax). Then it guarantees the value won't be overflowed. - In the original fairseq, the mask indices are generated by `numpy.random.choice`. Here replace `torch.multinomial` with `torch.randperm`. (cc carolineechen). Other improvements within training scripts will be included in a separate PR. Pull Request resolved: https://github.com/pytorch/audio/pull/2716 Reviewed By: xiaohui-zhang Differential Revision: D39832189 Pulled By: nateanl fbshipit-source-id: f4d2a473a79ad63add2dd16624bd155d5ce4de27
Showing
Please register or sign in to comment