Fix ConformerWav2Vec2PretrainModel (#3085)

Summary: The negative sampling should be applied to unmasked features in masked indices, the PR fixes the logic in ConformerWav2Vec2PretrainModel. Pull Request resolved: https://github.com/pytorch/audio/pull/3085 Reviewed By: mthrok Differential Revision: D43488570 Pulled By: nateanl fbshipit-source-id: 3820400d50b74216bb98ca6a40dc6a7acca01564

Fix ConformerWav2Vec2PretrainModel (#3085)
Summary: The negative sampling should be applied to unmasked features in masked indices, the PR fixes the logic in ConformerWav2Vec2PretrainModel. Pull Request resolved: https://github.com/pytorch/audio/pull/3085 Reviewed By: mthrok Differential Revision: D43488570 Pulled By: nateanl fbshipit-source-id: 3820400d50b74216bb98ca6a40dc6a7acca01564
b35a5fcf · Zhaoheng Ni · Facebook GitHub Bot · 3267c7ed · b35a5fcf
Commit b35a5fcf authored Feb 22, 2023 by Zhaoheng Ni Committed by Facebook GitHub Bot Feb 22, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 2 deletions

torchaudio/prototype/models/_conformer_wav2vec2.py torchaudio/prototype/models/_conformer_wav2vec2.py +7 -2

No files found.
--- a/torchaudio/prototype/models/_conformer_wav2vec2.py
+++ b/torchaudio/prototype/models/_conformer_wav2vec2.py
@@ -318,9 +318,14 @@ class ConformerWav2Vec2PretrainModel(Module):

        x = self.wav2vec2.encoder.feature_projection.layer_norm(x)
        x = self.wav2vec2.encoder.feature_projection.dropout(x)
-        x, mask_idxs = self.mask_generator(x, padding_mask)

-        targets, negs, neg_idxs = self.negative_sampler(x)
+        # Unmasked feature is used to generate positive and negative samples.
+        unmasked_x = x.clone()
+        # Apply masking to x before passing it to Conformer layers.
+        x, mask_idxs = self.mask_generator(x, padding_mask)
+        # Select the frames from masked indices for negative sampling.
+        unmasked_x = unmasked_x[mask_idxs].view(x.shape[0], -1, x.shape[-1])
+        targets, negs, neg_idxs = self.negative_sampler(unmasked_x)

        x = self.wav2vec2.encoder.feature_projection.projection(x)
        x = x.transpose(0, 1)