1. 08 Nov, 2022 2 commits
    • Caroline Chen's avatar
      Enable log probs input for rnnt loss (#2798) · ca478823
      Caroline Chen authored
      Summary:
      Add `fused_log_softmax` argument (default/current behavior = True) to rnnt loss.
      
      If setting it to `False`, call `log_softmax` on the logits prior to passing it in to the rnnt loss function.
      
      The following should produce the same output:
      ```
      rnnt_loss(logits, targets, logit_lengths, target_lengths, fused_log_softmax=True)
      ```
      
      ```
      log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
      rnnt_loss(log_probs, targets, logit_lengths, target_lengths, fused_log_softmax=False)
      ```
      
      testing -- unit tests + get same results on the conformer rnnt recipe
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2798
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D41083523
      
      Pulled By: carolineechen
      
      fbshipit-source-id: e15442ceed1f461bbf06b724aa0561ff8827ad61
      ca478823
    • hwangjeff's avatar
      Add convolution transforms (#2811) · 2d99fee2
      hwangjeff authored
      Summary:
      Adds `torch.nn.Module`-based implementations for convolution and FFT convolution.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2811
      
      Reviewed By: carolineechen
      
      Differential Revision: D40881937
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: bfe8969e6178ad4f58981efd4b2720ac006be8de
      2d99fee2
  2. 04 Nov, 2022 1 commit
  3. 03 Nov, 2022 1 commit
  4. 02 Nov, 2022 5 commits
  5. 01 Nov, 2022 1 commit
    • hwangjeff's avatar
      Fix convolve mode docstring (#2809) · 6318c81f
      hwangjeff authored
      Summary:
      Argument `mode` in `convolve` and `fftconvolve` is expected to be a string, but the docstrings incorrectly say bool. This PR fixes the docstrings accordingly.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2809
      
      Reviewed By: nateanl
      
      Differential Revision: D40854464
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: 75b339ba34715723c93b91e7d48be2ed28bee115
      6318c81f
  6. 31 Oct, 2022 1 commit
  7. 29 Oct, 2022 1 commit
  8. 28 Oct, 2022 2 commits
  9. 27 Oct, 2022 1 commit
  10. 26 Oct, 2022 2 commits
    • hwangjeff's avatar
      Deprecate 'onesided' init param for MelSpectrogram (#2797) · 546e699a
      hwangjeff authored
      Summary:
      Initializer parameter `onesided` isn't relevant to `MelSpectrogram` — it should always be `True`. In fact, the module already assumes `onesided == True` in the filterbank it generates and fails in its forward pass when `onesided == False`. Accordingly, this PR makes param `onesided` optional and adds a deprecation warning that's fired when the param is provided.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2797
      
      Reviewed By: carolineechen, xiaohui-zhang
      
      Differential Revision: D40731238
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: 6eea8eb9d4a85a805162e03ad91682a1946f92cd
      546e699a
    • moto's avatar
      Refactor StreamProcessor interface (#2791) · 9e1999ae
      moto authored
      Summary:
      StreamProcessor is constructed on top of AVStream object, and attach streams defined by client code.
      
      This commit refactor the constructor and add_stream method signature so that `add_stream`'s signature is centered around the parameters required for filter construction.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2791
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D40667979
      
      Pulled By: mthrok
      
      fbshipit-source-id: 42220832f09a7895ede3cddf969d57feeb4ef7ec
      9e1999ae
  11. 25 Oct, 2022 1 commit
    • moto's avatar
      Fix issue with the missing video frame in StreamWriter (#2789) · 17a2b93b
      moto authored
      Summary:
      Addresses https://github.com/pytorch/audio/issues/2790.
      
      Previously AVPacket objects had duration==0.
      
      `av_interleaved_write_frame` function was inferring the duration of packets by
      comparing them against the next ones but It could not infer the duration of
      the last packet, as there is no subsequent frame, thus was omitting it from the final data.
      
      This commit fixes it by explicitly setting packet duration = 1 (one frame)
      only for video. (audio AVPacket contains multiple samples, so it's different.
      To ensure the correctness for audio, the tests were added.)
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2789
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D40627439
      
      Pulled By: mthrok
      
      fbshipit-source-id: 4d0d827bff518c017b115445e03bdf0bf1e68320
      17a2b93b
  12. 21 Oct, 2022 1 commit
  13. 20 Oct, 2022 1 commit
  14. 19 Oct, 2022 7 commits
  15. 18 Oct, 2022 1 commit
  16. 17 Oct, 2022 1 commit
  17. 14 Oct, 2022 2 commits
  18. 13 Oct, 2022 4 commits
  19. 12 Oct, 2022 4 commits
    • Nikita Shulga's avatar
      Fix typos in tacotron2 tutorial (#2761) · 7aabcbd4
      Nikita Shulga authored
      Summary:
      `publishe`->`published`
      
      Also, not sure if it should be `pre-trained weight is published` or `pre-trained weights are published`
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2761
      
      Reviewed By: carolineechen
      
      Differential Revision: D40313042
      
      Pulled By: malfet
      
      fbshipit-source-id: c22085ca0b1125a06aa04bf38231d0a9fbfed00b
      7aabcbd4
    • Zhaoheng Ni's avatar
      Improve hubert recipe for pre-training and fine-tuning (#2744) · 27433050
      Zhaoheng Ni authored
      Summary:
      following pr https://github.com/pytorch/audio/issues/2716
      - For preprocessing
        - The HuBERT feature takes lots of memory which may not fit some machines. Enable to use a subset of feature for training a k-means model.
      
      - For pre-training
        - Normalize the loss based on the total number of masked frames across all GPUs.
        - Use mixed precision training. fp16 is not well supported in pytorch_lightning.
        - Log accuracies of masked/unmasked frames during training.
        - Clip the gradients with norm `10.0`.
      
      - For ASR fine-tuning
        - Normalize the loss based on the total number of batches across all GPUs, same as in the conformer recipe of TorchAudio.
        - Use mixed precision training.
        - Add "|" after the end of transcription to capture the silence/word termination, same as in fairseq recipe.
      
      - Update the WER results on LibriSpeech dev and test sets.
      
      |                   | WER% (Viterbi)|  WER% (KenLM) |
      |:-----------------:|--------------:|--------------:|
      | dev-clean         |       10.9    |       4.2     |
      | dev-other         |       17.5    |       9.4     |
      | test-clean        |       10.9    |       4.4     |
      | test-other        |       17.8    |       9.5     |
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2744
      
      Reviewed By: carolineechen
      
      Differential Revision: D40282322
      
      Pulled By: nateanl
      
      fbshipit-source-id: 4723584c912e70e8970149fe09de005385eaab90
      27433050
    • Zhaoheng Ni's avatar
      Improve wav2vec2/hubert model for pre-training (#2716) · c5bd93b6
      Zhaoheng Ni authored
      Summary:
      This PR improves the Wav2Vec2/HuBERT model regarding model pre-training.
      
      - The model initialization of positional embedding and transformer module is essential to model pre-training. The accuracy of unmasked frames should be higher than masked frames, as it is an easier task. but without the initialization, the accuracy of masked frames is higher than unmasked frames.
        Compared the performance after two epochs with 16 GPUs.
        - With model initialization, the accuracies of masked/unmasked frames are 0.08/0.11.
        - Without model initialization, the accuracies of masked/unmasked frames are 0.06/0.04.
      - After adding the model initialization, the gradient is easy to overflow (aka `nan` gradient). In paper [Self-Supervised Learning for speech recognition with Intermediate layer supervision](https://arxiv.org/abs/2112.08778) the authors propose a simple but effective method to mitigate the overflow issue, by scaling down the multiplication of query and key and subtracting the maximum value from it (subtracting a constant value won't change the output of softmax). Then it guarantees the value won't be overflowed.
      - In the original fairseq, the mask indices are generated by `numpy.random.choice`. Here replace `torch.multinomial` with `torch.randperm`. (cc carolineechen).
      
      Other improvements within training scripts will be included in a separate PR.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2716
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D39832189
      
      Pulled By: nateanl
      
      fbshipit-source-id: f4d2a473a79ad63add2dd16624bd155d5ce4de27
      c5bd93b6
    • Caroline Chen's avatar
      Skip hubert xlarge torchscript test (#2758) · c2ea6898
      Caroline Chen authored
      Summary:
      a couple of circleci unittests are failing during hubert xlarge torchscript test, which has been known to fail on Windows in the past (#65776). this PR disables this test on circleci
      
      cc atalman
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2758
      
      Reviewed By: mthrok
      
      Differential Revision: D40290535
      
      Pulled By: carolineechen
      
      fbshipit-source-id: 5c5fb43434a517b6c439a8cb8e853015d1550a57
      c2ea6898
  20. 11 Oct, 2022 1 commit