1. 16 Oct, 2023 1 commit
  2. 25 Aug, 2023 1 commit
  3. 14 Jun, 2023 1 commit
  4. 08 May, 2023 1 commit
  5. 05 May, 2023 1 commit
  6. 09 Dec, 2022 4 commits
    • moto's avatar
      Update author and maintainer info (#2911) · b90d7988
      moto authored
      Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2911
      
      Reviewed By: carolineechen
      
      Differential Revision: D41887854
      
      Pulled By: mthrok
      
      fbshipit-source-id: eb91773ec67b4cda2d70733df450956d83742509
      b90d7988
    • Moto Hira's avatar
      Fix duplicated memory allocation in StreamWriter (#2906) · 4adbd54a
      Moto Hira authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/audio/pull/2906
      
      The correct way to create AVFormatContext* for output is to pass an address of an uninitialized *AVFormatContext struct to `avformat_alloc_output_context2` function.
      
      The current code pre-allocates AVFormatContext* with `avformat_alloc_context`, then this allocated object is lost inside of `avformat_alloc_output_context2`.
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D41865685
      
      fbshipit-source-id: 9a9dc83b5acfe9b450f191fe716c85ebb5a5d842
      4adbd54a
    • Moto Hira's avatar
      Fix wrong frame allocation in StreamWriter (#2905) · 30a1070c
      Moto Hira authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/audio/pull/2905
      
      In StreamWriter, if the tensor format is different from the encoding format, then a FilterGraph object is automatically inserted to convert the format.
      
      The FilterGraph object operates on AVFrames. The input AVFrame must be allocated by us, but the output AVFrames is filled by FilterGraph, thus no need to allocate it.
      
      Now the output AVFrame is used as input to encoder regardless of whether FilterGraph was inserted. Thus the output AVFrame has to be manually allocated by us when FilterGraph is not used.
      
      The current code flips this condition and incorrectly allocates AVFrame when FilterGraph is present and does not allocate otherwise.
      
      This commit fix that.
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D41866198
      
      fbshipit-source-id: 40799c147dc8166a979ecfb58ed8e502539a6aed
      30a1070c
    • Andrey Talman's avatar
      fbf968c0
  7. 04 Dec, 2022 1 commit
  8. 18 Nov, 2022 4 commits
  9. 16 Nov, 2022 4 commits
  10. 15 Nov, 2022 1 commit
  11. 03 Nov, 2022 2 commits
  12. 02 Nov, 2022 1 commit
  13. 29 Oct, 2022 1 commit
  14. 20 Oct, 2022 1 commit
  15. 19 Oct, 2022 4 commits
  16. 18 Oct, 2022 1 commit
  17. 17 Oct, 2022 1 commit
  18. 14 Oct, 2022 2 commits
  19. 13 Oct, 2022 5 commits
  20. 12 Oct, 2022 3 commits
    • Zhaoheng Ni's avatar
      Improve hubert recipe for pre-training and fine-tuning (#2744) · 928248d7
      Zhaoheng Ni authored
      Summary:
      following pr https://github.com/pytorch/audio/issues/2716
      - For preprocessing
        - The HuBERT feature takes lots of memory which may not fit some machines. Enable to use a subset of feature for training a k-means model.
      
      - For pre-training
        - Normalize the loss based on the total number of masked frames across all GPUs.
        - Use mixed precision training. fp16 is not well supported in pytorch_lightning.
        - Log accuracies of masked/unmasked frames during training.
        - Clip the gradients with norm `10.0`.
      
      - For ASR fine-tuning
        - Normalize the loss based on the total number of batches across all GPUs, same as in the conformer recipe of TorchAudio.
        - Use mixed precision training.
        - Add "|" after the end of transcription to capture the silence/word termination, same as in fairseq recipe.
      
      - Update the WER results on LibriSpeech dev and test sets.
      
      |                   | WER% (Viterbi)|  WER% (KenLM) |
      |:-----------------:|--------------:|--------------:|
      | dev-clean         |       10.9    |       4.2     |
      | dev-other         |       17.5    |       9.4     |
      | test-clean        |       10.9    |       4.4     |
      | test-other        |       17.8    |       9.5     |
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2744
      
      Reviewed By: carolineechen
      
      Differential Revision: D40282322
      
      Pulled By: nateanl
      
      fbshipit-source-id: 4723584c912e70e8970149fe09de005385eaab90
      928248d7
    • Caroline Chen's avatar
      Skip hubert xlarge torchscript test (#2758) · 97baba1b
      Caroline Chen authored
      Summary:
      a couple of circleci unittests are failing during hubert xlarge torchscript test, which has been known to fail on Windows in the past (#65776). this PR disables this test on circleci
      
      cc atalman
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2758
      
      Reviewed By: mthrok
      
      Differential Revision: D40290535
      
      Pulled By: carolineechen
      
      fbshipit-source-id: 5c5fb43434a517b6c439a8cb8e853015d1550a57
      97baba1b
    • Zhaoheng Ni's avatar
      Improve wav2vec2/hubert model for pre-training (#2716) · 6de7bb98
      Zhaoheng Ni authored
      Summary:
      This PR improves the Wav2Vec2/HuBERT model regarding model pre-training.
      
      - The model initialization of positional embedding and transformer module is essential to model pre-training. The accuracy of unmasked frames should be higher than masked frames, as it is an easier task. but without the initialization, the accuracy of masked frames is higher than unmasked frames.
        Compared the performance after two epochs with 16 GPUs.
        - With model initialization, the accuracies of masked/unmasked frames are 0.08/0.11.
        - Without model initialization, the accuracies of masked/unmasked frames are 0.06/0.04.
      - After adding the model initialization, the gradient is easy to overflow (aka `nan` gradient). In paper [Self-Supervised Learning for speech recognition with Intermediate layer supervision](https://arxiv.org/abs/2112.08778) the authors propose a simple but effective method to mitigate the overflow issue, by scaling down the multiplication of query and key and subtracting the maximum value from it (subtracting a constant value won't change the output of softmax). Then it guarantees the value won't be overflowed.
      - In the original fairseq, the mask indices are generated by `numpy.random.choice`. Here replace `torch.multinomial` with `torch.randperm`. (cc carolineechen).
      
      Other improvements within training scripts will be included in a separate PR.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2716
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D39832189
      
      Pulled By: nateanl
      
      fbshipit-source-id: f4d2a473a79ad63add2dd16624bd155d5ce4de27
      6de7bb98