1. 28 Jul, 2023 1 commit
    • Zhaoheng Ni's avatar
      Move TorchAudio-Squim models to Beta (#3512) · b7d2d928
      Zhaoheng Ni authored
      Summary:
      The PR move `SquimObjective` and `SquimSubjective` models and corresponding factory functions and pre-trained pipelines out of prototype and to the core directory. They will be included in the next official release.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3512
      
      Reviewed By: mthrok
      
      Differential Revision: D47837434
      
      Pulled By: nateanl
      
      fbshipit-source-id: d0639f29079f7e1afc30f236849e530c8cadffd8
      b7d2d928
  2. 26 Jul, 2023 1 commit
  3. 25 Jul, 2023 1 commit
  4. 17 Jul, 2023 1 commit
  5. 12 Jul, 2023 1 commit
    • moto's avatar
      Support multiple FFmpeg versions (#3464) · 786066b4
      moto authored
      Summary:
      This commit introduces support for multiple FFmpeg versions for OSS binary distributions.
      
      Currently torchaudio only works with FFmpeg 4. This is inconvenient from installing to runtime linking.
      This commit allows to pick FFmpeg 4, 5 or 6 at runtime, instead of just looking for v4.
      
      The way it works is that we compile the FFmpeg extension three times with different FFmpeg and ship them.
      At runtime, we look for libavutil of specific version and when one is found, load the corresponding FFmpeg extension.
      The order of preference is 6, 5, then 4.
      
      To make the build process simple and reproducible, we use pre-built binaries of FFmpeg during the build.
      They are LGPL and downloaded from S3 at build time, instead of building every time.
      
      The use of pre-built binaries as scaffolding limits the system that can build torchaudio, so it also introduces
      single FFmpeg version support mode. setting FFMPEG_ROOT during the build will change the way binaries are built
      so that it will only support one specific version of FFmpeg.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3464
      
      Differential Revision: D47300223
      
      Pulled By: mthrok
      
      fbshipit-source-id: 560c7968315e4c8922afa11a4693f648c0356d04
      786066b4
  6. 10 Jul, 2023 1 commit
  7. 05 Jul, 2023 1 commit
  8. 21 Jun, 2023 1 commit
  9. 14 Jun, 2023 1 commit
  10. 09 Jun, 2023 1 commit
  11. 08 Jun, 2023 2 commits
    • Jeff Hwang's avatar
      Introduce chroma filter bank function (#3395) · dfd0c5fd
      Jeff Hwang authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/audio/pull/3395
      
      Adds chroma filter bank function `chroma_filterbank` to `torchaudio.prototype.functional`.
      
      Reviewed By: mthrok
      
      Differential Revision: D46307672
      
      fbshipit-source-id: c5d8104a8bb03da70d0629b5cc224e0d897148d5
      dfd0c5fd
    • moto's avatar
      Delay the initialization of CUDA tensor converter (#3419) · 7dff24ca
      moto authored
      Summary:
      StreamReader decoding process is composed of the three steps;
      
      1. Decode the incoming AVPacket into AVFrame
      2. Pass AVFrame through AVFilter to perform post process
      3. Convert the resulgint AVFrame
      
      The internal of StreamReader was refactored in https://github.com/pytorch/audio/issues/3188 so that the above pipeline is initialized at the time output stream is defined and output stream shape can be retrieved.
      
      For CPU decoder, this works fine because resizing happens in step 2, and the resulting shape can be retrievable.
      However, this is problematic for GPU decoder, as resizing is currently done using GPU decoder option (step 1) and there seems to be no interface to retrieve the output shape. This refactor introduced regression, which is described in https://github.com/pytorch/audio/issues/3405
      
      AVFilter internally is adoptive to the change of input frame size. This commit changes the conversion process to be similar, so that it will wait until the first frame comes in to finalize the frame shape.
      
      Fix https://github.com/pytorch/audio/issues/3405
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3419
      
      Differential Revision: D46557505
      
      Pulled By: mthrok
      
      fbshipit-source-id: 46ad2d82c8c30f368ebfbaf6947718a5036c7dc6
      7dff24ca
  12. 07 Jun, 2023 1 commit
  13. 06 Jun, 2023 3 commits
  14. 02 Jun, 2023 1 commit
    • moto's avatar
      [BC-Breaking] Remove compute_kaldi_pitch (#3368) · 5bbbb1d5
      moto authored
      Summary:
      This commit removes compute_kaldi_pitch function and the underlying Kaldi integration from torchaudio.
      
      Kaldi pitch function was added in a short period of time by integrating the original Kaldi implementation, instead of reimplementing it in PyTorch.
      
      The Kaldi integration employed a hack which replaces the base vector/matrix implementation of Kaldi with PyTorch Tensor so that there is only one blas library within torchaudio.
      
      Recently, we are making torchaudio more lean, and we don't see a wide adoption of kaldi_pitch feature, so we decided to remove them.
      
      See some of the discussion https://github.com/pytorch/audio/issues/1269
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3368
      
      Differential Revision: D46406176
      
      Pulled By: mthrok
      
      fbshipit-source-id: ee5e24d825188f379979ddccd680c7323b119b1e
      5bbbb1d5
  15. 01 Jun, 2023 3 commits
  16. 30 May, 2023 1 commit
  17. 27 May, 2023 1 commit
    • moto's avatar
      Fix AudioEffector for mulaw (#3372) · af932cc7
      moto authored
      Summary:
      When encoding audio with mulaw, the resulting data does not have header, and the StreamReader defaults to 16k Hz, which can strech/shrink the resulting waveform.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3372
      
      Reviewed By: hwangjeff
      
      Differential Revision: D46234772
      
      Pulled By: mthrok
      
      fbshipit-source-id: 942c89a8cfe29b0b6f57b3e5b6c9dfd3524ca552
      af932cc7
  18. 26 May, 2023 3 commits
    • moto's avatar
      Fix encoding g722 format (#3373) · 1b05ca7e
      moto authored
      Summary:
      g722 format only supports 16k Hz, but AVCodec does not list this. The implementation does not insert resampling and the resulting audio can be slowed down or sped up.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3373
      
      Reviewed By: hwangjeff
      
      Differential Revision: D46233181
      
      Pulled By: mthrok
      
      fbshipit-source-id: 902b3f862a8f7269dc35bc871e868b0e78326c6c
      1b05ca7e
    • Zhaoheng Ni's avatar
      Temporarily remove test for extract_features (#3378) · 05649ca3
      Zhaoheng Ni authored
      Summary:
      The tests failed for several bundles. Remove them and will re-add once the root cause is figured out.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3378
      
      Reviewed By: atalman
      
      Differential Revision: D46230884
      
      Pulled By: nateanl
      
      fbshipit-source-id: 42056a29b2ec2335268b273d3e37fb517035be92
      05649ca3
    • Lakshmi Krishnan's avatar
      Improve RNN-T streaming decoding (#3295) · 9fc0dcaa
      Lakshmi Krishnan authored
      Summary:
      This commit fixes the following issues affecting streaming decoding quality
      1. The `init_b` hypothesis is only regenerated from blank token if no initial hypotheses are provided.
      2. Allows the decoder to receive top-K hypothesis to continue decoding from, instead of using just the top hypothesis at each decoding step.  This dramatically affects decoding quality especially for speech with long pauses and disfluencies.
      3. Some minor errors regarding shape checking for length.
      
      This also means that the resulting output is the entire transcript up until that time step, instead of just the incremental change in transcript.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3295
      
      Reviewed By: nateanl
      
      Differential Revision: D46216113
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: 8f7efae28dcca4a052f434ca55a2795c9e5ec0b0
      9fc0dcaa
  19. 24 May, 2023 1 commit
    • moto's avatar
      Update smoke test (#3346) · 71b2634b
      moto authored
      Summary:
      * Delay the import of torchaudio until the CLI options are parsed.
      * Add option to set log level to DEBUG so that it's easy to see the issue with external libraries.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3346
      
      Reviewed By: nateanl
      
      Differential Revision: D46022546
      
      Pulled By: mthrok
      
      fbshipit-source-id: 9f988bbd770c2fd2bb260c3cfe02b238a9da2808
      71b2634b
  20. 23 May, 2023 3 commits
  21. 22 May, 2023 1 commit
  22. 20 May, 2023 1 commit
  23. 17 May, 2023 1 commit
    • moto's avatar
      Add 420p10le CPU support to StreamReader (#3332) · c12f4734
      moto authored
      Summary:
      This commit add support to decode YUV420P010LE format.
      
      The image tensor returned by this format
      - NCHW format (C == 3)
      - int16 type
      - value range [0, 2^10).
      
      Note that the value range is different from what "hevc_cuvid" decoder
      returns. "hevc_cuvid" decoder uses full range of int16 (internally,
      it's uint16) to express the color (with some intervals), but the values
      returned by CPU "hevc" decoder are with in [0, 2^10).
      
      Address https://github.com/pytorch/audio/issues/3331
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3332
      
      Reviewed By: hwangjeff
      
      Differential Revision: D45925097
      
      Pulled By: mthrok
      
      fbshipit-source-id: 4e669b65c030f388bba2fdbb8f00faf7e2981508
      c12f4734
  24. 10 May, 2023 2 commits
  25. 09 May, 2023 1 commit
  26. 05 May, 2023 1 commit
    • Xiaohui Zhang's avatar
      Add SpecAugment transform (#3309) · 82febc59
      Xiaohui Zhang authored
      Summary:
      (2/2 of the previous https://github.com/pytorch/audio/pull/2360 which I accidentally closed)
      
      The previous way of doing SpecAugment via Frequency/TimeMasking transforms has the following problems:
      - Only zero masking can be done; masking by mean value is not supported.
      - mask_along_axis is hard-coded to mask the 1st dimension and mask_along_axis_iid is hard-code to mask the 2nd or 3rd dimension of the input tensor.
      - For 3D spectrogram tensors where the first dimension is batch or channel, features from the same batch or different channels have to use the same mask, because mask_along_axis_iid only support 4D tensors, because of the above hard-coding
      - For 2D spectrogram tensors w/o a batch or channel dimension, Time/Frequency masking can't be applied at all, since mask_along_axis only support 3D tensors, because of the above hard-coding.
      - It's not straightforward to apply multiple time/frequency masks by the current design. If we need N masks across time/frequency axis, we need to sequentially apply N Frequency/TimeMasking transforms to input tensors, and such API looks very inconvenient. We need to introduce a separate SpecAugment transform to handle this.
      
      To solve these issues, here we
      [done in the previous [PR](https://github.com/pytorch/audio/pull/3289)] Extend mask_along_axis_iid to support 3D+ tensors and mask_along_axis to support 2D+ tensors. Now both of them are able to mask one of the last two dimensions (where the time or frequency dimension lives) of the input tensor.
      [done in this PR] Introducing SpecAugment transform.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3309
      
      Reviewed By: nateanl
      
      Differential Revision: D45592926
      
      Pulled By: xiaohui-zhang
      
      fbshipit-source-id: 97cd686dbb6c1c6ff604716b71a876e616aaf1a2
      82febc59
  27. 04 May, 2023 1 commit
    • Xiaohui Zhang's avatar
      Extend mask_along_axis{,_iid} (#3289) · 74bd971a
      Xiaohui Zhang authored
      Summary:
      (1/2 of the previous [PR](https://github.com/pytorch/audio/pull/2360) which I accidentally closed)
      
      The previous way of doing SpecAugment via Frequency/TimeMasking transforms has the following problems:
      - Only zero masking can be done; masking by mean value is not supported.
      - mask_along_axis is hard-coded to mask the 1st dimension and mask_along_axis_iid is hard-code to mask the 2nd or 3rd dimension of the input tensor.
      - For 3D spectrogram tensors where the first dimension is batch or channel, features from the same batch or different channels have to use the same mask, because mask_along_axis_iid only support 4D tensors, because of the above hard-coding
      - For 2D spectrogram tensors w/o a batch or channel dimension, Time/Frequency masking can't be applied at all, since mask_along_axis only support 3D tensors, because of the above hard-coding.
      - It's not straightforward to apply multiple time/frequency masks by the current design.
      
      To solve these issues, here we
      - Extend mask_along_axis_iid to support 3D tensors and mask_along_axis to support 2D tensors. Now both of them are able to mask one of the last two dimensions (where the time or frequency dimension lives) of the input tensor.
      
      The introduction of SpecAugment transform will be done in another PR.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3289
      
      Reviewed By: hwangjeff
      
      Differential Revision: D45460357
      
      Pulled By: xiaohui-zhang
      
      fbshipit-source-id: 91bf448294799f13789d96a13d4bae2451461ef3
      74bd971a
  28. 28 Apr, 2023 1 commit
    • Yuekai Zhang's avatar
      Add cuctc decoder (#3096) · 0a1801ed
      Yuekai Zhang authored
      Summary:
      This PR implements a CUDA based ctc prefix beam search decoder.
      
      Attach serveral benchmark results using V100 below:
      |decoder type| model |datasets       | decoding time (secs)| beam size | batch size | model unit | subsampling times | vocab size |
      |--------------|---------|------|-----------------|------------|-------------|------------|-----------------------|------------|
      | cuctc |  conformer nemo    |dev clean        |7.68s | 8           |  32       | bpe         |    4  | 1000|
      | cuctc |  conformer nemo   |dev clean  (sort by length)      |1.6s | 8           |  32       | bpe         |    4  | 1000|
      | cuctc |  wav2vec2.0 torchaudio |dev clean                                |22s | 10           |  1       | char         |    2  | 29|
      | cuctc |   conformer espnet   |aishell1 test                             | 5s | 10           |  24       | char         |    4  | 4233|
      
      Note:
      1.  The design is to parallel computation through batch and vocab axis, for loop the frames axis. So it's more friendly with smaller sequence lengths, larger vocab size comparing with CPU implementations.
      2. WER is the same as CPU implementations. However, it can't decode with LM now.
      
      Resolves: https://github.com/pytorch/audio/issues/2957.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3096
      
      Reviewed By: nateanl
      
      Differential Revision: D44709397
      
      Pulled By: mthrok
      
      fbshipit-source-id: 3078c54a2b44dc00eb4a81b4c657487eeff8c155
      0a1801ed
  29. 12 Apr, 2023 2 commits
    • moto's avatar
      Allow overwrite temp data in ffmpeg test (#3263) · cc7b8bd4
      moto authored
      Summary:
      When `TORCHAUDIO_TEST_TEMP_DIR` is set,
      all the unit test temporary data are stored in the  given directory.
      Running unit tests multiple times reuses the
      directory and the temporary files from the
      previous test runs are found there.
      
      FFmpeg save test writes reference data to the
      temporary directory, but it is not given the
      overwrite flag ("-y"), so it fails in such cases.
      
      This commit fixes that.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3263
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44859003
      
      Pulled By: mthrok
      
      fbshipit-source-id: 2db92fbdec1c015455f3779e10a18f7f1146166b
      cc7b8bd4
    • moto's avatar
      Specify backend directly in test (#3262) · 563e409c
      moto authored
      Summary:
      Preparation to land https://github.com/pytorch/audio/pull/3241
      
      This commit applies patch to make the sox_io TorchScript test pass when dispatcher is enabled.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3262
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44897513
      
      Pulled By: mthrok
      
      fbshipit-source-id: 9b65f705cd02324328a2bc1c414aa4b7ca0fed32
      563e409c