1. 23 May, 2023 1 commit
  2. 22 May, 2023 1 commit
  3. 20 May, 2023 1 commit
  4. 17 May, 2023 1 commit
    • moto's avatar
      Add 420p10le CPU support to StreamReader (#3332) · c12f4734
      moto authored
      Summary:
      This commit add support to decode YUV420P010LE format.
      
      The image tensor returned by this format
      - NCHW format (C == 3)
      - int16 type
      - value range [0, 2^10).
      
      Note that the value range is different from what "hevc_cuvid" decoder
      returns. "hevc_cuvid" decoder uses full range of int16 (internally,
      it's uint16) to express the color (with some intervals), but the values
      returned by CPU "hevc" decoder are with in [0, 2^10).
      
      Address https://github.com/pytorch/audio/issues/3331
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3332
      
      Reviewed By: hwangjeff
      
      Differential Revision: D45925097
      
      Pulled By: mthrok
      
      fbshipit-source-id: 4e669b65c030f388bba2fdbb8f00faf7e2981508
      c12f4734
  5. 10 May, 2023 2 commits
  6. 09 May, 2023 1 commit
  7. 05 May, 2023 1 commit
    • Xiaohui Zhang's avatar
      Add SpecAugment transform (#3309) · 82febc59
      Xiaohui Zhang authored
      Summary:
      (2/2 of the previous https://github.com/pytorch/audio/pull/2360 which I accidentally closed)
      
      The previous way of doing SpecAugment via Frequency/TimeMasking transforms has the following problems:
      - Only zero masking can be done; masking by mean value is not supported.
      - mask_along_axis is hard-coded to mask the 1st dimension and mask_along_axis_iid is hard-code to mask the 2nd or 3rd dimension of the input tensor.
      - For 3D spectrogram tensors where the first dimension is batch or channel, features from the same batch or different channels have to use the same mask, because mask_along_axis_iid only support 4D tensors, because of the above hard-coding
      - For 2D spectrogram tensors w/o a batch or channel dimension, Time/Frequency masking can't be applied at all, since mask_along_axis only support 3D tensors, because of the above hard-coding.
      - It's not straightforward to apply multiple time/frequency masks by the current design. If we need N masks across time/frequency axis, we need to sequentially apply N Frequency/TimeMasking transforms to input tensors, and such API looks very inconvenient. We need to introduce a separate SpecAugment transform to handle this.
      
      To solve these issues, here we
      [done in the previous [PR](https://github.com/pytorch/audio/pull/3289)] Extend mask_along_axis_iid to support 3D+ tensors and mask_along_axis to support 2D+ tensors. Now both of them are able to mask one of the last two dimensions (where the time or frequency dimension lives) of the input tensor.
      [done in this PR] Introducing SpecAugment transform.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3309
      
      Reviewed By: nateanl
      
      Differential Revision: D45592926
      
      Pulled By: xiaohui-zhang
      
      fbshipit-source-id: 97cd686dbb6c1c6ff604716b71a876e616aaf1a2
      82febc59
  8. 04 May, 2023 1 commit
    • Xiaohui Zhang's avatar
      Extend mask_along_axis{,_iid} (#3289) · 74bd971a
      Xiaohui Zhang authored
      Summary:
      (1/2 of the previous [PR](https://github.com/pytorch/audio/pull/2360) which I accidentally closed)
      
      The previous way of doing SpecAugment via Frequency/TimeMasking transforms has the following problems:
      - Only zero masking can be done; masking by mean value is not supported.
      - mask_along_axis is hard-coded to mask the 1st dimension and mask_along_axis_iid is hard-code to mask the 2nd or 3rd dimension of the input tensor.
      - For 3D spectrogram tensors where the first dimension is batch or channel, features from the same batch or different channels have to use the same mask, because mask_along_axis_iid only support 4D tensors, because of the above hard-coding
      - For 2D spectrogram tensors w/o a batch or channel dimension, Time/Frequency masking can't be applied at all, since mask_along_axis only support 3D tensors, because of the above hard-coding.
      - It's not straightforward to apply multiple time/frequency masks by the current design.
      
      To solve these issues, here we
      - Extend mask_along_axis_iid to support 3D tensors and mask_along_axis to support 2D tensors. Now both of them are able to mask one of the last two dimensions (where the time or frequency dimension lives) of the input tensor.
      
      The introduction of SpecAugment transform will be done in another PR.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3289
      
      Reviewed By: hwangjeff
      
      Differential Revision: D45460357
      
      Pulled By: xiaohui-zhang
      
      fbshipit-source-id: 91bf448294799f13789d96a13d4bae2451461ef3
      74bd971a
  9. 28 Apr, 2023 1 commit
    • Yuekai Zhang's avatar
      Add cuctc decoder (#3096) · 0a1801ed
      Yuekai Zhang authored
      Summary:
      This PR implements a CUDA based ctc prefix beam search decoder.
      
      Attach serveral benchmark results using V100 below:
      |decoder type| model |datasets       | decoding time (secs)| beam size | batch size | model unit | subsampling times | vocab size |
      |--------------|---------|------|-----------------|------------|-------------|------------|-----------------------|------------|
      | cuctc |  conformer nemo    |dev clean        |7.68s | 8           |  32       | bpe         |    4  | 1000|
      | cuctc |  conformer nemo   |dev clean  (sort by length)      |1.6s | 8           |  32       | bpe         |    4  | 1000|
      | cuctc |  wav2vec2.0 torchaudio |dev clean                                |22s | 10           |  1       | char         |    2  | 29|
      | cuctc |   conformer espnet   |aishell1 test                             | 5s | 10           |  24       | char         |    4  | 4233|
      
      Note:
      1.  The design is to parallel computation through batch and vocab axis, for loop the frames axis. So it's more friendly with smaller sequence lengths, larger vocab size comparing with CPU implementations.
      2. WER is the same as CPU implementations. However, it can't decode with LM now.
      
      Resolves: https://github.com/pytorch/audio/issues/2957.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3096
      
      Reviewed By: nateanl
      
      Differential Revision: D44709397
      
      Pulled By: mthrok
      
      fbshipit-source-id: 3078c54a2b44dc00eb4a81b4c657487eeff8c155
      0a1801ed
  10. 12 Apr, 2023 2 commits
    • moto's avatar
      Allow overwrite temp data in ffmpeg test (#3263) · cc7b8bd4
      moto authored
      Summary:
      When `TORCHAUDIO_TEST_TEMP_DIR` is set,
      all the unit test temporary data are stored in the  given directory.
      Running unit tests multiple times reuses the
      directory and the temporary files from the
      previous test runs are found there.
      
      FFmpeg save test writes reference data to the
      temporary directory, but it is not given the
      overwrite flag ("-y"), so it fails in such cases.
      
      This commit fixes that.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3263
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44859003
      
      Pulled By: mthrok
      
      fbshipit-source-id: 2db92fbdec1c015455f3779e10a18f7f1146166b
      cc7b8bd4
    • moto's avatar
      Specify backend directly in test (#3262) · 563e409c
      moto authored
      Summary:
      Preparation to land https://github.com/pytorch/audio/pull/3241
      
      This commit applies patch to make the sox_io TorchScript test pass when dispatcher is enabled.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3262
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44897513
      
      Pulled By: mthrok
      
      fbshipit-source-id: 9b65f705cd02324328a2bc1c414aa4b7ca0fed32
      563e409c
  11. 05 Apr, 2023 1 commit
  12. 03 Apr, 2023 1 commit
  13. 01 Apr, 2023 1 commit
    • moto's avatar
      Add AudioEffector (#3163) · a4036248
      moto authored
      Summary:
      This commit adds a new feature AudioEffector, which can be used to
      apply various effects and codecs to waveforms in Tensor.
      
      Under the hood it uses StreamWriter and StreamReader to apply
      filters and encode/decode.
      
      This is going to replace the deprecated `apply_codec` and
      `apply_sox_effect_tensor` functions.
      
      It can also perform online, chunk-by-chunk filtering.
      
      Tutorial to follow.
      
      closes https://github.com/pytorch/audio/issues/3161
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3163
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44576660
      
      Pulled By: mthrok
      
      fbshipit-source-id: 2c5cc87082ab431315d29d56d6ac9efaf4cf7aeb
      a4036248
  14. 30 Mar, 2023 2 commits
    • moto's avatar
      Support encode spec change in StreamWriter (#3207) · 1b648626
      moto authored
      Summary:
      This commit adds support for changing the spec of media
      (such as sample rate, #channels, image size and frame rate)
      on-the-fly at encoding time.
      
      The motivation behind this addition is that certain media
      formats support only limited number of spec, and it is
      cumbersome to require client code to change the spec
      every time.
      
      For example, OPUS supports only 48kHz sampling rate, and
      vorbis only supports stereo.
      
      To make it easy to work with media of different formats,
      this commit makes it so that anything that's not compatible
      with the format is automatically converted, and allows
      users to specify the override.
      
      Notable implementation detail is that, for sample format and
      pixel format, the default value of encoder has higher precedent
      to source value, while for other attributes like sample rate and
      #channels, the source value has higher precedent as long as
      they are supported.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3207
      
      Reviewed By: nateanl
      
      Differential Revision: D44439622
      
      Pulled By: mthrok
      
      fbshipit-source-id: 09524f201d485d201150481884a3e9e4d2aab081
      1b648626
    • moto's avatar
      Support changing the number of channels in StreamReader (#3216) · 4bc4ca75
      moto authored
      Summary:
      This commit adds `num_channels` argument,
      which allows one to change the number of channels on-the-fly.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3216
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44516925
      
      Pulled By: mthrok
      
      fbshipit-source-id: 3e5a11b3fdbb19071f712a8148e27aff60341df3
      4bc4ca75
  15. 29 Mar, 2023 1 commit
    • Moto Hira's avatar
      Reduce io tests (#3217) · 09ccf7cc
      Moto Hira authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/audio/pull/3217
      
      This commit removes some tests for file-like object from StreamWriter test.
      
      The rational is that testing things after the output file is opened are
      same for file-like object and regular files. Things like filter-graph and
      encoder format change does not affect how the encoded bynary are written.
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44518626
      
      fbshipit-source-id: 821ec20deca92e5e5c85bf4d47997eed51735374
      09ccf7cc
  16. 28 Mar, 2023 1 commit
  17. 27 Mar, 2023 1 commit
    • hwangjeff's avatar
      Revise encoder config arg and docstrings (#3203) · b1de9f1a
      hwangjeff authored
      Summary:
      For `StreamWriter`,
      * Renames arg `config` to codec_config`.
      * Renames struct `EncodingConfig` and dataclass `EncodeConfig` to `CodecConfig`.
      * Adds docstrings for arg codec_config`.
      * Updates `chunk` to `frames` in `write_*_chunk` methods.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3203
      
      Reviewed By: mthrok
      
      Differential Revision: D44350153
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: 1b940b1366a43ec0565c362bfcbf62744088b343
      b1de9f1a
  18. 25 Mar, 2023 1 commit
    • moto's avatar
      Properly set #samples passed to encoder (#3204) · d8a37a21
      moto authored
      Summary:
      Some audio encoders expect specific, exact number of samples described as in `AVCodecContext.frame_size`.
      
      The `AVFrame.nb_samples` is set for the frames passed to `AVFilterGraph`,
      but frames coming out of the graph do not necessarily have the same numbr of frames.
      
      This causes issues with encoding OPUS (among others).
      
      This commit fixes it by inserting `asetnsamples` to filter graph if a fixed number of samples is requested.
      
      Note:
      It turned out that FFmpeg 4.1 has issue with OPUS encoding. It does not properly discard some sample.
      We should probably move the minimum required FFmpeg to 4.2, but I am not sure if we can enforce it via ABI.
      Work around will be to issue an warning if encoding OPUS with 4.1. (follow-up)
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3204
      
      Reviewed By: nateanl
      
      Differential Revision: D44374668
      
      Pulled By: mthrok
      
      fbshipit-source-id: 10ef5333dc0677dfb83c8e40b78edd8ded1b21dc
      d8a37a21
  19. 23 Mar, 2023 3 commits
  20. 22 Mar, 2023 1 commit
  21. 21 Mar, 2023 2 commits
  22. 20 Mar, 2023 1 commit
    • moto's avatar
      Support CUDA frame in FilterGraph (#3183) · c5b96558
      moto authored
      Summary:
      This commit adds CUDA frame support to FilterGraph
      
      It initializes and attaches CUDA frames context to FilterGraph,
      so that CUDA frames can be processed in FilterGraph.
      
      As a result, it enables
      1. CUDA filter support such as `scale_cuda`
      2. Properly retrieve the pixel format coming out of FilterGraph when
         CUDA HW acceleration is enabled. (currently it is reported as "cuda")
      
      Resolves https://github.com/pytorch/audio/issues/3159
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3183
      
      Reviewed By: hwangjeff
      
      Differential Revision: D44183722
      
      Pulled By: mthrok
      
      fbshipit-source-id: 522d21039c361ddfaa87fa89cf49c19d210ac62f
      c5b96558
  23. 17 Mar, 2023 1 commit
  24. 16 Mar, 2023 1 commit
    • moto's avatar
      Refactor Tensor conversion in StreamReader (#3170) · 014d7140
      moto authored
      Summary:
      Currently, when the Buffer converts AVFrame* to torch::Tensor,
      it checks the format at each time a frame is passed, and
      perform the conversion.
      
      This commit changes it so that the conversion operation is
      pre-instantiated at the time outside stream is configured.
      
      It introduces Converter implementations for various formats,
      and use template to embed them in Buffer class.
      This way, branching like if/switch are eliminated from
      decoding path.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3170
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D44048293
      
      Pulled By: mthrok
      
      fbshipit-source-id: 30d8b240a5695d7513f499ce17853f2f0ffcab9f
      014d7140
  25. 15 Mar, 2023 1 commit
  26. 08 Mar, 2023 2 commits
    • moto's avatar
      Include format information after filter (#3155) · 146195d8
      moto authored
      Summary:
      This commit adds fields to OutputStream, which shows the result
      of fitlers, such as width and height after filtering.
      
      Before
      
      ```
      OutputStream(
          source_index=0,
          filter_description='fps=3,scale=width=320:height=320,format=pix_fmts=gray')
      ```
      
      After
      
      ```
      OutputVideoStream(
          source_index=0,
          filter_description='fps=3,scale=width=320:height=320,format=pix_fmts=gray',
          media_type='video',
          format='gray',
          width=320,
          height=320,
          frame_rate=3.0)
      ```
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3155
      
      Reviewed By: nateanl
      
      Differential Revision: D43882399
      
      Pulled By: mthrok
      
      fbshipit-source-id: 620676b1a06f293fdd56de8203a11120f228fa2d
      146195d8
    • moto's avatar
      Support overwriting PTS in StreamWriter (#3135) · 8d2f6f8d
      moto authored
      Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/3135
      
      Reviewed By: xiaohui-zhang
      
      Differential Revision: D43724273
      
      Pulled By: mthrok
      
      fbshipit-source-id: 9b52823618948945a26e57d5b3deccbf5f9268c1
      8d2f6f8d
  27. 07 Mar, 2023 3 commits
  28. 02 Mar, 2023 1 commit
  29. 01 Mar, 2023 1 commit
    • Zhaoheng Ni's avatar
      Fix windows tests (#3119) · 6a4a8200
      Zhaoheng Ni authored
      Summary:
      `sox` is not available on Windows machines. Add skip decorators to the sox related tests to skip running tests on Windows.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/3119
      
      Reviewed By: mthrok
      
      Differential Revision: D43682754
      
      Pulled By: nateanl
      
      fbshipit-source-id: f69987dac8232a3569be83f096b32389bd8bda81
      6a4a8200
  30. 27 Feb, 2023 1 commit
  31. 25 Feb, 2023 1 commit