1. 11 Jan, 2023 1 commit
  2. 10 Jan, 2023 2 commits
  3. 06 Jan, 2023 6 commits
  4. 05 Jan, 2023 4 commits
    • Zhaoheng Ni's avatar
      Rename generator to vocoder in HiFiGAN model and factory functions (#2955) · 5e75c8e8
      Zhaoheng Ni authored
      Summary:
      The generator part of HiFiGAN model is a vocoder which converts mel spectrogram to waveform. It makes more sense to name it as vocoder for better understanding.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2955
      
      Reviewed By: carolineechen
      
      Differential Revision: D42348864
      
      Pulled By: nateanl
      
      fbshipit-source-id: c45a2f8d8d205ee381178ae5d37e9790a257e1aa
      5e75c8e8
    • moto's avatar
      Fix filtering function fallback mechanism (#2953) · 5428e283
      moto authored
      Summary:
      lfilter, overdrive have faster implementation written in C++. If they are not available, torchaudio is supposed to fall back on Python-based implementation.
      
      The original fallback mechanism relied on error type and messages from PyTorch core, which has been changed.
      
      This commit updates it for more proper fallback mechanism.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2953
      
      Reviewed By: hwangjeff
      
      Differential Revision: D42344893
      
      Pulled By: mthrok
      
      fbshipit-source-id: 18ce5c1aa1c69d0d2ab469b0b0c36c0221f5ccfd
      5428e283
    • Moto Hira's avatar
      Use PyBind for binding utilities (#2956) · f70b970a
      Moto Hira authored
      Summary:
      Pull Request resolved: https://github.com/pytorch/audio/pull/2956
      
      Merge utility binding
      
      This commit updates the utility binding, so that we can use `is_module_available()`
      for checking the existence of extension modules.
      
      To ensure the existence of module, this commit migrates the binding of utility functions
      to PyBind11.
      
      Going forward, we should use TorchBind for ops that we want to support TorchScript,
      otherwise default to PyBind11. (PyBind has advantage of not copying strings.)
      
      Reviewed By: hwangjeff
      
      Differential Revision: D42355992
      
      fbshipit-source-id: 4c71d65b24a0882a38a80dc097d45ba72b4c4a6b
      f70b970a
    • Grigory Sizov's avatar
      Add HiFiGAN bundle (#2921) · 54e5c859
      Grigory Sizov authored
      Summary:
      Closes [T138011314](https://www.internalfb.com/intern/tasks/?t=138011314)
      ## Description
      - Add  bundle `HIFIGAN_GENERATOR_V3_LJSPEECH` to prototypes. The bundle contains pre-trained HiFiGAN generator weights from the [original HiFiGAN publication](https://github.com/jik876/hifi-gan#pretrained-model), converted slightly to fit our model
      - Add tests
        - unit tests checking that vocoder and mel-transform implementations in the bundle give the same results as the original ones. Part of the original HiFiGAN code is ported to this repo to enable these tests
        - integration test checking that waveform reconstructed from mel spectrogram by the bundle is close enough to the original
      - Add docs
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2921
      
      Reviewed By: nateanl, mthrok
      
      Differential Revision: D42034761
      
      Pulled By: sgrigory
      
      fbshipit-source-id: 8b0dadeed510b3c9371d6aa2c46ec7d8378f6048
      54e5c859
  5. 04 Jan, 2023 5 commits
    • moto's avatar
      Make fill_buffer a public API and move the impl to C++ (#2954) · bf085b1f
      moto authored
      Summary:
      Currently, when iterating media data with StreamReader, using the for-loop is the only way with public API.
      
      This does not support usecases like "Fetch one chunk after seek" well.
      
      ```python
      s = StreamReader
      s.add_audio_stream(...)
      s.seek(10)
      chunk = None
      for chunk, in s.stream():
          break
      ```
      
      This commit make the `fill_buffer` used in iterative method public API so that one acn do
      
      ```python
      s.seek(10)
      s.fill_buffer()
      chunk, = s.pop_chunks()
      ```
      
       ---
      
      Also this commit moves the implementation to C++ so that it reduces the number of FFI boundary crossing.
      This improves the performance when the iteration is longer.
      
      AVI (generated with `ffmpeg -hide_banner -f lavfi -t ${duration} -i testsrc "${file}.avi"`)
      
      | Video Duration [sec] | Original [msec] | Fill Buffer C++ | One Go  (reference) |
      |----------------------|----------|-----------------|--------|
      |                    1 |       18 |            18.4 |   16.6 |
      |                    5 |       44 |            42.6 |   35.1 |
      |                   10 |     75.3 |            74.4 |   60.9 |
      |                   30 |      200 |             195 |    158 |
      |                   60 |      423 |             382 |    343 |
      
      MP4 (generated with `ffmpeg -hide_banner -f lavfi -t ${duration} -i testsrc "${file}.mp4"`)
      
      | Video Duration [sec] | Original [msec] | Fill Buffer C++ | One Go |
      |----------------------|-----------------|-----------------|--------|
      |                    1 |            18.7 |            18.1 |   10.3 |
      |                    5 |            42.2 |            40.6 |   25.2 |
      |                   10 |            73.9 |            71.8 |   43.6 |
      |                   30 |             202 |             194 |    116 |
      |                   60 |             396 |             386 |    227 |
      * Original (Python implementation)
      
      ```python
      r = StreamReader(src)
      r.add_video_stream(1, decoder_option={"threads": "1"})
      for chunk, in r.stream():
          pass
      ```
      
      * This (C++)
      
      ```python
      r = StreamReader(src)
      r.add_video_stream(1, decoder_option={"threads": "1"})
      for chunk, in r.stream():
          pass
      ```
      
      * Using `process_all_packets` (process all in one go)
      
      ```python
      r = StreamReader(src)
      r.add_video_stream(1, decoder_option={"threads": "1"})
      r.process_all_packets()
      ```
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2954
      
      Reviewed By: carolineechen
      
      Differential Revision: D42349446
      
      Pulled By: mthrok
      
      fbshipit-source-id: 9e4e37923e46299c3f43f4ad17a2a2b938b2b197
      bf085b1f
    • moto's avatar
      Add multithread options to docstring and default to single thread (#2949) · 7f778fc9
      moto authored
      Summary:
      One can pass "threads" and "thread_type" to `decoder_option` of StreamReaader to change the multithreading configuration.
      
      These affects the timing that decoder starts emitting the decoded frames. i.e. how many packets at minimum have to be processed before the first frame is decoded.
      
      Overall, multithreading in decoder does not improve the performance.
      (One possible reason is because the design of StreamReader, "decode few frames then fetch them", does not suited to saturate the decoder with incoming packets.)
      
      num_threads=1 seems to exhibit overall good performance/resource balance.
      
      ![Decoding time over threading](https://user-images.githubusercontent.com/855818/210380632-1ab72608-76aa-4445-808a-50cef215dbac.png)
      ![Decoding time over threading (1)](https://user-images.githubusercontent.com/855818/210380639-ec5574b0-9087-4f22-84a4-a2c0ee632dd9.png)
      (Tested on 320x240 25 FPPS, YUV420P videos generated with `ffmpeg -f lavfi -t "${duration}" -i testsrc -pix_fmt "yuv420p"`)
      
      For this reason, we default to single thread execution in StreamReader.
      closes https://github.com/pytorch/audio/issues/2855
      
      Follow-up: Apply similar change to encoder option in StreamWriter.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2949
      
      Reviewed By: carolineechen
      
      Differential Revision: D42343951
      
      Pulled By: mthrok
      
      fbshipit-source-id: aea234717d37918f99fc24f575dbcfe7dcae1e80
      7f778fc9
    • moto's avatar
      Use CCache if available (#2866) · d5b5aba6
      moto authored
      Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2866
      
      Reviewed By: carolineechen
      
      Differential Revision: D42349474
      
      Pulled By: mthrok
      
      fbshipit-source-id: 31455184031fff52719ef829e40bb1e09e11b0e7
      d5b5aba6
    • hwangjeff's avatar
      Guard invocation of cuda_version (#2952) · 10787fee
      hwangjeff authored
      Summary:
      Currently, importing TorchAudio triggers a check of the CUDA version it was compiled with, which in turn calls `torch.ops.torchaudio.cuda_version()`. This function is available only if `libtorchaudio` is available; developers, however, may want to import TorchAudio regardless of its availability. To allow for such usage, this PR adds code that bypasses the check if `libtorchaudio` is not available.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2952
      
      Reviewed By: mthrok
      
      Differential Revision: D42336396
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: 465353cf46b218c0bcdf51ca5cf0b83c93185f39
      10787fee
    • Jeff Hwang's avatar
      Add preemphasis and deemphasis transforms (#2935) · c2b62ae8
      Jeff Hwang authored
      Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2935
      
      Reviewed By: mthrok
      
      Differential Revision: D42302275
      
      Pulled By: hwangjeff
      
      fbshipit-source-id: d995d335bf17d63d3c1dda77d8ef596570853638
      c2b62ae8
  6. 30 Dec, 2022 4 commits
    • moto's avatar
      Refactor YUV handling functions (#2946) · d0dca115
      moto authored
      Summary:
      * Split `convert_[yuv420p|nv12|nv12_cuda]` functions into allocation
      and data write functions.
      * Merge the `get_[interlaced|planar]_image_buffer` functions into
      `get_buffer` and `get_image_buffer`.
      * Disassemble `convert_XXX_image` helper functions.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2946
      
      Reviewed By: nateanl
      
      Differential Revision: D42287501
      
      Pulled By: mthrok
      
      fbshipit-source-id: b8dd0d52fd563a112a16887b643bf497f77dfb80
      d0dca115
    • Zhaoheng Ni's avatar
      Throw RuntimeError if path not found in LibriMix dataset (#2944) · 276dcd69
      Zhaoheng Ni authored
      Summary:
      The `root` path can be confusing to users without reading the document. The PR adds runtime error for a better understanding.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2944
      
      Reviewed By: mthrok
      
      Differential Revision: D42281034
      
      Pulled By: nateanl
      
      fbshipit-source-id: 6e5f4bfb118583d678d6b7a2565ef263fe8e4a5a
      276dcd69
    • moto's avatar
      Refactor and optimize yuv420p and nv12 processing (#2945) · cc0d1e0b
      moto authored
      Summary:
      This commit refactors and optimizes functions that converts AVFrames of `yuv420p` and `nv12` into PyTorch's Tensor.
      The performance is improved about 30%.
      
      1. Reduce the number of intermediate Tensors allocated.
      2. Replace 2 calls to `repeat_interleave` with `F::interpolate`.
      
       * (`F::interpolate` is about 5x faster than `repeat_interleave`. )
          <details><summary>code</summary>
      
          ```bash
          #!/usr/bin/env bash
      
          set -e
      
          python -c """
          import torch
          import torch.nn.functional as F
      
          a = torch.arange(49, dtype=torch.uint8).reshape(7, 7).clone()
          val1 = a.repeat_interleave(2, -1).repeat_interleave(2, -2)
          val2 = F.interpolate(a.view((1, 1, 7, 7, 1)), size=[14, 14, 1], mode=\"nearest\")
          print(torch.sum(torch.abs(val1 - val2[0, 0, :, :, 0])))
          """
      
          python3 -m timeit \
                  --setup """
          import torch
      
          a = torch.arange(49, dtype=torch.uint8).reshape(7, 7).clone()
          """ \
                  """
          a.repeat_interleave(2, -1).repeat_interleave(2, -2)
          """
      
          python3 -m timeit \
                  --setup """
          import torch
          import torch.nn.functional as F
      
          a = torch.arange(49, dtype=torch.uint8).reshape(7, 7).clone()
          """ \
                  """
          F.interpolate(a.view((1, 1, 7, 7, 1)), size=[14, 14, 1], mode=\"nearest\")
          """
          ```
      
          </details>
      
          ```
          tensor(0)
          10000 loops, best of 5: 38.3 usec per loop
          50000 loops, best of 5: 7.1 usec per loop
          ```
      
      ## Benchmark Result
      
      <details><summary>code</summary>
      
      ```bash
      #!/usr/bin/env bash
      
      set -e
      
      mkdir -p tmp
      
      for ext in avi mp4; do
          for duration in 1 5 10 30 60; do
              printf "Testing ${ext} ${duration} [sec]\n"
      
              test_data="tmp/test_${duration}.${ext}"
              if [ ! -f "${test_data}" ]; then
                  printf "Generating test data\n"
                  ffmpeg -hide_banner -f lavfi -t ${duration} -i testsrc "${test_data}" > /dev/null 2>&1
              fi
      
              python -m timeit \
                     --setup="from torchaudio.io import StreamReader" \
                     """
      r = StreamReader(\"${test_data}\")
      r.add_basic_video_stream(frames_per_chunk=-1, format=\"yuv420p\")
      r.process_all_packets()
      r.pop_chunks()
      """
          done
      done
      ```
      
      </details>
      
      ![Time to decode AVI file](https://user-images.githubusercontent.com/855818/210008881-8cc83f18-0e51-46e3-afe9-a5ff5dff041e.png)
      
      <details><summary>raw data</summary>
      
      Video Type - AVI
      Duration | Before | After
      -- | -- | --
      1 | 10.3 | 6.29
      5 | 44.3 | 28.3
      10 | 89.3 | 56.9
      30 | 265 | 185
      60 | 555 | 353
      </details>
      
      ![Time to decode MP4 file](https://user-images.githubusercontent.com/855818/210008891-c4546c52-43d7-49d0-8eff-d866ad627129.png)
      
      <details><summary>raw data</summary>
      
      Video Type - MP4
      Duration | Before | After
      -- | -- | --
      1 | 15.3 | 10.5
      5 | 62.1 | 43.2
      10 | 124 | 83.8
      30 | 380 | 252
      60 | 721 | 511
      </details>
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2945
      
      Reviewed By: carolineechen
      
      Differential Revision: D42283269
      
      Pulled By: mthrok
      
      fbshipit-source-id: 59840f943ff516b69ab8ad35fed7104c48a0bf0c
      cc0d1e0b
    • moto's avatar
      Add subtractive synthesis tutorial (#2934) · 9f57951a
      moto authored
      Summary:
      Artifact: [subtractive_synthesis_tutorial](https://output.circle-artifacts.com/output/job/4c1ce33f-834d-48e0-ba89-2e91acdcb572/artifacts/0/docs/tutorials/subtractive_synthesis_tutorial.html)
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2934
      
      Reviewed By: carolineechen
      
      Differential Revision: D42284945
      
      Pulled By: mthrok
      
      fbshipit-source-id: d255b8e8e2a601a19bc879f9e1c38edbeebaf9b3
      9f57951a
  7. 29 Dec, 2022 3 commits
  8. 28 Dec, 2022 1 commit
  9. 27 Dec, 2022 1 commit
    • moto's avatar
      Refactor Buffer implementation in StreamReader (#2939) · 4699ef21
      moto authored
      Summary:
      The `Buffer` class is responsible for converting `AVFrame` into `torch::Tensor` and storing the frames in accordance to `frames_per_chunk` and `buffer_chunk_size`.
      
      There are four operating modes of Buffer; [audio|video] x [chunked|unchunked]. Audio and video have a separate class implementations, but the behavior of chunked/unchunked depends on `frames_per_chunk<0` or not.
      
      Chunked mode is where frames should be returned by chunk of a unit number frames, while unchunked mode is where frames are returned as-is.
      
      When frames are accumulated, in chunked mode, old frames are dropped, while in unchunked mode all the frames are retained.
      
      Currently, the underlying buffer implementations are the same `std::dequeu<torch::Tensor>`. As we plan to make chunked-mode behavior more efficient by changing the underlying buffer container, it will be easier if the unchuked-mode behavior is kept as-is as a separate class.
      
      This commit makes the following changes.
      
      * Change `Buffer` class into pure virtual class (interface).
      * Split `AudioBuffer` into` UnchunkedAudioBuffer` and `ChunkedAudioBuffer`.
      * Split `VideoBuffer` into` UnchunkedVideoBuffer` and `ChunkedVideoBuffer`.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2939
      
      Reviewed By: carolineechen
      
      Differential Revision: D42247509
      
      Pulled By: mthrok
      
      fbshipit-source-id: 7363e442a5b2db5dcbaaf0ffbfa702e088726d1b
      4699ef21
  10. 22 Dec, 2022 1 commit
  11. 21 Dec, 2022 1 commit
    • moto's avatar
      Extract libsox integration from libtorchaudio (#2929) · 1706a72f
      moto authored
      Summary:
      This commit makes the following changes to the C++ library organization
      - Move sox-related feature implementations from `libtorchaudio` to `libtorchaudio_sox`.
      - Remove C++ implementation of `is_sox_available` and `is_ffmpeg_available` as it is now sufficient to check the existence of `libtorchaudio_sox` and `libtorchaudio_ffmpeg` to check the availability. This makes `libtorchaudio_sox` and `libtorchaudio_ffmpeg` independent from `libtorchaudio`.
      - Move PyBind11-based bindings (`_torchaudio_sox`, `_torchaudio_ffmpeg`) into `torchaudio.lib` so that the built library structure is less cluttered.
      
      Background:
      Originally, when the `libsox` was the only C++ extension and `libtorchaudio` was supposed to contain all the C++ code.
      The things are different now. We have a bunch of C++ extensions and we need to make the code/build structure more modular.
      
      The new `libtorchaudio_sox` contains the implementations and `_torchaudio_sox` contains the PyBin11-based bindings.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2929
      
      Reviewed By: hwangjeff
      
      Differential Revision: D42159594
      
      Pulled By: mthrok
      
      fbshipit-source-id: 1a0fbca9e4143137f6363fc001b2378ce6029aa7
      1706a72f
  12. 20 Dec, 2022 1 commit
  13. 19 Dec, 2022 2 commits
    • moto's avatar
      Split extract_archive into dedicated functions. (#2927) · 5807078c
      moto authored
      Summary:
      `extra_archive` in `datasets.utils` does not distinguish the input type, and blindly treats it as tar, then zip in case of failure.
      
      This is an anti-pattern. All the dataset implementations know which archive type the downloaded files are.
      
      This commit splits extract_archive function into dedicated functions, and make each dataset use the correct one.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2927
      
      Reviewed By: carolineechen
      
      Differential Revision: D42154069
      
      Pulled By: mthrok
      
      fbshipit-source-id: bc46cc2af26aa086ef49aa1f9a94b6dedb55f85e
      5807078c
    • moto's avatar
      Remove deprecated/unused functions from datasets.utils (#2926) · d744f33f
      moto authored
      Summary:
      `stream_url`, `download_url` and `validate_file` are not used and not listed in documentation (`download_url` is marked as deprecated) so remove them.
      
      This will also fix the failing bandit workflow.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2926
      
      Reviewed By: carolineechen
      
      Differential Revision: D42153484
      
      Pulled By: mthrok
      
      fbshipit-source-id: 0fccdc7b7e0e40db8046e12f46eb68de57d838ca
      d744f33f
  14. 17 Dec, 2022 1 commit
  15. 16 Dec, 2022 3 commits
  16. 15 Dec, 2022 1 commit
    • DanilBaibak's avatar
      Switch to Nova MacOS Wheel (#2907) · 0be8423d
      DanilBaibak authored
      Summary:
      Switch to Nova MacOS and M1 Wheels. This PR is a step in migrating from CircleCI to the Nova workflow.
      
      - [x] Disable the CircleCI builds for MacOS Wheel.
      - [x] Disable the CircleCI builds for M1 Wheel.
      - [x] Enable the Nova workflow for MacOS Wheel.
      - [x] Enable the Nova workflow for M1 Wheel.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2907
      
      Reviewed By: osalpekar, mthrok
      
      Differential Revision: D42040965
      
      Pulled By: DanilBaibak
      
      fbshipit-source-id: b87f028cf5686bf97265109591fb0a8c1190324c
      0be8423d
  17. 13 Dec, 2022 2 commits
  18. 12 Dec, 2022 1 commit
    • moto's avatar
      Update precise seek behavior for t=0 (#2915) · cbd35438
      moto authored
      Summary:
      It was reported that when videos with invalid PTS values are fed to StreamReader, StreamReader returns only the last frame.
      
      https://github.com/pytorch/vision/blob/677fc939b21a8893f07db4c1f90482b648b6573f/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi
      
      ```
      import torchaudio
      
      src = "RATRACE_wave_f_nm_np1_fr_goo_37.avi"
      
      streamer = torchaudio.io.StreamReader(src=src)
      streamer.add_basic_video_stream(frames_per_chunk=-1)
      streamer.process_all_packets()
      video, = streamer.pop_chunks()
      
      print(video.size(0))  # prints 1, but there are more than 70 frames
      ```
      
      The reason why all the frames are not returned is due to invalid PTS values. All the frames's PTS values are `-9223372036854775808` so the internal mechanism discards them.
      
      The reason why the last frame is output is because when entering drain mode, the discard value of -1 is used, which is interpreted as no discard.
      
      For the second issue, the discard behavior should be consistent across regular decoding and drain mode.
      
      For the first issue, although the normal behavior is not guaranteed for such invalid input, we can support the case where one reads video from start (or when one seeks into t=0)
      
       ---
      
      This commits make the following changes to address the above two.
      
      1. Define the discard_before_pts attribtue on StreamProcessor, so that StreamProcessor is aware of the discard behavior without being told by StreamReader, and its behavior is consistent between regular decoding and drain.
      
         This gets rid of the discard_before_pts computation that is currently happening at the every time a frame is processed, so this should improve the peformance a bit.
      
      2. Change the meaning of discard_before_pts, so that when it's 0, no discard happens. With this change, the negative value is not necessary so we put it a UB status.
      
      Note:
         Even with those changes seeking videos with invalid PTS is not plausible, client codes can implement a fallback which decodes frames first and discard undesired ones.
      
      Pull Request resolved: https://github.com/pytorch/audio/pull/2915
      
      Reviewed By: nateanl
      
      Differential Revision: D41957784
      
      Pulled By: mthrok
      
      fbshipit-source-id: 2dafdbada5aa33bfc81c986306f80642ba6277df
      cbd35438