Commits · fbf968c071a35a758dfad518b83a186a5791efc3 · OpenDAS / Torchaudio

09 Dec, 2022 1 commit
- [Rlease only change] Advance version for nightly (#2903) · fbf968c0
  Andrey Talman authored Dec 08, 2022
  
  fbf968c0
04 Dec, 2022 1 commit

Fix _init_hubert_pretrain_model (#2886) · 6b13e266

Zhaoheng Ni authored Dec 03, 2022

Summary:
address https://github.com/pytorch/audio/issues/2885

In `_init_hubert_pretrain_model ` method which initialize the hubert pretrain models, `kaiming_normal_` should be applied on `ConvLayerBlock` instead of `LayerNorm` layer. This PR fixes it and adds more unit tests.

Pull Request resolved: https://github.com/pytorch/audio/pull/2886

Reviewed By: hwangjeff

Differential Revision: D41713801

Pulled By: nateanl

fbshipit-source-id: ed199baf7504d06bbf2d31c522ae708a75426a2d

6b13e266

18 Nov, 2022 4 commits

Update decoder doc (#2865) · 1533b2cf

moto authored Nov 18, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2865

Reviewed By: carolineechen

Differential Revision: D41403756

Pulled By: mthrok

fbshipit-source-id: d193caa90e786f08f28e4cc2df4b4fb77aa8f592

1533b2cf

packaging: Specify otool / install_name_tool (#2828) · 235add98

Eli Uriegas authored Nov 02, 2022

Summary:
Makes it specific to which version of otool and install_name_tool we actually prefer since using the one from conda can produce inconsistent results

Fixes https://github.com/pytorch/audio/issues/2806

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/audio/pull/2828

Reviewed By: malfet, mthrok

Differential Revision: D40960633

Pulled By: seemethere

fbshipit-source-id: 5010c06578f1efc4fe314f9a3ff47f18e14ad156

235add98

Fix decimal FPS handling StreamWriter (#2831) · 0c1e6f22

moto authored Nov 04, 2022

Summary:
StreamWriter assumed that frame rate is always expressed as 1/something, which is a reasonable assumption.

This commit fixes it by properly computing time_base from frame rate.

Address https://github.com/pytorch/audio/issues/2830

Pull Request resolved: https://github.com/pytorch/audio/pull/2831

Reviewed By: carolineechen

Differential Revision: D41036084

Pulled By: mthrok

fbshipit-source-id: 805881d4cb221ab2c002563aefb986e30fb91609

0c1e6f22

Fix issue with the missing video frame in StreamWriter (#2789) · 199a6ee2

moto authored Oct 24, 2022

Summary:
Addresses https://github.com/pytorch/audio/issues/2790.

Previously AVPacket objects had duration==0.

`av_interleaved_write_frame` function was inferring the duration of packets by
comparing them against the next ones but It could not infer the duration of
the last packet, as there is no subsequent frame, thus was omitting it from the final data.

This commit fixes it by explicitly setting packet duration = 1 (one frame)
only for video. (audio AVPacket contains multiple samples, so it's different.
To ensure the correctness for audio, the tests were added.)

Pull Request resolved: https://github.com/pytorch/audio/pull/2789

Reviewed By: xiaohui-zhang

Differential Revision: D40627439

Pulled By: mthrok

fbshipit-source-id: 4d0d827bff518c017b115445e03bdf0bf1e68320

199a6ee2

16 Nov, 2022 4 commits

Enable mixed precision training for hubert_pretrain_model (#2854) · 030646c0

Zhaoheng Ni authored Nov 16, 2022

Summary:
address https://github.com/pytorch/audio/issues/2847

In mixed precision training, the dtype of `mask_embedding` is **not** converted to fp16 automatically. This PR addresses the issue by changing the dtype of `mask_embedding` to `x` to enable mixed precision training.

Pull Request resolved: https://github.com/pytorch/audio/pull/2854

Reviewed By: carolineechen

Differential Revision: D41343486

Pulled By: nateanl

fbshipit-source-id: 4a5cbb429ff8ba5d3c439a3d5acb5094f66bf705

030646c0

Fix hubert fine-tuning recipe (#2851) · 980528e9

Zhaoheng Ni authored Nov 16, 2022

Summary:
- `_get_fileids_paths` in `LibriLightLimited` dataset was changed dataset in https://github.com/pytorch/audio/issues/2653, the absolute path becomes relative paths. This PR fixes the usage in hubert fine-tuning recipe to get correct audio paths.
- model options should be `hubert_pretrain_large` and `hubert_pretrain_xlarge` instead of `hubert_large` and `hubert_xlarge`.
- The input dimension of CTC linear layer varies depending on the model architecture, update it in lightning module.

cc simpleoier

Pull Request resolved: https://github.com/pytorch/audio/pull/2851

Reviewed By: carolineechen

Differential Revision: D41327998

Pulled By: nateanl

fbshipit-source-id: f92248ee84ec860b4e4dbef880c5794b338e1e2d

980528e9

Fix initialization in hubert_pretrain_model (#2846) · 2574e114

Zhaoheng Ni authored Nov 13, 2022

Summary:
address https://github.com/pytorch/audio/issues/2845

Pull Request resolved: https://github.com/pytorch/audio/pull/2846

Reviewed By: carolineechen

Differential Revision: D41251624

Pulled By: nateanl

fbshipit-source-id: 1a363d2314d6a452f35c109b9730da64ada5a2fd

2574e114

Make buffer size configurable in ffmpeg file object operations and set size in backend · 71ddee16
hwangjeff authored Oct 31, 2022

71ddee16

15 Nov, 2022 1 commit

Add logo (#2802) · 89e28623

moto authored Nov 14, 2022

Summary:
* Add the new official torchaudio logo to documentation/README.
* Add a page for download logo.

https://output.circle-artifacts.com/output/job/e9eb1292-7c10-4fef-adc3-ad568802aa59/artifacts/0/docs/index.html

<img width="1068" alt="Screen Shot 2022-11-14 at 10 30 27 AM" src="https://user-images.githubusercontent.com/855818/201738349-9e248f15-dce2-4931-9066-aa898a53d6ad.png">

https://output.circle-artifacts.com/output/job/e9eb1292-7c10-4fef-adc3-ad568802aa59/artifacts/0/docs/logo.html

<img width="617" alt="Screen Shot 2022-11-14 at 10 30 47 AM" src="https://user-images.githubusercontent.com/855818/201738420-ad0fda2f-f310-4802-851c-bbdf6c84c045.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2802

Reviewed By: carolineechen

Differential Revision: D41295277

Pulled By: mthrok

fbshipit-source-id: 6615d00799c9611f875e8485459d800e350b3486

89e28623

03 Nov, 2022 2 commits

Update favicon (#2825) · 613cb669

moto authored Nov 02, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2825

Reviewed By: carolineechen

Differential Revision: D40954522

Pulled By: mthrok

fbshipit-source-id: 433fb856a74a340af4d49e5c65a6270f0b00c835

613cb669

Remove redundant PyTorch logo assets (#2824) · 8125372b

moto authored Nov 02, 2022

Summary:
PyTorch logo is included in pytorch doc theme, (and cannot be changed without custom CSS) so no need to have them here.

Pull Request resolved: https://github.com/pytorch/audio/pull/2824

Reviewed By: carolineechen

Differential Revision: D40954564

Pulled By: mthrok

fbshipit-source-id: 5e9a91fddcc92c141baf1996f721c09c037fb003

8125372b

02 Nov, 2022 1 commit

Add links to training recipes (#2812) · 34f8a4b9

moto authored Nov 01, 2022

Summary:
<img width="756" alt="Screen Shot 2022-11-01 at 3 32 58 PM" src="https://user-images.githubusercontent.com/855818/199173348-f463ae71-438c-4dad-a481-b65522a8e52f.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2812

Reviewed By: carolineechen

Differential Revision: D40919942

Pulled By: mthrok

fbshipit-source-id: 18e5a709c262fb0b15ada0d303f1d0dee033beb1

34f8a4b9

29 Oct, 2022 1 commit
- Cherry-pick #2767 to release/0.13 (#2805) · d416a49b
  moto authored Oct 29, 2022
  
  d416a49b
20 Oct, 2022 1 commit

Fix doc in torchaudio.backend (#2781) · bc8640b4

Zhaoheng Ni authored Oct 20, 2022

Summary:
address https://github.com/pytorch/audio/issues/2780

Pull Request resolved: https://github.com/pytorch/audio/pull/2781

Reviewed By: carolineechen, mthrok

Differential Revision: D40556794

Pulled By: nateanl

fbshipit-source-id: b24912489d41e5663b4b4dcfb8be743fb962097e

bc8640b4

19 Oct, 2022 4 commits

Add iemocap variants (#2778) · 1b444d87

Caroline Chen authored Oct 19, 2022

Summary:
add ability to load only improvised or only scripted utterances.

Pull Request resolved: https://github.com/pytorch/audio/pull/2778

Reviewed By: nateanl

Differential Revision: D40511865

Pulled By: carolineechen

fbshipit-source-id: e1fe3908ac2aa306ad30c242ddd25762b2268539

1b444d87

Update download path for speechcommands (#2777) · ee68a982

Caroline Chen authored Oct 18, 2022

Summary:
previous download link for v0.02 did not download the entire dataset, but only the training dataset, resulting in issues when trying to access the testing or validation data.

Pull Request resolved: https://github.com/pytorch/audio/pull/2777

Reviewed By: nateanl

Differential Revision: D40480605

Pulled By: carolineechen

fbshipit-source-id: a594506b4ccfb548a7d5043b716c58463480c103

ee68a982

Add notes on file structure in Voxceleb1 based datasets (#2776) · 3b1d85d2

Zhaoheng Ni authored Oct 18, 2022

Summary:
The file structure of VoxCeleb1 is as follows:
```
root/
└── wav/
    └── speaker_id folders
```
Users who use [Kaldi](https://github.com/kaldi-asr/kaldi/blob/f6f4ccaf213f0fe8b26e633a7dc0c802150626a0/egs/voxceleb/v1/local/make_voxceleb1_v2.pl) to get the VoxCeleb1 dataset have "dev" and "test" folders above "wav" folder. However, in the file lists like https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test.txt or https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/iden_split.txt there is not such differentiation. It's not necessary to put the extracted files into separate folders.

This PR adds notes in `VoxCeleb1Identification` and `VoxCeleb1Verification` datasets to inform the file structure to users.

Pull Request resolved: https://github.com/pytorch/audio/pull/2776

Reviewed By: carolineechen

Differential Revision: D40483707

Pulled By: nateanl

fbshipit-source-id: ccd1780a72a5b53f0300c2466c3073a293ad7b8d

3b1d85d2

Add file_name to the returned item in Snips dataset (#2775) · 9a013fdb

Zhaoheng Ni authored Oct 18, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2775

Reviewed By: carolineechen

Differential Revision: D40481144

Pulled By: nateanl

fbshipit-source-id: 5d0fb2478767704603a3ec28d74160e7892d4d0e

9a013fdb

18 Oct, 2022 1 commit

Update description of HDemucs pipelines (#2774) · 88a8dd4d

nateanl authored Oct 18, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2774

Reviewed By: carolineechen

Differential Revision: D40445274

Pulled By: nateanl

fbshipit-source-id: 6388323a5fa5c548a86829cb3f7cafee5382d18d

88a8dd4d

17 Oct, 2022 1 commit

Update resampling tutorial (#2773) · b703a631

moto authored Oct 17, 2022

Summary:
* Refactor benchmark script
* Rename `time` variable to avoid (potential) conflicting with time module
* Fix `beta` parameter in benchmark (it was not used previously)
* Use `timeit` module for benchmark
* Add plot
* Move the comment on result at the end
* Add link to an explanation of aliasing

https://output.circle-artifacts.com/output/job/20b57d2f-3614-4161-a18e-e0c1a537739c/artifacts/0/docs/tutorials/audio_resampling_tutorial.html

Pull Request resolved: https://github.com/pytorch/audio/pull/2773

Reviewed By: carolineechen

Differential Revision: D40421337

Pulled By: mthrok

fbshipit-source-id: b402f84d4517695daeca75fb84ad876ef9354b3a

b703a631

14 Oct, 2022 2 commits

Fix leaking matplotlib figure (#2771) · fc6090e9

moto authored Oct 14, 2022

Summary:
In StreamWriter basic usage tutorial, matplotlib is used to generate raster images of waveforms, and the figure used is left unshown in the resulting tutorial with the use of ``sphinx_gallery_defer_figures`` command.

It turned out that this figure is shown in the next code block executed by Sphinx Gallery, and the figure is placed in totally unrelated place. https://pytorch.org/audio/main/tutorials/audio_feature_extractions_tutorial.html

<img width="951" alt="Screen Shot 2022-10-14 at 10 06 58 PM" src="https://user-images.githubusercontent.com/855818/195855124-ecd9be49-5085-4acd-9a93-608d9d1ee9ce.png">

This commit fixes it by closing the figure.

Pull Request resolved: https://github.com/pytorch/audio/pull/2771

Reviewed By: nateanl

Differential Revision: D40382076

Pulled By: mthrok

fbshipit-source-id: 015f2bab8492d3b4fbe70e1174c7776a5aa2679a

fc6090e9

Fix fading in hybrid demucs tutorial (#2769) · 55c695bb

nateanl authored Oct 13, 2022

Summary:
The separation applies on chunks of audios to avoid OOM. The combination of consecutive chunks is described in the graph:

![image](https://user-images.githubusercontent.com/8653221/195691886-002844e6-4ec5-41de-8910-df8046553998.png)

In the last audio chunk, there is no future chunk to be combined, hence the overlap on the right side doesn't need to be faded.

Pull Request resolved: https://github.com/pytorch/audio/pull/2769

Reviewed By: carolineechen

Differential Revision: D40358382

Pulled By: nateanl

fbshipit-source-id: ec8be895d7a67acb257e2693b64922397163ed5e

55c695bb

13 Oct, 2022 5 commits

Fix CTCDecoder doc (#2766) · bd37611b

moto authored Oct 13, 2022

Summary:
* Document `__call__` instead of `__init__`
* List CTCHypothesis first as it is used in combination with CTCDecoder
* Fix indentation of score method docstring

Pull Request resolved: https://github.com/pytorch/audio/pull/2766

Reviewed By: carolineechen

Differential Revision: D40349388

Pulled By: mthrok

fbshipit-source-id: 5e512e6c2b29d3533eb62d09b289154ccd1abf4c

bd37611b

Fix typos in tacotron2 tutorial (#2761) · 2c3448a9

Nikita Shulga authored Oct 12, 2022

Summary:
`publishe`->`published`

Also, not sure if it should be `pre-trained weight is published` or `pre-trained weights are published`

Pull Request resolved: https://github.com/pytorch/audio/pull/2761

Reviewed By: carolineechen

Differential Revision: D40313042

Pulled By: malfet

fbshipit-source-id: c22085ca0b1125a06aa04bf38231d0a9fbfed00b

2c3448a9

Add custom lm example to decoder tutorial (#2762) · c6f3d123

Caroline Chen authored Oct 13, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2762

Reviewed By: mthrok

Differential Revision: D40332603

Pulled By: carolineechen

fbshipit-source-id: 2de51265adc81b4728f4d6798d287bd2eccf5251

c6f3d123

Add gtzan download note (#2763) · 284e8f50

Caroline Chen authored Oct 12, 2022

Summary:
GTZAN download link is no longer working, so the torchaudio download functionality for GTZAN does not work properly, per https://github.com/pytorch/audio/issues/2743. Add a note in the docs to reflect this discovery.

Pull Request resolved: https://github.com/pytorch/audio/pull/2763

Reviewed By: nateanl, mthrok

Differential Revision: D40315071

Pulled By: carolineechen

fbshipit-source-id: 3250326c45d227546a9c62b33ba890199ad19242

284e8f50

Update tutorial author information (#2764) · 1ed7c06f

moto authored Oct 13, 2022

Summary:
Adding and updating author information.

Pull Request resolved: https://github.com/pytorch/audio/pull/2764

Reviewed By: carolineechen

Differential Revision: D40332427

Pulled By: mthrok

fbshipit-source-id: 4f04c7351386c122e3b0a45c2ed1757a04b7dc9a

1ed7c06f

12 Oct, 2022 4 commits

Improve hubert recipe for pre-training and fine-tuning (#2744) · 928248d7

Zhaoheng Ni authored Oct 12, 2022

Summary:
following pr https://github.com/pytorch/audio/issues/2716
- For preprocessing
  - The HuBERT feature takes lots of memory which may not fit some machines. Enable to use a subset of feature for training a k-means model.

- For pre-training
  - Normalize the loss based on the total number of masked frames across all GPUs.
  - Use mixed precision training. fp16 is not well supported in pytorch_lightning.
  - Log accuracies of masked/unmasked frames during training.
  - Clip the gradients with norm `10.0`.

- For ASR fine-tuning
  - Normalize the loss based on the total number of batches across all GPUs, same as in the conformer recipe of TorchAudio.
  - Use mixed precision training.
  - Add "|" after the end of transcription to capture the silence/word termination, same as in fairseq recipe.

- Update the WER results on LibriSpeech dev and test sets.

|                   | WER% (Viterbi)|  WER% (KenLM) |
|:-----------------:|--------------:|--------------:|
| dev-clean         |       10.9    |       4.2     |
| dev-other         |       17.5    |       9.4     |
| test-clean        |       10.9    |       4.4     |
| test-other        |       17.8    |       9.5     |

Pull Request resolved: https://github.com/pytorch/audio/pull/2744

Reviewed By: carolineechen

Differential Revision: D40282322

Pulled By: nateanl

fbshipit-source-id: 4723584c912e70e8970149fe09de005385eaab90

928248d7

Skip hubert xlarge torchscript test (#2758) · 97baba1b

Caroline Chen authored Oct 11, 2022

Summary:
a couple of circleci unittests are failing during hubert xlarge torchscript test, which has been known to fail on Windows in the past (#65776). this PR disables this test on circleci

cc atalman

Pull Request resolved: https://github.com/pytorch/audio/pull/2758

Reviewed By: mthrok

Differential Revision: D40290535

Pulled By: carolineechen

fbshipit-source-id: 5c5fb43434a517b6c439a8cb8e853015d1550a57

97baba1b

Improve wav2vec2/hubert model for pre-training (#2716) · 6de7bb98

Zhaoheng Ni authored Oct 12, 2022

Summary:
This PR improves the Wav2Vec2/HuBERT model regarding model pre-training.

- The model initialization of positional embedding and transformer module is essential to model pre-training. The accuracy of unmasked frames should be higher than masked frames, as it is an easier task. but without the initialization, the accuracy of masked frames is higher than unmasked frames.
Compared the performance after two epochs with 16 GPUs.
- With model initialization, the accuracies of masked/unmasked frames are 0.08/0.11.
- Without model initialization, the accuracies of masked/unmasked frames are 0.06/0.04.
- After adding the model initialization, the gradient is easy to overflow (aka `nan` gradient). In paper [Self-Supervised Learning for speech recognition with Intermediate layer supervision](https://arxiv.org/abs/2112.08778) the authors propose a simple but effective method to mitigate the overflow issue, by scaling down the multiplication of query and key and subtracting the maximum value from it (subtracting a constant value won't change the output of softmax). Then it guarantees the value won't be overflowed.
- In the original fairseq, the mask indices are generated by `numpy.random.choice`. Here replace `torch.multinomial` with `torch.randperm`. (cc carolineechen).

Other improvements within training scripts will be included in a separate PR.

Pull Request resolved: https://github.com/pytorch/audio/pull/2716

Reviewed By: xiaohui-zhang

Differential Revision: D39832189

Pulled By: nateanl

fbshipit-source-id: f4d2a473a79ad63add2dd16624bd155d5ce4de27

6de7bb98

Fix torchaudio build channel for the release (#2759) · 8b2fbf28
Andrey Talman authored Oct 12, 2022
```
* Fix torchaudio build channel

* Fix channel
```
8b2fbf28

11 Oct, 2022 6 commits

Add metadata for Librimix (#2751) · 720c36b1

Caroline Chen authored Oct 11, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2751

Reviewed By: nateanl

Differential Revision: D40267874

Pulled By: carolineechen

fbshipit-source-id: 4e45a02c650ed65c05cde82289a400a3be877927

720c36b1

Increase inactivity timeout (#2755) · 9574b7ca
Andrey Talman authored Oct 11, 2022

9574b7ca

Fix windows python 3.8 loading path (#2747) · 0f1d13ba

atalman authored Oct 11, 2022

Summary:
Fix windows python 3.8 loading path

Pull Request resolved: https://github.com/pytorch/audio/pull/2747

Reviewed By: nateanl

Differential Revision: D40264326

Pulled By: nateanl

fbshipit-source-id: f4a24757de7b48c63a7481034eb11fc3ff174327

0f1d13ba

[release 0.13] Remove prototype (#2749) · 4d251485
Caroline Chen authored Oct 11, 2022

4d251485
[release 0.13] update version numbers for release (#2748) · 84d8ced9
Caroline Chen authored Oct 11, 2022

84d8ced9

Add Snips Dataset (#2738) · 84187909

Zhaoheng Ni authored Oct 10, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2738

Reviewed By: carolineechen

Differential Revision: D40238099

Pulled By: nateanl

fbshipit-source-id: c5cc94c2a348a6ef34c04b8dd26114ecb874d73e

84187909

10 Oct, 2022 1 commit

Add unit test for LibriMix dataset (#2659) · c5b8e585

Zhaoheng Ni authored Oct 10, 2022

Summary:
Besides the unit test, the PR also addresses these issues:
- The original `LibriMix` dataset only supports "min" mode, which means the audio length is the minimum of all clean sources. It is default for source separation task. Users may also want to use "max" mode which allows for end-to-end separation and recognition. The PR adds ``mode`` argument to let users decide which dataset they want to use.
- If the task is ``"enh_both"``, the target is the audios in ``mix_clean`` instead of separate clean sources. The PR fixes it to use ``mix_clean`` as target.

Pull Request resolved: https://github.com/pytorch/audio/pull/2659

Reviewed By: carolineechen

Differential Revision: D40229227

Pulled By: nateanl

fbshipit-source-id: fc07e0d88a245e1367656d3767cf98168a799235

c5b8e585