Commits · 3267c7ed38088e67dd1bdb4095689d82747b0d75 · OpenDAS / Torchaudio

22 Feb, 2023 1 commit

Add objective metric estimation model for speech enhancement (#3042) · 3267c7ed

Zhaoheng Ni authored Feb 21, 2023

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/3042

Reviewed By: mthrok

Differential Revision: D43405932

Pulled By: nateanl

fbshipit-source-id: 88f6dabae35565b699230e9909b8f68f4a57f5c7

3267c7ed

14 Feb, 2023 1 commit

Add simulate_rir_ism method for room impulse response simulation (#2880) · 8c5c9a9b

Zhaoheng Ni authored Feb 14, 2023

Summary:
replicate of https://github.com/pytorch/audio/issues/2644

Pull Request resolved: https://github.com/pytorch/audio/pull/2880

Reviewed By: mthrok

Differential Revision: D41633911

Pulled By: nateanl

fbshipit-source-id: 73cf145d75c389e996aafe96571ab86dc21f86e5

8c5c9a9b

15 Jan, 2023 1 commit

Add pre-trained pipelines for XLS-R models (#2978) · 9b7b64e4

Zhaoheng Ni authored Jan 15, 2023

Summary:
The PR adds three `Wav2Vec2Bundle ` pipeline objects for XLS-R models:
- WAV2VEC2_XLSR_300M
- WAV2VEC2_XLSR_1B
- WAV2VEC2_XLSR_2B

All three models use layer normalization in the feature extraction layers, hence `_normalize_waveform` is set to `True`.

Pull Request resolved: https://github.com/pytorch/audio/pull/2978

Reviewed By: hwangjeff

Differential Revision: D42501491

Pulled By: nateanl

fbshipit-source-id: 2429ec880cc14798034843381e458e1b4664dac3

9b7b64e4

13 Jan, 2023 1 commit

Add XLS-R models (#2959) · a5664ca9

Zhaoheng Ni authored Jan 12, 2023

Summary:
XLSR (cross-lingual speech representation) are a set of cross-lingual self-supervised learning models for generating cross-lingual speech representation. It was first proposed in https://arxiv.org/pdf/2006.13979.pdf which is trained on 53 languages (so-called XLSR-53). This PR supports more XLS-R models from https://arxiv.org/pdf/2111.09296.pdf that have more parameters (300M, 1B, 2B) and are trained on 128 languages.

Pull Request resolved: https://github.com/pytorch/audio/pull/2959

Reviewed By: mthrok

Differential Revision: D42397643

Pulled By: nateanl

fbshipit-source-id: 23e8e51a7cde0a226db4f4028db7df8f02b986ce

a5664ca9

08 Dec, 2022 1 commit

Add HiFi GAN Generator to prototypes (#2860) · b5e4663a

Grigory Sizov authored Dec 08, 2022

Summary:
Part 1 of [T138011314](https://www.internalfb.com/intern/tasks/?t=138011314)

This PR ports the generator part of [HiFi GAN](https://arxiv.org/abs/2010.05646v2) from [the original implementation](https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L75)

Adds tests:
- Smoke tests for architectures V1, V2, V3
- Check that output shapes are correct
- Check that the model is torchscriptable and scripting doesn't change the output
- Check that our code's output matches the original implementation. Here I clone the original repo inside `/tmp` and import necessary objects from inside the test function.  On test teardown I restore `PATH`, but don't remove the cloned code, so that it can be reused on subsequent runs - let me know if removing it would be a better practice

There are no quantization tests, because the model consists mainly of `Conv1d` and `ConvTransposed1d`, and they are [not supported by dynamic quantization](https://pytorch.org/docs/stable/quantization.html)

Pull Request resolved: https://github.com/pytorch/audio/pull/2860

Reviewed By: nateanl

Differential Revision: D41433416

Pulled By: sgrigory

fbshipit-source-id: f135c560df20f5138f01e3efdd182621edabb4f5

b5e4663a

07 Dec, 2022 1 commit

Introduce MUSAN dataset (#2888) · 45c7d05a

hwangjeff authored Dec 06, 2022

Summary:
Introduces the MUSAN dataset (https://www.openslr.org/17/), which contains music, speech, and noise recordings.

Pull Request resolved: https://github.com/pytorch/audio/pull/2888

Reviewed By: xiaohui-zhang

Differential Revision: D41762164

Pulled By: hwangjeff

fbshipit-source-id: 14d5baaa4d40f065dd5d99bf7f2e0a73aa6c31a9

45c7d05a

30 Nov, 2022 1 commit

Add speed and speed perturbation functions and transforms (#2829) · c28073cc

hwangjeff authored Nov 30, 2022

Summary:
Adds functions and transforms for speed and speed perturbation (https://www.isca-speech.org/archive/interspeech_2015/ko15_interspeech.html).

Pull Request resolved: https://github.com/pytorch/audio/pull/2829

Reviewed By: xiaohui-zhang

Differential Revision: D41285114

Pulled By: hwangjeff

fbshipit-source-id: 114740507698e01f35d4beb2c568a2479e847506

c28073cc

10 Nov, 2022 1 commit

Add conformer w2v2 model architecture (#2826) · 74f9a894

Caroline Chen authored Nov 09, 2022

Summary:
internal comparison tests: D40080919

follow up PR for pretrained models https://github.com/pytorch/audio/issues/2827

Pull Request resolved: https://github.com/pytorch/audio/pull/2826

Reviewed By: nateanl

Differential Revision: D41160061

Pulled By: carolineechen

fbshipit-source-id: f3c478b28c235af53d1d8e21b573c53684a63ac4

74f9a894

09 Nov, 2022 1 commit

Add WavLM model (#2822) · bd76d3d7

Grigory Sizov authored Nov 09, 2022

Summary:
Closes T136364380

Added [WavLM Model](https://github.com/microsoft/UniSpeech/tree/main/WavLM):
- Added `WavLMSelfAttention` class (from [original implementation](https://github.com/microsoft/UniSpeech/blob/2e9dde8bf815a5f5fd958e3435e5641f59f96928/WavLM/modules.py)) and adjusted existing Encoder and Transformer classes to be compatible with it
- Added factory functions `wavlm_model`, `wavlm_base`, `wavlm_large` to `models/wav2vec2/model.py`
- Added bundles for base and large models to pipelines. **TODO**: pre-trained model weights are not yet uploaded to `download.pytorch.org`, permissions not granted yet.

## Tests
- Expanded HuggingFace integration tests to cover WavLM. For there tests, added JSON configs for base and large models from HF ([base](https://huggingface.co/microsoft/wavlm-base/blob/main/config.json), [large](https://huggingface.co/microsoft/wavlm-large/blob/main/config.json)) into test assets
- Expanded TorchScript and quantization tests to cover WavLM

## Comments
There are a few workarounds I had to introduce:
- Quantization tests for WavLM were breaking down at [`torch.cat`](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR466) ~~until I excluded the arguments of `torch.cat` from quantization [here](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR368-R369). I haven't found a better way to fix it, let me know if there is one~~ The reason for this seems to be that quantization replaces `.bias` and `.weight` attributes of a `Linear` module with methods. Since we are using weights and biases directly, the code was break. The final solution suggested by nateanl was to define attention weights and biases directly in `WavLMSelfAttention`, skipping the `Linear` layers
- ~~WavLM uses position embedding in the first layer of encoder, but not in the subsequent ones.  So [UniSpeech](https://github.com/microsoft/UniSpeech/blob/2e9dde8bf815a5f5fd958e3435e5641f59f96928/WavLM/modules.py#L342) and [HF](https://github.com/huggingface/transformers/blob/b047472650cba259621549ac27b18fd2066ce18e/src/transformers/models/wavlm/modeling_wavlm.py#L441-L442) implementations only create this embedding module in the layers where it's used. However, we can't do this here because it breaks TorchScript. So as a solution I add a dummy `Identity` module to `WavLMSelfAttention` when the actual embedding is not needed: [here](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR361-R368).~~ Thanks nateanl for resolving this!
- I had to add dummy `position_bias` and `key_padding_mask` arguments to `SelfAttention.forward` to make TorchScript tests pass. Since both `SelfAttention` and `WavLMSelfAttention` are called from `EncoderLayer`, they need to have compatible signatures. Having a variable number of arguments with `**kwargs` or checking object class doesn't seem to work with TorchScript, so I instead made both types of attention accept `position_bias` and `key_padding_mask` arguments.

Nit: do we still need to specify `__all__` if there are no wildcard imports in `__init__.py`, e.g. in `torchaudio/models/__init__.py`?

Pull Request resolved: https://github.com/pytorch/audio/pull/2822

Reviewed By: nateanl

Differential Revision: D41121855

Pulled By: sgrigory

fbshipit-source-id: 9f4f787e5810010de4e74cb704063a26c66767d7

bd76d3d7

11 Oct, 2022 1 commit

Add Snips Dataset (#2738) · 84187909

Zhaoheng Ni authored Oct 10, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2738

Reviewed By: carolineechen

Differential Revision: D40238099

Pulled By: nateanl

fbshipit-source-id: c5cc94c2a348a6ef34c04b8dd26114ecb874d73e

84187909

09 Oct, 2022 1 commit

Add IEMOCAP dataset (#2732) · 0b4b1fd4

Caroline Chen authored Oct 09, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2732

Reviewed By: nateanl

Differential Revision: D40186996

Pulled By: nateanl

fbshipit-source-id: a0ad325b7153c9e580dad2c515730dadbe8840c4

0b4b1fd4

21 Sep, 2022 2 commits

Adopt `:autosummary:` in `torchaudio.pipelines` module doc (#2689) · 0b3ddec6

moto authored Sep 21, 2022

Summary:
* Introduce the mini-index at `torchaudio.pipelines` page.
* Add introductions
* Update pipeline tutorials

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/pipelines.html

<img width="1163" alt="Screen Shot 2022-09-20 at 1 23 29 PM" src="https://user-images.githubusercontent.com/855818/191167049-98324e93-2e16-41db-8538-3b5b54eb8224.png">

<img width="1115" alt="Screen Shot 2022-09-20 at 1 23 49 PM" src="https://user-images.githubusercontent.com/855818/191167071-4770f594-2540-43a4-a01c-e983bf59220f.png">

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle

<img width="1108" alt="Screen Shot 2022-09-20 at 1 24 18 PM" src="https://user-images.githubusercontent.com/855818/191167123-51b33a5f-c30c-46bc-b002-b05d2d0d27b7.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2689

Reviewed By: carolineechen

Differential Revision: D39691253

Pulled By: mthrok

fbshipit-source-id: ddf5fdadb0b64cf2867b6271ba53e8e8c0fa7e49

0b3ddec6

Adopt `:autosummary:` in `torchaudio.models` module doc (#2690) · 30c7077b

moto authored Sep 20, 2022

Summary:
* Introduce the mini-index at `torchaudio.models` page.

https://output.circle-artifacts.com/output/job/25e59810-3866-4ece-b1b7-8a10c7a2286d/artifacts/0/docs/models.html

<img width="1042" alt="Screen Shot 2022-09-20 at 1 20 50 PM" src="https://user-images.githubusercontent.com/855818/191166816-83314ad1-8b67-475b-aa10-d4cc59126295.png">

<img width="1048" alt="Screen Shot 2022-09-20 at 1 20 58 PM" src="https://user-images.githubusercontent.com/855818/191166829-1ceb65e0-9506-4328-9a2f-8b75b4e54404.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2690

Reviewed By: carolineechen

Differential Revision: D39654948

Pulled By: mthrok

fbshipit-source-id: 703d1526617596f647c85a7148f41ca55fffdbc8

30c7077b

12 Jul, 2022 1 commit

Hybrid Demucs model implementation (#2506) · 608b8ea6

Sean Kim authored Jul 12, 2022

Summary:
Draft PR with initial model implementation with minor changes from previous implementation

Pull Request resolved: https://github.com/pytorch/audio/pull/2506

Reviewed By: nateanl

Differential Revision: D37762671

Pulled By: skim0514

fbshipit-source-id: b7dc0a6ef725d6ae6d76c23c882623f7d339977c

608b8ea6

27 Jun, 2022 1 commit

Add VoxCeleb1 dataset (#2349) · 21b2d139

Zhaoheng Ni authored Jun 27, 2022

Summary:
This PR adds two dataset classes of VoxCeleb1 corpus.
- `VoxCeleb1Identification`
Each data sample contains the waveform, sample rate, speaker id, and the file id.
- `VoxCeleb1Verification`
Each data sample contains a pair of waveforms, sample rate, the label indicating if they are from the same speaker, and the file ids.

Pull Request resolved: https://github.com/pytorch/audio/pull/2349

Reviewed By: carolineechen

Differential Revision: D35927921

Pulled By: nateanl

fbshipit-source-id: 3e07ddd329178777698841565053eb59befe6449

21b2d139

21 Jun, 2022 1 commit

Create musdb handler and tests (#2484) · b92a8a09

Sean Kim authored Jun 21, 2022

Summary:
Create dataset handler and tests for new dataset. Manually tested and unit tested to test validity. Pre-commit ran for style checks.

Pull Request resolved: https://github.com/pytorch/audio/pull/2484

Reviewed By: carolineechen, nateanl

Differential Revision: D37250556

Pulled By: skim0514

fbshipit-source-id: d2c8d73d22fd9d7282026265676f3eab1e178d51

b92a8a09

20 Jun, 2022 1 commit

Add fluent speech commands (#2480) · 66a67d2e

Caroline Chen authored Jun 20, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2480

Reviewed By: nateanl

Differential Revision: D37249571

Pulled By: carolineechen

fbshipit-source-id: caefeec4253c91f2579655a0c1735edaeed51be9

66a67d2e

10 May, 2022 2 commits

Add ConvEmformer module (#2358) · 2c79b55a

hwangjeff authored May 10, 2022

Summary:
Adds an implementation of the convolution-augmented streaming transformer (effectively Emformer with convolution block) described in https://arxiv.org/abs/2110.05241.

Continuation of https://github.com/pytorch/audio/issues/2324.

Pull Request resolved: https://github.com/pytorch/audio/pull/2358

Reviewed By: nateanl, xiaohui-zhang

Differential Revision: D36137992

Pulled By: hwangjeff

fbshipit-source-id: 9c7a7c233944fe9ef15b9ba397d7f0809da1f063

2c79b55a

Add citations for datasets (#2371) · 638120ca

Caroline Chen authored May 09, 2022

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/2371

Reviewed By: xiaohui-zhang

Differential Revision: D36246167

Pulled By: carolineechen

fbshipit-source-id: 23042a1c393711864a18c9815d248c18d1d258b4

638120ca

08 Apr, 2022 1 commit

Add devices/properties badges (#2321) · 72ae755a

moto authored Apr 07, 2022

Summary:
Add badges of supported properties and devices to functionals and transforms.

This commit adds `.. devices::` and `.. properties::` directives to sphinx.

APIs with these directives will have badges (based off of shields.io) which link to the
page with description of these features.

Continuation of https://github.com/pytorch/audio/issues/2316
Excluded dtypes for further improvement, and actually added badges to most of functional/transforms.

Pull Request resolved: https://github.com/pytorch/audio/pull/2321

Reviewed By: hwangjeff

Differential Revision: D35489063

Pulled By: mthrok

fbshipit-source-id: f68a70ebb22df29d5e9bd171273bd19007a81762

72ae755a

24 Mar, 2022 1 commit

Update CTC decoder docs and add citation (#2278) · 05592dff

Caroline Chen authored Mar 24, 2022

Summary:
rendered:
- [tutorial](https://output.circle-artifacts.com/output/job/e7fb5a23-87cf-4dd5-b4a8-8b4f91e20eb4/artifacts/0/docs/tutorials/asr_inference_with_ctc_decoder_tutorial.html)
- [docs](https://output.circle-artifacts.com/output/job/e7fb5a23-87cf-4dd5-b4a8-8b4f91e20eb4/artifacts/0/docs/prototype.ctc_decoder.html)

Pull Request resolved: https://github.com/pytorch/audio/pull/2278

Reviewed By: mthrok

Differential Revision: D35097734

Pulled By: carolineechen

fbshipit-source-id: 1e5d5fff0b7740757cca358cf3ea44c6488fcd5c

05592dff

25 Feb, 2022 1 commit

Add mvdr_weights_souden to torchaudio.functional (#2228) · 5d06a369

Zhaoheng Ni authored Feb 25, 2022

Summary:
This PR adds ``mvdr_weights_souden`` method to ``torchaudio.functional``.
It computes the MVDR weight matrix based on the solution proposed by [``Souden et, al.``](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.725.673&rep=rep1&type=pdf).
The input arguments are the complex-valued power spectral density (PSD) matrix of the target speech, PSD matrix of noise, int or one-hot Tensor to indicate the reference channel, respectively.

Pull Request resolved: https://github.com/pytorch/audio/pull/2228

Reviewed By: mthrok

Differential Revision: D34474018

Pulled By: nateanl

fbshipit-source-id: 725df812f8f6e6cc81cc37e8c3cb0da2ab3b74fb

5d06a369

23 Dec, 2021 1 commit

Introduce Conformer (#2068) · 1b17b011

hwangjeff authored Dec 22, 2021

Summary:
Adds implementation of Conformer module.

Adapted from sravyapopuri388's implementation for fairseq at https://github.com/fairinternal/fairseq-py/pull/2770.

Pull Request resolved: https://github.com/pytorch/audio/pull/2068

Reviewed By: mthrok

Differential Revision: D33236957

Pulled By: hwangjeff

fbshipit-source-id: 382d99394996ff5249522b5899e1a4b4a95de9e6

1b17b011

25 Oct, 2021 1 commit
- Add pretrained French ASR from voxpopuli (#1919) · cbf267c3
  moto authored Oct 25, 2021
  
  cbf267c3
16 Oct, 2021 1 commit
- Add SpecAugment figure/citation (#1887) · 9e3778d2
  moto authored Oct 16, 2021
  
  9e3778d2
15 Oct, 2021 1 commit

Add TTS bundle/pipelines (#1872) · e885204e

moto authored Oct 15, 2021

Future work items:
- length computation of GriffinLim
- better way to make InverseMelScale work in inference_mode

e885204e

06 Oct, 2021 1 commit

Introduce Emformer (#1801) · 48cfbf2b

hwangjeff authored Oct 06, 2021

Adds an implementation of Emformer, a memory-efficient transformer architecture
introduced in https://ieeexplore.ieee.org/document/9414560 that targets low-latency
streaming speech recognition applications.

48cfbf2b

05 Oct, 2021 1 commit
- Add HUBERT_BASE and HUBERT_ASR_LARGE pretrained models (#1821) · 358e9e93
  moto authored Oct 05, 2021
  
  358e9e93
28 Sep, 2021 1 commit

Add HuBERT model architectures (#1769) · a7854f33

moto authored Sep 28, 2021

This commit adds the following HuBERT model architectures

 - `base` (pre-training)
 - `large` (pre-training / fine-tuning)
 - `xlarge` (pre-training / fine-tuning)

Since the internal components are same as `Wav2Vec2Model`, it reuses the existing modules..
With these models, it is possible to 
- import the pre-trained model published by `fairseq` and TorchScript it.
- fine-tune the existing model for downstream task.

a7854f33

20 Sep, 2021 1 commit
- Move MVDR and PSD modules to transforms (#1771) · ac97ad82
  nateanl authored Sep 20, 2021
  
  ac97ad82
12 Aug, 2021 1 commit
- Add prototype.tacotron2 page to docs (#1695) · 9c641849
  yangarbiter authored Aug 12, 2021
  
  9c641849
20 Jul, 2021 1 commit

Add Tacotron2 model (#1621) · 394d617e

yangarbiter authored Jul 20, 2021

Porting Tacotron2 from https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py

394d617e

03 Jun, 2021 1 commit

Update docs (#1550) · 0166a851

moto authored Jun 03, 2021

* Use `bibtex` for paper citations.
  * add `override.css` for fixing back reference.
  * wav2vec2
  * wav2letter
  * convtasnet
  * deepspeech
  * rnnt-loss
  * griffinlim
* Fix broken references in `filtering`.
* Fix note in soundfile backends.
* Tweak wav2vec2 example.
* Removes unused `pytorch_theme.css`

0166a851