Commits · c6bc65fd8c16e7d89887397f9e196cbdedf89a1b · OpenDAS / Torchaudio

20 Dec, 2022 1 commit

Fallback to best_effort_timestamp in case of invalid PTS (#2916) · c6bc65fd

moto authored Dec 20, 2022

Summary:
If the input video has invalid PTS, the current precise seek fails except when seeking into t=0.

This commit updates the discard mechanism to fallback to `best_effort_timestamp` in such cases.

`best_effort_timestamp` is just the number of frames went through decoder starting from the beginning of the file.

This means if the input file is very long, but seeking towards the end of the file, the StreamReader still decodes all the frames.

For videos with valid PTS, `best_effort_timestamp` should be same as `pts`. [[src](https://ffmpeg.org/doxygen/4.1/decode_8c.html#a8d86329cf58a4adbd24ac840d47730cf)]

Pull Request resolved: https://github.com/pytorch/audio/pull/2916

Reviewed By: YosuaMichael

Differential Revision: D42170204

Pulled By: mthrok

fbshipit-source-id: 80c04dc376e0f427d41eb9feb44c251a1648a998

c6bc65fd

09 Nov, 2022 1 commit

Add WavLM model (#2822) · bd76d3d7

Grigory Sizov authored Nov 09, 2022

Summary:
Closes T136364380

Added [WavLM Model](https://github.com/microsoft/UniSpeech/tree/main/WavLM):
- Added `WavLMSelfAttention` class (from [original implementation](https://github.com/microsoft/UniSpeech/blob/2e9dde8bf815a5f5fd958e3435e5641f59f96928/WavLM/modules.py)) and adjusted existing Encoder and Transformer classes to be compatible with it
- Added factory functions `wavlm_model`, `wavlm_base`, `wavlm_large` to `models/wav2vec2/model.py`
- Added bundles for base and large models to pipelines. **TODO**: pre-trained model weights are not yet uploaded to `download.pytorch.org`, permissions not granted yet.

## Tests
- Expanded HuggingFace integration tests to cover WavLM. For there tests, added JSON configs for base and large models from HF ([base](https://huggingface.co/microsoft/wavlm-base/blob/main/config.json), [large](https://huggingface.co/microsoft/wavlm-large/blob/main/config.json)) into test assets
- Expanded TorchScript and quantization tests to cover WavLM

## Comments
There are a few workarounds I had to introduce:
- Quantization tests for WavLM were breaking down at [`torch.cat`](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR466) ~~until I excluded the arguments of `torch.cat` from quantization [here](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR368-R369). I haven't found a better way to fix it, let me know if there is one~~ The reason for this seems to be that quantization replaces `.bias` and `.weight` attributes of a `Linear` module with methods. Since we are using weights and biases directly, the code was break. The final solution suggested by nateanl was to define attention weights and biases directly in `WavLMSelfAttention`, skipping the `Linear` layers
- ~~WavLM uses position embedding in the first layer of encoder, but not in the subsequent ones.  So [UniSpeech](https://github.com/microsoft/UniSpeech/blob/2e9dde8bf815a5f5fd958e3435e5641f59f96928/WavLM/modules.py#L342) and [HF](https://github.com/huggingface/transformers/blob/b047472650cba259621549ac27b18fd2066ce18e/src/transformers/models/wavlm/modeling_wavlm.py#L441-L442) implementations only create this embedding module in the layers where it's used. However, we can't do this here because it breaks TorchScript. So as a solution I add a dummy `Identity` module to `WavLMSelfAttention` when the actual embedding is not needed: [here](https://github.com/pytorch/audio/pull/2822/files#diff-6f1486901c94320ec0610a460dc674638fab9d104a61564ff7b59353a8b8547cR361-R368).~~ Thanks nateanl for resolving this!
- I had to add dummy `position_bias` and `key_padding_mask` arguments to `SelfAttention.forward` to make TorchScript tests pass. Since both `SelfAttention` and `WavLMSelfAttention` are called from `EncoderLayer`, they need to have compatible signatures. Having a variable number of arguments with `**kwargs` or checking object class doesn't seem to work with TorchScript, so I instead made both types of attention accept `position_bias` and `key_padding_mask` arguments.

Nit: do we still need to specify `__all__` if there are no wildcard imports in `__init__.py`, e.g. in `torchaudio/models/__init__.py`?

Pull Request resolved: https://github.com/pytorch/audio/pull/2822

Reviewed By: nateanl

Differential Revision: D41121855

Pulled By: sgrigory

fbshipit-source-id: 9f4f787e5810010de4e74cb704063a26c66767d7

bd76d3d7

31 Oct, 2022 1 commit

Add precise seek (#2737) · 60f29ca0

Joao Gomes authored Oct 31, 2022

Summary:
cc mthrok

Implements precise seek and seek to any frame in torchaudio

Pull Request resolved: https://github.com/pytorch/audio/pull/2737

Reviewed By: mthrok

Differential Revision: D40546716

Pulled By: jdsgomes

fbshipit-source-id: d37da7f55977337eb16a3c4df44ce8c3c102698e

60f29ca0

09 Aug, 2022 1 commit

Add NNLM support to CTC Decoder (#2528) · 03a0d68e

Caroline Chen authored Aug 09, 2022

Summary:
Expose flashlight's LM and LMState classes to support decoding with custom language models, including NN LMs.

The `ctc_decoder` API is as follows
- To decode with KenLM, pass in KenLM language model path to `lm` variable
- To decode with custom LM, create Python class with `CTCDecoderLM` subclass, and pass in the class to `lm` variable. Additionally create a file of LM words listed in order of the LM index, with a word per line, and pass in the file to `lm_path`.
- To decode without a language model, set `lm` to `None` (default)

Validated against fairseq w2l decoder on sample LibriSpeech dataset and LM. Code for validation can be found [here](https://github.com/facebookresearch/fairseq/compare/main...carolineechen:fairseq:ctc-decoder). Also added unit tests to validate custom implementations of ZeroLM and KenLM, and also using a biased LM.

Follow ups:
- Train simple LM on LibriSpeech and demonstrate usage in tutorial or examples directory

cc jacobkahn

Pull Request resolved: https://github.com/pytorch/audio/pull/2528

Reviewed By: mthrok

Differential Revision: D38243802

Pulled By: carolineechen

fbshipit-source-id: 445e78f6c20bda655aabf819fc0f771fe68c73d7

03a0d68e

26 Apr, 2022 1 commit

Add lexicon free CTC decoder (#2342) · 97ed428d

Caroline Chen authored Apr 26, 2022

Summary:
Add support for lexicon free decoding based on [fairseq's](https://github.com/pytorch/fairseq/blob/main/examples/speech_recognition/new/decoders/flashlight_decoder.py#L53) implementation. Reached numerical parity with fairseq's decoder in offline experimentation

Follow ups
- Add pretrained LM support for lex free decoding
- Add example in tutorial
- Replace flashlight C++ source code with flashlight text submodule
- [optional] fairseq compatibility test

Pull Request resolved: https://github.com/pytorch/audio/pull/2342

Reviewed By: nateanl

Differential Revision: D35856104

Pulled By: carolineechen

fbshipit-source-id: b64286550984df906ebb747e82f6fb1f21948ac7

97ed428d

21 Jan, 2022 1 commit

Add video test asset for streaming API (#2167) · db10bdfb

moto authored Jan 21, 2022

Summary:
Split from https://github.com/pytorch/audio/issues/2164
Add new test assets. Adding this commit separately so that
this commit message about the origin of the file is easier to find.

The original video is in public domain par
- https://svs.gsfc.nasa.gov/13013
- https://www.nasa.gov/multimedia/guidelines/index.html
(The YouTube page directly says so)
- https://www.youtube.com/watch?v=6zNsc0e3Zns

So, the video is modified to fit the needs for testing.
1. multiple audio/video streams
2. Non-audio/video (subtitle) streams
3. Different FPS and sampling rate
4. Ones without audio and video.

```
#!/usr/bin/env bash
original=https://svs.gsfc.nasa.gov/vis/a010000/a013000/a013013/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4
subtitle=https://svs.gsfc.nasa.gov/vis/a010000/a013000/a013013/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-SRT-CC.en_US.srt

# Fetch the original video, embed the subtitle
ffmpeg -i "${original}" -i "${subtitle}" -c:v copy -c:a copy -c:s mov_text -metadata:s:2 language=eng original.mp4 -y

# Extract, rescale video and resample audio
ffmpeg -i original.mp4 -ss 29 -to 42 -c:s copy -vf scale=480:270 -af aresample=16000 tmp1.mp4 -y
ffmpeg -i original.mp4 -ss 29 -to 42 -c:s copy -vf scale=320:180 -r 25 -af aresample=8000  tmp2.mp4 -y

# Merge them, retaining all the streams (6 in total)
ffmpeg -i tmp2.mp4 -i tmp1.mp4 -map 0 -map 1 -c:s copy nasa_13013.mp4 -y

# Make versions without audio / video
ffmpeg -i tmp2.mp4 -c copy -vn nasa_13013_no_video.mp4 -y
ffmpeg -i tmp2.mp4 -c copy -an nasa_13013_no_video.mp4 -y
```

Pull Request resolved: https://github.com/pytorch/audio/pull/2167

Reviewed By: carolineechen

Differential Revision: D33712954

Pulled By: mthrok

fbshipit-source-id: b7cfc1358043a4abd1c0b416e8a8fb0039867211

db10bdfb

23 Dec, 2021 2 commits

Add Python CTC decoder API (#2089) · a76b0066

Caroline Chen authored Dec 23, 2021

Summary:
Part of https://github.com/pytorch/audio/issues/2072 -- splitting up PR for easier review

This PR adds Python decoder API and basic README

Pull Request resolved: https://github.com/pytorch/audio/pull/2089

Reviewed By: mthrok

Differential Revision: D33299818

Pulled By: carolineechen

fbshipit-source-id: 778ec3692331e95258d3734f0d4ab60b6618ddbc

a76b0066

Apply arc lint to pytorch audio (#2096) · 5859923a

Joao Gomes authored Dec 23, 2021

Summary:
Pull Request resolved: https://github.com/pytorch/audio/pull/2096

run: `arc lint --apply-patches --paths-cmd 'hg files -I "./**/*.py"'`

Reviewed By: mthrok

Differential Revision: D33297351

fbshipit-source-id: 7bf5956edf0717c5ca90219f72414ff4eeaf5aa8

5859923a

18 Nov, 2021 1 commit
- Re-sync with internal repository (#2017) · b4184dc6
  Facebook Community Bot authored Nov 18, 2021
```
Co-authored-by: Facebook Community Bot <6422482+facebook-github-bot@users.noreply.github.com>
```
  b4184dc6
17 Nov, 2021 1 commit

Remove facebook folder in wav2vec unittests (#2015) · 2a5fe5ff

Zhaoheng Ni authored Nov 17, 2021

Summary:
Pull Request resolved: https://github.com/pytorch/audio/pull/2015

as titled

Reviewed By: hwangjeff, mthrok

Differential Revision: D32495691

fbshipit-source-id: 60d8a2337585e3147f24ca9f0b6518e30cd9134a

2a5fe5ff

22 Oct, 2021 1 commit
- Refactor integration test (#1922) · 19d8f1c2
  moto authored Oct 22, 2021
```
- Make the test support other languages
- Fetch tetst asset on-the-fly
```
  19d8f1c2
05 Oct, 2021 1 commit
- Add HUBERT_BASE and HUBERT_ASR_LARGE pretrained models (#1821) · 358e9e93
  moto authored Oct 05, 2021
  
  358e9e93
28 Sep, 2021 1 commit

Add HuBERT model architectures (#1769) · a7854f33

moto authored Sep 28, 2021

This commit adds the following HuBERT model architectures

 - `base` (pre-training)
 - `large` (pre-training / fine-tuning)
 - `xlarge` (pre-training / fine-tuning)

Since the internal components are same as `Wav2Vec2Model`, it reuses the existing modules..
With these models, it is possible to 
- import the pre-trained model published by `fairseq` and TorchScript it.
- fine-tune the existing model for downstream task.

a7854f33

22 Sep, 2021 1 commit

Update reference from master to main elsewhere (#1784) · 1b4b82e0

moto authored Sep 22, 2021



Summary: Update fairseq reference from master to main elsewhere

Reviewed By: alexeib

Differential Revision: D30938472

fbshipit-source-id: 243b98550207f241c9d3265bf3d4060350aaf0a8
Co-authored-by: Diana Liskovich <dianaml@fb.com>

1b4b82e0

04 Jun, 2021 1 commit

[BC-Breaking] Remove kaldi.resample_waveform (#1555) · 30de797c

moto authored Jun 04, 2021

`torchaudio.compliance.kaldi.resample_waveform` has been replaced with `torchaudio.funcitonal.resample`.

30de797c

01 Jun, 2021 1 commit
- Add wav2vec2 fairseq importer (#1531) · f1a0b605
  moto authored Jun 01, 2021
  
  f1a0b605
27 May, 2021 1 commit
- Add wav2vec2 HuggingFace importer (#1530) · c8239c64
  moto authored May 27, 2021
  
  c8239c64
15 Mar, 2021 1 commit
- Fix JSON Lines file extension (#1392) · 32cd700a
  moto authored Mar 15, 2021
  
  32cd700a
09 Feb, 2021 1 commit
- Add Kaldi Pitch feature (#1243) · 7ee1c46b
  moto authored Feb 09, 2021
  
  7ee1c46b
30 Dec, 2020 1 commit
- Make Dataset utility test independent of CommonVoice (#1132) · 93c3025f
  Aziz authored Dec 30, 2020
  
  93c3025f
22 Dec, 2020 1 commit
- Add format override to load and related I/O functions (#1104) · be442561
  moto authored Dec 22, 2020
  
  be442561
05 Aug, 2020 1 commit

[CI] Run unit test with non-editable installation (#845) · 9ba02d5b

moto authored Aug 04, 2020

We have been running unit test with editable installation. (i.e. `python setup.py develop`), with which we missed issues like #842.

This CC makes installation in CI non-editable, and change test directory structure so that the source code will not shadow the installed version of `torchaudio`. With simple `pytest test`, `pytest` modifies `sys.path` and prepend checked out repository, which shadows the installed version.

To remedy this, the whole test suites has been moved from `./test` to `./test/torchaudio_unittest`. This adds nice module structure to our test code and we can do absolute import in each test module, which makes it possible again to run test with `python -m unittest torchaudio_unittest/XXX.py`

This change does not affect the regular development process (`python setup.py develop` && `pytest test`)

9ba02d5b