Update WER results for CTC n-gram decoding (#3070)

Summary: In https://github.com/pytorch/audio/issues/2873, layer normalization is applied to waveforms for SSL models trained on large scale datasets. The word error rate is significantly reduced after the change. The PR updates the results for the affected models. Without the change in https://github.com/pytorch/audio/issues/2873, here is the WER result table: | Model | dev-clean | dev-other | test-clean | test-other | |:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:| | [WAV2VEC2_ASR_LARGE_LV60K_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M) | 10.59| 15.62| 9.58| 16.33| | [WAV2VEC2_ASR_LARGE_LV60K_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H) | 2.80| 6.01| 2.82| 6.34| | [WAV2VEC2_ASR_LARGE_LV60K_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H) | 2.36| 4.43| 2.41| 4.96| | [HUBERT_ASR_LARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE) | 1.85| 3.46| 2.09| 3.89| | [HUBERT_ASR_XLARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE) | 2.21| 3.40| 2.26| 4.05| After applying layer normalization, here is the updated result: | Model | dev-clean | dev-other | test-clean | test-other | |:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:| | [WAV2VEC2_ASR_LARGE_LV60K_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M) | 6.77| 10.03| 6.87| 10.51| | [WAV2VEC2_ASR_LARGE_LV60K_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H) | 2.19| 4.55| 2.32| 4.64| | [WAV2VEC2_ASR_LARGE_LV60K_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H) | 1.78| 3.51| 2.03| 3.68| | [HUBERT_ASR_LARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE) | 1.77| 3.32| 2.03| 3.68| | [HUBERT_ASR_XLARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE) | 1.73| 2.72| 1.90| 3.16| Pull Request resolved: https://github.com/pytorch/audio/pull/3070 Reviewed By: mthrok Differential Revision: D43365313 Pulled By: nateanl fbshipit-source-id: 34a60ad2e5eb1299da64ef88ff0208ec8ec76e91

Update WER results for CTC n-gram decoding (#3070)
Summary: In https://github.com/pytorch/audio/issues/2873, layer normalization is applied to waveforms for SSL models trained on large scale datasets. The word error rate is significantly reduced after the change. The PR updates the results for the affected models. Without the change in https://github.com/pytorch/audio/issues/2873, here is the WER result table: | Model | dev-clean | dev-other | test-clean | test-other | |:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:| | [WAV2VEC2_ASR_LARGE_LV60K_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M) | 10.59| 15.62| 9.58| 16.33| | [WAV2VEC2_ASR_LARGE_LV60K_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H) | 2.80| 6.01| 2.82| 6.34| | [WAV2VEC2_ASR_LARGE_LV60K_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H) | 2.36| 4.43| 2.41| 4.96| | [HUBERT_ASR_LARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE) | 1.85| 3.46| 2.09| 3.89| | [HUBERT_ASR_XLARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE) | 2.21| 3.40| 2.26| 4.05| After applying layer normalization, here is the updated result: | Model | dev-clean | dev-other | test-clean | test-other | |:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:| | [WAV2VEC2_ASR_LARGE_LV60K_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M) | 6.77| 10.03| 6.87| 10.51| | [WAV2VEC2_ASR_LARGE_LV60K_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H) | 2.19| 4.55| 2.32| 4.64| | [WAV2VEC2_ASR_LARGE_LV60K_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H) | 1.78| 3.51| 2.03| 3.68| | [HUBERT_ASR_LARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE) | 1.77| 3.32| 2.03| 3.68| | [HUBERT_ASR_XLARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE) | 1.73| 2.72| 1.90| 3.16| Pull Request resolved: https://github.com/pytorch/audio/pull/3070 Reviewed By: mthrok Differential Revision: D43365313 Pulled By: nateanl fbshipit-source-id: 34a60ad2e5eb1299da64ef88ff0208ec8ec76e91
11bdafc3 · Zhaoheng Ni · Facebook GitHub Bot · 6b2086cf · 11bdafc3
Commit 11bdafc3 authored Feb 16, 2023 by Zhaoheng Ni Committed by Facebook GitHub Bot Feb 16, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 11 additions and 6 deletions

examples/asr/librispeech_ctc_decoder/README.md examples/asr/librispeech_ctc_decoder/README.md +11 -6

No files found.
--- a/examples/asr/librispeech_ctc_decoder/README.md
+++ b/examples/asr/librispeech_ctc_decoder/README.md
@@ -20,9 +20,14 @@ python inference.py \
 ## Results
 The table below contains WER results for various pretrained models on LibriSpeech, using a beam size of 1500, and language model weight and word insertion scores taken from Table 7 of [wav2vec 2.0](https://arxiv.org/pdf/2006.11477.pdf).
-|                                                                                            Model | test-clean | test-other |
+|                                                                                            Model | LM weight | word insertion | dev-clean | dev-other | test-clean | test-other |
-|:------------------------------------------------------------------------------------------------:|-----------:|-----------:|
+|:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|
-| [WAV2VEC2_ASR_BASE_10M](https://pytorch.org/audio/main/pipelines.html#wav2vec2-asr-base-10m)     |        9.35|       15.91|
+| [WAV2VEC2_ASR_BASE_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M)     |        3.23|        -0.26|        9.41|        15.95|        9.35|       15.91|
-| [WAV2VEC2_ASR_BASE_100H](https://pytorch.org/audio/main/pipelines.html#wav2vec2-asr-base-100h)   |        3.42|        8.07|
+| [WAV2VEC2_ASR_BASE_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H)   |        2.15|        -0.52|        3.08|        7.89|        3.42|        8.07|
-| [WAV2VEC2_ASR_BASE_960H](https://pytorch.org/audio/main/pipelines.html#wav2vec2-asr-base-960h)   |        2.61|        6.15|
+| [WAV2VEC2_ASR_BASE_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H)   |        1.74|        0.52|        2.56|        6.26|        2.61|        6.15|
-| [WAV2VEC2_ASR_LARGE_960H](https://pytorch.org/audio/main/pipelines.html#wav2vec2-asr-large-960h) |        2.34|        4.98|
+| [WAV2VEC2_ASR_LARGE_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H) |        1.74|        0.52|        2.14|        4.62|        2.34|        4.98|
+| [WAV2VEC2_ASR_LARGE_LV60K_10M](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M) |        3.86|        -1.18|        6.77|        10.03|        6.87|        10.51|
+| [WAV2VEC2_ASR_LARGE_LV60K_100H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H) |        2.15|        -0.52|        2.19|        4.55|        2.32|        4.64|
+| [WAV2VEC2_ASR_LARGE_LV60K_960H](https://pytorch.org/audio/main/generated/torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H.html#torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H) |        1.57|        -0.64|        1.78|        3.51|        2.03|        3.68|
+| [HUBERT_ASR_LARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_LARGE.html#torchaudio.pipelines.HUBERT_ASR_LARGE) |        1.57|        -0.64|        1.77|        3.32|        2.03|        3.68|
+| [HUBERT_ASR_XLARGE](https://pytorch.org/audio/main/generated/torchaudio.pipelines.HUBERT_ASR_XLARGE.html#torchaudio.pipelines.HUBERT_ASR_XLARGE) |        1.57|        -0.64|        1.73|        2.72|        1.90|        3.16|