Split the CTC forced aligment API tutorial into two tutorials (#3443)

Summary: Splitting the multilingual example part into another tutorial. Pull Request resolved: https://github.com/pytorch/audio/pull/3443 Reviewed By: mthrok Differential Revision: D46802844 Pulled By: xiaohui-zhang fbshipit-source-id: a7093053cac8b79d650d4f665db7fde2d8254998

Split the CTC forced aligment API tutorial into two tutorials (#3443)
Summary: Splitting the multilingual example part into another tutorial. Pull Request resolved: https://github.com/pytorch/audio/pull/3443 Reviewed By: mthrok Differential Revision: D46802844 Pulled By: xiaohui-zhang fbshipit-source-id: a7093053cac8b79d650d4f665db7fde2d8254998
627c37a9 · Xiaohui Zhang · Facebook GitHub Bot · 77cdd160 · 627c37a9 · 627c37a9
Commit 627c37a9 authored Jun 20, 2023 by Xiaohui Zhang Committed by Facebook GitHub Bot Jun 20, 2023
3 changed files
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -51,6 +51,7 @@ model implementations and application components.
   tutorials/audio_data_augmentation_tutorial
   tutorials/audio_feature_extractions_tutorial
   tutorials/audio_feature_augmentation_tutorial
+   tutorials/ctc_forced_alignment_api_tutorial
   tutorials/oscillator_tutorial
   tutorials/additive_synthesis_tutorial
@@ -68,7 +69,7 @@ model implementations and application components.
   tutorials/asr_inference_with_ctc_decoder_tutorial
   tutorials/online_asr_tutorial
   tutorials/device_asr
-   tutorials/ctc_forced_alignment_api_tutorial
+   tutorials/forced_alignment_for_multilingual_data_tutorial
   tutorials/forced_alignment_tutorial
   tutorials/tacotron2_pipeline_tutorial
   tutorials/mvdr_tutorial
@@ -157,6 +158,13 @@ Tutorials
   :link: tutorials/ctc_forced_alignment_api_tutorial.html
   :tags: CTC,Forced-Alignment
+.. customcarditem::
+   :header: Forced alignment for multilingual data
+   :card_description: Learn how to use align multiligual data using TorchAudio's CTC forced alignment API (<code>torchaudio.functional.forced_align</code>) and a multiligual Wav2Vec2 model.
+   :image: https://download.pytorch.org/torchaudio/tutorial-assets/thumbnails/forced_alignment_for_multilingual_data_tutorial.png
+   :link: tutorials/forced_alignment_for_multilingual_data_tutorial.html
+   :tags: Forced-Alignment
 .. customcarditem::
   :header: Streaming media decoding with StreamReader
   :card_description: Learn how to load audio/video to Tensors using <code>torchaudio.io.StreamReader</code> class.

--- a/examples/tutorials/ctc_forced_alignment_api_tutorial.py
+++ b/examples/tutorials/ctc_forced_alignment_api_tutorial.py
@@ -9,8 +9,7 @@ This tutorial shows how to align transcripts to speech with
 ``torchaudio``'s CTC forced alignment API proposed in the paper
 `“Scaling Speech Technology to 1,000+
 Languages” <https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/>`__,
-and two advanced usages, i.e. dealing with non-English data and
+and one advanced usage, i.e. dealing with transcription errors with a <star> token.
-transcription errors.
 Though there’s some overlap in visualization
 diagrams, the scope here is different from the `“Forced Alignment with
@@ -44,8 +43,8 @@ except ModuleNotFoundError:
    raise
 ######################################################################
-# I. Basic usages
+# Basic usages
-# ---------------
+# ------------
 #
 # In this section, we cover the following content:
 #
@@ -252,6 +251,7 @@ frames, frame_alignment, frame_scores = compute_alignments(transcript, dictionar
 # frame-level confidence scores.
 #
 # Merge the labels
 @dataclass
 class Segment:
@@ -423,6 +423,7 @@ plt.show()
 ######################################################################
 # A trick to embed the resulting audio to the generated file.
 # `IPython.display.Audio` has to be the last call in a cell,
 # and there should be only one call par cell.
@@ -487,179 +488,46 @@ display_segment(8, waveform, word_segments, frame_alignment)
 ######################################################################
-# II. Advancd usages
+# Advanced usage: Dealing with missing transcripts using the <star> token
-# ------------------
+# ---------------------------------------------------------------------------
 #
-# Aligning non-English data
+# Now let’s look at when the transcript is partially missing, how can we
-# ~~~~~~~~~~~~~~~~~~~~~~~~~
+# improve alignment quality using the <star> token, which is capable of modeling
-#
+# any token.
-# Here we show an example of computing forced alignments on a German
+#
-# utterance using the multilingual Wav2vec2 model described in the paper
+# Here we use the same English example as used above. But we remove the
-# `“Scaling Speech Technology to 1,000+
+# beginning text “i had that curiosity beside me at” from the transcript.
-# Languages” <https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/>`__.
+# Aligning audio with such transcript results in wrong alignments of the
-# The model was trained on 23K of audio data from 1100+ languages using
+# existing word “this”. However, this issue can be mitigated by using the
-# the `“uroman vocabulary” <https://www.isi.edu/~ulf/uroman.html>`__ as
+# <star> token to model the missing text.
-# targets.
+#
-#
-from torchaudio.models import wav2vec2_model
-model = wav2vec2_model(
-    extractor_mode="layer_norm",
-    extractor_conv_layer_config=[
-        (512, 10, 5),
-        (512, 3, 2),
-        (512, 3, 2),
-        (512, 3, 2),
-        (512, 3, 2),
-        (512, 2, 2),
-        (512, 2, 2),
-    ],
-    extractor_conv_bias=True,
-    encoder_embed_dim=1024,
-    encoder_projection_dropout=0.0,
-    encoder_pos_conv_kernel=128,
-    encoder_pos_conv_groups=16,
-    encoder_num_layers=24,
-    encoder_num_heads=16,
-    encoder_attention_dropout=0.0,
-    encoder_ff_interm_features=4096,
-    encoder_ff_interm_dropout=0.1,
-    encoder_dropout=0.0,
-    encoder_layer_norm_first=True,
-    encoder_layer_drop=0.1,
-    aux_num_out=31,
-)
-torch.hub.download_url_to_file(
-    "https://dl.fbaipublicfiles.com/mms/torchaudio/ctc_alignment_mling_uroman/model.pt", "model.pt"
-)
-checkpoint = torch.load("model.pt", map_location="cpu")
-model.load_state_dict(checkpoint)
-model.eval()
-waveform, _ = torchaudio.load(SPEECH_FILE)
-def get_emission(waveform):
-    # NOTE: this step is essential
-    waveform = torch.nn.functional.layer_norm(waveform, waveform.shape)
-    emissions, _ = model(waveform)
+# Reload the emission tensor in order to add the extra dimension corresponding to the <star> token.
+with torch.inference_mode():
+    waveform, _ = torchaudio.load(SPEECH_FILE)
+    emissions, _ = model(waveform.to(device))
    emissions = torch.log_softmax(emissions, dim=-1)
-    emission = emissions[0].cpu().detach()
    # Append the extra dimension corresponding to the <star> token
    extra_dim = torch.zeros(emissions.shape[0], emissions.shape[1], 1)
-    emissions = torch.cat((emissions, extra_dim), 2)
+    emissions = torch.cat((emissions.cpu(), extra_dim), 2)
-    emission = emissions[0].cpu().detach()
+    emission = emissions[0].detach()
-    return emission, waveform
+# Extend the dictionary to include the <star> token.
+dictionary["*"] = 29
-emission, waveform = get_emission(waveform)
-# Construct the dictionary
-# '@' represents the OOV token, '*' represents the <star> token.
-# <pad> and </s> are fairseq's legacy tokens, which're not used.
-dictionary = {
-    "<blank>": 0,
-    "<pad>": 1,
-    "</s>": 2,
-    "@": 3,
-    "a": 4,
-    "i": 5,
-    "e": 6,
-    "n": 7,
-    "o": 8,
-    "u": 9,
-    "t": 10,
-    "s": 11,
-    "r": 12,
-    "m": 13,
-    "k": 14,
-    "l": 15,
-    "d": 16,
-    "g": 17,
-    "h": 18,
-    "y": 19,
-    "b": 20,
-    "p": 21,
-    "w": 22,
-    "c": 23,
-    "v": 24,
-    "j": 25,
-    "z": 26,
-    "f": 27,
-    "'": 28,
-    "q": 29,
-    "x": 30,
-    "*": 31,
-}
 assert len(dictionary) == emission.shape[1]
 def compute_and_plot_alignments(transcript, dictionary, emission, waveform):
    frames, frame_alignment, _ = compute_alignments(transcript, dictionary, emission)
    segments = merge_repeats(frames, transcript)
-    word_segments = merge_words(transcript, segments)
+    word_segments = merge_words(transcript, segments, "|")
-    plot_alignments(segments, word_segments, waveform[0], emission.shape[0])
+    plot_alignments(segments, word_segments, waveform[0], emission.shape[0], 1)
    plt.show()
    return word_segments, frame_alignment
-# One can follow the following steps to download the uroman romanizer and use it to obtain normalized transcripts.
-# def normalize_uroman(text):
-#     text = text.lower()
-#     text = text.replace("’", "'")
-#     text = re.sub("([^a-z' ])", " ", text)
-#     text = re.sub(' +', ' ', text)
-#     return text.strip()
-#
-# echo 'aber seit ich bei ihnen das brot hole brauch ich viel weniger schulze wandte sich ab die kinder taten ihm leid' > test.txt"
-# git clone https://github.com/isi-nlp/uroman
-# uroman/bin/uroman.pl < test.txt > test_romanized.txt
-#
-# file = "test_romanized.txt"
-# f = open(file, "r")
-# lines = f.readlines()
-# text_normalized = normalize_uroman(lines[0].strip())
-text_normalized = (
-    "aber seit ich bei ihnen das brot hole brauch ich viel weniger schulze wandte sich ab die kinder taten ihm leid"
-)
-SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/10349_8674_000087.flac")
-waveform, _ = torchaudio.load(SPEECH_FILE)
-emission, waveform = get_emission(waveform)
-transcript = text_normalized
-word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
-######################################################################
-# Dealing with missing transcripts using the <star> token
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-#
-# Now let’s look at when the transcript is partially missing, how can we
-# improve alignment quality using the <star> token, which is capable of modeling
-# any token. Note that in the above section, we have manually added the
-# token to the vocabualry and the emission matrix.
-#
-# Here we use the same English example as used above. But we remove the
-# beginning text “i had that curiosity beside me at” from the transcript.
-# Aligning audio with such transcript results in wrong alignments of the
-# existing word “this”. Using the OOV token “@” to model the missing text
-# doesn’t help (still resulting in wrong alignments for “this”). However,
-# this issue can be mitigated by using a <star> token to model the missing text.
-#
-SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
-waveform, _ = torchaudio.load(SPEECH_FILE)
-emission, waveform = get_emission(waveform)
-transcript = "i had that curiosity beside me at this moment"
 # original:
 word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
@@ -667,19 +535,13 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction
 # Demonstrate the effect of <star> token for dealing with deletion errors
 # ("i had that curiosity beside me at" missing from the transcript):
-transcript = "this moment"
+transcript = "THIS|MOMENT"
-word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
-######################################################################
-# Replacing the missing transcript with the OOV token "@":
-transcript = "@ this moment"
 word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
 ######################################################################
 # Replacing the missing transcript with the <star> token:
-transcript = "* this moment"
+transcript = "*|THIS|MOMENT"
 word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
@@ -688,9 +550,8 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction
 # ----------
 #
 # In this tutorial, we looked at how to use torchaudio’s forced alignment
-# API and Wav2Vec2 pre-trained acoustic model to align and segment audio
+# API to align and segment speech files, and demonstrated one advanced usage:
-# files, and demonstrated two advanced usages: 1) Inference on non-English data
+# How introducing a <star> token could improve alignment accuracy when
-# 2) How introducing a <star> token could improve alignment accuracy when
 # transcription errors exist.
 #

--- a/examples/tutorials/forced_alignment_for_multilingual_data_tutorial.py
+++ b/examples/tutorials/forced_alignment_for_multilingual_data_tutorial.py