Commit 627c37a9 authored by Xiaohui Zhang's avatar Xiaohui Zhang Committed by Facebook GitHub Bot
Browse files

Split the CTC forced aligment API tutorial into two tutorials (#3443)

Summary:
Splitting the multilingual example part into another tutorial.

Pull Request resolved: https://github.com/pytorch/audio/pull/3443

Reviewed By: mthrok

Differential Revision: D46802844

Pulled By: xiaohui-zhang

fbshipit-source-id: a7093053cac8b79d650d4f665db7fde2d8254998
parent 77cdd160
...@@ -51,6 +51,7 @@ model implementations and application components. ...@@ -51,6 +51,7 @@ model implementations and application components.
tutorials/audio_data_augmentation_tutorial tutorials/audio_data_augmentation_tutorial
tutorials/audio_feature_extractions_tutorial tutorials/audio_feature_extractions_tutorial
tutorials/audio_feature_augmentation_tutorial tutorials/audio_feature_augmentation_tutorial
tutorials/ctc_forced_alignment_api_tutorial
tutorials/oscillator_tutorial tutorials/oscillator_tutorial
tutorials/additive_synthesis_tutorial tutorials/additive_synthesis_tutorial
...@@ -68,7 +69,7 @@ model implementations and application components. ...@@ -68,7 +69,7 @@ model implementations and application components.
tutorials/asr_inference_with_ctc_decoder_tutorial tutorials/asr_inference_with_ctc_decoder_tutorial
tutorials/online_asr_tutorial tutorials/online_asr_tutorial
tutorials/device_asr tutorials/device_asr
tutorials/ctc_forced_alignment_api_tutorial tutorials/forced_alignment_for_multilingual_data_tutorial
tutorials/forced_alignment_tutorial tutorials/forced_alignment_tutorial
tutorials/tacotron2_pipeline_tutorial tutorials/tacotron2_pipeline_tutorial
tutorials/mvdr_tutorial tutorials/mvdr_tutorial
...@@ -157,6 +158,13 @@ Tutorials ...@@ -157,6 +158,13 @@ Tutorials
:link: tutorials/ctc_forced_alignment_api_tutorial.html :link: tutorials/ctc_forced_alignment_api_tutorial.html
:tags: CTC,Forced-Alignment :tags: CTC,Forced-Alignment
.. customcarditem::
:header: Forced alignment for multilingual data
:card_description: Learn how to use align multiligual data using TorchAudio's CTC forced alignment API (<code>torchaudio.functional.forced_align</code>) and a multiligual Wav2Vec2 model.
:image: https://download.pytorch.org/torchaudio/tutorial-assets/thumbnails/forced_alignment_for_multilingual_data_tutorial.png
:link: tutorials/forced_alignment_for_multilingual_data_tutorial.html
:tags: Forced-Alignment
.. customcarditem:: .. customcarditem::
:header: Streaming media decoding with StreamReader :header: Streaming media decoding with StreamReader
:card_description: Learn how to load audio/video to Tensors using <code>torchaudio.io.StreamReader</code> class. :card_description: Learn how to load audio/video to Tensors using <code>torchaudio.io.StreamReader</code> class.
......
...@@ -9,8 +9,7 @@ This tutorial shows how to align transcripts to speech with ...@@ -9,8 +9,7 @@ This tutorial shows how to align transcripts to speech with
``torchaudio``'s CTC forced alignment API proposed in the paper ``torchaudio``'s CTC forced alignment API proposed in the paper
`“Scaling Speech Technology to 1,000+ `“Scaling Speech Technology to 1,000+
Languages” <https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/>`__, Languages” <https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/>`__,
and two advanced usages, i.e. dealing with non-English data and and one advanced usage, i.e. dealing with transcription errors with a <star> token.
transcription errors.
Though there’s some overlap in visualization Though there’s some overlap in visualization
diagrams, the scope here is different from the `“Forced Alignment with diagrams, the scope here is different from the `“Forced Alignment with
...@@ -44,8 +43,8 @@ except ModuleNotFoundError: ...@@ -44,8 +43,8 @@ except ModuleNotFoundError:
raise raise
###################################################################### ######################################################################
# I. Basic usages # Basic usages
# --------------- # ------------
# #
# In this section, we cover the following content: # In this section, we cover the following content:
# #
...@@ -252,6 +251,7 @@ frames, frame_alignment, frame_scores = compute_alignments(transcript, dictionar ...@@ -252,6 +251,7 @@ frames, frame_alignment, frame_scores = compute_alignments(transcript, dictionar
# frame-level confidence scores. # frame-level confidence scores.
# #
# Merge the labels # Merge the labels
@dataclass @dataclass
class Segment: class Segment:
...@@ -423,6 +423,7 @@ plt.show() ...@@ -423,6 +423,7 @@ plt.show()
###################################################################### ######################################################################
# A trick to embed the resulting audio to the generated file. # A trick to embed the resulting audio to the generated file.
# `IPython.display.Audio` has to be the last call in a cell, # `IPython.display.Audio` has to be the last call in a cell,
# and there should be only one call par cell. # and there should be only one call par cell.
...@@ -487,179 +488,46 @@ display_segment(8, waveform, word_segments, frame_alignment) ...@@ -487,179 +488,46 @@ display_segment(8, waveform, word_segments, frame_alignment)
###################################################################### ######################################################################
# II. Advancd usages # Advanced usage: Dealing with missing transcripts using the <star> token
# ------------------ # ---------------------------------------------------------------------------
# #
# Aligning non-English data # Now let’s look at when the transcript is partially missing, how can we
# ~~~~~~~~~~~~~~~~~~~~~~~~~ # improve alignment quality using the <star> token, which is capable of modeling
# # any token.
# Here we show an example of computing forced alignments on a German #
# utterance using the multilingual Wav2vec2 model described in the paper # Here we use the same English example as used above. But we remove the
# `“Scaling Speech Technology to 1,000+ # beginning text “i had that curiosity beside me at” from the transcript.
# Languages” <https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/>`__. # Aligning audio with such transcript results in wrong alignments of the
# The model was trained on 23K of audio data from 1100+ languages using # existing word “this”. However, this issue can be mitigated by using the
# the `“uroman vocabulary” <https://www.isi.edu/~ulf/uroman.html>`__ as # <star> token to model the missing text.
# targets. #
#
from torchaudio.models import wav2vec2_model
model = wav2vec2_model(
extractor_mode="layer_norm",
extractor_conv_layer_config=[
(512, 10, 5),
(512, 3, 2),
(512, 3, 2),
(512, 3, 2),
(512, 3, 2),
(512, 2, 2),
(512, 2, 2),
],
extractor_conv_bias=True,
encoder_embed_dim=1024,
encoder_projection_dropout=0.0,
encoder_pos_conv_kernel=128,
encoder_pos_conv_groups=16,
encoder_num_layers=24,
encoder_num_heads=16,
encoder_attention_dropout=0.0,
encoder_ff_interm_features=4096,
encoder_ff_interm_dropout=0.1,
encoder_dropout=0.0,
encoder_layer_norm_first=True,
encoder_layer_drop=0.1,
aux_num_out=31,
)
torch.hub.download_url_to_file(
"https://dl.fbaipublicfiles.com/mms/torchaudio/ctc_alignment_mling_uroman/model.pt", "model.pt"
)
checkpoint = torch.load("model.pt", map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()
waveform, _ = torchaudio.load(SPEECH_FILE)
def get_emission(waveform):
# NOTE: this step is essential
waveform = torch.nn.functional.layer_norm(waveform, waveform.shape)
emissions, _ = model(waveform) # Reload the emission tensor in order to add the extra dimension corresponding to the <star> token.
with torch.inference_mode():
waveform, _ = torchaudio.load(SPEECH_FILE)
emissions, _ = model(waveform.to(device))
emissions = torch.log_softmax(emissions, dim=-1) emissions = torch.log_softmax(emissions, dim=-1)
emission = emissions[0].cpu().detach()
# Append the extra dimension corresponding to the <star> token # Append the extra dimension corresponding to the <star> token
extra_dim = torch.zeros(emissions.shape[0], emissions.shape[1], 1) extra_dim = torch.zeros(emissions.shape[0], emissions.shape[1], 1)
emissions = torch.cat((emissions, extra_dim), 2) emissions = torch.cat((emissions.cpu(), extra_dim), 2)
emission = emissions[0].cpu().detach() emission = emissions[0].detach()
return emission, waveform
# Extend the dictionary to include the <star> token.
dictionary["*"] = 29
emission, waveform = get_emission(waveform)
# Construct the dictionary
# '@' represents the OOV token, '*' represents the <star> token.
# <pad> and </s> are fairseq's legacy tokens, which're not used.
dictionary = {
"<blank>": 0,
"<pad>": 1,
"</s>": 2,
"@": 3,
"a": 4,
"i": 5,
"e": 6,
"n": 7,
"o": 8,
"u": 9,
"t": 10,
"s": 11,
"r": 12,
"m": 13,
"k": 14,
"l": 15,
"d": 16,
"g": 17,
"h": 18,
"y": 19,
"b": 20,
"p": 21,
"w": 22,
"c": 23,
"v": 24,
"j": 25,
"z": 26,
"f": 27,
"'": 28,
"q": 29,
"x": 30,
"*": 31,
}
assert len(dictionary) == emission.shape[1] assert len(dictionary) == emission.shape[1]
def compute_and_plot_alignments(transcript, dictionary, emission, waveform): def compute_and_plot_alignments(transcript, dictionary, emission, waveform):
frames, frame_alignment, _ = compute_alignments(transcript, dictionary, emission) frames, frame_alignment, _ = compute_alignments(transcript, dictionary, emission)
segments = merge_repeats(frames, transcript) segments = merge_repeats(frames, transcript)
word_segments = merge_words(transcript, segments) word_segments = merge_words(transcript, segments, "|")
plot_alignments(segments, word_segments, waveform[0], emission.shape[0]) plot_alignments(segments, word_segments, waveform[0], emission.shape[0], 1)
plt.show() plt.show()
return word_segments, frame_alignment return word_segments, frame_alignment
# One can follow the following steps to download the uroman romanizer and use it to obtain normalized transcripts.
# def normalize_uroman(text):
# text = text.lower()
# text = text.replace("’", "'")
# text = re.sub("([^a-z' ])", " ", text)
# text = re.sub(' +', ' ', text)
# return text.strip()
#
# echo 'aber seit ich bei ihnen das brot hole brauch ich viel weniger schulze wandte sich ab die kinder taten ihm leid' > test.txt"
# git clone https://github.com/isi-nlp/uroman
# uroman/bin/uroman.pl < test.txt > test_romanized.txt
#
# file = "test_romanized.txt"
# f = open(file, "r")
# lines = f.readlines()
# text_normalized = normalize_uroman(lines[0].strip())
text_normalized = (
"aber seit ich bei ihnen das brot hole brauch ich viel weniger schulze wandte sich ab die kinder taten ihm leid"
)
SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/10349_8674_000087.flac")
waveform, _ = torchaudio.load(SPEECH_FILE)
emission, waveform = get_emission(waveform)
transcript = text_normalized
word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
######################################################################
# Dealing with missing transcripts using the <star> token
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Now let’s look at when the transcript is partially missing, how can we
# improve alignment quality using the <star> token, which is capable of modeling
# any token. Note that in the above section, we have manually added the
# token to the vocabualry and the emission matrix.
#
# Here we use the same English example as used above. But we remove the
# beginning text “i had that curiosity beside me at” from the transcript.
# Aligning audio with such transcript results in wrong alignments of the
# existing word “this”. Using the OOV token “@” to model the missing text
# doesn’t help (still resulting in wrong alignments for “this”). However,
# this issue can be mitigated by using a <star> token to model the missing text.
#
SPEECH_FILE = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
waveform, _ = torchaudio.load(SPEECH_FILE)
emission, waveform = get_emission(waveform)
transcript = "i had that curiosity beside me at this moment"
# original: # original:
word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform) word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
...@@ -667,19 +535,13 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction ...@@ -667,19 +535,13 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction
# Demonstrate the effect of <star> token for dealing with deletion errors # Demonstrate the effect of <star> token for dealing with deletion errors
# ("i had that curiosity beside me at" missing from the transcript): # ("i had that curiosity beside me at" missing from the transcript):
transcript = "this moment" transcript = "THIS|MOMENT"
word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
######################################################################
# Replacing the missing transcript with the OOV token "@":
transcript = "@ this moment"
word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform) word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
###################################################################### ######################################################################
# Replacing the missing transcript with the <star> token: # Replacing the missing transcript with the <star> token:
transcript = "* this moment" transcript = "*|THIS|MOMENT"
word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform) word_segments, frame_alignment = compute_and_plot_alignments(transcript, dictionary, emission, waveform)
...@@ -688,9 +550,8 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction ...@@ -688,9 +550,8 @@ word_segments, frame_alignment = compute_and_plot_alignments(transcript, diction
# ---------- # ----------
# #
# In this tutorial, we looked at how to use torchaudio’s forced alignment # In this tutorial, we looked at how to use torchaudio’s forced alignment
# API and Wav2Vec2 pre-trained acoustic model to align and segment audio # API to align and segment speech files, and demonstrated one advanced usage:
# files, and demonstrated two advanced usages: 1) Inference on non-English data # How introducing a <star> token could improve alignment accuracy when
# 2) How introducing a <star> token could improve alignment accuracy when
# transcription errors exist. # transcription errors exist.
# #
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment