Add note for lexicon free decoder output (#2603)

Summary: ``words`` field of CTCHypothesis is empty if no lexicon is provided, which produces confusing output (see issue https://github.com/pytorch/audio/issues/2584) when following our tutorial example with lexicon free usage. This PR adds a note in both docs and tutorial. Followup: determine if we want to modify the behavior of ``words`` in the lexicon free case. One option is to merge and then split the generated tokens by the input silent token to populate the words field, but this is tricky since the meaning of a "word" in the lexicon free case can be vague and not all languages have whitespaces between words, etc Pull Request resolved: https://github.com/pytorch/audio/pull/2603 Reviewed By: mthrok Differential Revision: D38459709 Pulled By: carolineechen fbshipit-source-id: d64ff186df4633f00e94c64afeaa6a50cebf2934

Add note for lexicon free decoder output (#2603)
Summary: ``words`` field of CTCHypothesis is empty if no lexicon is provided, which produces confusing output (see issue https://github.com/pytorch/audio/issues/2584) when following our tutorial example with lexicon free usage. This PR adds a note in both docs and tutorial. Followup: determine if we want to modify the behavior of ``words`` in the lexicon free case. One option is to merge and then split the generated tokens by the input silent token to populate the words field, but this is tricky since the meaning of a "word" in the lexicon free case can be vague and not all languages have whitespaces between words, etc Pull Request resolved: https://github.com/pytorch/audio/pull/2603 Reviewed By: mthrok Differential Revision: D38459709 Pulled By: carolineechen fbshipit-source-id: d64ff186df4633f00e94c64afeaa6a50cebf2934
33485b8c · Caroline Chen · Facebook GitHub Bot · 50bba1df · 33485b8c · 33485b8c
Commit 33485b8c authored Aug 05, 2022 by Caroline Chen Committed by Facebook GitHub Bot Aug 05, 2022
Showing with 19 additions and 2 deletions

examples/tutorials/asr_inference_with_ctc_decoder_tutorial.py ...ples/tutorials/asr_inference_with_ctc_decoder_tutorial.py +14 -2

torchaudio/models/decoder/_ctc_decoder.py torchaudio/models/decoder/_ctc_decoder.py +5 -0

No files found.
--- a/examples/tutorials/asr_inference_with_ctc_decoder_tutorial.py
+++ b/examples/tutorials/asr_inference_with_ctc_decoder_tutorial.py
@@ -280,8 +280,8 @@ greedy_decoder = GreedyCTCDecoder(tokens)
 # Now that we have the data, acoustic model, and decoder, we can perform
 # inference. The output of the beam search decoder is of type
 # :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the
-# predicted token IDs, corresponding words, hypothesis score, and timesteps
+# predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,
-# corresponding to the token IDs. Recall the transcript corresponding to the
+# and timesteps corresponding to the token IDs. Recall the transcript corresponding to the
 # waveform is
 # ::
 #   i really was very much afraid of showing him how much shocked i was at some parts of what he said
@@ -320,6 +320,18 @@ print(f"WER: {beam_search_wer}")
 ######################################################################
+# .. note::
+#
+#    The ``words`` field of the output hypotheses will be empty if no lexicon
+#    is provided to the decoder. To retrieve a transcript with lexicon-free
+#    decoding, you can perform the following to retrieve the token indices,
+#    convert them to original tokens, then join them together.
+#
+#    .. code::
+#
+#       tokens_str = "".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))
+#       transcript = " ".join(tokens_str.split("|"))
+#
 # We see that the transcript with the lexicon-constrained beam search
 # decoder produces a more accurate result consisting of real words, while
 # the greedy decoder can predict incorrectly spelled words like “affrayd”

--- a/torchaudio/models/decoder/_ctc_decoder.py
+++ b/torchaudio/models/decoder/_ctc_decoder.py
@@ -57,6 +57,11 @@ _PretrainedFiles = namedtuple("PretrainedFiles", ["lexicon", "tokens", "lm"])
 class CTCHypothesis(NamedTuple):
    r"""Represents hypothesis generated by CTC beam search decoder :py:func:`CTCDecoder`.
+    Note:
+        The ``words`` field is only applicable if a lexicon is provided to the decoder. If
+        decoding without a lexicon, it will be blank. Please refer to ``tokens`` and
+        :py:func:`idxs_to_tokens <torchaudio.models.decoder.CTCDecoder.idxs_to_tokens>` instead.
    :ivar torch.LongTensor tokens: Predicted sequence of token IDs. Shape `(L, )`, where
        `L` is the length of the output sequence
    :ivar List[str] words: List of predicted words