Commit 33485b8c authored by Caroline Chen's avatar Caroline Chen Committed by Facebook GitHub Bot
Browse files

Add note for lexicon free decoder output (#2603)

Summary:
``words`` field of CTCHypothesis is empty if no lexicon is provided, which produces confusing output (see issue https://github.com/pytorch/audio/issues/2584) when following our tutorial example with lexicon free usage. This PR adds a note in both docs and tutorial.

Followup: determine if we want to modify the behavior of ``words`` in the lexicon free case. One option is to merge and then split the generated tokens by the input silent token to populate the words field, but this is tricky since the meaning of a "word" in the lexicon free case can be vague and not all languages have whitespaces between words, etc

Pull Request resolved: https://github.com/pytorch/audio/pull/2603

Reviewed By: mthrok

Differential Revision: D38459709

Pulled By: carolineechen

fbshipit-source-id: d64ff186df4633f00e94c64afeaa6a50cebf2934
parent 50bba1df
...@@ -280,8 +280,8 @@ greedy_decoder = GreedyCTCDecoder(tokens) ...@@ -280,8 +280,8 @@ greedy_decoder = GreedyCTCDecoder(tokens)
# Now that we have the data, acoustic model, and decoder, we can perform # Now that we have the data, acoustic model, and decoder, we can perform
# inference. The output of the beam search decoder is of type # inference. The output of the beam search decoder is of type
# :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the # :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the
# predicted token IDs, corresponding words, hypothesis score, and timesteps # predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,
# corresponding to the token IDs. Recall the transcript corresponding to the # and timesteps corresponding to the token IDs. Recall the transcript corresponding to the
# waveform is # waveform is
# :: # ::
# i really was very much afraid of showing him how much shocked i was at some parts of what he said # i really was very much afraid of showing him how much shocked i was at some parts of what he said
...@@ -320,6 +320,18 @@ print(f"WER: {beam_search_wer}") ...@@ -320,6 +320,18 @@ print(f"WER: {beam_search_wer}")
###################################################################### ######################################################################
# .. note::
#
# The ``words`` field of the output hypotheses will be empty if no lexicon
# is provided to the decoder. To retrieve a transcript with lexicon-free
# decoding, you can perform the following to retrieve the token indices,
# convert them to original tokens, then join them together.
#
# .. code::
#
# tokens_str = "".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))
# transcript = " ".join(tokens_str.split("|"))
#
# We see that the transcript with the lexicon-constrained beam search # We see that the transcript with the lexicon-constrained beam search
# decoder produces a more accurate result consisting of real words, while # decoder produces a more accurate result consisting of real words, while
# the greedy decoder can predict incorrectly spelled words like “affrayd” # the greedy decoder can predict incorrectly spelled words like “affrayd”
......
...@@ -57,6 +57,11 @@ _PretrainedFiles = namedtuple("PretrainedFiles", ["lexicon", "tokens", "lm"]) ...@@ -57,6 +57,11 @@ _PretrainedFiles = namedtuple("PretrainedFiles", ["lexicon", "tokens", "lm"])
class CTCHypothesis(NamedTuple): class CTCHypothesis(NamedTuple):
r"""Represents hypothesis generated by CTC beam search decoder :py:func:`CTCDecoder`. r"""Represents hypothesis generated by CTC beam search decoder :py:func:`CTCDecoder`.
Note:
The ``words`` field is only applicable if a lexicon is provided to the decoder. If
decoding without a lexicon, it will be blank. Please refer to ``tokens`` and
:py:func:`idxs_to_tokens <torchaudio.models.decoder.CTCDecoder.idxs_to_tokens>` instead.
:ivar torch.LongTensor tokens: Predicted sequence of token IDs. Shape `(L, )`, where :ivar torch.LongTensor tokens: Predicted sequence of token IDs. Shape `(L, )`, where
`L` is the length of the output sequence `L` is the length of the output sequence
:ivar List[str] words: List of predicted words :ivar List[str] words: List of predicted words
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment