Revise RNN-T pipeline streaming decoding logic (#2192)

Summary: Rather than apply SentencePiece's `decode` to directly convert each hypothesis's token id sequence to an output string, we convert each token id sequence to word pieces and then manually join the word pieces ourselves. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR. https://user-images.githubusercontent.com/8345689/152093668-11fb775a-bf7b-4b1d-9516-9f8d5a9b6683.mov Versus the previous behavior visualized in https://github.com/pytorch/audio/issues/2093, the scheme here properly constructs words comprising multiple pieces. Pull Request resolved: https://github.com/pytorch/audio/pull/2192 Reviewed By: mthrok Differential Revision: D33936622 Pulled By: hwangjeff fbshipit-source-id: e550980c7d4cac9e982315508f793a6b816752e9

Revise RNN-T pipeline streaming decoding logic (#2192)
Summary: Rather than apply SentencePiece's `decode` to directly convert each hypothesis's token id sequence to an output string, we convert each token id sequence to word pieces and then manually join the word pieces ourselves. This allows us to preserve leading whitespaces on output strings and therefore account for word breaks and continuations across token processor invocations, which is particularly useful when performing streaming ASR. https://user-images.githubusercontent.com/8345689/152093668-11fb775a-bf7b-4b1d-9516-9f8d5a9b6683.mov Versus the previous behavior visualized in https://github.com/pytorch/audio/issues/2093, the scheme here properly constructs words comprising multiple pieces. Pull Request resolved: https://github.com/pytorch/audio/pull/2192 Reviewed By: mthrok Differential Revision: D33936622 Pulled By: hwangjeff fbshipit-source-id: e550980c7d4cac9e982315508f793a6b816752e9
612de66b · hwangjeff · Facebook GitHub Bot · 7a3e262d · 612de66b · 612de66b
Commit 612de66b authored Feb 01, 2022 by hwangjeff Committed by Facebook GitHub Bot Feb 01, 2022
Showing with 12 additions and 6 deletions

examples/asr/librispeech_emformer_rnnt/pipeline_demo.py examples/asr/librispeech_emformer_rnnt/pipeline_demo.py +2 -3

torchaudio/pipelines/rnnt_pipeline.py torchaudio/pipelines/rnnt_pipeline.py +10 -3

No files found.
--- a/examples/asr/librispeech_emformer_rnnt/pipeline_demo.py
+++ b/examples/asr/librispeech_emformer_rnnt/pipeline_demo.py
@@ -41,9 +41,8 @@ def cli_main():
                features, length = streaming_feature_extractor(segment)
                hypos, state = decoder.infer(features, length, 10, state=state, hypothesis=hypothesis)
            hypothesis = hypos[0]
-            transcript = token_processor(hypothesis.tokens)
+            transcript = token_processor(hypothesis.tokens, lstrip=False)
-            if transcript:
+            print(transcript, end="", flush=True)
-                print(transcript, end=" ", flush=True)
        print()
        # Non-streaming decode.

--- a/torchaudio/pipelines/rnnt_pipeline.py
+++ b/torchaudio/pipelines/rnnt_pipeline.py
@@ -79,7 +79,7 @@ class _FeatureExtractor(ABC):
 class _TokenProcessor(ABC):
    @abstractmethod
-    def __call__(self, tokens: List[int]) -> str:
+    def __call__(self, tokens: List[int], **kwargs) -> str:
        """Decodes given list of tokens to text sequence.
        Args:
@@ -140,11 +140,13 @@ class _SentencePieceTokenProcessor(_TokenProcessor):
            self.sp_model.pad_id(),
        }
-    def __call__(self, tokens: List[int]) -> str:
+    def __call__(self, tokens: List[int], lstrip: bool = True) -> str:
        """Decodes given list of tokens to text sequence.
        Args:
            tokens (List[int]): list of tokens to decode.
+            lstrip (bool, optional): if ``True``, returns text sequence with leading whitespace
+                removed. (Default: ``True``).
        Returns:
            str:
@@ -153,7 +155,12 @@ class _SentencePieceTokenProcessor(_TokenProcessor):
        filtered_hypo_tokens = [
            token_index for token_index in tokens[1:] if token_index not in self.post_process_remove_list
        ]
-        return self.sp_model.decode(filtered_hypo_tokens)
+        output_string = "".join(self.sp_model.id_to_piece(filtered_hypo_tokens)).replace("\u2581", " ")
+        if lstrip:
+            return output_string.lstrip()
+        else:
+            return output_string
 @dataclass