Add LRS3 data preparation (#3421)
Summary: This PR adds a data preparation recipe that uses the ultra face detector to extract full-face video. The resulting video output is then used as input for training and evaluating RNNT-based models for automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual ASR (AV-ASR) on the LRS3 dataset. This PR also updates the word error rate (WER) for AV-ASR LRS3 models and improves the code readability. Pull Request resolved: https://github.com/pytorch/audio/pull/3421 Reviewed By: mpc001 Differential Revision: D46799748 Pulled By: mthrok fbshipit-source-id: 97af3feac0592b240617faaffa4c0ac8cef614a9
Showing
Please register or sign in to comment