logo

RNN-T ASR/VSR/AV-ASR Examples

This repository contains sample implementations of training and evaluation pipelines for RNNT based automatic, visual, and audio-visual (ASR, VSR, AV-ASR) models on LRS3. This repository includes both streaming/non-streaming modes. We follow the same training pipeline as [AutoAVSR](https://arxiv.org/abs/2303.14307). ## Preparation 1. Setup the environment. ``` conda create -y -n autoavsr python=3.8 conda activate autoavsr ``` 2. Install PyTorch nightly version (Pytorch, Torchvision, Torchaudio) from [source](https://pytorch.org/get-started/), along with all necessary packages: ```Shell pip install pytorch-lightning sentencepiece ``` 3. Preprocess LRS3 to a cropped-face dataset from the [data_prep](./data_prep) folder. 4. `[sp_model_path]` is a sentencepiece model to encode targets, which can be generated using `train_spm.py`. ### Training ASR or VSR model - `[root_dir]` is the root directory for the LRS3 cropped-face dataset. - `[modality]` is the input modality type, including `v`, `a`, and `av`. - `[mode]` is the model type, including `online` and `offline`. ```Shell python train.py --root-dir [root_dir] \ --sp-model-path ./spm_unigram_1023.model --exp-dir ./exp \ --num-nodes 8 \ --gpus 8 \ --md [modality] \ --mode [mode] ``` ### Training AV-ASR model ```Shell python train.py --root-dir [root-dir] \ --sp-model-path ./spm_unigram_1023.model --exp-dir ./exp \ --num-nodes 8 \ --gpus 8 \ --md av \ --mode [mode] ``` ### Evaluating models ```Shell python eval.py --dataset-path [dataset_path] \ --sp-model-path ./spm_unigram_1023.model --md [modality] \ --mode [mode] \ --checkpoint-path [checkpoint_path] ``` The table below contains WER for AV-ASR models [offline evaluation]. | Model | WER [%] | Params (M) | |:-----------:|:------------:|:--------------:| | Non-streaming models | | | AV-ASR | 4.0 | 50 | | Streaming models | | | AV-ASR | 4.3 | 40 |