
RNN-T ASR/VSR/AV-ASR Examples
This repository contains sample implementations of training and evaluation pipelines for RNNT based automatic, visual, and audio-visual (ASR, VSR, AV-ASR) models on LRS3. This repository includes both streaming/non-streaming modes.
## Preparation
1. Setup the environment.
```
conda create -y -n autoavsr python=3.8
conda activate autoavsr
```
2. Install PyTorch nightly version (Pytorch, Torchvision, Torchaudio) from [source](https://pytorch.org/get-started/), along with all necessary packages:
```Shell
pip install pytorch-lightning sentencepiece
```
3. Preprocess LRS3 to a cropped-face dataset from the [data_prep](./data_prep) folder.
4. Download models below to initialise ASR/VSR front-end.
### Training A/V-ASR model
- `[dataset_path]` is the directory for original dataset.
- `[label_path]` is the labels directory.
- `[modality]` is the input modality type, including `v`, `a`, and `av`.
- `[mode]` is the model type, including `online` and `offline`.
```Shell
python train.py --dataset-path [dataset_path] \
--label-path [label-path]
--pretrained-model-path [pretrained_model_path] \
--sp-model-path ./spm_unigram_1023.model
--exp-dir ./exp \
--num-nodes 8 \
--gpus 8 \
--md [modality] \
--mode [mode]
```
### Training AV-ASR model
```Shell
python train.py --dataset-path [dataset_path] \
--label-path [label-path]
--pretrained-vid-model-path [pretrained_vid_model_path] \
--pretrained-aud-model-path [pretrained_aud_model_path] \
--sp-model-path ./spm_unigram_1023.model
--exp-dir ./exp \
--num-nodes 8 \
--gpus 8 \
--md av \
--mode [mode]
```
### Evaluating models
```Shell
python eval.py --dataset-path [dataset_path] \
--label-path [label-path]
--pretrained-model-path [pretrained_model_path] \
--sp-model-path ./spm_unigram_1023.model
--md [modality] \
--mode [mode] \
--checkpoint-path [checkpoint_path]
```
The table below contains WER for AV-ASR models.
| Model | WER [%] | Params (M) |
|:-----------:|:------------:|:--------------:|
| Non-streaming models | |
| AV-ASR | 4.2 | 50 |
| Streaming models | |
| AV-ASR | 4.9 | 40 |