logo

Real-time ASR/VSR/AV-ASR Examples

[📘Introduction](#introduction) | [📊Training](#Training) | [🔮Evaluation](#Evaluation)
## Introduction This directory contains the training recipe for real-time audio, visual, and audio-visual speech recognition (ASR, VSR, AV-ASR) models, which is an extension of [Auto-AVSR](https://arxiv.org/abs/2303.14307). ## Preparation 1. Install PyTorch (pytorch, torchvision, torchaudio) from [source](https://pytorch.org/get-started/), along with all necessary packages: ```Shell pip install torch torchvision torchaudio pytorch-lightning sentencepiece ``` 2. Preprocess LRS3. See the instructions in the [data_prep](./data_prep) folder. ## Usage ### Training ```Shell python train.py --exp-dir=[exp_dir] \ --exp-name=[exp_name] \ --modality=[modality] \ --mode=[mode] \ --root-dir=[root-dir] \ --sp-model-path=[sp_model_path] \ --num-nodes=[num_nodes] \ --gpus=[gpus] ``` - `exp-dir` and `exp-name`: The directory where the checkpoints will be saved, will be stored at the location `[exp_dir]`/`[exp_name]`. - `modality`: Type of the input modality. Valid values are: `video`, `audio`, and `audiovisual`. - `mode`: Type of the mode. Valid values are: `online` and `offline`. - `root-dir`: Path to the root directory where all preprocessed files will be stored. - `sp-model-path`: Path to the sentencepiece model. Default: `./spm_unigram_1023.model`, which can be produced using `train_spm.py`. - `num-nodes`: The number of machines used. Default: 4. - `gpus`: The number of gpus in each machine. Default: 8. ### Evaluation ```Shell python eval.py --modality=[modality] \ --mode=[mode] \ --root-dir=[dataset_path] \ --sp-model-path=[sp_model_path] \ --checkpoint-path=[checkpoint_path] ``` - `modality`: Type of the input modality. Valid values are: `video`, `audio`, and `audiovisual`. - `mode`: Type of the mode. Valid values are: `online` and `offline`. - `root-dir`: Path to the root directory where all preprocessed files will be stored. - `sp-model-path`: Path to the sentencepiece model. Default: `./spm_unigram_1023.model`. - `checkpoint-path`: Path to a pretraned model. ## Results The table below contains WER for AV-ASR models that were trained from scratch [offline evaluation]. | Model | Training dataset (hours) | WER [%] | Params (M) | |:--------------------:|:------------------------:|:-------:|:----------:| | Non-streaming models | | | | | AV-ASR | LRS3 (438) | 3.9 | 50 | | Streaming models | | | | | AV-ASR | LRS3 (438) | 3.9 | 40 |