README.md 2.97 KB
Newer Older
Pingchuan Ma's avatar
Pingchuan Ma committed
1
2
<p align="center"><img width="160" src="https://download.pytorch.org/torchaudio/doc-assets/avsr/lip_white.png" alt="logo"></p>
<h1 align="center">Real-time ASR/VSR/AV-ASR Examples</h1>
Pingchuan Ma's avatar
Pingchuan Ma committed
3

Pingchuan Ma's avatar
Pingchuan Ma committed
4
5
6
7
8
9
10
11
12
13
14
<div align="center">

[📘Introduction](#introduction) |
[📊Training](#Training) |
[🔮Evaluation](#Evaluation)
</div>

## Introduction

This directory contains the training recipe for real-time audio, visual, and audio-visual speech recognition (ASR, VSR, AV-ASR) models, which is an extension of [Auto-AVSR](https://arxiv.org/abs/2303.14307).

Pingchuan Ma's avatar
Pingchuan Ma committed
15
16
## Preparation

Pingchuan Ma's avatar
Pingchuan Ma committed
17
1. Install PyTorch (pytorch, torchvision, torchaudio) from [source](https://pytorch.org/get-started/), along with all necessary packages:
Pingchuan Ma's avatar
Pingchuan Ma committed
18
19

```Shell
Pingchuan Ma's avatar
Pingchuan Ma committed
20
pip install torch torchvision torchaudio pytorch-lightning sentencepiece
Pingchuan Ma's avatar
Pingchuan Ma committed
21
22
```

Pingchuan Ma's avatar
Pingchuan Ma committed
23
2. Preprocess LRS3. See the instructions in the [data_prep](./data_prep) folder.
Pingchuan Ma's avatar
Pingchuan Ma committed
24

Pingchuan Ma's avatar
Pingchuan Ma committed
25
## Usage
Pingchuan Ma's avatar
Pingchuan Ma committed
26

Pingchuan Ma's avatar
Pingchuan Ma committed
27
### Training
28

Pingchuan Ma's avatar
Pingchuan Ma committed
29
```Shell
Pingchuan Ma's avatar
Pingchuan Ma committed
30
31
32
33
34
35
36
37
python train.py --exp-dir=[exp_dir] \
                --exp-name=[exp_name] \
                --modality=[modality] \
                --mode=[mode] \
                --root-dir=[root-dir] \
                --sp-model-path=[sp_model_path] \
                --num-nodes=[num_nodes] \
                --gpus=[gpus]
Pingchuan Ma's avatar
Pingchuan Ma committed
38
39
```

Pingchuan Ma's avatar
Pingchuan Ma committed
40
41
42
43
44
45
46
47
48
- `exp-dir` and `exp-name`: The directory where the checkpoints will be saved, will be stored at the location `[exp_dir]`/`[exp_name]`.
- `modality`: Type of the input modality. Valid values are: `video`, `audio`, and `audiovisual`.
- `mode`: Type of the mode. Valid values are: `online` and `offline`.
- `root-dir`: Path to the root directory where all preprocessed files will be stored.
- `sp-model-path`: Path to the sentencepiece model. Default: `./spm_unigram_1023.model`, which can be produced using `train_spm.py`.
- `num-nodes`: The number of machines used. Default: 4.
- `gpus`: The number of gpus in each machine. Default: 8.

### Evaluation
Pingchuan Ma's avatar
Pingchuan Ma committed
49
50

```Shell
Pingchuan Ma's avatar
Pingchuan Ma committed
51
52
53
54
55
python eval.py --modality=[modality] \
               --mode=[mode] \
               --root-dir=[dataset_path] \
               --sp-model-path=[sp_model_path] \
               --checkpoint-path=[checkpoint_path]
Pingchuan Ma's avatar
Pingchuan Ma committed
56
57
```

Pingchuan Ma's avatar
Pingchuan Ma committed
58
59
60
61
- `modality`: Type of the input modality. Valid values are: `video`, `audio`, and `audiovisual`.
- `mode`: Type of the mode. Valid values are: `online` and `offline`.
- `root-dir`: Path to the root directory where all preprocessed files will be stored.
- `sp-model-path`: Path to the sentencepiece model. Default: `./spm_unigram_1023.model`.
Oren Amsalem's avatar
Oren Amsalem committed
62
- `checkpoint-path`: Path to a pre-trained model.
Pingchuan Ma's avatar
Pingchuan Ma committed
63

Pingchuan Ma's avatar
Pingchuan Ma committed
64
## Results
Pingchuan Ma's avatar
Pingchuan Ma committed
65

Pingchuan Ma's avatar
Pingchuan Ma committed
66
The table below contains WER for AV-ASR models that were trained from scratch [offline evaluation].
Pingchuan Ma's avatar
Pingchuan Ma committed
67

Pingchuan Ma's avatar
Pingchuan Ma committed
68
69
70
71
72
73
|         Model        | Training dataset (hours) | WER [%] | Params (M) |
|:--------------------:|:------------------------:|:-------:|:----------:|
| Non-streaming models |                          |         |            |
|        AV-ASR        |        LRS3 (438)        |   3.9   |     50     |
|  Streaming models    |                          |         |            |
|        AV-ASR        |        LRS3 (438)        |   3.9   |     40     |