README.md 5.68 KB
Newer Older
Sehoon Kim's avatar
Sehoon Kim committed
1
# Squeezeformer:  An Efficient Transformer for Automatic Speech Recognition
Sehoon Kim's avatar
Sehoon Kim committed
2
3
![teaser](https://user-images.githubusercontent.com/50283958/172300924-157b8458-0e95-4b2e-b992-fc7927738146.png)

Sehoon Kim's avatar
Sehoon Kim committed
4
5
6

We provide testing codes for Squeezeformer, along with the pre-trained checkpoints.

Sehoon Kim's avatar
Sehoon Kim committed
7
8
9
10
11
Check out our [paper](https://arxiv.org/pdf/2206.00888.pdf) for more details.


Squeezeformer is now supported at NVIDIA's  [NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/intro.html#:~:text=NVIDIA%20NeMo%2C%20part%20of%20the,%2DSpeech%20(TTS)%20models.) library as well, along with the training recipes and scripts. Please check out [link](https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/squeezeformer).

Sehoon Kim's avatar
Sehoon Kim committed
12

Sehoon Kim's avatar
Sehoon Kim committed
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
## Install Squeezeformer

We recommend using Python version 3.8.  

### 1. Install dependancies

We support Tensorflow version of 2.5. Run the following commands depending on your target device type.

* Running on CPUs: `pip install -e '.[tf2.5]'`
* Running on GPUs: `pip install -e '.[tf2.5-gpu]'`

### 2. Install CTC decoder 
```bash
cd scripts
bash install_ctc_decoders.sh
```

## Prepare Dataset

### 1. Download Librispeech

[Librispeech](https://ieeexplore.ieee.org/document/7178964) is a widely-used ASR benchmark that consists of 960hr speech corpus with text transcriptions.
The dataset consists of 3 training sets (`train-clean-100`, `train-clean-360`, `train-other-500`), 
2 development sets (`dev-clean`, `dev-other`), and 2 test sets (`test-clean`, `test-other`).

Download the datasets from this [link](http://www.openslr.org/12) and untar them.
If this is for testing purposes only, you can skip the training datasets to save disk space.
You should have flac files under `{dataset_path}/LibriSpeech`.

### 2. Create Manifest Files

Once you download the datasets, you should create a manifest file that links the file path to the audio input and its transcription.
We use a script from [TensorFlowASR](https://github.com/TensorSpeech/TensorFlowASR).

```bash
cd scripts
python create_librispeech_trans_all.py --data {dataset_path}/LibriSpeech --output {tsv_dir}
```
* The `dataset_path` is the directory that you untarred the datasets in the previous step.
* This script creates tsv files under `tsv_dir` that list the audio file path, duration, and the transcription.
* To skip processing the training datasets, use an additional argument `--mode test-only`.

If you have followed the instruction correctly, you should have the following files under `tsv_dir`.
* `dev_clean.tsv`, `dev_other.tsv`, `test_clean.tsv`, `test_other.tsv`
* `train_clean_100.tsv`, `train_clean_360.tsv`, `train_other_500.tsv` (if not `--mode test-only`)
* `train_other.tsv` that merges all training tsv files into one (if not `--mode test-only`)


## Testing Squeezeformer

### 1. Download Pre-trained Checkpoints

We provide pre-trained checkpoints for all variants of Squeezeformer.

|      **Model**      |                                                  **Checkpoint**                            | **test-clean** | **test-other** |
| :-----------------: | :---------------------------------------------------------------------------------------:  | :------------: | :------------: |
|  Squeezeformer-XS   | [link](https://drive.google.com/file/d/1qSukKHz2ltBiWU-xHGmI-P9ziPJcLcSu/view?usp=sharing) |    3.74        |      9.09      |
|  Squeezeformer-S    | [link](https://drive.google.com/file/d/1PGao0AOe5aQXc-9eh2RDQZnZ4UcefcHB/view?usp=sharing) |    3.08        |      7.47      |
|  Squeezeformer-SM   | [link](https://drive.google.com/file/d/17cL1p0KJgT-EBu_-bg3bF7-Uh-pnf-8k/view?usp=sharing) |    2.79        |      6.89      |
|  Squeezeformer-M    | [link](https://drive.google.com/file/d/1fbaby-nOxHAGH0GqLoA0DIjFDPaOBl1d/view?usp=sharing) |    2.56        |      6.50      |
|  Squeezeformer-ML   | [link](https://drive.google.com/file/d/1-ZPtJjJUHrcbhPp03KioadenBtKpp-km/view?usp=sharing) |    2.61        |      6.05      |
|  Squeezeformer-L    | [link](https://drive.google.com/file/d/1LJua7A4ZMoZFi2cirf9AnYEl51pmC-m5/view?usp=sharing) |    2.47        |      5.97      |


### 2. Run Inference!

Run the following commands:
```bash
cd examples/squeezeformer
python test.py --bs {batch_size} --config configs/squeezeformer-S.yml --saved squeezeformer-S.h5 \
    --dataset_path {tsv_dir} --dataset {dev_clean|dev_other|test_clean|test_other}
```

* `tsv_dir` is the directory path to the tsv manifest files that you created in the previous step.
* You can test on other Squeezeformer models by changing `--config` and `--saved`, e.g., Squeezeformer-L or Squeezeformer-M.
Amir Gholami's avatar
Amir Gholami committed
88

Sehoon Kim's avatar
Sehoon Kim committed
89
90
91
92
93
94
## External implementations 
We are thankful to all the researchers who have extended Squeezeformer for different purposes.

|      **Description**      | **Checkpoint**                                    | 
| :-----------------------: | :----------------------------------------------:  |
|  PyTorch implementation   | [link](https://github.com/upskyy/Squeezeformer)   | 
Sehoon Kim's avatar
Sehoon Kim committed
95
|  NeMo                     | [link](https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/squeezeformer)   | 
Sehoon Kim's avatar
Sehoon Kim committed
96
|  WeNet                    | [link](https://github.com/wenet-e2e/wenet) |
Sehoon Kim's avatar
Sehoon Kim committed
97
98


Amir Gholami's avatar
Amir Gholami committed
99
## Citation
Sehoon Kim's avatar
Sehoon Kim committed
100
Squeezeformer has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:
Amir Gholami's avatar
Amir Gholami committed
101
102
103
104

```text
@article{kim2022squeezeformer,
  title={Squeezeformer: An Efficient Transformer for Automatic Speech Recognition},
Sehoon Kim's avatar
Sehoon Kim committed
105
  author={Kim, Sehoon and Gholami, Amir and Shaw, Albert and Lee, Nicholas and Mangalam, Karttikeya and Malik, Jitendra and Mahoney, Michael W and Keutzer, Kurt},
Amir Gholami's avatar
Amir Gholami committed
106
107
108
109
  journal={arxiv:2206.00888},
  year={2022}
}
```
Amir Gholami's avatar
Amir Gholami committed
110
111
112
113

## Copyright

THIS SOFTWARE AND/OR DATA WAS DEPOSITED IN THE BAIR OPEN RESEARCH COMMONS REPOSITORY ON 02/07/23.