README.md 9.73 KB
Newer Older
1
**NOTE**: This example is outdated and is not longer actively maintained. Please 
2
follow the new instructions of fine-tuning Wav2Vec2 [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/README.md)
3

4
5
## Fine-tuning Wav2Vec2

6
The `run_asr.py` script allows one to fine-tune pretrained Wav2Vec2 models that can be found [here](https://huggingface.co/models?search=facebook/wav2vec2).
7
8
9

This finetuning script can also be run as a google colab [TODO: here]( ).

10
11
12
13
14
15
16
17
18
19
20
### Fine-Tuning with TIMIT
Let's take a look at the [script](./finetune_base_timit_asr.sh) used to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)
with the [TIMIT dataset](https://huggingface.co/datasets/timit_asr):

```bash
#!/usr/bin/env bash
python run_asr.py \
--output_dir="./wav2vec2-base-timit-asr" \
--num_train_epochs="30" \
--per_device_train_batch_size="20" \
--per_device_eval_batch_size="20" \
21
--eval_strategy="steps" \
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
--save_steps="500" \
--eval_steps="100" \
--logging_steps="50" \
--learning_rate="5e-4" \
--warmup_steps="3000" \
--model_name_or_path="facebook/wav2vec2-base" \
--fp16 \
--dataset_name="timit_asr" \
--train_split_name="train" \
--validation_split_name="test" \
--orthography="timit" \
--preprocessing_num_workers="$(nproc)" \
--group_by_length \
--freeze_feature_extractor \
--verbose_logging \
```

The resulting model and inference examples can be found [here](https://huggingface.co/elgeish/wav2vec2-base-timit-asr).
Some of the arguments above may look unfamiliar, let's break down what's going on:

`--orthography="timit"` applies certain text preprocessing rules, for tokenization and normalization, to clean up the dataset.
In this case, we use the following instance of `Orthography`:

```python
Orthography(
    do_lower_case=True,
    # break compounds like "quarter-century-old" and replace pauses "--"
    translation_table=str.maketrans({"-": " "}),
)
```

The instance above is used as follows:
* creates a tokenizer with `do_lower_case=True` (ignores casing for input and lowercases output when decoding)
* replaces `"-"` with `" "` to break compounds like `"quarter-century-old"` and to clean up suspended hyphens
* cleans up consecutive whitespaces (replaces them with a single space: `" "`)
* removes characters not in vocabulary (lacking respective sound units)

`--verbose_logging` logs text preprocessing updates and when evaluating, using the validation split every `eval_steps`,
logs references and predictions.

### Fine-Tuning with Arabic Speech Corpus

Other datasets, like the [Arabic Speech Corpus dataset](https://huggingface.co/datasets/arabic_speech_corpus),
require more work! Let's take a look at the [script](./finetune_large_xlsr_53_arabic_speech_corpus.sh)
used to fine-tune [wav2vec2-large-xlsr-53](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic):

```bash
#!/usr/bin/env bash
python run_asr.py \
--output_dir="./wav2vec2-large-xlsr-53-arabic-speech-corpus" \
--num_train_epochs="50" \
--per_device_train_batch_size="1" \
--per_device_eval_batch_size="1" \
--gradient_accumulation_steps="8" \
76
--eval_strategy="steps" \
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
--save_steps="500" \
--eval_steps="100" \
--logging_steps="50" \
--learning_rate="5e-4" \
--warmup_steps="3000" \
--model_name_or_path="elgeish/wav2vec2-large-xlsr-53-arabic" \
--fp16 \
--dataset_name="arabic_speech_corpus" \
--train_split_name="train" \
--validation_split_name="test" \
--max_duration_in_seconds="15" \
--orthography="buckwalter" \
--preprocessing_num_workers="$(nproc)" \
--group_by_length \
--freeze_feature_extractor \
--target_feature_extractor_sampling_rate \
--verbose_logging \
```

First, let's understand how this dataset represents Arabic text; it uses a format called
[Buckwalter transliteration](https://en.wikipedia.org/wiki/Buckwalter_transliteration).
We use the [lang-trans](https://github.com/kariminf/lang-trans) package to convert back to Arabic when logging.
The Buckwalter format only includes ASCII characters, some of which are non-alpha (e.g., `">"` maps to `"兀"`).

`--orthography="buckwalter"` applies certain text preprocessing rules, for tokenization and normalization, to clean up the dataset. In this case, we use the following instance of `Orthography`:

```python
Orthography(
    vocab_file=pathlib.Path(__file__).parent.joinpath("vocab/buckwalter.json"),
    word_delimiter_token="/",  # "|" is Arabic letter alef with madda above
    words_to_remove={"sil"},  # fixing "sil" in arabic_speech_corpus dataset
    untransliterator=arabic.buckwalter.untransliterate,
    translation_table=str.maketrans(translation_table = {
        "-": " ",  # sometimes used to represent pauses
        "^": "v",  # fixing "tha" in arabic_speech_corpus dataset
    }),
)
```

The instance above is used as follows:
* creates a tokenizer with Buckwalter vocabulary and `word_delimiter_token="/"`
* replaces `"-"` with `" "` to clean up hyphens and fixes the orthography for `"孬"`
* removes words used as indicators (in this case, `"sil"` is used for silence)
* cleans up consecutive whitespaces (replaces them with a single space: `" "`)
* removes characters not in vocabulary (lacking respective sound units)

`--verbose_logging` logs text preprocessing updates and when evaluating, using the validation split every `eval_steps`,
logs references and predictions. Using the Buckwalter format, text is also logged in Arabic abjad.

`--target_feature_extractor_sampling_rate` resamples audio to target feature extractor's sampling rate (16kHz).

`--max_duration_in_seconds="15"` filters out examples whose audio is longer than the specified limit,
which helps with capping GPU memory usage.
130
131
132
133


### DeepSpeed Integration

134
To learn how to deploy Deepspeed Integration please refer to [this guide](https://huggingface.co/transformers/main/main_classes/deepspeed.html#deepspeed-trainer-integration).
135
136

But to get started quickly all you need is to install:
137
```bash
138
139
140
141
142
143
144
145
146
147
148
149
150
pip install deepspeed
```
and then use the default configuration files in this directory:

* `ds_config_wav2vec2_zero2.json`
* `ds_config_wav2vec2_zero3.json`

Here are examples of how you can use DeepSpeed:

(edit the value for `--num_gpus` to match the number of GPUs you have)

ZeRO-2:

151
```bash
152
153
154
PYTHONPATH=../../../src deepspeed --num_gpus 2 \
run_asr.py \
--output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \
155
--per_device_eval_batch_size=2 --eval_strategy=steps --save_steps=500 --eval_steps=100 \
156
157
--logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \
--model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \
Patrick von Platen's avatar
Patrick von Platen committed
158
--dataset_name=hf-internal-testing/librispeech_asr_dummy --dataset_config_name=clean \
159
160
161
162
163
164
--train_split_name=validation --validation_split_name=validation --orthography=timit \
--preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \
--deepspeed ds_config_wav2vec2_zero2.json
```

For ZeRO-2 with more than 1 gpu you need to use (which is already in the example configuration file):
165
```json
166
167
168
169
170
171
172
173
174
    "zero_optimization": {
        ...
        "find_unused_parameters": true,
        ...
    }
```

ZeRO-3:

175
```bash
176
177
178
PYTHONPATH=../../../src deepspeed --num_gpus 2 \
run_asr.py \
--output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \
179
--per_device_eval_batch_size=2 --eval_strategy=steps --save_steps=500 --eval_steps=100 \
180
181
--logging_steps=5 --learning_rate=5e-4 --warmup_steps=3000 \
--model_name_or_path=patrickvonplaten/wav2vec2_tiny_random_robust \
Patrick von Platen's avatar
Patrick von Platen committed
182
--dataset_name=hf-internal-testing/librispeech_asr_dummy --dataset_config_name=clean \
183
184
185
186
--train_split_name=validation --validation_split_name=validation --orthography=timit \
--preprocessing_num_workers=1 --group_by_length --freeze_feature_extractor --verbose_logging \
--deepspeed ds_config_wav2vec2_zero3.json
```
Anton Lozhkov's avatar
Anton Lozhkov committed
187
188
189
190

### Pretraining Wav2Vec2

The `run_pretrain.py` script allows one to pretrain a Wav2Vec2 model from scratch using Wav2Vec2's contrastive loss objective (see official [paper](https://arxiv.org/abs/2006.11477) for more information). 
191
It is recommended to pre-train Wav2Vec2 with Trainer + Deepspeed (please refer to [this guide](https://huggingface.co/transformers/main/main_classes/deepspeed.html#deepspeed-trainer-integration) for more information).
Anton Lozhkov's avatar
Anton Lozhkov committed
192
193
194

Here is an example of how you can use DeepSpeed ZeRO-2 to pretrain a small Wav2Vec2 model:

195
```bash
Patrick von Platen's avatar
Patrick von Platen committed
196
PYTHONPATH=../../../src deepspeed --num_gpus 4 run_pretrain.py \
Anton Lozhkov's avatar
Anton Lozhkov committed
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
--output_dir="./wav2vec2-base-libri-100h" \
--num_train_epochs="3" \
--per_device_train_batch_size="32" \
--per_device_eval_batch_size="32" \
--gradient_accumulation_steps="2" \
--save_total_limit="3" \
--save_steps="500" \
--logging_steps="10" \
--learning_rate="5e-4" \
--weight_decay="0.01" \
--warmup_steps="3000" \
--model_name_or_path="patrickvonplaten/wav2vec2-base-libri-100h" \
--dataset_name="librispeech_asr" \
--dataset_config_name="clean" \
--train_split_name="train.100" \
--preprocessing_num_workers="4" \
--max_duration_in_seconds="10.0" \
--group_by_length \
--verbose_logging \
--fp16 \
--deepspeed ds_config_wav2vec2_zero2.json \
```
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240


### Forced Alignment

Character level forced alignment for audio and text pairs with wav2vec2 models finetuned on ASR task for a specific language.
Inspired by [this](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html) Pytorch tutorial.

#### Input Formats

    Input format in script.txt              Input format in wavs directroy
    0000    sentence1                       0000.wav
    0001    sentence2                       0001.wav
    
#### Output Format

Output directory will contain 0000.txt and 0001.txt. Each file will have format like below

    char    score   start_ms    end_ms
    h       0.25    1440        1520
    
#### Run command

241
```bash
242
243
244
245
246
247
248
249
python alignment.py  \
--model_name="arijitx/wav2vec2-xls-r-300m-bengali" \
--wav_dir="./wavs"
--text_file="script.txt" \
--input_wavs_sr=48000 \
--output_dir="./out_alignment" \
--cuda
```