Unverified Commit e118e085 authored by Patrick von Platen's avatar Patrick von Platen Committed by GitHub
Browse files

[Robust Speech Event] Add guides (#15155)



* up

* improve readme

* up

* up

* more info

* up

* up

* Apply suggestions from code review
Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>

* add more stuff for eval

* update

* up

* Update README.md

* Update examples/research_projects/xls_r/README.md
Co-authored-by: default avatarOmar Sanseviero <osanseviero@users.noreply.github.com>

* apply omar's suggestions
Co-authored-by: default avatarAnton Lozhkov <aglozhkov@gmail.com>
Co-authored-by: default avatarOmar Sanseviero <osanseviero@users.noreply.github.com>
parent 1a354d53
# Speech recognition community week - version 2 🤗 # Robust Speech Challange 🤗
Welcome to the 2nd version of the speech recognition community event🎙️ ! Welcome to the robust speech recognition challenge 🎙️ !
The goal of this event is to build **robust**, **real-world** speech recognition (ASR) models in as many languages as possible🌏🌍🌎.
If necessary and available, free access to a V100 32 GB GPU will kindly be provided by the [OVH team](https://us.ovhcloud.com/) 🚀. The goal of this event is to build **robust**, **real-world** speech recognition (ASR) systems in as many languages as possible 🌏🌍🌎.
If necessary and available, free access to a V100 32 GB GPU will kindly be provided by the [OVH could team](https://us.ovhcloud.com/) 🚀.
This document summarizes all the relevant information required for the speech community event 📋.
This document summarizes all the relevant information required for the speech community event📋. To sign-up, please see [this forum post](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614) 🤗. Please make sure to:
- Read it in detail
Don't forget to sign up [here](TODO: Create google from)🤗. - Fill the google form
- Join our Discord server in the #join-sprint channel.
## Table of Contents ## Table of Contents
- [Organization](#organization) - [TLDR;](#tldr)
- [Important dates](#important-dates) - [Important dates](#important-dates)
- [How to install pytorch, transformers, datasets](#how-to-install-relevant-libraries) - [How to install pytorch, transformers, datasets](#how-to-install-relevant-libraries)
- [How to fine-tune a speech recognition model](#how-to-finetune-a-model) - [Data and Preprocessing](#data-and-preprocessing)
- [How to fine-tune an acoustic model](#how-to-finetune-an-acoustic-model)
- [How to fine-tune with OVH could](#how-to-finetune-with-ovh-cloud)
- [How to combine n-gram language models with acoustic model](#how-to-combine-n-gram-with-acoustic-model)
- [Evaluation](#evaluation)
- [Prizes](#prizes)
- [Communication and Problems](#communication-and-problems)
- [Talks](#talks) - [Talks](#talks)
- [Project evaluation](#project-evaluation)
- [General Tips & Tricks](#general-tips-and-tricks) - [General Tips & Tricks](#general-tips-and-tricks)
- [FAQ](#faq)
## Organization ## TLDR
Participants are encouraged to leverage pre-trained speech recognition checkpoints, Participants are encouraged to leverage pre-trained speech recognition checkpoints,
preferably [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), to train a speech recognition system in a language of their preferably [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53),
choice. to train a speech recognition system in a language of their choice.
Speech recognition systems should be trained using **PyTorch**, **🤗 Transformers**, and, **🤗 Datasets**.
For more information on how to install the above libraries, please read through
[How to install pytorch, transformers, datasets](#how-to-install-relevant-libraries).
Participants can make use of whatever data they think is useful to build a Participants can make use of whatever data they think is useful to build a
**robust** speech recognition system for **real-world** audio data. We strongly speech recognition system for **real-world** audio data -
recommend making use of [Mozilla's diverse Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0) when training the model. **except** the Common Voice `"test"` split of their chosen language.
Please do **not** use the `"test"` split of the Common Voice datasets for training The section [Data and preprocessing](#data-and-preprocessing) explains
as we will likely use this split for the final evaluation of your model. in more detail what audio data can be used, how to find suitable audio data, and
We kindly ask you to make sure that the dataset that you are using for training how the audio data can be processed.
has the appropriate licensing - see [here](TODO: ) for more information.
For training, it is recommended to use the [official training script](https://github.com/huggingface/transformers/blob/master/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or a modification thereof. A step-by-step guide on how to fine-tune
During the event, the fine-tuned models will regularly be tested on a **development an acoustic model for a speech recognition system can be found under [How to fine-tune an acoustic model](#how-to-finetune-an-acoustic-model).
dataset** provided by the Hugging Face team and at the end of the event, all models If possible it is encouraged to fine-tune the acoustic models on local GPU machines, but
will be tested on a **test dataset**. For each language, if those are not available, the OVH could team kindly provides a limited
the best performing model will receive a prize 🏆 - more information regarding number of GPUs for the event. Simply fill out [this google form](https://forms.gle/GFZkMkKLiufi75g28) to get access to a GPU.
the testing [here](TODO: ) and prizes [here](TODO: ). We believe that framing the For more information on how to train an acoustic model on one of OVH's GPU - see [How to fine-tune a speech recognition model with OVHcould](#how-to-fine-tune-with-ovh-cloud).
event as a competition is more fun, but at the core, we strongly encourage
participants to work together by helping each other to solve bugs, share important findings, etc...🤗 The performance of speech recognition system can often significantly be improved by adding a
language model for decoding. For more information on how to add a language model, please
If possible it is encouraged to fine-tune the models on local GPU machines, but take a look at [How to combine n-gram language models with speech recognition models](#how-to-combine-n-gram-with-model).
if those are not available, the OVH cloud team kindly provides a limited
number of GPUs for the event. For more information on how to get access to the GPU - see [here](TODO: ). During the event, the speech recognition system will be evaluated on both the Common Voice `"test"` split
of the participants' chosen language as well as the *real-world* `"dev"` data provided by
the Hugging Face team.
**Please note**: At the end of the robust speech recognition challenge, the speech recognition system will also be evaluated on the
*real-world* `"test"` data provided by the Hugging Face team. Each participant should add an
`eval.py` script to her/his model repository in a specific format that lets one easily
evaluate the speech recognition system on both Common Voice's `"test"` data as well as the *real-world* audio
data. Please read through the [Evaluation](#evaluation) section to make sure your evaluation script is in the correct format. Speech recognition systems
with evaluation scripts in an incorrect format can sadly not be considered for the Challenge.
At the end of the event, the best performing speech recognition system
will receive a prize 🏆 - more information regarding the prizes can be found under [Prizes](#prizes).
We believe that framing the event as a competition is more fun, but at the core, the event is about
creating speech recognition systems in as many languages as possible as a community.
This can be achieved by working together, helping each other to solve bugs, share important findings, etc...🤗
**Note**:
Please, read through the section on [Communication & Problems](#communication-and-problems) to make sure you
know how to ask for help, etc...
All important announcements will be made on discord. Please make sure that All important announcements will be made on discord. Please make sure that
you've joined the following discord server: TODO: fill out. you've joined [this discord channel](https://discord.gg/SHr5wC7m)
Please make sure that you have been added to the [Speech Event Organization](https://huggingface.co/speech-recognition-community-v2). You should have received an
invite by email. If you didn't receive an invite, please contact the organizers, *e.g.* Anton, Patrick, or Omar on discord.
Also, please make sure that you have been added to the [Speech Event Organization](https://huggingface.co/speech-recognition-community-v2).
You should have received an invite by email. If you didn't receive an invite, please contact the organizers, *e.g.* Anton, Patrick, or Omar directly on discord.
## Important dates ## Important dates
...@@ -63,6 +89,61 @@ invite by email. If you didn't receive an invite, please contact the organizers, ...@@ -63,6 +89,61 @@ invite by email. If you didn't receive an invite, please contact the organizers,
- **24.01. - 07.02.** The OVH & Hugging Face team will be available for any questions, problems the participants might have. - **24.01. - 07.02.** The OVH & Hugging Face team will be available for any questions, problems the participants might have.
- **07.02.** Access to GPU is deactivated and community week officially ends. - **07.02.** Access to GPU is deactivated and community week officially ends.
## Data and preprocessing
In this section, we will quickly go over how to find suitable training data and
how to preprocess it.
To begin with, **all data except Common Voice's `"test"` data can be used as training data.**
The exception includes all Common Voice versions as the test data split of later Common Voice versions often
overlaps with the one of previous versions, *e.g.* the test data of Common Voice 7 in English is
to a big part identical to the test data of Common Voice 6 in English:
```python
load_dataset("mozilla-foundation/common_voice_7_0", "en", split="test")
```
includes more or less the same data as
```python
load_dataset("mozilla-foundation/common_voice_6_1", "en", split="test")
```
However, we strongly encourage participants to make use of Common Voice's other splits, *e.g.* `"train"` and `"validation"`.
For most languages, the Common Voice dataset offers already a decent amount of training data. It is usually
always advantageous to collect additional data. To do so, the participants are in a first step encouraged to search the
Hugging Face Hub for additional audio data, for example by selecting the category
["speech-processing"](https://huggingface.co/datasets?task_categories=task_categories:speech-processing&sort=downloads).
All datasets that are available on the Hub can be downloaded via the 🤗 Datasets library in the same way Common Voice is downloaded.
If one wants to combine multiple datasets for training, it might make sense to take a look at
the [`interleave_datasets`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=interleave#datasets.interleave_datasets) function.
In addition, participants can also make use of their audio data. Here, please make sure that you **are allowed to use the audio data**. E.g., if audio data
is taken from media platforms, such as YouTube, it should be verified that the media platform and the owner of the data have given her/his approval to use the audio
data in the context of machine learning research. If you are not sure whether the data you want to use has the appropriate licensing, please contact the Hugging Face
team on discord.
Next, let's talk about preprocessing. Audio data and transcriptions have to be brought into the correct format when
training the acoustic model (example shown in [How to fine-tune an acoustic model](#how-to-finetune-an-acoustic-model)).
It is recommended that this is done by using 🤗 Datasets `.map()` function as shown
[here](https://github.com/huggingface/transformers/blob/9a2dabae7002258e41419491c73dd43ad61b5de7/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L444). As can be
see we can pass some characters that will be removed from the transcriptions, *e.g.*: `--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \`
on the official ["Single GPU Example"](https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition#single-gpu-ctc).
The participants are free to modify this preprocessing by removing more characters or even replacing characters as
it is done in the [official blog post](https://github.com/huggingface/transformers/blob/9a2dabae7002258e41419491c73dd43ad61b5de7/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L444).
**However**, there are some rules regarding what characters are allowed to be removed/replaced and which are not.
These rules are not this straightforward and therefore often have to be evaluated case-by-case.
It is allowed (and recommended) to normalize the data to only have lower-case characters. It is also allowed (and recommended) to remove typographical
symbols and punctuation marks. A list of such symbols can *e.g.* be found [here](https://en.wikipedia.org/wiki/List_of_typographical_symbols_and_punctuation_marks) - however here we already must be careful. We should **not** remove a symbol that would change the meaning of the words, *e.g.* in English,
we should not remove the single quotation mark `'` since it would change the meaning of the word `"it's"` to `"its"` which would then be incorrect.
So the golden rule here is to not remove any characters that could change the meaning of a word into another word. This is not always obvious and should
be given some consideration. As another example, it is fine to remove the "Hyphen-minus" sign "`-`" since it doesn't change the
meaning of a word to another one. *E.g.* "`fine-tuning`" would be changed to "`finetuning`" which has still the same meaning.
Since those choices are not always obvious when in doubt feel free to ask on Discord or even better post your question on the forum, as was
done, *e.g.* [here](https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586).
## How to install relevant libraries ## How to install relevant libraries
The following libraries are required to fine-tune a speech model with 🤗 Transformers and 🤗 Datasets in PyTorch. The following libraries are required to fine-tune a speech model with 🤗 Transformers and 🤗 Datasets in PyTorch.
...@@ -94,7 +175,7 @@ The following command should return ``True``: ...@@ -94,7 +175,7 @@ The following command should return ``True``:
python -c "import torch; print(torch.cuda.is_available())" python -c "import torch; print(torch.cuda.is_available())"
``` ```
If the above command doesn't print ``True``, in a first step, please follow the If the above command doesn't print ``True``, in the first step, please follow the
instructions [here](https://pytorch.org/) to install PyTorch with CUDA. instructions [here](https://pytorch.org/) to install PyTorch with CUDA.
We strongly recommend making use of the provided PyTorch examples scripts in [transformers/examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition) to train your speech recognition We strongly recommend making use of the provided PyTorch examples scripts in [transformers/examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/master/examples/pytorch/speech-recognition) to train your speech recognition
...@@ -132,7 +213,7 @@ In all likelihood, you will adjust one of the example scripts, so we recommend f ...@@ -132,7 +213,7 @@ In all likelihood, you will adjust one of the example scripts, so we recommend f
If you have already cloned that repo, you might need to `git pull` to get the most recent changes in the `transformers` If you have already cloned that repo, you might need to `git pull` to get the most recent changes in the `transformers`
library. library.
Running this command will automatically install `pytorch` and the most relevant Running this command will automatically install `torch` and the most relevant
libraries required for fine-tuning a speech recognition system. libraries required for fine-tuning a speech recognition system.
Next, you should also install the 🤗 Datasets library. We strongly recommend installing the Next, you should also install the 🤗 Datasets library. We strongly recommend installing the
...@@ -173,7 +254,7 @@ logits = model(input_values).logits ...@@ -173,7 +254,7 @@ logits = model(input_values).logits
assert logits.shape[-1] == 32 assert logits.shape[-1] == 32
``` ```
## How to finetune a model ## How to finetune an acoustic model
In this section, we show you how to fine-tune a pre-trained [XLS-R Model](https://huggingface.co/docs/transformers/model_doc/xls_r) on the [Common Voice 7 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0). In this section, we show you how to fine-tune a pre-trained [XLS-R Model](https://huggingface.co/docs/transformers/model_doc/xls_r) on the [Common Voice 7 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0).
...@@ -196,7 +277,7 @@ In this section, we will explain how to fine-tune the model on a local machine. ...@@ -196,7 +277,7 @@ In this section, we will explain how to fine-tune the model on a local machine.
1. **Log in** 1. **Log in**
To begin with you should check that you are correctly logged in and that you have `git-lfs` installed so that your fine-tuned model can automatically be uploaded. To begin with, you should check that you are correctly logged in and that you have `git-lfs` installed so that your fine-tuned model can automatically be uploaded.
Run: Run:
...@@ -204,7 +285,7 @@ Run: ...@@ -204,7 +285,7 @@ Run:
huggingface-cli login huggingface-cli login
``` ```
to login. It is recommend to login with your personal access token that can be found under your hugging face profile (icon in the top right corner on [hf.co](http://hf.co/), then Settings -> Access Tokens -> User Access Tokens -> New Token (if haven't generated one already) to login. It is recommended to login with your access token that can be found under your hugging face profile (icon in the top right corner on [hf.co](http://hf.co/), then Settings -> Access Tokens -> User Access Tokens -> New Token (if haven't generated one already)
You can then copy-paste this token to log in locally. You can then copy-paste this token to log in locally.
...@@ -323,7 +404,7 @@ However as you can see on the model card [hf-test/xls-r-ab-test](https://hugging ...@@ -323,7 +404,7 @@ However as you can see on the model card [hf-test/xls-r-ab-test](https://hugging
not surprising given that we trained for just 10 steps on a randomly initialized not surprising given that we trained for just 10 steps on a randomly initialized
model. model.
For a real model training, one of the actual pre-trained XLS-R models should be used: For real model training, it is recommended to use one of the actual pre-trained XLS-R models:
- [300M parameters version](https://huggingface.co/facebook/wav2vec2-xls-r-300m) - [300M parameters version](https://huggingface.co/facebook/wav2vec2-xls-r-300m)
- [1B parameters version](https://huggingface.co/facebook/wav2vec2-xls-r-1b) - [1B parameters version](https://huggingface.co/facebook/wav2vec2-xls-r-1b)
...@@ -340,17 +421,15 @@ Following the above steps we first create the model: ...@@ -340,17 +421,15 @@ Following the above steps we first create the model:
huggingface-cli repo create xls-r-300m-sv huggingface-cli repo create xls-r-300m-sv
``` ```
and then clone it locally: , clone it locally (assuming the `<username>` is `hf-test`)
```bash ```bash
git clone hf-test/xls-r-300m-sv
```
, and, define the following hyperparameters for training
and we define the following
hyperparameters for training
```bash ```bash
echo '''python run_speech_recognition_ctc.py \ echo '''python run_speech_recognition_ctc.py \
--dataset_name="mozilla-foundation/common_voice_7_0" \ --dataset_name="mozilla-foundation/common_voice_7_0" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \
...@@ -391,5 +470,253 @@ The training takes *ca.* 7 hours and yields a reasonable test word ...@@ -391,5 +470,253 @@ The training takes *ca.* 7 hours and yields a reasonable test word
error rate of 27% as can be seen on the automatically generated [model card](https://huggingface.co/hf-test/xls-r-300m-sv). error rate of 27% as can be seen on the automatically generated [model card](https://huggingface.co/hf-test/xls-r-300m-sv).
The above-chosen hyperparameters probably work quite well on a range of different The above-chosen hyperparameters probably work quite well on a range of different
datasets and languages, but are by no means optimal. It is up to you to find a good set of datasets and languages but are by no means optimal. It is up to you to find a good set of
hyperparameters. hyperparameters.
## How to finetune with OVH cloud
For a more detailed guide on setting up OVHcloud please watch this video: TODO
### Creating an OVHCloud account
*TIP*: If you haven't created a project on OVHcloud yet, make sure you've received your GPU voucher code *beforehand*,
so that you can skip entering the credit card information.
1. If you're a US citizen, create an account via [OVHcloud.CA](https://ovhcloud.ca/).
If you're from anywhere else in the world, create an account via [OVHcloud.COM](https://ovhcloud.com/).
2. Once logged in, click `Public Cloud` from the top menu and then click `Create your first OVH Public Cloud project`.
Then enter a project name (e.g. "huggingface"), enter your voucher code, and click `Continue` -> `Create my project`.
*Note: if you see a request for credit card details during the last step, and you can't skip it, then your voucher code
is invalid. Please report it to the [#ovh-support](https://discord.gg/p4qqDV3M) channel on Discord.*
### Setting up an AI notebook
1. Go to the `Public Cloud` page and select `Project Management` -> `Users & Roles` from the menu on the left.
2. Click `+ Add user`. Write a user description (e.g. `AI Trainer`), and select an `AI Training Operator` user role.
Click `Confirm`.
3. Write down the *username* and *password* (at the top of the screen) somewhere. They will be needed during step 7.
4. Select `AI & Machine Learning` -> `AI Training` from the menu on the left.
Click `+ Launch a new job` on the AI Training page.
5. On the `Launch a new job` page:
* In `1. Choose a region` select a region closest to you.
* In `2. Enter the Docker image` select `Custom image` -> `baaastijn/ovh_huggingface`.
* You can skip steps `3.` and `4.` if you will be using the Hugging Face Hub to store the models after training.
* In `5. Configure your job` select **1** `GPU`.
* Validate the info and Create the job.
6. On the `AI Training Jobs` screen wait until the job's status changes from `Pending` to `Running`.
7. Click `HTTP Access` and log in with the AI training user you've created earlier.
Once logged in, you can close the page and click `HTTP Access` to launch a JupyterLab notebook.
8. Awesome, now you have a free GPU-enabled Jupyter instance!
**Note**: If you're an experienced Docker user, feel free to create a custom docker image with all of the needed packages
like the one in step 5. The Dockerfile for it is available here:
[baaastijn/Dockerimages](https://github.com/baaastijn/Dockerimages/tree/main/Hugginface_challenge_speech).
Once you've built your image, push it to https://hub.docker.com/ and select it during the OVHcloud job creation.
## How to combine n-gram with acoustic model
Having trained a speech recognition model with CTC as shown in the section above,
one can further improve the model's performance by adding an **n-gram language model**
to the decoding process of the model. By doing so, we are replacing the naive greedy decoding
with **n-gram-boosted** beam search decoding.
N-gram language models can be built on CPU in just a few minutes. *N-gram-boosted* beam search decoding noticeably slows down the
inference time, but also yields significant word error rates improvements - usually between 10-40 %.
You can find an in-detail blog post on how to build an *n-gram* [here](https://huggingface.co/blog/wav2vec2-with-ngram).
The blog post can be opened in a google colab and by adapting three lines of the example for your use case, one can directly
create an *n-gram* in the google colab.
The blog post gives in-detail instructions on how to build an n-gram and how to add it to your trained speech recognition model.
- why one should add an *n-gram* to her/his speech recognition system,
- how to build an *n-gram*, and,
- how to add the built *n-gram* the speech recognition system for seamless decoding
Our previously trained model - [xls-r-300m-sv](https://huggingface.co/hf-test/xls-r-300m-sv) - enjoys a 30% word error rate reduction after
having added an n-gram. As shown in the example of the blog post, we strongly advise participants to upload all files required for combining
the *n-gram* with a trained speech recognition model directly into the same model repository.
## Evaluation
Finally, we have arrived at the most fun part of the challenge - sitting back and
watching the model transcribe audio. If possible, every participant should evaluate
the speech recognition system on the test set of Common Voice 7 and
ideally also on the real-world audio data (if available).
For languages that have neither a Common Voice evaluation dataset nor a real world
evaluation dataset, please contact the organizers on Discord so that we can work
together to find some evaluation data.
As a first step, one should copy the official `eval.py` script to her/his model
repository. Let's use our previously trained [xls-r-300m-sv](https://huggingface.co/hf-test/xls-r-300m-sv) again as an example.
Assuming that we have a clone of the model's repo under `~/xls-r-300m-sv`, we can
copy the `eval.py` script to the repo.
```bash
cp ~/transformers/examples/research_projects/robust-speech-event/eval.py ~/xls-r-300m-sv
```
Next, we should adapt `eval.py` so that it fits our evaluation data. Here it is
important to keep the `eval.py` file in the following format:
- 1. The following input arguments should not be changed and keep their original functionality/meaning (being to load the model and dataset): `"--model_id"`, `"--dataset"`, `"--config"`, `"--split"`. We recommend to not change any of the code written under `if __name__ == "__main__":`.
- 2. The function `def log_results(result: Dataset, args: Dict[str, str])` should also not be changed. The function expects the above names attached to the `args` object as well as a `datasets.Dataset` object, called `result` which includes all predictions and target transcriptions under the names `"predictions"` and `"targets"` respectively.
- 3. All other code can be changed and adapted. Participants are especially invited to change the `def normalize_text(text: str) -> str:` function as this might be a very language and model-training specific function.
- 4. **Important**: It is not allowed to "cheat" in any way when in comes to pre-and postprocessing. In short, "cheating" refers to any of the following:
- a. Somehow giving the model access to the target transcriptions to improve performance. The model is not allowed to use the target transcriptions to generate its predictions.
- b. Pre-processing the target transcriptions in a way that makes the target transcriptions lose their original meaning. This corresponds to what has already been said in [Data and Preprocessing](#data-and-preprocessing) and is somewhat of a grey zone. It means that one should not remove characters that would make a word to lose its meaning. E.g., it is not allowed to replace all `e` in English with `i` and simply make the model learn that `e` and `i` are the same letter for a better word error rate. This would destroy the meaning of words such as `fell -> fill`. However, it is totally fine to normalize (*e.g.* lowercase) all letters, remove punctuation. There can be a lot of language-specific exceptions and in case you are not sure whether your target transcription pre-processing is allowed, please ask on the Discord channel.
Uff, that was a lot of text describing how to make sure your `eval.py` script
is in the correct format. If you have any questions, please ask openly in Discord.
Great, now that we have adapted the `eval.py` script, we can lean back and run the
evaluation.
First, one should evaluate the model on Common Voice 7's test data. This might
already have been done for your acoustic model during training but in case you
added an *n-gram* language model after having fine-tuned the acoustic model, you
should now see a nice improvement.
The command to evaluate our test model [xls-r-300m-sv](https://huggingface.co/hf-test/xls-r-300m-sv) on Common Voice 7's test data is the following:
```bash
cd xls-r-300m-sv
./eval.py --model_id ./ --dataset mozilla-foundation/common_voice_7_0 --config sv-SE --split test --log_outputs
```
To log each of the model's predictions with the target transcriptions, you can just
add the `--log_outputs` flag.
Running this command should automatically create the file:
`mozilla-foundation_common_voice_7_0_sv-SE_test_eval_results.txt` that contains
both the word- and character error rate.
In a few days, we will give everybody access to some real-world audio data for as many languages as possible.
If your language has real-world audio data, it will most likely have audio input
of multiple minutes. 🤗Transformer's [ASR pipeline](https://huggingface.co/docs/transformers/master/en/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) supports audio chunking out-of-the-box. You only need to specify
how song each audio chunk should be (`chunk_length_s`) and how much audio stride
(`stride_length_s`) each chunk should use.
For more information on the chunking works, please have a look at [this nice blog post](TODO: ).
In the case of `xls-r-300m-sv`, the following command can be run:
```bash
cd xls-r-300m-sv
./eval.py --model_id hf-test/xls-r-300m-sv --dataset <to-be-announced> --config sv --split validation --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
```
Great, now you should have successfully evaluated your model. Finally, there is one
**important** thing you should do so that your model is taken into account
for the final evaluation. You should add two tags to your model, one being `robust-speech-event`, one being the ISO code of your chosen language, *e.g.* `"sv"` for the
exemplary model we used above. You can find a list of all available languages and
their ISO code [here](https://huggingface.co/languages).
To add the tags, simply edit the README.md of your model repository and add
```
- "sv"
- "robust-speech-event"
```
under `tags:` as done [here](https://huggingface.co/hf-test/xls-r-300m-sv/commit/a495fd70c96bb7d019729be9273a265c2557345e).
To verify that you've added the tags correctly make sure that your model
appears when clicking on [this link](https://huggingface.co/models?other=robust-speech-event).
Great that's it! This should give you all the necessary information to evaluate
your model. For the final evaluation, we will verify each evaluation result to
determine the final score and thereby the winning models for each language.
The final score is calculated as follows:
```bash
FINAL_SCORE = 1/3 * WER_Common_Voice_7_test + 1/3 * WER_REAL_AUDIO_DEV + 1/3 * WER_REAL_AUDIO_TEST
```
The dataset `WER_REAL_AUDIO_TEST` is hidden and will only be published
at the end of the robust speech challenge.
If there is no real audio data for your language the final score will be
computed solely based on the Common Voice 7 test dataset. If there is also
no Common Voice 7 test dataset for your language, we will see together how to
score your model - if this is the case, please don't be discouraged. We are
especially excited about speech recognition systems of such low-resource
languages and will make sure that we'll decide on a good approach to evaluating
your model.
## Prizes
TODO(Patrick, Omar, ...)
## Communication and Problems
If you encounter any problems or have any questions, you should use one of the following platforms
depending on your type of problem. Hugging Face is an "open-source-first" organization meaning
that we'll try to solve all problems in the most public and most transparent way possible so that everybody
in the community profits.
The following table summarizes what platform to use for which problem.
- Problem/question/bug with the 🤗 Datasets library that you think is a general problem that also impacts other people, please open an [Issues on Datasets](https://github.com/huggingface/datasets/issues/new?assignees=&labels=bug&template=bug-report.md&title=) and ping @anton-l and @patrickvonplaten.
- Problem/question/bug with the 🤗 Transformers library that you think is a general problem that also impacts other people, please open an [Issues on Transformers](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title=) and ping @anton-l and @patrickvonplaten.
- Problem/question with a modified, customized training script that is less likely to impact other people, please post your problem/question [on the forum](https://discuss.huggingface.co/) and ping @anton-l and @patrickvonplaten.
- Questions regarding access to the OVHcloud GPU, please ask in the Discord channel **#ovh-support**.
- Other questions regarding the event, rules of the event, or if you are not sure where to post your question, please ask in the Discord channel **#sprint-discussions**.
## Talks
We are very excited to be hosting 2 days of talks from Kensho-Technologies, Mozilla's Common Voice, Kensho-Technologies, Hugging Face.
### Thursday, January 20th
- [Watch the talks on YouTube](TODO)
- [Chat history](TODO)
Speaker | Topic | Time | Video |
|-------------|---------------------------------|------------------------|------------------------|
| Patrick von Platen, Hugging Face | TODO | ??? UTC | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](TODO)
| Raymond Grossman and Jeremy Lopez, Kensho-Technologies | Pyctcdecode & Speech2text decoding | 5h30pm - 6h00pm UTC | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](TODO)
### Friday, January 21th
- [Watch the talks on YouTube](TODO)
- [Chat history](TODO)
Speaker | Topic | Time | Video |
|-------------|---------------------------------|------------------------|------------------------|
| Gabriel Habayeb, Mozilla Common Voice | TODO | 4h30pm - 5h00pm UTC | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](TODO)
| Changhan Wang, Meta AI Research | XLS-R: Large-Scale Cross-lingual Speech Representation Learning on 128 Languages | 5h30pm - 6h00pm UTC | [![Youtube](https://www.youtube.com/s/desktop/f506bd45/img/favicon_32.png)](TODO)
### Talks & Speakers
#### Patrick von Platen, Research Engineer, Hugging Face
- Talk: Introduction to Robust Speech Challenge
- Abstract: In this talk, Patrick outlines the Robust Speech Challenge and gives tips and tricks on how to train and evaluate speech recognition systems with 🤗 Transformers and 🤗 Datasets, and PyTorch.
- Speaker info: Patrick von Platen is a research engineer at Hugging Face and one of the core maintainers of the popular Transformers library. He specializes in speech recognition, encoder-decoder models, and long-range sequence modeling. Before joining Hugging Face, Patrick researched speech recognition at Uber AI, Cambridge University, and RWTH Aachen University.
#### Raymond Grossman, Jeremy Lopez, Machine Learning Engineer, Kensho Technologies
- Talk: PyCTCDecode & Speech2text decoding
- Abstract: PyCTCDecode is a fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support similar to PaddlePaddle's decoder, but incorporating many new features such as byte pair encoding and real-time decoding to support models like Nvidia's Conformer-CTC or Facebook's Wav2Vec2.
- Speaker info :
- Raymond works as a machine learning engineer at Kensho Technologies, specializing in speech and natural language domains. Before coming to Kensho, he studied mathematics at Princeton and was an avid Kaggler under the moniker @ToTrainThemIsMyCause.
- Jeremy is a machine learning engineer at Kensho Technologies and has worked on a variety of different topics including search and speech recognition. Before working at Kensho, he earned a PhD in experimental particle physics at MIT and continued doing physics research as a postdoc at the University of Colorado Boulder.
#### Gabriel Habayeb, Data Engineer, Common Voice @ Mozilla
- Talk: Common Voice
- Abstract:
- Speaker info:
#### Changhan Wang, Main author of XLS-R and Research Engineer, Meta AI Research
- Talk: XLS-R: Large-Scale Cross-lingual Speech Representation Learning on 128 Languages
- Abstract: In this talk, Changhan will present XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. XLS-R has up to 2B parameters and was trained on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. On the CoVoST-2 speech translation benchmark, XLS-R improves the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. The XLS-R team hopes to work together with the open-source community to improve speech processing tasks for many more languages of the world.
## General Tips and Tricks
- Memory efficient training:
In case, you are getting out-of-memory errors on your GPU, we recommend to use
[bitsandbytes](https://github.com/facebookresearch/bitsandbytes) to replace the
native memory-intensive Adam optimizer with the one of `bitsandbytes`. You
can simply run the script `./run_speech_recognition_ctc_bnb.py` provided in this
folder that makes use of `bitsandbytes` instead of the official one.
- Dataset streaming
TODO(Patrick)
#!/usr/bin/env python3
import argparse
import re
from typing import Dict
from datasets import Audio, Dataset, load_dataset, load_metric
from transformers import AutoFeatureExtractor, pipeline
def log_results(result: Dataset, args: Dict[str, str]):
"""DO NOT CHANGE. This function computes and logs the result metrics."""
log_outputs = args.log_outputs
dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
# load metric
wer = load_metric("wer")
cer = load_metric("cer")
# compute metrics
wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
# print & log results
result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
print(result_str)
with open(f"{dataset_id}_eval_results.txt", "w") as f:
f.write(result_str)
# log all results in text file. Possibly interesting for analysis
if log_outputs is not None:
pred_file = f"log_{dataset_id}_predictions.txt"
target_file = f"log_{dataset_id}_targets.txt"
with open(pred_file, "w") as p, open(target_file, "w") as t:
# mapping function to write output
def write_to_file(batch, i):
p.write(f"{i}" + "\n")
p.write(batch["prediction"] + "\n")
t.write(f"{i}" + "\n")
t.write(batch["target"] + "\n")
result.map(write_to_file, with_indices=True)
def normalize_text(text: str) -> str:
"""DO ADAPT FOR YOUR USE CASE. this function normalizes the target text."""
chars_to_ignore_regex = '[,?.!\-\;\:"“%‘”�—’…–]' # noqa: W605 IMPORTANT: this should correspond to the chars that were ignored during training
text = re.sub(chars_to_ignore_regex, "", text.lower())
# In addition, we can normalize the target text, e.g. removing new lines characters etc...
# note that order is important here!
token_sequences_to_ignore = ["\n\n", "\n", " ", " "]
for t in token_sequences_to_ignore:
text = " ".join(text.split(t))
return text
def main(args):
# load dataset
dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
# for testing: only process the first two examples as a test
dataset = dataset.select(range(10))
# load processor
feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
sampling_rate = feature_extractor.sampling_rate
# resample audio
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
# load eval pipeline
asr = pipeline("automatic-speech-recognition", model=args.model_id)
# map function to decode audio
def map_to_pred(batch):
prediction = asr(
batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s
)
batch["prediction"] = prediction["text"]
batch["target"] = normalize_text(batch["sentence"])
return batch
# run inference on all examples
result = dataset.map(map_to_pred, remove_columns=dataset.column_names)
# compute and log_results
# do not change function below
log_results(result, args)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
)
parser.add_argument(
"--dataset",
type=str,
required=True,
help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
)
parser.add_argument(
"--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
)
parser.add_argument("--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`")
parser.add_argument(
"--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to 5 seconds."
)
parser.add_argument(
"--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to 1 second."
)
parser.add_argument(
"--log_outputs", action="store_true", help="If defined, write outputs to log file for analysis."
)
args = parser.parse_args()
main(args)
#!/usr/bin/env python
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
""" Fine-tuning a 🤗 Transformers CTC model for automatic speech recognition"""
import functools
import json
import logging
import os
import re
import sys
import warnings
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union
import datasets
import numpy as np
import torch
from datasets import DatasetDict, load_dataset, load_metric
import bitsandbytes as bnb
import transformers
from transformers import (
AutoConfig,
AutoFeatureExtractor,
AutoModelForCTC,
AutoProcessor,
AutoTokenizer,
HfArgumentParser,
Trainer,
TrainingArguments,
Wav2Vec2Processor,
set_seed,
)
from transformers.trainer_pt_utils import get_parameter_names
from transformers.trainer_utils import get_last_checkpoint, is_main_process
from transformers.utils import check_min_version
from transformers.utils.versions import require_version
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.16.0.dev0")
require_version("datasets>=1.13.3", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt")
logger = logging.getLogger(__name__)
def list_field(default=None, metadata=None):
return field(default_factory=lambda: default, metadata=metadata)
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
tokenizer_name_or_path: Optional[str] = field(
default=None,
metadata={"help": "Path to pretrained tokenizer or tokenizer identifier from huggingface.co/models"},
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
freeze_feature_encoder: bool = field(
default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
)
attention_dropout: float = field(
default=0.0, metadata={"help": "The dropout ratio for the attention probabilities."}
)
activation_dropout: float = field(
default=0.0, metadata={"help": "The dropout ratio for activations inside the fully connected layer."}
)
feat_proj_dropout: float = field(default=0.0, metadata={"help": "The dropout ratio for the projected features."})
hidden_dropout: float = field(
default=0.0,
metadata={
"help": "The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
},
)
final_dropout: float = field(
default=0.0,
metadata={"help": "The dropout probability for the final projection layer."},
)
mask_time_prob: float = field(
default=0.05,
metadata={
"help": "Probability of each feature vector along the time axis to be chosen as the start of the vector"
"span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature"
"vectors will be masked along the time axis."
},
)
mask_time_length: int = field(
default=10,
metadata={"help": "Length of vector span to mask along the time axis."},
)
mask_feature_prob: float = field(
default=0.0,
metadata={
"help": "Probability of each feature vector along the feature axis to be chosen as the start of the vector"
"span to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature bins will be masked along the time axis."
},
)
mask_feature_length: int = field(
default=10,
metadata={"help": "Length of vector span to mask along the feature axis."},
)
layerdrop: float = field(default=0.0, metadata={"help": "The LayerDrop probability."})
ctc_loss_reduction: Optional[str] = field(
default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."}
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
Using `HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on
the command line.
"""
dataset_name: str = field(
metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
dataset_config_name: str = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
train_split_name: str = field(
default="train+validation",
metadata={
"help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
},
)
eval_split_name: str = field(
default="test",
metadata={
"help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
},
)
audio_column_name: str = field(
default="audio",
metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
)
text_column_name: str = field(
default="text",
metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
},
)
max_eval_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
"value if set."
},
)
chars_to_ignore: Optional[List[str]] = list_field(
default=None,
metadata={"help": "A list of characters to remove from the transcripts."},
)
eval_metrics: List[str] = list_field(
default=["wer"],
metadata={"help": "A list of metrics the model should be evaluated on. E.g. `'wer cer'`"},
)
max_duration_in_seconds: float = field(
default=20.0,
metadata={
"help": "Filter audio files that are longer than `max_duration_in_seconds` seconds to 'max_duration_in_seconds`"
},
)
min_duration_in_seconds: float = field(
default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}
)
preprocessing_only: bool = field(
default=False,
metadata={
"help": "Whether to only do data preprocessing and skip training. "
"This is especially useful when data preprocessing errors out in distributed training due to timeout. "
"In this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` "
"so that the cached datasets can consequently be loaded in distributed training"
},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": "If :obj:`True`, will use the token generated when running"
":obj:`transformers-cli login` as HTTP bearer authorization for remote files."
},
)
unk_token: str = field(
default="[UNK]",
metadata={"help": "The unk token for the tokenizer"},
)
pad_token: str = field(
default="[PAD]",
metadata={"help": "The padding token for the tokenizer"},
)
word_delimiter_token: str = field(
default="|",
metadata={"help": "The word delimiter token for the tokenizer"},
)
phoneme_language: Optional[str] = field(
default=None,
metadata={
"help": "The target language that should be used be"
" passed to the tokenizer for tokenization. Note that"
" this is only relevant if the model classifies the"
" input audio to a sequence of phoneme sequences."
},
)
@dataclass
class DataCollatorCTCWithPadding:
"""
Data collator that will dynamically pad the inputs received.
Args:
processor (:class:`~transformers.AutoProcessor`)
The processor used for proccessing the data.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
among:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
max_length (:obj:`int`, `optional`):
Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
max_length_labels (:obj:`int`, `optional`):
Maximum length of the ``labels`` returned list and optionally padding length (see above).
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
7.5 (Volta).
"""
processor: AutoProcessor
padding: Union[bool, str] = "longest"
pad_to_multiple_of: Optional[int] = None
pad_to_multiple_of_labels: Optional[int] = None
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lenghts and need
# different padding methods
input_features = [{"input_values": feature["input_values"]} for feature in features]
label_features = [{"input_ids": feature["labels"]} for feature in features]
batch = self.processor.pad(
input_features,
padding=self.padding,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
with self.processor.as_target_processor():
labels_batch = self.processor.pad(
label_features,
padding=self.padding,
pad_to_multiple_of=self.pad_to_multiple_of_labels,
return_tensors="pt",
)
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
return batch
def create_vocabulary_from_data(
datasets: DatasetDict,
word_delimiter_token: Optional[str] = None,
unk_token: Optional[str] = None,
pad_token: Optional[str] = None,
):
# Given training and test labels create vocabulary
def extract_all_chars(batch):
all_text = " ".join(batch["target_text"])
vocab = list(set(all_text))
return {"vocab": [vocab], "all_text": [all_text]}
vocabs = datasets.map(
extract_all_chars,
batched=True,
batch_size=-1,
keep_in_memory=True,
remove_columns=datasets["train"].column_names,
)
# take union of all unique characters in each dataset
vocab_set = functools.reduce(
lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
)
vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}
# replace white space with delimiter token
if word_delimiter_token is not None:
vocab_dict[word_delimiter_token] = vocab_dict[" "]
del vocab_dict[" "]
# add unk and pad token
if unk_token is not None:
vocab_dict[unk_token] = len(vocab_dict)
if pad_token is not None:
vocab_dict[pad_token] = len(vocab_dict)
return vocab_dict
def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
raise ValueError(
f"Output directory ({training_args.output_dir}) already exists and is not empty. "
"Use --overwrite_output_dir to overcome."
)
elif last_checkpoint is not None:
logger.info(
f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
# Set the verbosity to info of the Transformers logger (on main process only):
if is_main_process(training_args.local_rank):
transformers.utils.logging.set_verbosity_info()
logger.info("Training/evaluation parameters %s", training_args)
# Set seed before initializing model.
set_seed(training_args.seed)
# 1. First, let's load the dataset
raw_datasets = DatasetDict()
if training_args.do_train:
raw_datasets["train"] = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=data_args.train_split_name,
use_auth_token=data_args.use_auth_token,
)
if data_args.audio_column_name not in raw_datasets["train"].column_names:
raise ValueError(
f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--audio_column_name` to the correct audio column - one of "
f"{', '.join(raw_datasets['train'].column_names)}."
)
if data_args.text_column_name not in raw_datasets["train"].column_names:
raise ValueError(
f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--text_column_name` to the correct text column - one of "
f"{', '.join(raw_datasets['train'].column_names)}."
)
if data_args.max_train_samples is not None:
raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples))
if training_args.do_eval:
raw_datasets["eval"] = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=data_args.eval_split_name,
use_auth_token=data_args.use_auth_token,
)
if data_args.max_eval_samples is not None:
raw_datasets["eval"] = raw_datasets["eval"].select(range(data_args.max_eval_samples))
# 2. We remove some special characters from the datasets
# that make training complicated and do not help in transcribing the speech
# E.g. characters, such as `,` and `.` do not really have an acoustic characteristic
# that could be easily picked up by the model
chars_to_ignore_regex = (
f'[{"".join(data_args.chars_to_ignore)}]' if data_args.chars_to_ignore is not None else None
)
text_column_name = data_args.text_column_name
def remove_special_characters(batch):
if chars_to_ignore_regex is not None:
batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " "
else:
batch["target_text"] = batch[text_column_name].lower() + " "
return batch
with training_args.main_process_first(desc="dataset map special characters removal"):
raw_datasets = raw_datasets.map(
remove_special_characters,
remove_columns=[text_column_name],
desc="remove special characters from datasets",
)
# save special tokens for tokenizer
word_delimiter_token = data_args.word_delimiter_token
unk_token = data_args.unk_token
pad_token = data_args.pad_token
# 3. Next, let's load the config as we might need it to create
# the tokenizer
# load config
config = AutoConfig.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
)
# 4. Next, if no tokenizer file is defined,
# we create the vocabulary of the model by extracting all unique characters from
# the training and evaluation datasets
# We need to make sure that only first rank saves vocabulary
# make sure all processes wait until vocab is created
tokenizer_name_or_path = model_args.tokenizer_name_or_path
tokenizer_kwargs = {}
if tokenizer_name_or_path is None:
# save vocab in training output dir
tokenizer_name_or_path = training_args.output_dir
vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")
with training_args.main_process_first():
if training_args.overwrite_output_dir and os.path.isfile(vocab_file):
os.remove(vocab_file)
with training_args.main_process_first(desc="dataset map vocabulary creation"):
if not os.path.isfile(vocab_file):
os.makedirs(tokenizer_name_or_path, exist_ok=True)
vocab_dict = create_vocabulary_from_data(
raw_datasets,
word_delimiter_token=word_delimiter_token,
unk_token=unk_token,
pad_token=pad_token,
)
# save vocab dict to be loaded into tokenizer
with open(vocab_file, "w") as file:
json.dump(vocab_dict, file)
# if tokenizer has just been created
# it is defined by `tokenizer_class` if present in config else by `model_type`
tokenizer_kwargs = {
"config": config if config.tokenizer_class is not None else None,
"tokenizer_type": config.model_type if config.tokenizer_class is None else None,
"unk_token": unk_token,
"pad_token": pad_token,
"word_delimiter_token": word_delimiter_token,
}
# 5. Now we can instantiate the feature extractor, tokenizer and model
# Note for distributed training, the .from_pretrained methods guarantee that only
# one local process can concurrently download model & vocab.
# load feature_extractor and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name_or_path,
use_auth_token=data_args.use_auth_token,
**tokenizer_kwargs,
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
)
# adapt config
config.update(
{
"feat_proj_dropout": model_args.feat_proj_dropout,
"attention_dropout": model_args.attention_dropout,
"hidden_dropout": model_args.hidden_dropout,
"final_dropout": model_args.final_dropout,
"mask_time_prob": model_args.mask_time_prob,
"mask_time_length": model_args.mask_time_length,
"mask_feature_prob": model_args.mask_feature_prob,
"mask_feature_length": model_args.mask_feature_length,
"gradient_checkpointing": training_args.gradient_checkpointing,
"layerdrop": model_args.layerdrop,
"ctc_loss_reduction": model_args.ctc_loss_reduction,
"pad_token_id": tokenizer.pad_token_id,
"vocab_size": len(tokenizer),
"activation_dropout": model_args.activation_dropout,
}
)
# create model
model = AutoModelForCTC.from_pretrained(
model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
config=config,
use_auth_token=data_args.use_auth_token,
)
# freeze encoder
if model_args.freeze_feature_encoder:
model.freeze_feature_encoder()
# 6. Now we preprocess the datasets including loading the audio, resampling and normalization
# Thankfully, `datasets` takes care of automatically loading and resampling the audio,
# so that we just need to set the correct target sampling rate and normalize the input
# via the `feature_extractor`
# make sure that dataset decodes audio with correct sampling rate
dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate
if dataset_sampling_rate != feature_extractor.sampling_rate:
raw_datasets = raw_datasets.cast_column(
data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
)
# derive max & min input length for sample rate & max duration
max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate
min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate
audio_column_name = data_args.audio_column_name
num_workers = data_args.preprocessing_num_workers
# `phoneme_language` is only relevant if the model is fine-tuned on phoneme classification
phoneme_language = data_args.phoneme_language
# Preprocessing the datasets.
# We need to read the audio files as arrays and tokenize the targets.
def prepare_dataset(batch):
# load audio
sample = batch[audio_column_name]
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
batch["input_values"] = inputs.input_values[0]
batch["input_length"] = len(batch["input_values"])
# encode targets
additional_kwargs = {}
if phoneme_language is not None:
additional_kwargs["phonemizer_lang"] = phoneme_language
batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids
return batch
with training_args.main_process_first(desc="dataset map preprocessing"):
vectorized_datasets = raw_datasets.map(
prepare_dataset,
remove_columns=next(iter(raw_datasets.values())).column_names,
num_proc=num_workers,
desc="preprocess datasets",
)
def is_audio_in_length_range(length):
return length > min_input_length and length < max_input_length
# filter data that is shorter than min_input_length
vectorized_datasets = vectorized_datasets.filter(
is_audio_in_length_range,
num_proc=num_workers,
input_columns=["input_length"],
)
# 7. Next, we can prepare the training.
# Let's use word error rate (WER) as our evaluation metric,
# instantiate a data collator and the trainer
# Define evaluation metrics during training, *i.e.* word error rate, character error rate
eval_metrics = {metric: load_metric(metric) for metric in data_args.eval_metrics}
# for large datasets it is advised to run the preprocessing on a
# single machine first with ``args.preprocessing_only`` since there will mostly likely
# be a timeout when running the script in distributed mode.
# In a second step ``args.preprocessing_only`` can then be set to `False` to load the
# cached dataset
if data_args.preprocessing_only:
logger.info(f"Data preprocessing finished. Files cached at {vectorized_datasets.cache_files}")
return
def compute_metrics(pred):
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
pred_str = tokenizer.batch_decode(pred_ids)
# we do not want to group tokens when computing the metrics
label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)
metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}
return metrics
# Now save everything to be able to create a single processor later
if is_main_process(training_args.local_rank):
# save feature extractor, tokenizer and config
feature_extractor.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
config.save_pretrained(training_args.output_dir)
try:
processor = AutoProcessor.from_pretrained(training_args.output_dir)
except (OSError, KeyError):
warnings.warn(
"Loading a processor from a feature extractor config that does not"
" include a `processor_class` attribute is deprecated and will be removed in v5. Please add the following "
" attribute to your `preprocessor_config.json` file to suppress this warning: "
" `'processor_class': 'Wav2Vec2Processor'`",
FutureWarning,
)
processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir)
# Instantiate custom data collator
data_collator = DataCollatorCTCWithPadding(processor=processor)
decay_parameters = get_parameter_names(model, [torch.nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if n in decay_parameters],
"weight_decay": training_args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if n not in decay_parameters],
"weight_decay": 0.0,
},
]
optimizer = bnb.optim.Adam8bit(
params=optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
)
optimizers = (optimizer, None)
# Initialize Trainer
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
tokenizer=feature_extractor,
optimizers=optimizers,
)
# 8. Finally, we can start training
# Training
if training_args.do_train:
# use last checkpoint if exist
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
metrics = train_result.metrics
max_train_samples = (
data_args.max_train_samples
if data_args.max_train_samples is not None
else len(vectorized_datasets["train"])
)
metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"]))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
# Evaluation
results = {}
if training_args.do_eval:
logger.info("*** Evaluate ***")
metrics = trainer.evaluate()
max_eval_samples = (
data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"])
)
metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"]))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Write model card and (optionally) push to hub
config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na"
kwargs = {
"finetuned_from": model_args.model_name_or_path,
"tasks": "speech-recognition",
"tags": ["automatic-speech-recognition", data_args.dataset_name],
"dataset_args": f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split: {data_args.eval_split_name}",
"dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}",
}
if "common_voice" in data_args.dataset_name:
kwargs["language"] = config_name
if training_args.push_to_hub:
trainer.push_to_hub(**kwargs)
else:
trainer.create_model_card(**kwargs)
return results
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment