README.md 2.12 KB
Newer Older
Sugon_ldc's avatar
Sugon_ldc committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# w2v-conformer based end-to-end model for Openasr2021 challenge

This is a example to use unsupervised pretrained w2v-conformer model to fintune [OpenASR2021](https://www.nist.gov/itl/iad/mig/openasr-challenge) constrained-plus tasks.

We pretrain conformer encoders using wav2vec 2.0 pre-training method , which we called ch-w2v-conformer. The original pre-training works take raw waveforms
as input. Unlike these works, we use MFCC features as inputs.

The ch-w2v-conformer model uses following datasets to pretrain:

ISML datasets (6 languages,70k hours): internal dataset contains 40k hours Chinese, Cantonese, Tibetan, Inner Mongolian, Inner Kazakh, Uighur.

Babel datasets (17 languages, 2k hours): Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Kurmanji, Lao, Pashto, Swahili, Tagalog, Tamil, Tok, Turkish, Vietnamese, Zulu

After pretraining, we build ASR system based on CTC-Attention structure. In very low resource task, we find that if too many initialization network structures are constructed in the upper layer of pre-training conformer encoder, the migration performance of the pre-training model will be destroyed, so we only build a single-layer transformer decoder for joint training.

pretrained model link: https://huggingface.co/emiyasstar/ch-w2v-conformer


## constrained-plus Task Performance

* Languages: Cantonese,mongolian,kazakh
* config: conf/train_conformer_large_10h.yaml
* Feature info: using mfcc feature, with dither 1.0, without cmvn
* Training info: lr 0.001, batch size 10, 4 gpus on V100, acc_grad 1, 80 epochs
* Decoding info: ctc_weight 0.5, average_num 35

dev set results trained only with 10 hours training set

## w2v-Conformer

|   decoding_method   | Cantonese(CER)  | mongolian(WER) |
|:-------------------:|:----:|:----:|
|  ctc_greedy_search  | 31.46 | 53.64 |
|  ctc_prefix_search |  31.47   | 53.50 |
| attention_rescoring | 31.45 |  52.96 |

## Conformer (train from scratch)


|   decoding_method   |  Cantonese(CER)  | mongolian(WER) |
|:-------------------:|----:|:----:|
|  ctc_greedy_search  | 61.43 | 89.38 |
|  ctc_prefix_search |  61.37   | 89.53|
| attention_rescoring | 60.61 | 89.60|