Commit 70a8a9e0 authored by wangwei990215's avatar wangwei990215
Browse files

initial commit

parents
Pipeline #1738 failed with stages
in 0 seconds
# Baseline
## Overview
We will release an E2E SA-ASR baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.
![model archietecture](images/sa_asr_arch.png)
## Quick start
To run the baseline, first you need to install FunASR and ModelScope. ([installation](https://github.com/alibaba-damo-academy/FunASR#installation))
There are two startup scripts, `run.sh` for training and evaluating on the old eval and test sets, and `run_m2met_2023_infer.sh` for inference on the new test set of the Multi-Channel Multi-Party Meeting Transcription 2.0 ([M2MeT2.0](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)) Challenge.
Before running `run.sh`, you must manually download and unpack the [AliMeeting](http://www.openslr.org/119/) corpus and place it in the `./dataset` directory:
```shell
dataset
|—— Eval_Ali_far
|—— Eval_Ali_near
|—— Test_Ali_far
|—— Test_Ali_near
|—— Train_Ali_far
|—— Train_Ali_near
```
Before running `run_m2met_2023_infer.sh`, you need to place the new test set `Test_2023_Ali_far` (to be released after the challenge starts) in the `./dataset` directory, which contains only raw audios. Then put the given `wav.scp`, `wav_raw.scp`, `segments`, `utt2spk` and `spk2utt` in the `./data/Test_2023_Ali_far` directory.
```shell
data/Test_2023_Ali_far
|—— wav.scp
|—— wav_raw.scp
|—— segments
|—— utt2spk
|—— spk2utt
```
For more details you can see [here](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md)
## Baseline results
The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.
| |SI-CER(%) |cpCER(%) |
|:---------------|:------------:|----------:|
|oracle profile |32.72 |42.92 |
|cluster profile|32.73 |49.37 |
\ No newline at end of file
# Challenge Result
The following table shows the final results of the competition, where Sub-track1 represents the sub-track under fixed training condition and Sub-track 2 represents the sub-track under the open training condition. All result in this table is cp-CER (%). The rankings in the table are the combined rankings of the two sub-tracks as all teams' submissions met the requirements of the sub-track under fixed training condition.
| Rank     | Team Name                             | Sub-track1     | Sub-track2     | paper |
|------|----------------------|------------|------------|------------------------|
| 1 | Ximalaya Speech Team | 11.27 | 11.27 | |
| 2 | 小马达 | 18.64 | 18.64 | |
| 3 | AIzyzx | 22.83 | 22.83 | |
| 4 | AsrSpeeder | / | 23.51 | |
| 5 | zyxlhz | 24.82 | 24.82 | |
| 6 | CMCAI | 26.11 | / | |
| 7 | Volcspeech | 34.21 | 34.21 | |
| 8 | 鉴往知来 | 40.14 | 40.14 | |
| 9 | baseline | 41.55 | 41.55 | |
| 10 | DAICT | 41.64 | | |
# Contact
If you have any questions about M2MeT2.0 challenge, please contact us by
- email: [m2met.alimeeting@gmail.com](mailto:m2met.alimeeting@gmail.com)
# Datasets
## Overview of training data
In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.
## Detail of AliMeeting corpus
AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.
The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.
![meeting room](images/meeting_room.png)
The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27\%, 34.76\% and 42.8\%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.
![dataset detail](images/dataset_details.png)
The Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.
We also record the near-field signal of each participant using a headset microphone and ensure that only the participant's own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.
All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.
## Get the data
The three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.
- [AliMeeting](https://openslr.org/119/)
- [AISHELL-4](https://openslr.org/111/)
- [CN-Celeb](https://openslr.org/82/)
Now, the new test set is available [here](https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_2023_Ali.tar.gz)
# Introduction
## Call for participation
Automatic speech recognition (ASR) and speaker diarization have made significant strides in recent years, resulting in a surge of speech technology applications across various domains. However, meetings present unique challenges to speech technologies due to their complex acoustic conditions and diverse speaking styles, including overlapping speech, variable numbers of speakers, far-field signals in large conference rooms, and environmental noise and reverberation.
Over the years, several challenges have been organized to advance the development of meeting transcription, including the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. The latest iteration of the CHIME challenge has a particular focus on distant automatic speech recognition and developing systems that can generalize across various array topologies and application scenarios. However, while progress has been made in English meeting transcription, language differences remain a significant barrier to achieving comparable results in non-English languages, such as Mandarin. The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have been instrumental in advancing Mandarin meeting transcription. The MISP challenge seeks to address the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on tackling the speech overlap issue in offline meeting rooms.
The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU 2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
## Timeline(AOE Time)
- $ April~29, 2023: $ Challenge and registration open.
- $ May~11, 2023: $ Baseline release.
- $ May~22, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~16, 2023: $ Test data release and leaderboard open.
- $ June~20, 2023: $ Final submission deadline and leaderboar close.
- $ June~26, 2023: $ Evaluation result and ranking release.
- $ July~3, 2023: $ Deadline for paper submission.
- $ July~10, 2023: $ Deadline for final paper submission.
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and Challenge Session.
## Guidelines
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 22, 2023.
[M2MeT2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top ranking submissions to be included in the ASRU2023 Proceedings.
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
# Organizers
***Lei Xie, Professor, AISHELL foundation, China***
Email: [lxie@nwpu.edu.cn](mailto:lxie@nwpu.edu.cn)
<img src="images/lxie.jpeg" alt="lxie" width="20%">
***Kong Aik Lee, Senior Scientist at Institute for Infocomm Research, A\*Star, Singapore***
Email: [kongaik.lee@ieee.org](mailto:kongaik.lee@ieee.org)
<img src="images/kong.png" alt="kong" width="20%">
***Zhijie Yan, Principal Engineer at Alibaba, China***
Email: [zhijie.yzj@alibaba-inc.com](mailto:zhijie.yzj@alibaba-inc.com)
<img src="images/zhijie.jpg" alt="zhijie" width="20%">
***Shiliang Zhang, Senior Engineer at Alibaba, China***
Email: [sly.zsl@alibaba-inc.com](mailto:sly.zsl@alibaba-inc.com)
<img src="images/zsl.JPG" alt="zsl" width="20%">
***Yanmin Qian, Professor, Shanghai Jiao Tong University, China***
Email: [yanminqian@sjtu.edu.cn](mailto:yanminqian@sjtu.edu.cn)
<img src="images/qian.jpeg" alt="qian" width="20%">
***Zhuo Chen, Applied Scientist in Microsoft, USA***
Email: [zhuc@microsoft.com](mailto:zhuc@microsoft.com)
<img src="images/chenzhuo.jpg" alt="chenzhuo" width="20%">
***Jian Wu, Applied Scientist in Microsoft, USA***
Email: [wujian@microsoft.com](mailto:wujian@microsoft.com)
<img src="images/wujian.jpg" alt="wujian" width="20%">
***Hui Bu, CEO, AISHELL foundation, China***
Email: [buhui@aishelldata.com](mailto:buhui@aishelldata.com)
<img src="images/buhui.jpeg" alt="buhui" width="20%">
# Rules
All participants should adhere to the following rules to be eligible for the challenge.
- Data augmentation is allowed on the original training dataset, including, but not limited to, adding noise or reverberation, speed perturbation and tone change.
- Participants are permitted to use the Eval set for model training, but it is not allowed to use the Test set for this purpose. Instead, the Test set should only be utilized for parameter tuning and model selection. Any use of the Test-2023 dataset that violates these rules is strictly prohibited, including but not limited to the use of the Test set for fine-tuning or training the model.
- If the cpCER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.
- If the forced alignment is used to obtain the frame-level classification label, the forced alignment model must be trained on the basis of the data allowed by the corresponding sub-track.
- Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT and Transformer, but the training data of the shallow fusion language model can only come from the transcripts of the allowed training dataset.
- The right of final interpretation belongs to the organizer. In case of special circumstances, the organizer will coordinate the interpretation.
\ No newline at end of file
# Track & Evaluation
## Speaker-Attributed ASR
The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. It's worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps of the Test-2023 set. Instead, segments containing multiple speakers will be provided, which can be obtained using a simple voice activity detection (VAD) model.
![task difference](images/task_diff.png)
## Evaluation metric
The accuracy of a speaker-attributed ASR system is evaluated using the concatenated minimum permutation character error rate (cpCER) metric. The calculation of cpCER involves three steps. Firstly, the reference and hypothesis transcriptions from each speaker in a session are concatenated in chronological order. Secondly, the character error rate (CER) is calculated between the concatenated reference and hypothesis transcriptions, and this process is repeated for all possible speaker permutations. Finally, the permutation with the lowest CER is selected as the cpCER for that session. TThe CER is obtained by dividing the total number of insertions (Ins), substitutions (Sub), and deletions(Del) of characters required to transform the ASR output into the reference transcript by the total number of characters in the reference transcript. Specifically, CER is calculated by:
$$\text{CER} = \frac {\mathcal N_{\text{Ins}} + \mathcal N_{\text{Sub}} + \mathcal N_{\text{Del}} }{\mathcal N_{\text{Total}}} \times 100\%,$$
where $\mathcal N_{\text{Ins}}$, $\mathcal N_{\text{Sub}}$, $\mathcal N_{\text{Del}}$ are the character number of the three errors, and $\mathcal N_{\text{Total}}$ is the total number of characters.
## Sub-track arrangement
### Sub-track I (Fixed Training Condition):
Participants are required to use only the fixed-constrained data (i.e., AliMeeting, Aishell-4, and CN-Celeb) for system development. The usage of any additional data is strictly prohibited. However, participants can use open-source pre-trained models from third-party sources, such as [Hugging Face](https://huggingface.co/models) and [ModelScope](https://www.modelscope.cn/models), provided that the list of utilized models is clearly stated in the final system description paper.
### Sub-track II (Open Training Condition):
Participants are allowed to use any publicly available data set, privately recorded data, and manual simulation to build their systems in addition to the fixed-constrained data. They can also utilize any open-source pre-trained models, but it is mandatory to provide a clear list of the data and models used in the final system description paper. If manually simulated data is used, a detailed description of the data simulation scheme must be provided.
\ No newline at end of file
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0"
copyright = "2023, Speech Lab, Alibaba Group; ASLP Group, Northwestern Polytechnical University"
author = "Speech Lab, Alibaba Group; Audio, Speech and Language Processing Group, Northwestern Polytechnical University"
extensions = [
"nbsphinx",
"sphinx.ext.autodoc",
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx.ext.mathjax",
"sphinx.ext.todo",
# "sphinxarg.ext",
"sphinx_markdown_tables",
# 'recommonmark',
"sphinx_rtd_theme",
"myst_parser",
]
myst_enable_extensions = [
"colon_fence",
"deflist",
"dollarmath",
"html_image",
]
myst_heading_anchors = 2
myst_highlight_code_blocks = True
myst_update_mathjax = False
templates_path = ["_templates"]
source_suffix = [".rst", ".md"]
pygments_style = "sphinx"
html_theme = "sphinx_rtd_theme"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment