For each recipe, we ask you to report the following: experiments results and environnement, model information.
For reproducibility, a link to upload the pre-trained model may also be added. All this information should be written
in a markdown file called `RESULTS.md` and put at the recipe root. You can refer to
[tedlium2-example](https://github.com/espnet/espnet/blob/master/egs/tedlium2/asr1/RESULTS.md) for an example.
To generate `RESULTS.md` for a recipe, please follow the following instructions:
- Execute `~/espnet/utils/show_result.sh` at the recipe root (where `run.sh` is located).
You'll get your environment information and evaluation results for each experiment in a markdown format.
From here, you can copy or redirect text output to `RESULTS.md`.
- Execute `~/espnet/utils/pack_model.sh` at the recipe root to generate a packed ESPnet model called `model.tar.gz`
and output model information. Executing the utility script without argument will give you the expected arguments.
- Put the model information in `RESULTS.md` and model link if you're using a private web storage
- If you don't have private web storage, please contact Shinji Watanabe <shinjiw@ieee.org> to give you access to ESPnet storage.
#### 1.3.2 ESPnet2 recipes
ESPnet2's recipes correspond to `egs2`. ESPnet2 applies a new paradigm without dependencies of Kaldi's binaries, which makes it lighter and more generalized.
For ESPnet2, we do not recommend preparing the recipe's stages for each corpus but using the common pipelines we provided in `asr.sh`, `tts.sh`, and
`enh.sh`. For details of creating ESPnet2 recipes, please refer to [egs2-readme](https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/README.md).
The common pipeline of ESPnet2 recipes will take care of the `RESULTS.md` generation, model packing, and uploading. ESPnet2 models are maintained at Hugging Face and Zenodo (Deprecated).
You can also refer to the document in https://github.com/espnet/espnet_model_zoo
To upload your model, you need first (This is currently deprecated , uploading to Huggingface Hub is prefered) :
5.`git clone https://huggingface.co/username/your-model-name` (clone this outside ESPNet to avoid issues as this a git repo)
6.`cd your-model-name`
7.`git lfs install`
8. copy contents from exp diretory of your recipe into this directory (Check other models of similar task under ESPNet to confirm your directory structure)
9.`git add . `
10.`git commit -m "Add model files"`
11.`git push`
12. Check if the inference demo on HF is running successfully to verify the upload
#### 1.3.3 Additional requirements for new recipe
- Common/shared files and directories such as `utils`, `steps`, `asr.sh`, etc, should be linked using
a symbolic link (e.g.: `ln -s <source-path> <target-path>`). Please refer to existing recipes if you're
unaware which files/directories are shared. Noted that in espnet2, some of them are automatically generated by https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/setup.sh.
- Default training and decoding configurations (i.e.: the default one in `run.sh`) should be named respectively `train.yaml`
and `decode.yaml` and put in `conf/`. Additional or variant configurations should be put in `conf/tuning/` and named accordingly
to its differences.
- If a recipe for a new corpus is proposed, you should add its name and information to:
https://github.com/espnet/espnet/blob/master/egs/README.md if it's a ESPnet1 recipe,
or https://github.com/espnet/espnet/blob/master/egs2/README.md + `db.sh` if it's a ESPnet2 recipe.
#### 1.3.4 Checklist before you submit the recipe-based PR
- [ ] be careful about the name for the recipe. It is recommended to follow naming conventions of the other recipes
- [ ] common/shared files are linked with **soft link** (see Section 1.3.3)
- [ ] modified or new python scripts should be passed through **latest** black formating (by using python package black). The command to be executed could be `black espnet espnet2 test utils setup.py egs*/*/*/local egs2/TEMPLATE/*/pyscripts tools/*.py ci/*.py`
- [ ] modified or new python scripts should be passed through **latest** isort formating (by using python package isort). The command to be executed could be `isort espnet espnet2 test utils setup.py egs*/*/*/local egs2/TEMPLATE/*/pyscripts tools/*.py ci/*.py`
- [ ] cluster settings should be set as **default** (e.g., cmd.sh conf/slurm.conf conf/queue.conf conf/pbs.conf)
- [ ] update `egs/README.md` or `egs2/README.md` with corresponding recipes
- [ ] add corresponding entry in `egs2/TEMPLATE/db.sh` for a new corpus
- [ ] try to **simplify** the model configurations. We recommend to have only the best configuration for the start of a recipe. Please also follow the default rule defined in Section 1.3.3
- [ ] large meta-information for a corpus should be maintained elsewhere other than in the recipe itself
- [ ] recommend to also include results and pre-trained model with the recipe
## 2 Pull Request
If your proposed feature or bugfix is ready, please open a Pull Request (PR) at https://github.com/espnet/espnet
or use the Pull Request button in your forked repo. If you're not familiar with the process, please refer to the following guides:
1. We will keep the first version digit `0` until we have some super major changes in the project organization level.
2. The second version digit will be updated when we have major updates, including new functions and refactoring, and
their related bug fix and recipe changes.
This version update will be done roughly every half year so far (but it depends on the development plan).
3. The third version digit will be updated when we fix serious bugs or accumulate some minor changes, including
recipe related changes periodically (every two months or so).
## 4 Unit testing
ESPnet's testing is located under `test/`. You can install additional packages for testing as follows:
``` console
$cd <espnet_root>
$. ./tools/activate_python.sh
$pip install-e".[test]"
```
### 4.1 Python
Then you can run the entire test suite using [flake8](http://flake8.pycqa.org/en/latest/), [autopep8](https://github.com/hhatto/autopep8), [black](https://github.com/psf/black), [isort](https://github.com/PyCQA/isort) and [pytest](https://docs.pytest.org/en/latest/) with [coverage](https://pytest-cov.readthedocs.io/en/latest/reporting.html) by
``` console
./ci/test_python.sh
```
Followings are some useful tips when you are using pytest:
- New test file should be put under `test/` directory and named `test_xxx.py`. Each method in the test file should
have the format `def test_yyy(...)`. [Pytest](https://docs.pytest.org/en/latest/) will automatically find and test them.
- We recommend adding several small test files instead of grouping them in one big file (e.g.: `test_e2e_xxx.py`).
Technically, a test file should only cover methods from one file (e.g.: `test_transformer_utils.py` to test `transformer_utils.py`).
- To monitor test coverage and avoid the overlapping test, we recommend using `pytest --cov-report term-missing <test_file|dir>`
to highlight covered and missed lines. For more details, please refer to [coverage-test](https://pytest-cov.readthedocs.io/en/latest/readme.html).
- We limited test running time to 2.0 seconds (see: [pytest-timeouts](https://pypi.org/project/pytest-timeouts/)). As such,
we recommend using small model parameters and avoiding dynamic imports, file access, and unnecessary loops. If a unit test needs
more running time, you can annotate your test with `@pytest.mark.execution_timeout(sec)`.
- For test initialization (parameters, modules, etc), you can use pytest fixtures. Refer to [pytest fixtures](https://docs.pytest.org/en/latest/fixture.html#using-fixtures-from-classes-modules-or-projects) for more information.
In addition, please follow the [PEP 8 convention](https://peps.python.org/pep-0008/) for the coding style and [Google's convention for docstrings](https://google.github.io/styleguide/pyguide.html#383-functions-and-methods).
Below are some specific points that should be taken care of in particular:
- Avoid writing python2-style code. For example, `super().__init__()` is preferred over `super(CLASS_NAME, self).__init()__`.
### 4.2 Bash scripts
You can also test the scripts in `utils` with [bats-core](https://github.com/bats-core/bats-core) and [shellcheck](https://github.com/koalaman/shellcheck).
To test:
``` console
./ci/test_shell.sh
```
## 5 Integration testing
Write new integration tests in [ci/test_integration_espnet1.sh](ci/test_integration_espnet1.sh) or [ci/test_integration_espnet2.sh](ci/test_integration_espnet2.sh) when you add new features in [espnet/bin](espnet/bin) or [espnet2/bin](espnet2/bin), respectively. They use our smallest dataset [egs/mini_an4](egs/mini_an4) or [egs2/mini_an4](egs/mini_an4) to test `run.sh`. **Don't call `python` directly in integration tests. Instead, use `coverage run --append`** as a python interpreter. Especially, `run.sh` should support `--python ${python}` to call the custom interpreter.
```bash
# ci/test_integration_espnet{1,2}.sh
python="coverage run --append"
cd egs/mini_an4/your_task
./run.sh --python"${python}"
```
### 5.1 Configuration files
-[setup.cfg](setup.cfg) configures pytest, black and flake8.
To generate doc, support `--help` to show its usage. If you use Kaldi's `utils/parse_option.sh`, define `help_message="Usage: $0 ..."`.
## 7 Writing documentation
See [doc](doc/README.md).
## 8 Adding pretrained models
Pack your trained models using `utils/pack_model.sh` and upload it [here](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF)(You require permission).
Add the shared link to `utils/recog_wav.sh` or `utils/synth_wav.sh` as follows:
ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on.
ESPnet uses [pytorch](http://pytorch.org/) as a deep learning engine and also follows [Kaldi](http://kaldi-asr.org/) style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.
- Attention: Dot product, location-aware attention, variants of multi-head
- Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
- Batch GPU decoding
- Data augmentation
-**Transducer** based end-to-end ASR
- Architecture:
- RNN-based encoder and decoder.
- Custom encoder and decoder supporting Transformer, Conformer (encoder), 1D Conv / TDNN (encoder) and causal 1D Conv (decoder) blocks.
- VGG2L (RNN/custom encoder) and Conv2D (custom encoder) bottlenecks.
- Search algorithms:
- Greedy search constrained to one emission by timestep.
- Default beam search algorithm [[Graves, 2012]](https://arxiv.org/abs/1211.3711) without prefix search.
- Alignment-Length Synchronous decoding [[Saon et al., 2020]](https://ieeexplore.ieee.org/abstract/document/9053040).
- Time Synchronous Decoding [[Saon et al., 2020]](https://ieeexplore.ieee.org/abstract/document/9053040).
- N-step Constrained beam search modified from [[Kim et al., 2020]](https://arxiv.org/abs/2002.03577).
- modified Adaptive Expansion Search based on [[Kim et al., 2021]](https://ieeexplore.ieee.org/abstract/document/9250505) and NSC.
- Features:
- Multi-task learning with various auxiliary losses:
- Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
- Decoder: cross-entropy w/ label smoothing.
- Transfer learning with acoustic model and/or language model.
- Training with FastEmit regularization method [[Yu et al., 2021]](https://arxiv.org/abs/2010.11148).
> Please refer to the [tutorial page](https://espnet.github.io/espnet/tutorial.html#transducer) for complete documentation.
- CTC segmentation
- Non-autoregressive model based on Mask-CTC
- ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
- Wav2Vec2.0 pretrained model as Encoder, imported from [FairSeq](https://github.com/pytorch/fairseq/tree/master/fairseq).
- Self-supervised learning representations as features, using upstream models in [S3PRL](https://github.com/s3prl/s3prl) in frontend.
- Set `frontend` to be `s3prl`
- Select any upstream model by setting the `frontend_conf` to the corresponding name.
- Transfer Learning :
- easy usage and transfers from models previously trained by your group, or models from [ESPnet Hugging Face repository](https://huggingface.co/espnet).
-[Documentation](https://github.com/espnet/espnet/tree/master/egs2/mini_an4/asr1/transfer_learning.md) and [toy example runnable on colab](https://github.com/espnet/notebook/blob/master/espnet2_asr_transfer_learning_demo.ipynb).
- Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
- Restricted Self-Attention based on [Longformer](https://arxiv.org/abs/2004.05150) as an encoder for long sequences
- OpenAI [Whisper](https://openai.com/blog/whisper/) model, robust ASR based on large-scale, weakly-supervised multitask learning
Demonstration
- Real-time ASR demo with ESPnet2 [](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_asr_realtime_demo.ipynb)
-[Gradio](https://github.com/gradio-app/gradio) Web Demo on [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces). Check out the [Web Demo](https://huggingface.co/spaces/akhaliq/espnet2_asr)
- Streaming Transformer ASR [Local Demo](https://github.com/espnet/notebook/blob/master/espnet2_streaming_asr_demo.ipynb) with ESPnet2.
### TTS: Text-to-speech
- Architecture
- Tacotron2
- Transformer-TTS
- FastSpeech
- FastSpeech2
- Conformer FastSpeech & FastSpeech2
- VITS
- JETS
- Multi-speaker & multi-language extention
- Pretrained speaker embedding (e.g., X-vector)
- Speaker ID embedding
- Language ID embedding
- Global style token (GST) embedding
- Mix of the above embeddings
- End-to-end training
- End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
- Joint training of text2mel and vocoder
- Various language support
- En / Jp / Zn / De / Ru / And more...
- Integration with neural vocoders
- Parallel WaveGAN
- MelGAN
- Multi-band MelGAN
- HiFiGAN
- StyleMelGAN
- Mix of the above models
Demonstration
- Real-time TTS demo with ESPnet2 [](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
- Integrated to [Hugging Face Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [](https://huggingface.co/spaces/akhaliq/ESPnet2-TTS)
To train the neural vocoder, please check the following repositories:
- Flexible ASR integration: working as an individual task or as the ASR frontend
- Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid)
- Both the pre-trained models from Asteroid and the specific configuration are supported.
Demonstration
- Interactive SE demo with ESPnet2 [](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)
-**State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT)
- Transformer based end-to-end ST (new!)
- Transformer based end-to-end MT (new!)
### VC: Voice conversion
- Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
- End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)
### SLU: Spoken Language Understanding
- Architecture
- Transformer based Encoder
- Conformer based Encoder
-[Branchformer](https://proceedings.mlr.press/v162/peng22a.html) based Encoder
-[E-Branchformer](https://arxiv.org/abs/2210.00077) based Encoder
- RNN based Decoder
- Transformer based Decoder
- Support Multitasking with ASR
- Predict both intent and ASR transcript
- Support Multitasking with NLU
- Deliberation encoder based 2 pass model
- Support using pretrained ASR models
- Hubert
- Wav2vec2
- VQ-APC
- TERA and more ...
- Support using pretrained NLP models
- BERT
- MPNet And more...
- Various language support
- En / Jp / Zn / Nl / And more...
- Supports using context from previous utterances
- Supports using other tasks like SE in pipeline manner
- Supports Two Pass SLU that combines audio and ASR transcript
Demonstration
- Performing noisy spoken language understanding using speech enhancement model followed by spoken language understanding model. [](https://colab.research.google.com/drive/14nCrJ05vJcQX0cJuXjbMVFWUHJ3Wfb6N?usp=sharing)
- Performing two pass spoken language understanding where the second pass model attends on both acoustic and semantic information. [](https://colab.research.google.com/drive/1p2cbGIPpIIcynuDl4ZVHDpmNPl8Nh_ci?usp=sharing)
- Integrated to [Hugging Face Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See SLU demo on multiple languages: [](https://huggingface.co/spaces/Siddhant/ESPnet2-SLU)
### SUM: Speech Summarization
- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263)
### SVS: Singing Voice Synthesis
- Framework merge from [Muskits](https://github.com/SJTMusicTeam/Muskits)
If you'll use ESPnet1, please install chainer and cupy.
```sh
pip install chainer==6.0.0 cupy==6.0.0 # [Option]
```
You might need to install some packages depending on each task. We prepared various installation scripts at [tools/installers](tools/installers).
- (ESPnet2) Once installed, run `wandb login` and set `--use_wandb true` to enable tracking runs using W&B.
## Usage
See [Usage](https://espnet.github.io/espnet/tutorial.html).
## Docker Container
go to [docker/](docker/) and follow [instructions](https://espnet.github.io/espnet/docker.html).
## Contribution
Thank you for taking times for ESPnet! Any contributions to ESPnet are welcome and feel free to ask any questions or requests to [issues](https://github.com/espnet/espnet/issues).
If it's the first contribution to ESPnet for you, please follow the [contribution guide](CONTRIBUTING.md).
## Results and demo
You can find useful tutorials and demos in [Interspeech 2019 Tutorial](https://github.com/espnet/interspeech2019-tutorial)
### ASR results
<details><summary>expand</summary><div>
We list the character error rate (CER) and word error rate (WER) of major ASR tasks.
Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by [RWTH](https://arxiv.org/pdf/1805.03294.pdf).
If you want to check the results of the other recipes, please check `egs/<name_of_recipe>/asr1/RESULTS.md`.
</div></details>
### ASR demo
<details><summary>expand</summary><div>
You can recognize speech in a WAV file using pretrained models.
Go to a recipe directory and run `utils/recog_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
You can try the interactive demo with Google Colab. Please click the following button to get access to the demos.
[](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)
It is based on ESPnet2. Pretrained models are available for both speech enhancement and speech separation tasks.
If you want to check the results of the other recipes, please check `egs/<name_of_recipe>/st1/RESULTS.md`.
</div></details>
### ST demo
<details><summary>expand</summary><div>
(**New!**) We made a new real-time E2E-ST + TTS demonstration in Google Colab.
Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!
[](https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb)
---
You can translate speech in a WAV file using pretrained models.
Go to a recipe directory and run `utils/translate_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 &&. ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
-[Multi Japanese speaker Transformer](https://drive.google.com/open?id=1fFMQDF6NV5Ysz48QLFYE8fEvbAxCsMBw)
-[Single English speaker models with Parallel WaveGAN](https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx)
-[Single English speaker knowledge distillation-based FastSpeech](https://drive.google.com/open?id=1wG-Y0itVYalxuLAHdkAHO7w1CWFfRPF4)
You can download all of the pretrained models and generated samples:
-[All of the pretrained E2E-TTS models](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF)
-[All of the generated samples](https://drive.google.com/open?id=1bQGuqH92xuxOX__reWLP4-cif0cbpMLX)
Note that in the generated samples we use the following vocoders: Griffin-Lim (**GL**), WaveNet vocoder (**WaveNet**), Parallel WaveGAN (**ParallelWaveGAN**), and MelGAN (**MelGAN**).
The neural vocoders are based on following repositories.
-[r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder): 16 bit mixture of Logistics WaveNet vocoder
-[kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder): 8 bit Softmax WaveNet Vocoder with the noise shaping
If you want to build your own neural vocoder, please check the above repositories.
[kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) provides [the manual](https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features) about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.
Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!
| Model link | Lang | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type |
If you want to use the above pretrained vocoders, please exactly match the feature setting with them.
</div></details>
### TTS demo
<details><summary>ESPnet2</summary><div>
You can try the real-time demo in Google Colab.
Please access the notebook from the following button and enjoy the real-time synthesis!
- Real-time TTS demo with ESPnet2 [](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)
English, Japanese, and Mandarin models are available in the demo.
</div></details>
<details><summary>ESPnet1</summary><div>
> NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.
You can try the real-time demo in Google Colab.
Please access the notebook from the following button and enjoy the real-time synthesis.
- Real-time TTS demo with ESPnet1 [](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)
We also provide shell script to perform synthesize.
Go to a recipe directory and run `utils/synth_wav.sh` as follows:
```sh
# go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 &&. ./path.sh
# we use upper-case char sequence for the default model.
echo"THIS IS A DEMONSTRATION OF TEXT TO SPEECH."> example.txt
# let's synthesize speech!
synth_wav.sh example.txt
# also you can use multiple sentences
echo"THIS IS A DEMONSTRATION OF TEXT TO SPEECH."> example_multi.txt
echo"TEXT TO SPEECH IS A TECHNIQUE TO CONVERT TEXT INTO SPEECH.">> example_multi.txt
WaveNet vocoder provides very high quality speech but it takes time to generate.
See more details or available models via `--help`.
```sh
synth_wav.sh --help
```
</div></details>
### VC results
<details><summary>expand</summary><div>
- Transformer and Tacotron2 based VC
You can listen to some samples on the [demo webpage](https://unilight.github.io/Publication-Demos/publications/transformer-vc/).
- Cascade ASR+TTS as one of the baseline systems of VCC2020
The [Voice Conversion Challenge 2020](http://www.vc-challenge.org/)(VCC2020) adopts ESPnet to build an end-to-end based baseline system.
In VCC2020, the objective is intra/cross lingual nonparallel VC.
You can download converted samples of the cascade ASR+TTS baseline system [here](https://drive.google.com/drive/folders/1oeZo83GrOgtqxGwF7KagzIrfjr8X59Ue?usp=sharing).
</div></details>
### SLU results
<details><summary>expand</summary><div>
We list the performance on various SLU tasks and dataset using the metric reported in the original dataset paper
| Task | Dataset | Metric | Result | Pretrained Model |
# utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE
# utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT
```
The output of the script can be redirected to a `segments` file by adding the argument `--output segments`.
Each line contains file/utterance name, utterance start and end times in seconds and a confidence score; optionally also the utterance text.
The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:
```sh
min_confidence_score=-7
# here, we assume that the output was written to the file `segments`
awk-vms=${min_confidence_score}'{ if ($5 > ms) {print} }' segments
```
See the module documentation for more information.
It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;
rather than using Transformer models that have a high memory consumption on longer audio data.
The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.
Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the `text` file. See the discussion in https://github.com/espnet/espnet/issues/4278#issuecomment-1100756463.
</div></details>
## Citations
```
@inproceedings{watanabe2018espnet,
author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={7654--7658},
year={2020},
organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
author = "Inaguma, Hirofumi and
Kiyono, Shun and
Duh, Kevin and
Karita, Shigeki and
Yalta, Nelson and
Hayashi, Tomoki and
Watanabe, Shinji",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
pages={785--792},
year={2021},
organization={IEEE},
}
@inproceedings{arora2021espnet,
title={{ESPnet-SLU}: Advancing Spoken Language Understanding through ESPnet},
author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={7167--7171},
year={2022},
organization={IEEE}
}
@inproceedings{shi2022muskits,
author={Shi, Jiatong and Guo, Shuai and Qian, Tao and Huo, Nan and Hayashi, Tomoki and Wu, Yuning and Xu, Frank and Chang, Xuankai and Li, Huazhe and Wu, Peter and Watanabe, Shinji and Jin, Qin},
title={{Muskits}: an End-to-End Music Processing Toolkit for Singing Voice Synthesis},