Commit 764b3a75 authored by Sugon_ldc's avatar Sugon_ldc
Browse files

add new model

parents
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at mikelei@mobvoi.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
# Contributing guidelines
## Pre-commit tidy/linting hook
You'll need to install flake8 first.
`pip install flake8==3.8.2`
We use flake8 to perform additional formatting and semantic checking of code.
We provide a pre-commit git hook for performing these checks, before a commit
is created:
```bash
ln -s ../../tools/git-pre-commit .git/hooks/pre-commit
```
You have to execute above command in wenet project root directory.
After that, each commit will be checked by flake8.
If you do not set pre-commit, just run `flake8` in wenet project root directory
and fix all the problems.
## Github checks
After a pull request is submitted, some checks will run to check your code style.
Below is an example where some checks fail.
![github checks](docs/images/checks.png)
You need to click the details to see the detailed info like the example below.
![github checks](docs/images/check_detail.png)
You have to fix all style problems according to the detailed info.
root=runtime/core
filter=-build/c++11
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# WeNet
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python-Version](https://img.shields.io/badge/Python-3.7%7C3.8-brightgreen)](https://github.com/wenet-e2e/wenet)
[**Roadmap**](https://github.com/wenet-e2e/wenet/issues/1683)
| [**Docs**](https://wenet-e2e.github.io/wenet)
| [**Papers**](https://wenet-e2e.github.io/wenet/papers.html)
| [**Runtime (x86)**](https://github.com/wenet-e2e/wenet/tree/main/runtime/libtorch)
| [**Runtime (android)**](https://github.com/wenet-e2e/wenet/tree/main/runtime/android)
| [**Pretrained Models**](docs/pretrained_models.md)
| [**HuggingFace**](https://huggingface.co/spaces/wenet/wenet_demo)
**We** share neural **Net** together.
The main motivation of WeNet is to close the gap between research and production end-to-end (E2E) speech recognition models,
to reduce the effort of productionizing E2E models, and to explore better E2E models for production.
## :fire: News
* 2022.12: Horizon X3 pi BPU, see https://github.com/wenet-e2e/wenet/pull/1597, Kunlun Core XPU, see https://github.com/wenet-e2e/wenet/pull/1455, Raspberry Pi, see https://github.com/wenet-e2e/wenet/pull/1477, IOS, see https://github.com/wenet-e2e/wenet/pull/1549.
* 2022.11: TrimTail paper released, see https://arxiv.org/pdf/2211.00522.pdf
* 2022.10: Squeezeformer is supported, see https://github.com/wenet-e2e/wenet/pull/1447.
* 2022.07: RNN-T is supported now, see [rnnt](https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/rnnt) for benchmark.
## Highlights
* **Production first and production ready**: The core design principle of WeNet. WeNet provides full stack solutions for speech recognition.
* *Unified solution for streaming and non-streaming ASR*: [U2++ framework](https://arxiv.org/pdf/2203.15455.pdf)--develop, train, and deploy only once.
* *Runtime solution*: built-in server [x86](https://github.com/wenet-e2e/wenet/tree/main/runtime/libtorch) and on-device [android](https://github.com/wenet-e2e/wenet/tree/main/runtime/android) runtime solution.
* *Model exporting solution*: built-in solution to export model to LibTorch/ONNX for inference.
* *LM solution*: built-in production-level [LM solution](docs/lm.md).
* *Other production solutions*: built-in contextual biasing, time stamp, endpoint, and n-best solutions.
* **Accurate**: WeNet achieves SOTA results on a lot of public speech datasets.
* **Light weight**: WeNet is easy to install, easy to use, well designed, and well documented.
## Performance Benchmark
Please see `examples/$dataset/s0/README.md` for benchmark on different speech datasets.
## Installation(Python Only)
If you just want to use WeNet as a python package for speech recognition application,
just install it by `pip`, please note python 3.6+ is required.
``` sh
pip3 install wenetruntime
```
And please see [doc](runtime/binding/python/README.md) for usage.
## Installation(Training and Developing)
- Clone the repo
``` sh
git clone https://github.com/wenet-e2e/wenet.git
```
- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
- Create Conda env:
``` sh
conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch=1.10.0 torchvision torchaudio=0.10.0 cudatoolkit=11.1 -c pytorch -c conda-forge
```
- Optionally, if you want to use x86 runtime or language model(LM),
you have to build the runtime as follows. Otherwise, you can just ignore this step.
``` sh
# runtime build requires cmake 3.14 or above
cd runtime/libtorch
mkdir build && cd build && cmake -DGRAPH_TOOLS=ON .. && cmake --build .
```
## Discussion & Communication
Please visit [Discussions](https://github.com/wenet-e2e/wenet/discussions) for further discussion.
For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet.
We created a WeChat group for better discussion and quicker response.
Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.
If you can not access the QR image, please access it on [gitee](https://gitee.com/robin1001/qr/tree/master).
| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://github.com/robin1001/qr/blob/master/binbin.jpeg" width="250px"> |
| ---- | ---- |
Or you can directly discuss on [Github Issues](https://github.com/wenet-e2e/wenet/issues).
## Contributors
| <a href="https://www.chumenwenwen.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/chumenwenwen.png" width="250px"></a> | <a href="http://lxie.npu-aslp.org" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/nwpu.png" width="250px"></a> | <a href="http://www.aishelltech.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/aishelltech.png" width="250px"></a> | <a href="http://www.ximalaya.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/ximalaya.png" width="250px"></a> | <a href="https://www.jd.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/jd.jpeg" width="250px"></a> |
| ---- | ---- | ---- | ---- | ---- |
| <a href="https://horizon.ai" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/hobot.png" width="250px"></a> | <a href="https://thuhcsi.github.io" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/thu.png" width="250px"></a> | <a href="https://www.nvidia.com/en-us" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/nvidia.png" width="250px"></a> | | | |
## Acknowledge
1. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet) for transformer based modeling.
2. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for WFST based decoding for LM integration.
3. We referred [EESEN](https://github.com/srvk/eesen) for building TLG based graph for LM integration.
4. We referred to [OpenTransformer](https://github.com/ZhengkunTian/OpenTransformer/) for python batch inference of e2e models.
## Citations
``` bibtex
@inproceedings{yao2021wenet,
title={WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit},
author={Yao, Zhuoyuan and Wu, Di and Wang, Xiong and Zhang, Binbin and Yu, Fan and Yang, Chao and Peng, Zhendong and Chen, Xiaoyu and Xie, Lei and Lei, Xin},
booktitle={Proc. Interspeech},
year={2021},
address={Brno, Czech Republic },
organization={IEEE}
}
@article{zhang2022wenet,
title={WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit},
author={Zhang, Binbin and Wu, Di and Peng, Zhendong and Song, Xingchen and Yao, Zhuoyuan and Lv, Hang and Xie, Lei and Yang, Chao and Pan, Fuping and Niu, Jianwei},
journal={arXiv preprint arXiv:2203.15455},
year={2022}
}
```
# WeNet
[**English version**](https://github.com/wenet-e2e/wenet/tree/main/README.md)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python-Version](https://img.shields.io/badge/Python-3.7%7C3.8-brightgreen)](https://github.com/wenet-e2e/wenet)
[**文档**](https://wenet-e2e.github.io/wenet/)
| [**训练模型教程 1**](https://wenet.org.cn/wenet/tutorial_librispeech.html)
| [**训练模型教程 2**](https://wenet.org.cn/wenet/tutorial_aishell.html)
| [**WeNet 论文**](https://wenet-e2e.github.io/wenet/papers.html)
| [**x86 识别服务**](https://github.com/wenet-e2e/wenet/tree/main/runtime/libtorch)
| [**android 本地识别**](https://github.com/wenet-e2e/wenet/tree/main/runtime/android)
## 核心功能
WeNet 是一款面向工业落地应用的语音识别工具包,提供了从语音识别模型的训练到部署的一条龙服务,其主要特点如下:
* 使用 conformer 网络结构和 CTC/attention loss 联合优化方法,统一的流式/非流式语音识别方案,具有业界一流的识别效果。
* 提供云上和端上直接部署的方案,最小化模型训练和产品落地之间的工程工作。
* 框架简洁,模型训练部分完全基于 pytorch 生态,不依赖于 kaldi 等复杂的工具。
* 详细的注释和文档,非常适合用于学习端到端语音识别的基础知识和实现细节。
* 支持时间戳,对齐,端点检测,语言模型等相关功能。
## 1分钟 Demo
**使用预训练模型和 docker 进行语音识别,1分钟(如果网速够快)搭建一个语音识别系统**
下载官方提供的预训练模型,并启动 docker 服务,加载模型,提供 websocket 协议的语音识别服务。
``` sh
wget https://wenet-1256283475.cos.ap-shanghai.myqcloud.com/models/aishell2/20210618_u2pp_conformer_libtorch.tar.gz
tar -xf 20210618_u2pp_conformer_libtorch.tar.gz
model_dir=$PWD/20210618_u2pp_conformer_libtorch
docker run --rm -it -p 10086:10086 -v $model_dir:/home/wenet/model wenetorg/wenet-mini:latest bash /home/run.sh
```
**实时识别**
使用浏览器打开文件`wenet/runtime/libtorch/web/templates/index.html`,在 `WebSocket URL` 中填入 `ws://127.0.0.1:10086` (若在windows下通过wsl2运行docker, 则使用`ws://localhost:10086`) , 允许浏览器弹出的请求使用麦克风,即可通过麦克风进行实时语音识别。
![Runtime web](/docs/images/runtime_web.png)
## 训练语音识别模型
**配置环境**
``` sh
git clone https://github.com/wenet-e2e/wenet.git
```
- 安装 Conda: https://docs.conda.io/en/latest/miniconda.html
- 创建 Conda 环境:
``` sh
conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch=1.10.0 torchvision torchaudio=0.10.0 cudatoolkit=11.1 -c pytorch -c conda-forge
```
**训练模型**
使用中文 Aishell-1 数据集训练模型
```
cd examples/aishell/s0/
bash run.sh --stage -1
```
细节请阅读 [**训练模型教程**](https://wenet-e2e.github.io/wenet/tutorial_aishell.html)
### 新手常见问题
1. 请使用具有gpu的机器。确保cuda和torch都已经安装。wenet也支持cpu训练,但是速度非常很慢。
2. 请使用支持bash的环境。windows的默认cmd是不支持bash的。
3. run.sh脚本里,`export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"`, 改为自己要用的GPU id,比如你的机器有4张GPU卡,4张卡都用来训练,改为 `export CUDA_VISIBLE_DEVICES="0,1,2,3"`
4. run.sh脚本里,`data=/export/data/asr-data/OpenSLR/33/`设置为自己的目录。请使用绝对路径而不要用相对路径。
5. 如果继续训练出错,请先删除实验目录下的 ddp_init文件再试一试。
## 技术支持
欢迎在 [Github Issues](https://github.com/wenet-e2e/wenet/issues) 中提交问题。
欢迎扫二维码加入微信讨论群,如果群人数较多,请添加右侧个人微信入群。
| <img src="https://github.com/robin1001/qr/blob/master/wenet.jpeg" width="250px"> | <img src="https://github.com/robin1001/qr/blob/master/binbin.jpeg" width="250px"> |
| ---- | ---- |
## 贡献者列表
| <a href="https://www.chumenwenwen.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/chumenwenwen.png" width="250px"></a> | <a href="http://lxie.npu-aslp.org" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/nwpu.png" width="250px"></a> | <a href="http://www.aishelltech.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/aishelltech.png" width="250px"></a> | <a href="http://www.ximalaya.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/ximalaya.png" width="250px"></a> | <a href="https://www.jd.com" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/jd.jpeg" width="250px"></a> |
| ---- | ---- | ---- | ---- | ---- |
| <a href="https://horizon.ai" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/hobot.png" width="250px"></a> | <a href="https://thuhcsi.github.io" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/colleges/thu.png" width="250px"></a> | <a href="https://www.nvidia.com/en-us" target="_blank"><img src="https://raw.githubusercontent.com/wenet-e2e/wenet-contributors/main/companies/nvidia.png" width="250px"></a> | | | |
## 致谢
WeNet 借鉴了一些优秀的开源项目,包括
1. Transformer 建模 [ESPnet](https://github.com/espnet/espnet)
2. WFST 解码 [Kaldi](http://kaldi-asr.org/)
3. TLG 构图 [EESEN](https://github.com/srvk/eesen)
4. Python Batch 推理 [OpenTransformer](https://github.com/ZhengkunTian/OpenTransformer/)
## 引用
``` bibtex
@inproceedings{yao2021wenet,
title={WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit},
author={Yao, Zhuoyuan and Wu, Di and Wang, Xiong and Zhang, Binbin and Yu, Fan and Yang, Chao and Peng, Zhendong and Chen, Xiaoyu and Xie, Lei and Lei, Xin},
booktitle={Proc. Interspeech},
year={2021},
address={Brno, Czech Republic },
organization={IEEE}
}
@article{zhang2020unified,
title={Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition},
author={Zhang, Binbin and Wu, Di and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Yang, Chao and Guo, Liyong and Hu, Yaguang and Xie, Lei and Lei, Xin},
journal={arXiv preprint arXiv:2012.05481},
year={2020}
}
@article{wu2021u2++,
title={U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition},
author={Wu, Di and Zhang, Binbin and Yang, Chao and Peng, Zhendong and Xia, Wenjing and Chen, Xiaoyu and Lei, Xin},
journal={arXiv preprint arXiv:2106.05642},
year={2021}
}
```
# WeNet Roadmap
This roadmap for WeNet.
WeNet is a community-driven project and we love your feedback and proposals on where we should be heading.
Please open up [issues](https://github.com/wenet-e2e/wenet/issues/) or
[discussion](https://github.com/wenet-e2e/wenet/discussions) on github to write your proposal.
Feel free to volunteer yourself if you are interested in trying out some items(they do not have to be on the list).
## WeNet 3.0 (2023.06)
- [x] ONNX support, see https://github.com/wenet-e2e/wenet/pull/1103
- [x] RNN-T support, see https://github.com/wenet-e2e/wenet/pull/1261
- [ ] Self training, streaming
- [ ] Light weight, low latency, on-device model exploration
- [x] TrimTail, see https://github.com/wenet-e2e/wenet/pull/1487/, [paper link](https://arxiv.org/pdf/2211.00522.pdf)
- [ ] Audio-Visual speech recognition
- [ ] OS or Hardware Platforms
- [x] IOS, https://github.com/wenet-e2e/wenet/pull/1549
- [x] Raspberry Pi, see https://github.com/wenet-e2e/wenet/pull/1477
- [ ] Harmony OS
- [ ] ASIC XPU
- [x] Horizon X3 pi, BPU, see https://github.com/wenet-e2e/wenet/pull/1597
- [x] Kunlun XPU, see https://github.com/wenet-e2e/wenet/pull/1455
- [x] Public Model Hub Support
- [x] HuggingFace, see https://huggingface.co/spaces/wenet/wenet_demo
- [x] ModelScope, see https://modelscope.cn/models/wenet/u2pp_conformer-asr-cn-16k-online/summary
- [x] [Vosk](https://github.com/alphacep/vosk-api/) like models and API for developers.
- [x] Models(Chinese/English/Japanese/Korean/French/German/Spanish/Portuguese)
- [x] Chinese
- [x] English
- [x] API(python/c/c++/go/java)
- [x] python
## WeNet 2.0 (2022.06)
- [x] U2++ framework for better accuracy
- [x] n-gram + WFST language model solution
- [x] Context biasing(hotword) solution
- [x] Very big data training support with UIO
- [x] More dataset support, including WenetSpeech, GigaSpeech, HKUST and so on.
## WeNet 1.0 (2021.02)
- [x] Streaming solution(U2 framework)
- [x] Production runtime solution with `TorchScript` training and `LibTorch` inference.
- [x] Unified streaming and non-streaming model(U2 framework)
_gen/
_build/
build/
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SPHINXPROJ = Wenet
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
# UIO for WeNet
In order to support the model training of industrial tens of millions of hours of speech dataset, the data processing
method UIO (Unified IO) has been updated in WeNet. The document will introduce UIO from the following sections:
Necessity of upgrading IO mothod, System design of UIO, Validation experiments, Usage of UIO, Q&A.
## Necessity of upgrading IO mothod
The old IO method in WeNet is based on Pytorch's native Dataset. During training, it need to load all training audio
paths and correspondingly labels into the memory at one time, then randomly read data. In the case of industrial-grade
ultra-large-scale data (egs: more than 50,000 hours or 50 million or more audio), this method will cause the training
to fail to run for two reasons:
- Out of memory(OOM): The physical memory of the general machine is difficult to load the training data at one time.
- Slow down reading performance: In the case that the large-scale data memory cannot be used as a file cache, the training
data reading speed is greatly reduced.
## System design of UIO
### Overall design
Inspired by the following industrial methods(egs: [webdataset](https://github.com/webdataset/webdataset), [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)),
WeNet redesigned the IO method. Its core idea is to make the audio and labels of multiple small data(such as 1000 pieces),
into compressed packets (tar) and read them based on the IterableDataset of Pytorch. The advantages of this method is:
- Only the index information of the compressed package needs to be maintained in memory, which greatly saves memory and
solves the problem of OOM.
- The on-the-fly decompression is performed in the memory, and the data in the same compressed package is read in
sequence, which solves the problem of slow random reading performance. Different compressed packets can be read randomly
to ensure the global randomness of data.
The new IO method takes into account both small datasets and large datasets, and provides two data reading methods.
We call it UIO. The overall design of UIO is shown in the figure below:
![UIO System Design](./images/UIO_system.png)
Some necessary explanations about the above figure:
- Small IO(raw) supports small datasets, which we call ``raw`` mode. This mode only supports local file reading.
The required documents must be sorted into Kaldi style file: wav.scp and text.(It's the same as before)
- Big IO(shared) supports large datasets, which we call ``shard`` mode. This mode can support both local file
reading and network cloud storage file reading. The required files must be sorted into compressed packages. Audio (wav)
and label (txt) are stored in a single compressed package in sequence.
### Chain IO
Inspired by TFRecord chain IO, UIO also adopts chain implementation. In practical use, chain IO is more flexible,
easier to expand and easier to debug. TFRecord IO example as follows,
```python
def read_dataset(filename, batch_size):
dataset = tf.data.TFRecordDataset(filename)
dataset = dataset.map(_parse_image_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(500)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.repeat()
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset
```
Refer to TFRecord IO, the UIO dataflow in WeNet is designed as the figure below:
![UIO dataflow](./images/UIO_dataflow.png)
It includes the following modules:
- tokenize module: convert the label into specify modeling unit(egs: char or BPE).
- filter module: filter out unqualified training data.
- resample module: optional resampling of training data.
- compute_fbank module: fbank feature extraction.
- spec_augmentation module: feature enhancement.
- shuffle module: disrupt local data.
- sort module: sort local data.
- batch module: organize multiple pieces of data into batch.
- padding module: padding data in the same batch.
what's more, There are several parameters to note. first, ``shuffle buffer`` and ``sort buffer`` in ``buffer size``:
* ``Shuffle buffer``: shuffle data. It is recommended that the size of this buffer be larger than
the number of data contained in a single shard. Each time it is equivalent to shuffling data between two shards,
which increases the randomness of the data.(egs: if each shard contains 1000, you can set shuffle buffer as 1500)
* ``Sort buffer``: sort the data according to the number of frames. This operation is very important and can greatly
improve the training speed.
then, ``Prefetch``:
``Prefetch`` is used in the Pytorch ``Dataloader`` to pre-read data. The granularity of prefetch is the batch of final training.
The default parameter is 2, that is, the data of two batches will be pre-read by default. In the design of the UIO,
due to the existence of the pre buffer, the pre-read data may already be in the buffer, so there is no real pre read.
Only when the data in the buffer is insufficient during the next training can the buffer be filled on the fly.
At this time, the training is blocked on the read data. In a word, when the prefetch is very small, the training will
block reading data in part of the time, because the previous data is still in cache. Therefore, you can set a large
prefetch to avoid this problem.
## Validation experiments
At present, we have verified the accuracy of UIO on aishell (200 hours) and wenetspeech (10000 hours) data respectively.
### Aishell(``raw`` vs ``shard``)
|IO|CER|
|:---|:---|
|Old|4.61|
|UIO(``Raw``)|4.63|
|UIO(``Shard``)|4.67|
### WenetSpeech(``shard``)
![UIO WenetSpeech](./images/UIO_wenetspeech_cer.png)
WeNet and ESPnet use similar model structure and parameter configuration, and they achieve similar recognition rate,
which shows the correctness of UIO in WeNet. And during the training, we observed that the overall utilization rate of
GPU of UIO is more than 80% - 90%, indicating that the overall IO reading efficiency is very high.
## Usage of UIO
For detailed usage of UIO, please refer to the aishell dataset example:
https://github.com/wenet-e2e/wenet/blob/main/examples/aishell/s0/run.sh
At present, all datasets in WeNet have used UIO as the default data preparation.
There are three parameters related to UIO in the training script train.py:
- ``train_data``(``cv_data``/``test_data``): data.list
- ``data_type``: raw/shard
- ``symbol_table``: specify modeling unit
For example:
```shell
python wenet/bin/train.py --gpu $gpu_id \
--config $train_config \
--data_type $data_type \
--symbol_table $dict \
--train_data $feat_dir/$train_set/data.list \
--cv_data $feat_dir/dev/data.list \
...
```
If data_type is ``raw``, the format of data.list is as follows:
```
{"key": "BAC009S0002W0122", "wav": "/export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0122.wav", "txt": "而对楼市成交抑制作用最大的限购"}
{"key": "BAC009S0002W0123", "wav": "/export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0123.wav", "txt": "也成为地方政府的眼中钉"}
{"key": "BAC009S0002W0124", "wav": "/export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0124.wav", "txt": "自六月底呼和浩特市率先宣布取消限购后"}
```
Each line is a json serialized string, which contains three fields: ``key``, ``wav`` and ``txt``.
If data_type is ``shard``, the format of data.list is as follows:
```
# [option 1: local]
/export/maryland/binbinzhang/code/wenet/examples/aishell/s3/raw_wav/train/shards/shards_000000000.tar.gz
/export/maryland/binbinzhang/code/wenet/examples/aishell/s3/raw_wav/train/shards/shards_000000001.tar.gz
/export/maryland/binbinzhang/code/wenet/examples/aishell/s3/raw_wav/train/shards/shards_000000002.tar.gz
# [option 2: network(egs: OSS)]
https://examplebucket.oss-cn-hangzhou.aliyuncs.com/exampledir/1.tar.gz
https://examplebucket.oss-cn-hangzhou.aliyuncs.com/exampledir/2.tar.gz
```
## Q&A
Q1: How to operate distributed partition of training data?
A: According to rank and num_workers can segment the data. for example:
```python
class DistributedSampler:
def __init__(self, shuffle=True, partition=True):
self.epoch = -1
self.update()
self.shuffle = shuffle
self.partition = partition
def set_epoch(self, epoch):
self.epoch = epoch
def sample(self, data):
data = list(range(len(data)))
if self.partition:
if self.shuffle:
random.Random(self.epoch).shuffle(data)
data = data[self.rank::self.world_size]
data = data[self.worker_id::self.num_workers]
return data
```
Q2: How to deal with unbalanced data?
A: Use model.join() to handle the imbalance of data allocated on each rank. Please refer [this](https://pytorch.org/tutorials/advanced/generic_join.html#how-does-join-work).
\ No newline at end of file
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
# -- Project information -----------------------------------------------------
project = 'Wenet'
copyright = '2020, wenet-team'
author = 'wenet-team'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"nbsphinx",
"sphinx.ext.autodoc",
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
"sphinx.ext.mathjax",
"sphinx.ext.todo",
# "sphinxarg.ext",
"sphinx_markdown_tables",
'recommonmark',
'sphinx_rtd_theme',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
source_suffix = {
'.rst': 'restructuredtext',
'.txt': 'markdown',
'.md': 'markdown',
}
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
# html_theme = 'alabaster'
html_theme = "sphinx_rtd_theme"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
## Context Biasing
In the practical application of ASR, the recognition effect of commonly used words is better, but for some unique words, the recognition accuracy may be low. Contextual biasing is the problem of injecting prior knowledge into an ASR system during inference, for example a user’s favorite songs, contacts, apps or location. Conventional ASR systems perform contextual biasing by building an n-gram finite state transducer (FST) from a list of biasing phrases, which is composed on-the-fly with the decoder graph during decoding. This helps to bias the recognition result towards the n-grams contained in the contextual FST, and thus improves accuracy in certain scenarios.
In WeNet, we compute the biasing scores $P_C(\mathbf y)$, which are interpolated with the base model $P(\mathbf y|\mathbf x)$ using shallow-fusion during beam search, including CTC prefix beam search and CTC WFST beam search.
$$
\mathbf y^*=\mathrm{arg\,max\,log}P(\mathbf y|\mathbf x)+\lambda\,\mathrm{log}\,P_C(\mathbf y)
$$
where, $\lambda$ is a tunable hyperparameter controlling how much the contextual LM influences the overall model score during beam search.
### Context Graph
If we want to improve the score of the word "cat", and the biasing score $\lambda\,\mathrm{log}\,P_C(\mathbf y)$ of each character is 0.25. The context graph can be constructed as follow:
![context graph](images/context_graph.png)
In the decoding process, when the corresponding prefix is matched, the corresponding score reward will be obtained. In order to avoid artificially boosting prefixes which match early on but do not match the entire phrase, we add a special failure arc which removes the boosted score.
WeNet records only one state for each prefix, to easily determine the boundary of the matched hot word. That is, only one hot word can be matched at the same time, and only after the hot word matching succeeds or fails can other hot words start matching.
``` c++
int ContextGraph::GetNextState(int cur_state, int word_id, float* score,
bool* is_start_boundary, bool* is_end_boundary) {
int next_state = 0;
// Traverse the arcs of current state.
for (fst::ArcIterator<fst::StdFst> aiter(*graph_, cur_state); !aiter.Done();
aiter.Next()) {
const fst::StdArc& arc = aiter.Value();
if (arc.ilabel == 0) {
// Record the score of the backoff arc. It might will be covered.
*score = arc.weight.Value();
} else if (arc.ilabel == word_id) {
// If they match, record the next state and the score.
next_state = arc.nextstate;
*score = arc.weight.Value();
// Check whether is the boundary of the hot word.
if (cur_state == 0) {
*is_start_boundary = true;
}
if (graph_->Final(arc.nextstate) == fst::StdArc::Weight::One()) {
*is_end_boundary = true;
}
break;
}
}
return next_state;
}
```
### CTC Prefix Beam Search
In the process of CTC prefix beam search, each prefix needs to record the hot word matching information. After appending the current output character, if the prefix changes, call the above function `GetNextState` to update the state and score of the hot word. If it is the start or end of a hot word, it is also necessary to record the position, which are used to insert the start tag and end tag in the result, such as: "The \<context>cat\</context> is in the bag".
### CTC WFST Beam Search
WeNet adopts the Lattice Faster Online Decoder from Kaldi for WFST beam search. We have to modify the `lattice-faster-decoder.cc` to support context biasing.
WFST beam search decodes in the TLG graph according to the CTC outputs. If we bias the input label of the TLG, we need to compose the context graph with the Token graph. Finally, we decide to bias TLG's output towards the contextual fst. We need to modify the `ProcessEmitting` and `ProcessNonemitting` functions as follow:
```c++
Elem *e_next =
FindOrAddToken(arc.nextstate, frame + 1, tot_cost, tok, NULL);
// NULL: no change indicator needed
// ========== Context code BEGIN ===========
bool is_start_boundary = false;
bool is_end_boundary = false;
float context_score = 0;
if (context_graph_) {
if (arc.olabel == 0) {
e_next->val->context_state = tok->context_state;
} else {
e_next->val->context_state = context_graph_->GetNextState(
tok->context_state, arc.olabel, &context_score,
&is_start_boundary, &is_end_boundary);
graph_cost -= context_score;
}
}
// ========== Context code END ==========
// Add ForwardLink from tok to next_tok (put on head of list
// tok->links)
tok->links = new ForwardLinkT(e_next->val, arc.ilabel, arc.olabel,
graph_cost, ac_cost, is_start_boundary,
is_end_boundary, tok->links);
tok->links->context_score = context_score;
```
### Pruning
The backoff arc will return the accumulated scores to a single ForwardLink. It leads to the cost of that ForwardLink is too large. We have to remove the cost returned by backoff arc before pruning.
```c++
void LatticeFasterDecoderTpl<FST, Token>::PruneForwardLinks(
int32 frame_plus_one, bool *extra_costs_changed, bool *links_pruned,
BaseFloat delta) {
...
BaseFloat link_extra_cost =
next_tok->extra_cost +
((tok->tot_cost + link->acoustic_cost + link->graph_cost) -
next_tok->tot_cost); // difference in brackets is >= 0
// ========== Context code BEGIN ===========
// graph_cost contains the score of hot word
// link->context_score < 0 means the hot word of the link is returned from backoff arc
if (link->context_score < 0) {
link_extra_cost += link->context_score;
}
// ========== Context code END ==========
// link_exta_cost is the difference in score between the best paths
// through link source state and through link destination state
```
### Usage
1. Specify the `--context_path` to a text file.
- Each line of the file contains a context.
- Each context can be split into words with the symbol_table of the ASR model (It means there is no oov in the context).
2. Specify the `--context_score`, the reward of each word in the context.
```bash
cd /home/wenet/runtime/libtorch
export GLOG_logtostderr=1
export GLOG_v=2
wav_path=docker_resource/test.wav
context_path=docker_resource/context.txt
model_dir=docker_resource/model
./build/decoder_main \
--chunk_size -1 \
--wav_path $wav_path \
--model_path $model_dir/final.zip \
--context_path $context_path \
--context_score 3 \
--unit_path $model_dir/units.txt 2>&1 | tee log.txt
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment