Commit ee10550a authored by liugh5's avatar liugh5
Browse files

Initial commit

parents
Pipeline #790 canceled with stages
徐玠诡谲多智,善揣摩,知道徐知询不可辅佐,掌握着他的短处以归附徐知诰。
许乐夫生于山东省临朐县杨善镇大辛庄,毕业于抗大一分校。
宣统元年(1909年),顺德绅士冯国材在香山大黄圃成立安洲农务分会,管辖东海十六沙,冯国材任总理。
学生们大多住在校区宿舍,通过参加不同的体育文化俱乐部及社交活动,形成一个友谊长存的社会圈。
学校的“三节一会”(艺术节、社团节、科技节、运动会)是显示青春才华的盛大活动。
雪是先天自闭症患者,不懂与人沟通,却拥有灵敏听觉,而且对复杂动作过目不忘。
勋章通过一柱状螺孔和螺钉附着在衣物上。
雅恩雷根斯堡足球俱乐部()是一家位于德国雷根斯堡的足球俱乐部,处于德国足球丙级联赛。
亚历山大·格罗滕迪克于1957年证明了一个深远的推广,现在叫做格罗滕迪克–黎曼–罗赫定理。
\ No newline at end of file
MIT License
Copyright (c) 2022 Alibaba Research
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# KAN-TTS
With KAN-TTS you can train your own TTS model from zero to hero :).
## Models
Temporarily we support sam-bert and hifi-GAN, other models are coming soon.
## Support Languages
| Language | Model Links |
| :---: | :---: |
| Mandarin | https://modelscope.cn/models?name=zhcn&page=1&tasks=text-to-speech&type=audio |
| English | https://modelscope.cn/models?name=enus&page=1&tasks=text-to-speech&type=audio |
| British | https://modelscope.cn/models?name=engb&page=1&tasks=text-to-speech&type=audio |
| Shanghainese | https://modelscope.cn/models?name=WuuShanghai&page=1&tasks=text-to-speech&type=audio |
| Sichuanese | https://modelscope.cn/models?name=Sichuan&page=1&tasks=text-to-speech&type=audio |
| Cantonese | https://modelscope.cn/models?name=Cantonese&page=1&tasks=text-to-speech&type=audio |
| Italian | https://modelscope.cn/models?name=itit&page=1&tasks=text-to-speech&type=audio |
| Spanish | https://modelscope.cn/models?name=eses&page=1&tasks=text-to-speech&type=audio |
| Russian | https://modelscope.cn/models?name=ruru&page=1&tasks=text-to-speech&type=audio |
| Korean | https://modelscope.cn/models?name=kokr&page=1&tasks=text-to-speech&type=audio |
More languages are coming soon.
## Training Tutorial
You can find the training tutorial in our wiki page [KAN-TTS Wiki](https://github.com/AlibabaResearch/KAN-TTS/wiki).
## ModelScope Demo
Try our demo on ModelScope [KAN-TTS Demo](https://modelscope.cn/models?page=1&tasks=text-to-speech).
## Contribute to this repo
```shell
pip install -r requirements.txt
pre-commit install
```
## Contact us
If you have any questions, please feel free to contact us.
Scan the QR code to join our DingTalk group.
<img src="https://raw.githubusercontent.com/wiki/alibaba-damo-academy/KAN-TTS/resources/images/kantts_dinggroup.png" width="200" height="200" />
# sambert-hifigan_pytorch
## 论文
[RobuTrans: A Robust Transformer-Based Text-to-Speech Model](https://ojs.aaai.org/index.php/AAAI/article/view/6337)
[HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
## 模型结构
韵律建模sambert声学模型:在语音合成领域,类似FastSpeech的Parallel模型是目前的主流,它针对基频(pitch)、能量(energy)和时长(duration)三种韵律表征分别建模。但是,该类模型普遍存在一些效果和性能上的问题,例如,独立建模时长、基频、能量,忽视了其内在联系;完全非自回归的网络结构,无法满足工业级实时合成需求;帧级别基频和能量预测不稳定。 因此达摩院语音实验室设计了SAMBERT,一种基于Parallel结构的改良版TTS模型,它具有以下优点:
1. Backbone采用Self-Attention-Mechanism(SAM),提升模型建模能力。
2. Encoder部分采用BERT进行初始化,引入更多文本信息,提升合成韵律。
3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测,再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息,结合自回归结构,进一步提升韵律自然度.
4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans],自然支持流式合成。
![sambert.jpg](.\assets\sambert.jpg)
## 算法原理
如果需要进行迁移学习,那么需要先构建多说话人的声学模型,不同说话人是通过可训练的说话人编码(speaker embedding)进行区分的。给定新的一个说话人,一般通过随机初始化一个speaker embedding,然后再基于这个说话人的数据进行更新(见下图说话人空间1)。对于个性化语音合成来说,发音人的数据量比较少,学习难度很大,最终合成声音的相似度就无法保证。因此,我们采用说话人特征信息来表示每个说话人,此时,以少量说话人数据初始化的 speaker embedding 距离实际的目标说话人更近得多(见下图说话人空间2),学习难度小,此时合成声音的相似度就比较高。采用基于说话人特征信息的个性化语音合成,使得在20句条件下,依旧能够有较好的相似度。
![feature_space.png](.\assets\feature_space.png)
模型框架主要由三个部分组成:
1.数据自动化处理和标注
2.韵律建模SAMBERT声学模型
3.基于说话人特征信息的个性化语音合成
![ptts.png](.\assets\ptts.png)
## 环境配置
### Docker (方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk23.10-py38
# <your IMAGE ID>为以上拉取的docker的镜像ID替换
docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
cd /path/workspace/
首先是KAN-TTS环境搭建:
git clone -b develop https://github.com/alibaba-damo-academy/KAN-TTS.git
cd KAN-TTS
之后拉取预训练模型:
git clone https://www.modelscope.cn/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k.git
pip3 install -r requirements.txt
```
## 数据集
你可以从ModelScope下载经过阿里标准格式处理的AISHELL-3开源语音合成数据集,用来进行后续操作。如果你只有普通音频格式的数据,那么可以采用PTTS Autolabel自动化标注工具进行格式转换。
[训练数据](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/TTS/download_files/test_female.zip)
## 训练
#### 单卡训练
```
HIP_VISIBLE_DEVICES=0 python3 kantts/bin/train_sambert.py \
--model_config speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/config.yaml \
--root_dir training_stage/ptts_feats \
--stage_dir training_stage/ptts_sambert_ckpt \
--resume_path speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_*.pth
```
#### 单卡推理
```
HIP_VISIBLE_DEVICES=0 python3 kantts/bin/text_to_wav.py \
--txt Data/test.txt \
--output_dir res/ptts_syn \
--res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/resource.zip \
--am_ckpt training_stage/ptts_sambert_ckpt/ckpt/checkpoint_2402200.pth \
--voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth \
--se_file training_stage/ptts_feats/se/se.npy
```
## result
可在输出文件夹res/ptts_syn下找到克隆的语音文件。
## 应用场景
### 算法分类
语音处理
### 热点应用行业
制造,广媒,能源,医疗,家居,教育
## 源码仓库及问题反馈
https://developer.hpccube.com/codes/modelzoo/sambert-hifigan_pytorch
## 参考资料
[Modelscope - SambertHifigan](https://modelscope.cn/models/iic/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/summary)_
#!/bin/bash
# 特征提取
python3 kantts/preprocess/data_process.py \
--voice_input_dir Data/ptts_spk0_wav_autolabel \
--voice_output_dir training_stage/ptts_feats \
--audio_config kantts/configs/audio_config_se_16k.yaml \
--speaker F7 \
--se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.model
# 扩充epoch
stage0=training_stage
voice=ptts_feats
cat $stage0/$voice/am_valid.lst >> $stage0/$voice/am_train.lst
lines=0
while [ $lines -lt 400 ]
do
shuf $stage0/$voice/am_train.lst >> $stage0/$voice/am_train.lst.tmp
lines=$(wc -l < "$stage0/$voice/am_train.lst.tmp")
done
mv $stage0/$voice/am_train.lst.tmp $stage0/$voice/am_train.lst
import os
import sys
import argparse
import torch
import soundfile as sf
import yaml
import logging
import numpy as np
import time
import glob
ROOT_PATH = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) # NOQA: E402
sys.path.insert(0, os.path.dirname(ROOT_PATH)) # NOQA: E402
try:
from kantts.utils.log import logging_to_file
except ImportError:
raise ImportError("Please install kantts.")
logging.basicConfig(
# filename=os.path.join(stage_dir, 'stdout.log'),
format="%(asctime)s, %(levelname)-4s [%(filename)s:%(lineno)d] %(message)s",
datefmt="%Y-%m-%d:%H:%M:%S",
level=logging.INFO,
)
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def load_model(ckpt, config=None):
# load config if not provided
if config is None:
dirname = os.path.dirname(os.path.dirname(ckpt))
config = os.path.join(dirname, "config.yaml")
with open(config) as f:
config = yaml.load(f, Loader=yaml.Loader)
# lazy load for circular error
from kantts.models.hifigan.hifigan import Generator
model = Generator(**config["Model"]["Generator"]["params"])
states = torch.load(ckpt, map_location="cpu")
model.load_state_dict(states["model"]["generator"])
# add pqmf if needed
if config["Model"]["Generator"]["params"]["out_channels"] > 1:
# lazy load for circular error
from kantts.models.pqmf import PQMF
model.pqmf = PQMF()
return model
def binarize(mel, threshold=0.6):
# vuv binarize
res_mel = mel.copy()
index = np.where(mel[:, -1] < threshold)[0]
res_mel[:, -1] = 1.0
res_mel[:, -1][index] = 0.0
return res_mel
def hifigan_infer(input_mel, ckpt_path, output_dir, config=None):
if not torch.cuda.is_available():
device = torch.device("cpu")
else:
torch.backends.cudnn.benchmark = True
device = torch.device("cuda", 0)
if config is not None:
with open(config, "r") as f:
config = yaml.load(f, Loader=yaml.Loader)
else:
config_path = os.path.join(
os.path.dirname(os.path.dirname(ckpt_path)), "config.yaml"
)
if not os.path.exists(config_path):
raise ValueError("config file not found: {}".format(config_path))
with open(config_path, "r") as f:
config = yaml.load(f, Loader=yaml.Loader)
for key, value in config.items():
logging.info(f"{key} = {value}")
# check directory existence
if not os.path.exists(output_dir):
os.makedirs(output_dir)
logging_to_file(os.path.join(output_dir, "stdout.log"))
if os.path.isfile(input_mel):
mel_lst = [input_mel]
elif os.path.isdir(input_mel):
mel_lst = glob.glob(os.path.join(input_mel, "*.npy"))
else:
raise ValueError("input_mel should be a file or a directory")
model = load_model(ckpt_path, config)
logging.info(f"Loaded model parameters from {ckpt_path}.")
model.remove_weight_norm()
model = model.eval().to(device)
with torch.no_grad():
start = time.time()
pcm_len = 0
for mel in mel_lst:
utt_id = os.path.splitext(os.path.basename(mel))[0]
mel_data = np.load(mel)
if model.nsf_enable:
mel_data = binarize(mel_data)
# generate
mel_data = torch.tensor(mel_data, dtype=torch.float).to(device)
# (T, C) -> (B, C, T)
mel_data = mel_data.transpose(1, 0).unsqueeze(0)
y = model(mel_data)
if hasattr(model, "pqmf"):
y = model.pqmf.synthesis(y)
y = y.view(-1).cpu().numpy()
pcm_len += len(y)
# save as PCM 16 bit wav file
sf.write(
os.path.join(output_dir, f"{utt_id}_gen.wav"),
y,
config["audio_config"]["sampling_rate"],
"PCM_16",
)
rtf = (time.time() - start) / (
pcm_len / config["audio_config"]["sampling_rate"]
)
# report average RTF
logging.info(
f"Finished generation of {len(mel_lst)} utterances (RTF = {rtf:.03f})."
)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Infer hifigan model")
parser.add_argument(
"--ckpt", type=str, required=True, help="Path to model checkpoint"
)
parser.add_argument(
"--input_mel",
type=str,
required=True,
help="Path to input mel file or directory containing mel files",
)
parser.add_argument(
"--output_dir", type=str, required=True, help="Path to output directory"
)
parser.add_argument("--config", type=str, default=None, help="Path to config file")
args = parser.parse_args()
hifigan_infer(
args.input_mel,
args.ckpt,
args.output_dir,
args.config,
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment