Initial commit

ee10550a · liugh5 · ee10550a · ee10550a · ee10550a · ee10550a
Commit ee10550a authored Feb 06, 2024 by liugh5
20 changed files
--- a/Data/ptts_spk0_wav/SSB00180007.wav
+++ b/Data/ptts_spk0_wav/SSB00180007.wav
--- a/Data/ptts_spk0_wav/SSB00180012.wav
+++ b/Data/ptts_spk0_wav/SSB00180012.wav
--- a/Data/test.txt
+++ b/Data/test.txt
+徐玠诡谲多智，善揣摩，知道徐知询不可辅佐，掌握着他的短处以归附徐知诰。
+许乐夫生于山东省临朐县杨善镇大辛庄，毕业于抗大一分校。
+宣统元年（1909年），顺德绅士冯国材在香山大黄圃成立安洲农务分会，管辖东海十六沙，冯国材任总理。
+学生们大多住在校区宿舍，通过参加不同的体育文化俱乐部及社交活动，形成一个友谊长存的社会圈。
+学校的“三节一会”（艺术节、社团节、科技节、运动会）是显示青春才华的盛大活动。
+雪是先天自闭症患者，不懂与人沟通，却拥有灵敏听觉，而且对复杂动作过目不忘。
+勋章通过一柱状螺孔和螺钉附着在衣物上。
+雅恩雷根斯堡足球俱乐部（）是一家位于德国雷根斯堡的足球俱乐部，处于德国足球丙级联赛。
+亚历山大·格罗滕迪克于1957年证明了一个深远的推广，现在叫做格罗滕迪克–黎曼–罗赫定理。
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) 2022 Alibaba Research
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README-old.md
+++ b/README-old.md
+# KAN-TTS
+With KAN-TTS you can train your own TTS model from zero to hero :).
+## Models 
+Temporarily we support sam-bert and hifi-GAN, other models are coming soon.
+## Support Languages
+| Language     | Model Links                                                                                  |
+| :---:        | :---:                                                                                        |
+| Mandarin     | https://modelscope.cn/models?name=zhcn&page=1&tasks=text-to-speech&type=audio                |
+| English      | https://modelscope.cn/models?name=enus&page=1&tasks=text-to-speech&type=audio                |
+| British      | https://modelscope.cn/models?name=engb&page=1&tasks=text-to-speech&type=audio                |
+| Shanghainese | https://modelscope.cn/models?name=WuuShanghai&page=1&tasks=text-to-speech&type=audio         |
+| Sichuanese   | https://modelscope.cn/models?name=Sichuan&page=1&tasks=text-to-speech&type=audio             |
+| Cantonese    | https://modelscope.cn/models?name=Cantonese&page=1&tasks=text-to-speech&type=audio           |
+| Italian      | https://modelscope.cn/models?name=itit&page=1&tasks=text-to-speech&type=audio                |
+| Spanish      | https://modelscope.cn/models?name=eses&page=1&tasks=text-to-speech&type=audio                |
+| Russian      | https://modelscope.cn/models?name=ruru&page=1&tasks=text-to-speech&type=audio                |
+| Korean       | https://modelscope.cn/models?name=kokr&page=1&tasks=text-to-speech&type=audio                |
+More languages are coming soon.
+## Training Tutorial
+You can find the training tutorial in our wiki page [KAN-TTS Wiki](https://github.com/AlibabaResearch/KAN-TTS/wiki).
+## ModelScope Demo
+Try our demo on ModelScope [KAN-TTS Demo](https://modelscope.cn/models?page=1&tasks=text-to-speech).
+## Contribute to this repo
+```shell
+pip install -r requirements.txt
+pre-commit install
+```
+## Contact us
+If you have any questions, please feel free to contact us.
+Scan the QR code to join our DingTalk group.
+<img src="https://raw.githubusercontent.com/wiki/alibaba-damo-academy/KAN-TTS/resources/images/kantts_dinggroup.png" width="200" height="200" />
--- a/README.md
+++ b/README.md
+# sambert-hifigan_pytorch
+## 论文
+[RobuTrans: A Robust Transformer-Based Text-to-Speech Model](https://ojs.aaai.org/index.php/AAAI/article/view/6337)
+[HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
+## 模型结构
+韵律建模sambert声学模型:在语音合成领域，类似FastSpeech的Parallel模型是目前的主流，它针对基频（pitch）、能量（energy）和时长（duration）三种韵律表征分别建模。但是，该类模型普遍存在一些效果和性能上的问题，例如，独立建模时长、基频、能量，忽视了其内在联系；完全非自回归的网络结构，无法满足工业级实时合成需求；帧级别基频和能量预测不稳定。 因此达摩院语音实验室设计了SAMBERT，一种基于Parallel结构的改良版TTS模型，它具有以下优点：
+1. Backbone采用Self-Attention-Mechanism(SAM)，提升模型建模能力。
+2. Encoder部分采用BERT进行初始化，引入更多文本信息，提升合成韵律。
+3. Variance Adaptor对音素级别的韵律(基频、能量、时长)轮廓进行粗粒度的预测，再通过decoder进行帧级别细粒度的建模;并在时长预测时考虑到其与基频、能量的关联信息，结合自回归结构，进一步提升韵律自然度.
+4. Decoder部分采用PNCA AR-Decoder[@li2020robutrans]，自然支持流式合成。
+![sambert.jpg](.\assets\sambert.jpg)
+## 算法原理
+如果需要进行迁移学习，那么需要先构建多说话人的声学模型，不同说话人是通过可训练的说话人编码（speaker embedding）进行区分的。给定新的一个说话人，一般通过随机初始化一个speaker embedding，然后再基于这个说话人的数据进行更新（见下图说话人空间1）。对于个性化语音合成来说，发音人的数据量比较少，学习难度很大，最终合成声音的相似度就无法保证。因此，我们采用说话人特征信息来表示每个说话人，此时，以少量说话人数据初始化的 speaker embedding 距离实际的目标说话人更近得多（见下图说话人空间2），学习难度小，此时合成声音的相似度就比较高。采用基于说话人特征信息的个性化语音合成，使得在20句条件下，依旧能够有较好的相似度。
+![feature_space.png](.\assets\feature_space.png)
+模型框架主要由三个部分组成：
+1.数据自动化处理和标注
+2.韵律建模SAMBERT声学模型
+3.基于说话人特征信息的个性化语音合成
+![ptts.png](.\assets\ptts.png)
+## 环境配置
+### Docker (方法一)
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk23.10-py38
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换
+docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
+cd /path/workspace/
+首先是KAN-TTS环境搭建：
+git clone -b develop https://github.com/alibaba-damo-academy/KAN-TTS.git
+cd KAN-TTS
+之后拉取预训练模型：
+git clone https://www.modelscope.cn/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k.git
+pip3 install -r requirements.txt
+```
+## 数据集
+你可以从ModelScope下载经过阿里标准格式处理的AISHELL-3开源语音合成数据集，用来进行后续操作。如果你只有普通音频格式的数据，那么可以采用PTTS Autolabel自动化标注工具进行格式转换。
+[训练数据](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/TTS/download_files/test_female.zip)
+## 训练
+#### 单卡训练
+```
+HIP_VISIBLE_DEVICES=0 python3 kantts/bin/train_sambert.py \
+--model_config speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/config.yaml \
+--root_dir  training_stage/ptts_feats \
+--stage_dir training_stage/ptts_sambert_ckpt \
+--resume_path speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_*.pth
+```
+#### 单卡推理
+```
+HIP_VISIBLE_DEVICES=0 python3 kantts/bin/text_to_wav.py \
+--txt Data/test.txt \
+--output_dir res/ptts_syn \
+--res_zip speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/resource.zip \
+--am_ckpt training_stage/ptts_sambert_ckpt/ckpt/checkpoint_2402200.pth \
+--voc_ckpt speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth \
+--se_file training_stage/ptts_feats/se/se.npy
+```
+## result
+可在输出文件夹res/ptts_syn下找到克隆的语音文件。
+## 应用场景
+### 算法分类
+语音处理
+### 热点应用行业
+制造,广媒,能源,医疗,家居,教育
+## 源码仓库及问题反馈
+https://developer.hpccube.com/codes/modelzoo/sambert-hifigan_pytorch
+## 参考资料
+[Modelscope - SambertHifigan](https://modelscope.cn/models/iic/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/summary)_
--- a/assets/feature_space.png
+++ b/assets/feature_space.png
--- a/assets/ptts.png
+++ b/assets/ptts.png
--- a/assets/sambert.jpg
+++ b/assets/sambert.jpg
--- a/feats_extract.sh
+++ b/feats_extract.sh
+#!/bin/bash
+# 特征提取
+python3 kantts/preprocess/data_process.py \
+--voice_input_dir Data/ptts_spk0_wav_autolabel \
+--voice_output_dir training_stage/ptts_feats \
+--audio_config kantts/configs/audio_config_se_16k.yaml \
+--speaker F7 \
+--se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.model    
+# 扩充epoch
+stage0=training_stage
+voice=ptts_feats
+cat $stage0/$voice/am_valid.lst  >> $stage0/$voice/am_train.lst
+lines=0
+while [ $lines -lt 400 ]
+do
+    shuf $stage0/$voice/am_train.lst >> $stage0/$voice/am_train.lst.tmp
+    lines=$(wc -l < "$stage0/$voice/am_train.lst.tmp")
+done
+mv $stage0/$voice/am_train.lst.tmp $stage0/$voice/am_train.lst
--- a/kantts/__init__.py
+++ b/kantts/__init__.py
--- a/kantts/__pycache__/__init__.cpython-38.pyc
+++ b/kantts/__pycache__/__init__.cpython-38.pyc
--- a/kantts/bin/__init__.py
+++ b/kantts/bin/__init__.py
--- a/kantts/bin/__pycache__/__init__.cpython-38.pyc
+++ b/kantts/bin/__pycache__/__init__.cpython-38.pyc
--- a/kantts/bin/__pycache__/infer_hifigan.cpython-38.pyc
+++ b/kantts/bin/__pycache__/infer_hifigan.cpython-38.pyc
--- a/kantts/bin/__pycache__/infer_hifigan_to_onnx.cpython-38.pyc
+++ b/kantts/bin/__pycache__/infer_hifigan_to_onnx.cpython-38.pyc
--- a/kantts/bin/__pycache__/infer_sambert.cpython-38.pyc
+++ b/kantts/bin/__pycache__/infer_sambert.cpython-38.pyc
--- a/kantts/bin/__pycache__/infer_sambert_divide_to_onnx.cpython-38.pyc
+++ b/kantts/bin/__pycache__/infer_sambert_divide_to_onnx.cpython-38.pyc
--- a/kantts/bin/__pycache__/train_sambert.cpython-38.pyc
+++ b/kantts/bin/__pycache__/train_sambert.cpython-38.pyc
--- a/kantts/bin/infer_hifigan.py
+++ b/kantts/bin/infer_hifigan.py
+import os
+import sys
+import argparse
+import torch
+import soundfile as sf
+import yaml
+import logging
+import numpy as np
+import time
+import glob
+ROOT_PATH = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # NOQA: E402
+sys.path.insert(0, os.path.dirname(ROOT_PATH))  # NOQA: E402
+try:
+    from kantts.utils.log import logging_to_file
+except ImportError:
+    raise ImportError("Please install kantts.")
+logging.basicConfig(
+    #  filename=os.path.join(stage_dir, 'stdout.log'),
+    format="%(asctime)s, %(levelname)-4s [%(filename)s:%(lineno)d] %(message)s",
+    datefmt="%Y-%m-%d:%H:%M:%S",
+    level=logging.INFO,
+)
+def count_parameters(model):
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)
+def load_model(ckpt, config=None):
+    # load config if not provided
+    if config is None:
+        dirname = os.path.dirname(os.path.dirname(ckpt))
+        config = os.path.join(dirname, "config.yaml")
+        with open(config) as f:
+            config = yaml.load(f, Loader=yaml.Loader)
+    # lazy load for circular error
+    from kantts.models.hifigan.hifigan import Generator
+    model = Generator(**config["Model"]["Generator"]["params"])
+    states = torch.load(ckpt, map_location="cpu")
+    model.load_state_dict(states["model"]["generator"])
+    # add pqmf if needed
+    if config["Model"]["Generator"]["params"]["out_channels"] > 1:
+        # lazy load for circular error
+        from kantts.models.pqmf import PQMF
+        model.pqmf = PQMF()
+    return model
+def binarize(mel, threshold=0.6):
+    # vuv binarize
+    res_mel = mel.copy()
+    index = np.where(mel[:, -1] < threshold)[0]
+    res_mel[:, -1] = 1.0
+    res_mel[:, -1][index] = 0.0
+    return res_mel
+def hifigan_infer(input_mel, ckpt_path, output_dir, config=None):
+    if not torch.cuda.is_available():
+        device = torch.device("cpu")
+    else:
+        torch.backends.cudnn.benchmark = True
+        device = torch.device("cuda", 0)
+    if config is not None:
+        with open(config, "r") as f:
+            config = yaml.load(f, Loader=yaml.Loader)
+    else:
+        config_path = os.path.join(
+            os.path.dirname(os.path.dirname(ckpt_path)), "config.yaml"
+        )
+        if not os.path.exists(config_path):
+            raise ValueError("config file not found: {}".format(config_path))
+        with open(config_path, "r") as f:
+            config = yaml.load(f, Loader=yaml.Loader)
+    for key, value in config.items():
+        logging.info(f"{key} = {value}")
+    # check directory existence
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    logging_to_file(os.path.join(output_dir, "stdout.log"))
+    if os.path.isfile(input_mel):
+        mel_lst = [input_mel]
+    elif os.path.isdir(input_mel):
+        mel_lst = glob.glob(os.path.join(input_mel, "*.npy"))
+    else:
+        raise ValueError("input_mel should be a file or a directory")
+    model = load_model(ckpt_path, config)
+    logging.info(f"Loaded model parameters from {ckpt_path}.")
+    model.remove_weight_norm()
+    model = model.eval().to(device)
+    with torch.no_grad():
+        start = time.time()
+        pcm_len = 0
+        for mel in mel_lst:
+            utt_id = os.path.splitext(os.path.basename(mel))[0]
+            mel_data = np.load(mel)
+            if model.nsf_enable:
+                mel_data = binarize(mel_data)
+            # generate
+            mel_data = torch.tensor(mel_data, dtype=torch.float).to(device)
+            # (T, C) -> (B, C, T)
+            mel_data = mel_data.transpose(1, 0).unsqueeze(0)
+            y = model(mel_data)
+            if hasattr(model, "pqmf"):
+                y = model.pqmf.synthesis(y)
+            y = y.view(-1).cpu().numpy()
+            pcm_len += len(y)
+            # save as PCM 16 bit wav file
+            sf.write(
+                os.path.join(output_dir, f"{utt_id}_gen.wav"),
+                y,
+                config["audio_config"]["sampling_rate"],
+                "PCM_16",
+            )
+        rtf = (time.time() - start) / (
+            pcm_len / config["audio_config"]["sampling_rate"]
+        )
+    # report average RTF
+    logging.info(
+        f"Finished generation of {len(mel_lst)} utterances (RTF = {rtf:.03f})."
+    )
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Infer hifigan model")
+    parser.add_argument(
+        "--ckpt", type=str, required=True, help="Path to model checkpoint"
+    )
+    parser.add_argument(
+        "--input_mel",
+        type=str,
+        required=True,
+        help="Path to input mel file or directory containing mel files",
+    )
+    parser.add_argument(
+        "--output_dir", type=str, required=True, help="Path to output directory"
+    )
+    parser.add_argument("--config", type=str, default=None, help="Path to config file")
+    args = parser.parse_args()
+    hifigan_infer(
+        args.input_mel,
+        args.ckpt,
+        args.output_dir,
+        args.config,
+    )