# Fastspeech2-快速和高质量的端到端文本到语音

FastSpeech是基于Pytorch实现微软的文本到语音系统 [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558v1)。这个项目是基于[FastSpeech](https://github.com/xcmyz/FastSpeech)的。

FastSpeech2有几个版本，这个版本的实现更类似于[version 1](https://arxiv.org/abs/2006.04558v1)，version1版本使用F0值作为音高特征(pitch feature)

另一方面，[后续版本](https://arxiv.org/abs/2006.04558)采用连续小波变换提取的基音谱作为基音特征

![](./img/model.png)

## 安装相关环境和依赖

### 创建环境

```
#创建conda虚拟环境
conda create -name FastSpeech2 python=’3.7’
#查看环境是否创建成功
conda env list
#激活环境
conda activate FastSpeech2
```

### 安装依赖

```
pip3 install torch-1.10.0a0+git450cdd1.dtk22.4-cp37-cp37m-linux_x86_64.whl
pip3 install torchvision-0.10.0a0_dtk22.04_300a8a4-cp37-cp37m-linux_x86_64.whl
pip3 install -r requirments.txt
```

## 数据预处理

使用的数据集为[LJSpeech](https://keithito.com/LJ-Speech-Dataset/)：它是一个单讲英语的数据集，一共包含13100个短音频片段，内容是一位女性演讲者阅读7本非小说类书籍中的段落，总共大约24小时。

```
wget https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing
```

本模型使用蒙特利尔强制对齐器(Montreal Forced Aligner, MFA)来获得语音和音素序列之间的对齐。

[这里]((https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing))提供了受支持的数据集的对齐方式。

您需要从如下链接中下载preprocessed_data.zip和pretrain_model.zip文件，并将preprocessed_data.zip解压到到程序的根目录中,将pretrain_model.zip解压至hifigan文件夹中（注意pretrain_model.zip文件中的两个预训练模型generator_LJSpeech和generator_universal需要再次进行解压操作，否则会导致训练过程中报错，提示找不到文件）

链接：https://pan.baidu.com/s/1kDAAyXYClgS8U-703DA-nA 
提取码：ujn6 

之后执行如下命令即可完成对数据集LJSpeech的预处理：

```
python3 prepare_align.py config/LJSpeech/preprocess.yaml
python3 preprocess.py config/LJSpeech/preprocess.yaml
```

需要注意，在程序运行过程中，preprocess.yaml文件中的路径需按当前机器下载数据集的位置进行指定。如图，只需要在corpus_path中指定下载好的数据集路径即可，同时要指定存在的空文件夹作为raw_path的路径，否则会出现报错。

![pretrain](.\img\pretrain.png)

## 训练

单卡训练

建立如下脚本运行即可对模型进行训练

```
train_single.sh:

export HIP_VISIBLE_DEVICES=0
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

```
bash train_single.sh
```

多卡训练

建立如下脚本运行即可对模型进行训练

```
train_ddp.sh:


export HIP_VISIBLE_DEVICES=0,1,2,3

export NGPUS=4

python3 -m torch.distributed.launch --nproc_per_node ${NGPUS} train_ddp.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
```

模型训练结束后，日志和模型会保存在output文件夹中，通过tensorboard可对日志进行可视化，从而观察损失曲线梅尔频谱图和音频显示。

利用如下命令即可实现tensorboard的loss观测和语音的试听。

```
tensorboard --logdir output/log/LJSpeech
```

![](./img/tensorboard_loss.png)

![](./img/tensorboard_spec.png)

![](./img/tensorboard_audio.png)

## 性能测试

![图片2](.\img\图片2.png)

## FAQ

**a、AttributeError: module ‘distutils‘ has no attribute ‘version‘**

这是因为setuptools版本的问题，将其版本降级即可：

```
pip3 uninstall setuptools
pip3 install setuptools==59.5.0
```

**b、If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.**

将protobuf版本定为3.20X可解决问题

```
pip3 install –upgrade protobuf==3.20.1
```

**c、 RuntimeErrorNumpy is not available**

尝试重新安装numpy可解决问题

```
pip3 uninstall numpy
pip3 install numpy
```

## 参考

https://github.com/ming024/FastSpeech2