Unverified Commit 4ece3b94 authored by Matthijs Hollemans's avatar Matthijs Hollemans Committed by GitHub
Browse files

add VITS model (#24085)



* add VITS model

* let's vits

* finish TextEncoder (mostly)

* rename VITS to Vits

* add StochasticDurationPredictor

* ads flow model

* add generator

* correctly set vocab size

* add tokenizer

* remove processor & feature extractor

* add PosteriorEncoder

* add missing weights to SDP

* also convert LJSpeech and VCTK checkpoints

* add training stuff in forward

* add placeholder tests for tokenizer

* add placeholder tests for model

* starting cleanup

* let the great renaming begin!

* use config

* global_conditioning

* more cleaning

* renaming variables

* more renaming

* more renaming

* it never ends

* reticulating the splines

* more renaming

* HiFi-GAN

* doc strings for main model

* fixup

* fix-copies

* don't make it a PreTrainedModel

* fixup

* rename config options

* remove training logic from forward pass

* simplify relative position

* use actual checkpoint

* style

* PR review fixes

* more review changes

* fixup

* more unit tests

* fixup

* fix doc test

* add integration test

* improve tokenizer tests

* add tokenizer integration test

* fix tests on GPU (gave OOM)

* conversion script can handle repos from hub

* add conversion script for all MMS-TTS checkpoints

* automatically create a README for the converted checkpoint

* small changes to config

* push README to hub

* only show uroman note for checkpoints that need it

* remove conversion script because code formatting breaks the readme

* make WaveNet layers configurable

* rename variables

* simplifying the math

* output attentions and hidden states

* remove VitsFlip in flow model

* also got rid of the other flip

* fix tests

* rename more variables

* rename tokenizer, add phonemization

* raise error when phonemizer missing

* re-order config docstrings to match method

* change config naming

* remove redundant str -> list

* fix copyright: vits authors -> kakao enterprise

* (mean, log_variances) -> (prior_mean, prior_log_variances)

* if return dict -> if not return dict

* speed -> speaking rate

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* update fused tanh sigmoid

* reduce dims in tester

* audio -> output_values

* audio -> output_values in tuple out

* fix return type

* fix return type

* make _unconstrained_rational_quadratic_spline a function

* all nn's to accept a config

* add spectro to output

* move {speaking rate, noise scale, noise scale duration} to config

* path -> attn_path

* idxs -> valid idxs -> padded idxs

* output values -> waveform

* use config for attention

* make generation work

* harden integration test

* add spectrogram to dict output

* tokenizer refactor

* make style

* remove 'fake' padding token

* harden tokenizer tests

* ron norm test

* fprop / save tests deterministic

* move uroman to tokenizer as much as possible

* better logger message

* fix vivit imports

* add uroman integration test

* make style

* up

* matthijs -> sanchit-gandhi

* fix tokenizer test

* make fix-copies

* fix dict comprehension

* fix config tests

* fix model tests

* make outputs consistent with reverse/not reverse

* fix key concat

* more model details

* add author

* return dict

* speaker error

* labels error

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/vits/convert_original_checkpoint.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove uromanize

* add docstrings

* add docstrings for tokenizer

* upper-case skip messages

* fix return dict

* style

* finish tests

* update checkpoints

* make style

* remove doctest file

* revert

* fix docstring

* fix tokenizer

* remove uroman integration test

* add sampling rate

* fix docs / docstrings

* style

* add sr to model output

* fix outputs

* style / copies

* fix docstring

* fix copies

* remove sr from model outputs

* Update utils/documentation_tests.txt
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add sr as allowed attr

---------
Co-authored-by: default avatarsanchit-gandhi <sanchit@huggingface.co>
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
parent ef10dbce
...@@ -489,6 +489,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -489,6 +489,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
......
...@@ -466,6 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt ...@@ -466,6 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
......
...@@ -438,6 +438,7 @@ conda install -c huggingface transformers ...@@ -438,6 +438,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा। 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा।
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा। 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा।
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise से) Jaehyeon Kim, Jungil Kong, Juhee Son. द्वाराअनुसंधान पत्र [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) के साथ जारी किया गया
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा।
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया। 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया।
......
...@@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ ...@@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141)
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise から) Jaehyeon Kim, Jungil Kong, Juhee Son. から公開された研究論文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171)
......
...@@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다. 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise 에서 제공)은 Jaehyeon Kim, Jungil Kong, Juhee Son.의 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)논문과 함께 발표했습니다.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다.
......
...@@ -439,6 +439,7 @@ conda install -c huggingface transformers ...@@ -439,6 +439,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (来自 Kakao Enterprise) 伴随论文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 由 Jaehyeon Kim, Jungil Kong, Juhee Son 发布。
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
......
...@@ -451,6 +451,7 @@ conda install -c huggingface transformers ...@@ -451,6 +451,7 @@ conda install -c huggingface transformers
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
......
...@@ -603,6 +603,8 @@ ...@@ -603,6 +603,8 @@
title: UniSpeech title: UniSpeech
- local: model_doc/unispeech-sat - local: model_doc/unispeech-sat
title: UniSpeech-SAT title: UniSpeech-SAT
- local: model_doc/vits
title: VITS
- local: model_doc/wav2vec2 - local: model_doc/wav2vec2
title: Wav2Vec2 title: Wav2Vec2
- local: model_doc/wav2vec2-conformer - local: model_doc/wav2vec2-conformer
......
...@@ -255,6 +255,7 @@ The documentation is organized into five sections: ...@@ -255,6 +255,7 @@ The documentation is organized into five sections:
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. 1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
1. **[VITS](model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
...@@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow.
| VitDet | ✅ | ❌ | ❌ | | VitDet | ✅ | ❌ | ❌ |
| ViTMAE | ✅ | ✅ | ❌ | | ViTMAE | ✅ | ✅ | ❌ |
| ViTMSN | ✅ | ❌ | ❌ | | ViTMSN | ✅ | ❌ | ❌ |
| VITS | ✅ | ❌ | ❌ |
| ViViT | ✅ | ❌ | ❌ | | ViViT | ✅ | ❌ | ❌ |
| Wav2Vec2 | ✅ | ✅ | ✅ | | Wav2Vec2 | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ✅ | ❌ | ❌ | | Wav2Vec2-Conformer | ✅ | ❌ | ❌ |
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# VITS
## Overview
The VITS model was proposed in [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end
speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational
autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based
text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,
much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text
input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to
synthesise speech with different rhythms from the same input text.
The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.
To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During
inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the
waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,
the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
The abstract from the paper is the following:
*Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.*
This model can also be used with TTS checkpoints from [Massively Multilingual Speech (MMS)](https://arxiv.org/abs/2305.13516)
as these checkpoints use the same architecture and a slightly modified tokenizer.
This model was contributed by [Matthijs](https://huggingface.co/Matthijs) and [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found [here](https://github.com/jaywalnut310/vits).
## Model Usage
Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it
is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet,
such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example
runs a forward pass using the MMS-TTS English checkpoint:
```python
import torch
from transformers import VitsTokenizer, VitsModel, set_seed
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
model = VitsModel.from_pretrained("facebook/mms-tts-eng")
inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
set_seed(555) # make deterministic
with torch.no_grad():
outputs = model(**inputs)
waveform = outputs.waveform[0]
```
The resulting waveform can be saved as a `.wav` file:
```python
import scipy
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(waveform, rate=model.config.sampling_rate)
```
For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman)
perl package is required to pre-process the text inputs to the Roman alphabet.
You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of
the pre-trained `tokenizer`:
```python
from transformers import VitsTokenizer
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
print(tokenizer.is_uroman)
```
If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`,
since currently the tokenizer does not support performing the pre-processing itself.
## VitsConfig
[[autodoc]] VitsConfig
## VitsTokenizer
[[autodoc]] VitsTokenizer
- __call__
- save_vocabulary
## VitsModel
[[autodoc]] VitsModel
- forward
...@@ -587,6 +587,11 @@ _import_structure = { ...@@ -587,6 +587,11 @@ _import_structure = {
"models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"], "models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"],
"models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"], "models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"],
"models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"], "models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"],
"models.vits": [
"VITS_PRETRAINED_CONFIG_ARCHIVE_MAP",
"VitsConfig",
"VitsTokenizer",
],
"models.vivit": [ "models.vivit": [
"VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"VivitConfig", "VivitConfig",
...@@ -2935,6 +2940,13 @@ else: ...@@ -2935,6 +2940,13 @@ else:
"VitDetPreTrainedModel", "VitDetPreTrainedModel",
] ]
) )
_import_structure["models.vits"].extend(
[
"VITS_PRETRAINED_MODEL_ARCHIVE_LIST",
"VitsModel",
"VitsPreTrainedModel",
]
)
_import_structure["models.vivit"].extend( _import_structure["models.vivit"].extend(
[ [
"VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST", "VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -4643,6 +4655,11 @@ if TYPE_CHECKING: ...@@ -4643,6 +4655,11 @@ if TYPE_CHECKING:
from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig
from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig
from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig
from .models.vits import (
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP,
VitsConfig,
VitsTokenizer,
)
from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig
from .models.wav2vec2 import ( from .models.wav2vec2 import (
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
...@@ -6595,6 +6612,11 @@ if TYPE_CHECKING: ...@@ -6595,6 +6612,11 @@ if TYPE_CHECKING:
VitDetModel, VitDetModel,
VitDetPreTrainedModel, VitDetPreTrainedModel,
) )
from .models.vits import (
VITS_PRETRAINED_MODEL_ARCHIVE_LIST,
VitsModel,
VitsPreTrainedModel,
)
from .models.vivit import ( from .models.vivit import (
VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST, VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
VivitForVideoClassification, VivitForVideoClassification,
......
...@@ -211,6 +211,7 @@ from . import ( ...@@ -211,6 +211,7 @@ from . import (
vit_mae, vit_mae,
vit_msn, vit_msn,
vitdet, vitdet,
vits,
vivit, vivit,
wav2vec2, wav2vec2,
wav2vec2_conformer, wav2vec2_conformer,
......
...@@ -219,6 +219,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -219,6 +219,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("vit_mae", "ViTMAEConfig"), ("vit_mae", "ViTMAEConfig"),
("vit_msn", "ViTMSNConfig"), ("vit_msn", "ViTMSNConfig"),
("vitdet", "VitDetConfig"), ("vitdet", "VitDetConfig"),
("vits", "VitsConfig"),
("vivit", "VivitConfig"), ("vivit", "VivitConfig"),
("wav2vec2", "Wav2Vec2Config"), ("wav2vec2", "Wav2Vec2Config"),
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
...@@ -410,6 +411,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -410,6 +411,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vits", "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -643,6 +645,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -643,6 +645,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("vit_mae", "ViTMAE"), ("vit_mae", "ViTMAE"),
("vit_msn", "ViTMSN"), ("vit_msn", "ViTMSN"),
("vitdet", "VitDet"), ("vitdet", "VitDet"),
("vits", "VITS"),
("vivit", "ViViT"), ("vivit", "ViViT"),
("wav2vec2", "Wav2Vec2"), ("wav2vec2", "Wav2Vec2"),
("wav2vec2-conformer", "Wav2Vec2-Conformer"), ("wav2vec2-conformer", "Wav2Vec2-Conformer"),
......
...@@ -205,6 +205,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -205,6 +205,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("vit_mae", "ViTMAEModel"), ("vit_mae", "ViTMAEModel"),
("vit_msn", "ViTMSNModel"), ("vit_msn", "ViTMSNModel"),
("vitdet", "VitDetModel"), ("vitdet", "VitDetModel"),
("vits", "VitsModel"),
("vivit", "VivitModel"), ("vivit", "VivitModel"),
("wav2vec2", "Wav2Vec2Model"), ("wav2vec2", "Wav2Vec2Model"),
("wav2vec2-conformer", "Wav2Vec2ConformerModel"), ("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
......
...@@ -347,6 +347,7 @@ else: ...@@ -347,6 +347,7 @@ else:
), ),
("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("vits", ("VitsTokenizer", None)),
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)), ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
......
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_sentencepiece_available,
is_speech_available,
is_torch_available,
)
_import_structure = {
"configuration_vits": [
"VITS_PRETRAINED_CONFIG_ARCHIVE_MAP",
"VitsConfig",
],
"tokenization_vits": ["VitsTokenizer"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_vits"] = [
"VITS_PRETRAINED_MODEL_ARCHIVE_LIST",
"VitsModel",
"VitsPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_vits import (
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP,
VitsConfig,
)
from .tokenization_vits import VitsTokenizer
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_vits import (
VITS_PRETRAINED_MODEL_ARCHIVE_LIST,
VitsModel,
VitsPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 The Kakao Enterprise Authors and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" VITS model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/config.json",
}
class VitsConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`VitsModel`]. It is used to instantiate a VITS
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the VITS
[facebook/mms-tts-eng](https://huggingface.co/facebook/mms-tts-eng) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 38):
Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed to the forward method of [`VitsModel`].
hidden_size (`int`, *optional*, defaults to 192):
Dimensionality of the text encoder layers.
num_hidden_layers (`int`, *optional*, defaults to 6):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 2):
Number of attention heads for each attention layer in the Transformer encoder.
window_size (`int`, *optional*, defaults to 4):
Window size for the relative positional embeddings in the attention layers of the Transformer encoder.
use_bias (`bool`, *optional*, defaults to `True`)
Whether to use bias in the key, query, value projection layers in the Transformer encoder.
ffn_dim (`int`, *optional*, defaults to 768):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
layerdrop (`float`, *optional*, defaults to 0.1):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
ffn_kernel_size (`int`, *optional*, defaults to 3):
Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder.
flow_size (`int`, *optional*, defaults to 192):
Dimensionality of the flow layers.
spectrogram_bins (`int`, *optional*, defaults to 513):
Number of frequency bins in the target spectrogram.
hidden_act (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"selu"` and `"gelu_new"` are supported.
hidden_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings and encoder.
attention_dropout (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.1):
The dropout ratio for activations inside the fully connected layer.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
The epsilon used by the layer normalization layers.
use_stochastic_duration_prediction (`bool`, *optional*, defaults to `True`):
Whether to use the stochastic duration prediction module or the regular duration predictor.
num_speakers (`int`, *optional*, defaults to 1):
Number of speakers if this is a multi-speaker model.
speaker_embedding_size (`int`, *optional*, defaults to 0):
Number of channels used by the speaker embeddings. Is zero for single-speaker models.
upsample_initial_channel (`int`, *optional*, defaults to 512):
The number of input channels into the HiFi-GAN upsampling network.
upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 2, 2]`):
A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network.
The length of `upsample_rates` defines the number of convolutional layers and has to match the length of
`upsample_kernel_sizes`.
upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[16, 16, 4, 4]`):
A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsampling
network. The length of `upsample_kernel_sizes` defines the number of convolutional layers and has to match
the length of `upsample_rates`.
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GAN
multi-receptive field fusion (MRF) module.
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
HiFi-GAN multi-receptive field fusion (MRF) module.
leaky_relu_slope (`float`, *optional*, defaults to 0.1):
The angle of the negative slope used by the leaky ReLU activation.
depth_separable_channels (`int`, *optional*, defaults to 2):
Number of channels to use in each depth-separable block.
depth_separable_num_layers (`int`, *optional*, defaults to 3):
Number of convolutional layers to use in each depth-separable block.
duration_predictor_flow_bins (`int`, *optional*, defaults to 10):
Number of channels to map using the unonstrained rational spline in the duration predictor model.
duration_predictor_tail_bound (`float`, *optional*, defaults to 5.0):
Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictor
model.
duration_predictor_kernel_size (`int`, *optional*, defaults to 3):
Kernel size of the 1D convolution layers used in the duration predictor model.
duration_predictor_dropout (`float`, *optional*, defaults to 0.5):
The dropout ratio for the duration predictor model.
duration_predictor_num_flows (`int`, *optional*, defaults to 4):
Number of flow stages used by the duration predictor model.
duration_predictor_filter_channels (`int`, *optional*, defaults to 256):
Number of channels for the convolution layers used in the duration predictor model.
prior_encoder_num_flows (`int`, *optional*, defaults to 4):
Number of flow stages used by the prior encoder flow model.
prior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 4):
Number of WaveNet layers used by the prior encoder flow model.
posterior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 16):
Number of WaveNet layers used by the posterior encoder model.
wavenet_kernel_size (`int`, *optional*, defaults to 5):
Kernel size of the 1D convolution layers used in the WaveNet model.
wavenet_dilation_rate (`int`, *optional*, defaults to 1):
Dilation rates of the dilated 1D convolutional layers used in the WaveNet model.
wavenet_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the WaveNet layers.
speaking_rate (`float`, *optional*, defaults to 1.0):
Speaking rate. Larger values give faster synthesised speech.
noise_scale (`float`, *optional*, defaults to 0.667):
How random the speech prediction is. Larger values create more variation in the predicted speech.
noise_scale_duration (`float`, *optional*, defaults to 0.8):
How random the duration prediction is. Larger values create more variation in the predicted durations.
sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz).
Example:
```python
>>> from transformers import VitsModel, VitsConfig
>>> # Initializing a "facebook/mms-tts-eng" style configuration
>>> configuration = VitsConfig()
>>> # Initializing a model (with random weights) from the "facebook/mms-tts-eng" style configuration
>>> model = VitsModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "vits"
def __init__(
self,
vocab_size=38,
hidden_size=192,
num_hidden_layers=6,
num_attention_heads=2,
window_size=4,
use_bias=True,
ffn_dim=768,
layerdrop=0.1,
ffn_kernel_size=3,
flow_size=192,
spectrogram_bins=513,
hidden_act="relu",
hidden_dropout=0.1,
attention_dropout=0.1,
activation_dropout=0.1,
initializer_range=0.02,
layer_norm_eps=1e-5,
use_stochastic_duration_prediction=True,
num_speakers=1,
speaker_embedding_size=0,
upsample_initial_channel=512,
upsample_rates=[8, 8, 2, 2],
upsample_kernel_sizes=[16, 16, 4, 4],
resblock_kernel_sizes=[3, 7, 11],
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
leaky_relu_slope=0.1,
depth_separable_channels=2,
depth_separable_num_layers=3,
duration_predictor_flow_bins=10,
duration_predictor_tail_bound=5.0,
duration_predictor_kernel_size=3,
duration_predictor_dropout=0.5,
duration_predictor_num_flows=4,
duration_predictor_filter_channels=256,
prior_encoder_num_flows=4,
prior_encoder_num_wavenet_layers=4,
posterior_encoder_num_wavenet_layers=16,
wavenet_kernel_size=5,
wavenet_dilation_rate=1,
wavenet_dropout=0.0,
speaking_rate=1.0,
noise_scale=0.667,
noise_scale_duration=0.8,
sampling_rate=16_000,
**kwargs,
):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.window_size = window_size
self.use_bias = use_bias
self.ffn_dim = ffn_dim
self.layerdrop = layerdrop
self.ffn_kernel_size = ffn_kernel_size
self.flow_size = flow_size
self.spectrogram_bins = spectrogram_bins
self.hidden_act = hidden_act
self.hidden_dropout = hidden_dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.use_stochastic_duration_prediction = use_stochastic_duration_prediction
self.num_speakers = num_speakers
self.speaker_embedding_size = speaker_embedding_size
self.upsample_initial_channel = upsample_initial_channel
self.upsample_rates = upsample_rates
self.upsample_kernel_sizes = upsample_kernel_sizes
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.leaky_relu_slope = leaky_relu_slope
self.depth_separable_channels = depth_separable_channels
self.depth_separable_num_layers = depth_separable_num_layers
self.duration_predictor_flow_bins = duration_predictor_flow_bins
self.duration_predictor_tail_bound = duration_predictor_tail_bound
self.duration_predictor_kernel_size = duration_predictor_kernel_size
self.duration_predictor_dropout = duration_predictor_dropout
self.duration_predictor_num_flows = duration_predictor_num_flows
self.duration_predictor_filter_channels = duration_predictor_filter_channels
self.prior_encoder_num_flows = prior_encoder_num_flows
self.prior_encoder_num_wavenet_layers = prior_encoder_num_wavenet_layers
self.posterior_encoder_num_wavenet_layers = posterior_encoder_num_wavenet_layers
self.wavenet_kernel_size = wavenet_kernel_size
self.wavenet_dilation_rate = wavenet_dilation_rate
self.wavenet_dropout = wavenet_dropout
self.speaking_rate = speaking_rate
self.noise_scale = noise_scale
self.noise_scale_duration = noise_scale_duration
self.sampling_rate = sampling_rate
if len(upsample_kernel_sizes) != len(upsample_rates):
raise ValueError(
f"The length of `upsample_kernel_sizes` ({len(upsample_kernel_sizes)}) must match the length of "
f"`upsample_rates` ({len(upsample_rates)})"
)
super().__init__(**kwargs)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert VITS checkpoint."""
import argparse
import json
import tempfile
import torch
from huggingface_hub import hf_hub_download
from transformers import VitsConfig, VitsModel, VitsTokenizer, logging
logging.set_verbosity_info()
logger = logging.get_logger("transformers.models.vits")
MAPPING_TEXT_ENCODER = {
"enc_p.emb": "text_encoder.embed_tokens",
"enc_p.encoder.attn_layers.*.conv_k": "text_encoder.encoder.layers.*.attention.k_proj",
"enc_p.encoder.attn_layers.*.conv_v": "text_encoder.encoder.layers.*.attention.v_proj",
"enc_p.encoder.attn_layers.*.conv_q": "text_encoder.encoder.layers.*.attention.q_proj",
"enc_p.encoder.attn_layers.*.conv_o": "text_encoder.encoder.layers.*.attention.out_proj",
"enc_p.encoder.attn_layers.*.emb_rel_k": "text_encoder.encoder.layers.*.attention.emb_rel_k",
"enc_p.encoder.attn_layers.*.emb_rel_v": "text_encoder.encoder.layers.*.attention.emb_rel_v",
"enc_p.encoder.norm_layers_1.*.gamma": "text_encoder.encoder.layers.*.layer_norm.weight",
"enc_p.encoder.norm_layers_1.*.beta": "text_encoder.encoder.layers.*.layer_norm.bias",
"enc_p.encoder.ffn_layers.*.conv_1": "text_encoder.encoder.layers.*.feed_forward.conv_1",
"enc_p.encoder.ffn_layers.*.conv_2": "text_encoder.encoder.layers.*.feed_forward.conv_2",
"enc_p.encoder.norm_layers_2.*.gamma": "text_encoder.encoder.layers.*.final_layer_norm.weight",
"enc_p.encoder.norm_layers_2.*.beta": "text_encoder.encoder.layers.*.final_layer_norm.bias",
"enc_p.proj": "text_encoder.project",
}
MAPPING_STOCHASTIC_DURATION_PREDICTOR = {
"dp.pre": "duration_predictor.conv_pre",
"dp.proj": "duration_predictor.conv_proj",
"dp.convs.convs_sep.*": "duration_predictor.conv_dds.convs_dilated.*",
"dp.convs.convs_1x1.*": "duration_predictor.conv_dds.convs_pointwise.*",
"dp.convs.norms_1.*.gamma": "duration_predictor.conv_dds.norms_1.*.weight",
"dp.convs.norms_1.*.beta": "duration_predictor.conv_dds.norms_1.*.bias",
"dp.convs.norms_2.*.gamma": "duration_predictor.conv_dds.norms_2.*.weight",
"dp.convs.norms_2.*.beta": "duration_predictor.conv_dds.norms_2.*.bias",
"dp.flows.0.logs": "duration_predictor.flows.0.log_scale",
"dp.flows.0.m": "duration_predictor.flows.0.translate",
"dp.flows.*.pre": "duration_predictor.flows.*.conv_pre",
"dp.flows.*.proj": "duration_predictor.flows.*.conv_proj",
"dp.flows.*.convs.convs_1x1.0": "duration_predictor.flows.*.conv_dds.convs_pointwise.0",
"dp.flows.*.convs.convs_1x1.1": "duration_predictor.flows.*.conv_dds.convs_pointwise.1",
"dp.flows.*.convs.convs_1x1.2": "duration_predictor.flows.*.conv_dds.convs_pointwise.2",
"dp.flows.*.convs.convs_sep.0": "duration_predictor.flows.*.conv_dds.convs_dilated.0",
"dp.flows.*.convs.convs_sep.1": "duration_predictor.flows.*.conv_dds.convs_dilated.1",
"dp.flows.*.convs.convs_sep.2": "duration_predictor.flows.*.conv_dds.convs_dilated.2",
"dp.flows.*.convs.norms_1.0.gamma": "duration_predictor.flows.*.conv_dds.norms_1.0.weight",
"dp.flows.*.convs.norms_1.0.beta": "duration_predictor.flows.*.conv_dds.norms_1.0.bias",
"dp.flows.*.convs.norms_1.1.gamma": "duration_predictor.flows.*.conv_dds.norms_1.1.weight",
"dp.flows.*.convs.norms_1.1.beta": "duration_predictor.flows.*.conv_dds.norms_1.1.bias",
"dp.flows.*.convs.norms_1.2.gamma": "duration_predictor.flows.*.conv_dds.norms_1.2.weight",
"dp.flows.*.convs.norms_1.2.beta": "duration_predictor.flows.*.conv_dds.norms_1.2.bias",
"dp.flows.*.convs.norms_2.0.gamma": "duration_predictor.flows.*.conv_dds.norms_2.0.weight",
"dp.flows.*.convs.norms_2.0.beta": "duration_predictor.flows.*.conv_dds.norms_2.0.bias",
"dp.flows.*.convs.norms_2.1.gamma": "duration_predictor.flows.*.conv_dds.norms_2.1.weight",
"dp.flows.*.convs.norms_2.1.beta": "duration_predictor.flows.*.conv_dds.norms_2.1.bias",
"dp.flows.*.convs.norms_2.2.gamma": "duration_predictor.flows.*.conv_dds.norms_2.2.weight",
"dp.flows.*.convs.norms_2.2.beta": "duration_predictor.flows.*.conv_dds.norms_2.2.bias",
"dp.post_pre": "duration_predictor.post_conv_pre",
"dp.post_proj": "duration_predictor.post_conv_proj",
"dp.post_convs.convs_sep.*": "duration_predictor.post_conv_dds.convs_dilated.*",
"dp.post_convs.convs_1x1.*": "duration_predictor.post_conv_dds.convs_pointwise.*",
"dp.post_convs.norms_1.*.gamma": "duration_predictor.post_conv_dds.norms_1.*.weight",
"dp.post_convs.norms_1.*.beta": "duration_predictor.post_conv_dds.norms_1.*.bias",
"dp.post_convs.norms_2.*.gamma": "duration_predictor.post_conv_dds.norms_2.*.weight",
"dp.post_convs.norms_2.*.beta": "duration_predictor.post_conv_dds.norms_2.*.bias",
"dp.post_flows.0.logs": "duration_predictor.post_flows.0.log_scale",
"dp.post_flows.0.m": "duration_predictor.post_flows.0.translate",
"dp.post_flows.*.pre": "duration_predictor.post_flows.*.conv_pre",
"dp.post_flows.*.proj": "duration_predictor.post_flows.*.conv_proj",
"dp.post_flows.*.convs.convs_1x1.0": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.0",
"dp.post_flows.*.convs.convs_1x1.1": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.1",
"dp.post_flows.*.convs.convs_1x1.2": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.2",
"dp.post_flows.*.convs.convs_sep.0": "duration_predictor.post_flows.*.conv_dds.convs_dilated.0",
"dp.post_flows.*.convs.convs_sep.1": "duration_predictor.post_flows.*.conv_dds.convs_dilated.1",
"dp.post_flows.*.convs.convs_sep.2": "duration_predictor.post_flows.*.conv_dds.convs_dilated.2",
"dp.post_flows.*.convs.norms_1.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.0.weight",
"dp.post_flows.*.convs.norms_1.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.0.bias",
"dp.post_flows.*.convs.norms_1.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.1.weight",
"dp.post_flows.*.convs.norms_1.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.1.bias",
"dp.post_flows.*.convs.norms_1.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.2.weight",
"dp.post_flows.*.convs.norms_1.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.2.bias",
"dp.post_flows.*.convs.norms_2.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.0.weight",
"dp.post_flows.*.convs.norms_2.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.0.bias",
"dp.post_flows.*.convs.norms_2.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.1.weight",
"dp.post_flows.*.convs.norms_2.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.1.bias",
"dp.post_flows.*.convs.norms_2.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.2.weight",
"dp.post_flows.*.convs.norms_2.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.2.bias",
"dp.cond": "duration_predictor.cond", # num_speakers > 1
}
MAPPING_FLOW = {
"flow.flows.*.pre": "flow.flows.*.conv_pre",
"flow.flows.*.enc.in_layers.0": "flow.flows.*.wavenet.in_layers.0",
"flow.flows.*.enc.in_layers.1": "flow.flows.*.wavenet.in_layers.1",
"flow.flows.*.enc.in_layers.2": "flow.flows.*.wavenet.in_layers.2",
"flow.flows.*.enc.in_layers.3": "flow.flows.*.wavenet.in_layers.3",
"flow.flows.*.enc.res_skip_layers.0": "flow.flows.*.wavenet.res_skip_layers.0",
"flow.flows.*.enc.res_skip_layers.1": "flow.flows.*.wavenet.res_skip_layers.1",
"flow.flows.*.enc.res_skip_layers.2": "flow.flows.*.wavenet.res_skip_layers.2",
"flow.flows.*.enc.res_skip_layers.3": "flow.flows.*.wavenet.res_skip_layers.3",
"flow.flows.*.enc.cond_layer": "flow.flows.*.wavenet.cond_layer", # num_speakers > 1
"flow.flows.*.post": "flow.flows.*.conv_post",
}
MAPPING_GENERATOR = {
"dec.conv_pre": "decoder.conv_pre",
"dec.ups.0": "decoder.upsampler.0",
"dec.ups.1": "decoder.upsampler.1",
"dec.ups.2": "decoder.upsampler.2",
"dec.ups.3": "decoder.upsampler.3",
"dec.resblocks.*.convs1.0": "decoder.resblocks.*.convs1.0",
"dec.resblocks.*.convs1.1": "decoder.resblocks.*.convs1.1",
"dec.resblocks.*.convs1.2": "decoder.resblocks.*.convs1.2",
"dec.resblocks.*.convs2.0": "decoder.resblocks.*.convs2.0",
"dec.resblocks.*.convs2.1": "decoder.resblocks.*.convs2.1",
"dec.resblocks.*.convs2.2": "decoder.resblocks.*.convs2.2",
"dec.conv_post": "decoder.conv_post",
"dec.cond": "decoder.cond", # num_speakers > 1
}
MAPPING_POSTERIOR_ENCODER = {
"enc_q.pre": "posterior_encoder.conv_pre",
"enc_q.enc.in_layers.*": "posterior_encoder.wavenet.in_layers.*",
"enc_q.enc.res_skip_layers.*": "posterior_encoder.wavenet.res_skip_layers.*",
"enc_q.enc.cond_layer": "posterior_encoder.wavenet.cond_layer", # num_speakers > 1
"enc_q.proj": "posterior_encoder.conv_proj",
}
MAPPING = {
**MAPPING_TEXT_ENCODER,
**MAPPING_STOCHASTIC_DURATION_PREDICTOR,
**MAPPING_FLOW,
**MAPPING_GENERATOR,
**MAPPING_POSTERIOR_ENCODER,
"emb_g": "embed_speaker", # num_speakers > 1
}
TOP_LEVEL_KEYS = []
IGNORE_KEYS = []
def set_recursively(hf_pointer, key, value, full_name, weight_type):
for attribute in key.split("."):
hf_pointer = getattr(hf_pointer, attribute)
if weight_type is not None:
hf_shape = getattr(hf_pointer, weight_type).shape
else:
hf_shape = hf_pointer.shape
# strip off the kernel dimension at the end (original weights are Conv1d)
if key.endswith(".k_proj") or key.endswith(".v_proj") or key.endswith(".q_proj") or key.endswith(".out_proj"):
value = value.squeeze(-1)
if hf_shape != value.shape:
raise ValueError(
f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be"
f" {value.shape} for {full_name}"
)
if weight_type == "weight":
hf_pointer.weight.data = value
elif weight_type == "weight_g":
hf_pointer.weight_g.data = value
elif weight_type == "weight_v":
hf_pointer.weight_v.data = value
elif weight_type == "bias":
hf_pointer.bias.data = value
elif weight_type == "running_mean":
hf_pointer.running_mean.data = value
elif weight_type == "running_var":
hf_pointer.running_var.data = value
elif weight_type == "num_batches_tracked":
hf_pointer.num_batches_tracked.data = value
else:
hf_pointer.data = value
logger.info(f"{key + ('.' + weight_type if weight_type is not None else '')} was initialized from {full_name}.")
def should_ignore(name, ignore_keys):
for key in ignore_keys:
if key.endswith(".*"):
if name.startswith(key[:-1]):
return True
elif ".*." in key:
prefix, suffix = key.split(".*.")
if prefix in name and suffix in name:
return True
elif key in name:
return True
return False
def recursively_load_weights(fairseq_dict, hf_model):
unused_weights = []
for name, value in fairseq_dict.items():
if should_ignore(name, IGNORE_KEYS):
logger.info(f"{name} was ignored")
continue
is_used = False
for key, mapped_key in MAPPING.items():
if key.endswith(".*"):
key = key[:-1]
elif "*" in key:
prefix, suffix = key.split(".*.")
if prefix in name and suffix in name:
key = suffix
if key in name:
is_used = True
if mapped_key.endswith(".*"):
layer_index = name.split(key)[-1].split(".")[0]
mapped_key = mapped_key.replace("*", layer_index)
elif "*" in mapped_key:
layer_index = name.split(key)[0].split(".")[-2]
# remap the layer index since we removed the Flip layers
if "flow.flows" in mapped_key:
layer_index = str(int(layer_index) // 2)
if "duration_predictor.flows" in mapped_key or "duration_predictor.post_flows" in mapped_key:
layer_index = str(int(layer_index) // 2 + 1)
mapped_key = mapped_key.replace("*", layer_index)
if "weight_g" in name:
weight_type = "weight_g"
elif "weight_v" in name:
weight_type = "weight_v"
elif "bias" in name:
weight_type = "bias"
elif "weight" in name:
weight_type = "weight"
elif "running_mean" in name:
weight_type = "running_mean"
elif "running_var" in name:
weight_type = "running_var"
elif "num_batches_tracked" in name:
weight_type = "num_batches_tracked"
else:
weight_type = None
set_recursively(hf_model, mapped_key, value, name, weight_type)
continue
if not is_used:
unused_weights.append(name)
logger.warning(f"Unused weights: {unused_weights}")
@torch.no_grad()
def convert_checkpoint(
pytorch_dump_folder_path,
checkpoint_path=None,
config_path=None,
vocab_path=None,
language=None,
num_speakers=None,
sampling_rate=None,
repo_id=None,
):
"""
Copy/paste/tweak model's weights to transformers design.
"""
if config_path is not None:
config = VitsConfig.from_pretrained(config_path)
else:
config = VitsConfig()
if num_speakers:
config.num_speakers = num_speakers
config.speaker_embedding_size = 256
if sampling_rate:
config.sampling_rate = sampling_rate
if checkpoint_path is None:
logger.info(f"***Converting model: facebook/mms-tts {language}***")
vocab_path = hf_hub_download(
repo_id="facebook/mms-tts",
filename="vocab.txt",
subfolder=f"models/{language}",
)
config_file = hf_hub_download(
repo_id="facebook/mms-tts",
filename="config.json",
subfolder=f"models/{language}",
)
checkpoint_path = hf_hub_download(
repo_id="facebook/mms-tts",
filename="G_100000.pth",
subfolder=f"models/{language}",
)
with open(config_file, "r") as f:
data = f.read()
hps = json.loads(data)
is_uroman = hps["data"]["training_files"].split(".")[-1] == "uroman"
if is_uroman:
logger.warning("For this checkpoint, you should use `uroman` to convert input text before tokenizing it!")
else:
logger.info(f"***Converting model: {checkpoint_path}***")
is_uroman = False
# original VITS checkpoint
if vocab_path is None:
_pad = "_"
_punctuation = ';:,.!?¡¿—…"«»“” '
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
symbols = _pad + _punctuation + _letters + _letters_ipa
symbol_to_id = {s: i for i, s in enumerate(symbols)}
phonemize = True
else:
# Save vocab as temporary json file
symbols = [line.replace("\n", "") for line in open(vocab_path, encoding="utf-8").readlines()]
symbol_to_id = {s: i for i, s in enumerate(symbols)}
# MMS-TTS does not use a <pad> token, so we set to the token used to space characters
_pad = symbols[0]
phonemize = False
with tempfile.NamedTemporaryFile() as tf:
with open(tf.name, "w", encoding="utf-8") as f:
f.write(json.dumps(symbol_to_id, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
tokenizer = VitsTokenizer(tf.name, language=language, phonemize=phonemize, is_uroman=is_uroman, pad_token=_pad)
config.vocab_size = len(symbols)
model = VitsModel(config)
model.decoder.apply_weight_norm()
orig_checkpoint = torch.load(checkpoint_path, map_location=torch.device("cpu"))
recursively_load_weights(orig_checkpoint["model"], model)
model.decoder.remove_weight_norm()
model.save_pretrained(pytorch_dump_folder_path)
tokenizer.save_pretrained(pytorch_dump_folder_path)
if repo_id:
print("Pushing to the hub...")
tokenizer.push_to_hub(repo_id)
model.push_to_hub(repo_id)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--checkpoint_path", default=None, type=str, help="Local path to original checkpoint")
parser.add_argument("--vocab_path", default=None, type=str, help="Path to vocab.txt")
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
parser.add_argument("--language", default=None, type=str, help="Tokenizer language (three-letter code)")
parser.add_argument("--num_speakers", default=None, type=int, help="Number of speakers")
parser.add_argument(
"--sampling_rate", default=None, type=int, help="Sampling rate on which the model was trained."
)
parser.add_argument(
"--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
)
parser.add_argument(
"--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
)
args = parser.parse_args()
convert_checkpoint(
args.pytorch_dump_folder_path,
args.checkpoint_path,
args.config_path,
args.vocab_path,
args.language,
args.num_speakers,
args.sampling_rate,
args.push_to_hub,
)
# coding=utf-8
# Copyright 2023 The Kakao Enterprise Authors and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch VITS model."""
import math
from dataclasses import dataclass
from typing import Any, Optional, Tuple, Union
import numpy as np
import torch
import torch.utils.checkpoint
from torch import nn
from ...activations import ACT2FN
from ...deepspeed import is_deepspeed_zero3_enabled
from ...modeling_outputs import (
BaseModelOutput,
ModelOutput,
)
from ...modeling_utils import PreTrainedModel
from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from .configuration_vits import VitsConfig
logger = logging.get_logger(__name__)
# General docstring
_CONFIG_FOR_DOC = "VitsConfig"
VITS_PRETRAINED_MODEL_ARCHIVE_LIST = [
"facebook/mms-tts-eng",
# See all VITS models at https://huggingface.co/models?filter=vits
# and all MMS models at https://huggingface.co/models?sort=trending&search=facebook%2Fmms-tts
]
@dataclass
class VitsModelOutput(ModelOutput):
"""
Describes the outputs for the VITS model, with potential hidden states and attentions.
Args:
waveform (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
The final audio waveform predicted by the model.
sequence_lengths (`torch.FloatTensor` of shape `(batch_size,)`):
The length in samples of each element in the `waveform` batch.
spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`):
The log-mel spectrogram predicted at the output of the flow model. This spectrogram is passed to the Hi-Fi
GAN decoder model to obtain the final audio waveform.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attention weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
waveform: torch.FloatTensor = None
sequence_lengths: torch.FloatTensor = None
spectrogram: Optional[Tuple[torch.FloatTensor]] = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
@dataclass
class VitsTextEncoderOutput(ModelOutput):
"""
Describes the outputs for the VITS text encoder model, with potential hidden states and attentions.
Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
prior_means (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The predicted mean values of the prior distribution for the latent text variables.
prior_log_variances (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The predicted log-variance values of the prior distribution for the latent text variables.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attention weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
last_hidden_state: torch.FloatTensor = None
prior_means: torch.FloatTensor = None
prior_log_variances: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
# Copied from transformers.models.bart.modeling_bart._expand_mask
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
"""
Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
"""
bsz, src_len = mask.size()
tgt_len = tgt_len if tgt_len is not None else src_len
expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
inverted_mask = 1.0 - expanded_mask
return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b, num_channels):
in_act = input_a + input_b
t_act = torch.tanh(in_act[:, :num_channels, :])
s_act = torch.sigmoid(in_act[:, num_channels:, :])
acts = t_act * s_act
return acts
def _unconstrained_rational_quadratic_spline(
inputs,
unnormalized_widths,
unnormalized_heights,
unnormalized_derivatives,
reverse=False,
tail_bound=5.0,
min_bin_width=1e-3,
min_bin_height=1e-3,
min_derivative=1e-3,
):
"""
This transformation represents a monotonically increasing piecewise rational quadratic function. Outside of the
`tail_bound`, the transform behaves as an identity function.
Args:
inputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Second half of the hidden-states input to the Vits convolutional flow module.
unnormalized_widths (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
First `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
unnormalized_heights (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
Second `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
unnormalized_derivatives (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
Third `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
reverse (`bool`, *optional*, defaults to `False`):
Whether the model is being run in reverse mode.
tail_bound (`float`, *optional* defaults to 5):
Upper and lower limit bound for the rational quadratic function. Outside of this `tail_bound`, the
transform behaves as an identity function.
min_bin_width (`float`, *optional*, defaults to 1e-3):
Minimum bin value across the width dimension for the piecewise rational quadratic function.
min_bin_height (`float`, *optional*, defaults to 1e-3):
Minimum bin value across the height dimension for the piecewise rational quadratic function.
min_derivative (`float`, *optional*, defaults to 1e-3):
Minimum bin value across the derivatives for the piecewise rational quadratic function.
Returns:
outputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Hidden-states as transformed by the piecewise rational quadratic function with the `tail_bound` limits
applied.
log_abs_det (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Logarithm of the absolute value of the determinants corresponding to the `outputs` with the `tail_bound`
limits applied.
"""
inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
outside_interval_mask = ~inside_interval_mask
outputs = torch.zeros_like(inputs)
log_abs_det = torch.zeros_like(inputs)
constant = np.log(np.exp(1 - min_derivative) - 1)
unnormalized_derivatives = nn.functional.pad(unnormalized_derivatives, pad=(1, 1))
unnormalized_derivatives[..., 0] = constant
unnormalized_derivatives[..., -1] = constant
outputs[outside_interval_mask] = inputs[outside_interval_mask]
log_abs_det[outside_interval_mask] = 0.0
outputs[inside_interval_mask], log_abs_det[inside_interval_mask] = _rational_quadratic_spline(
inputs=inputs[inside_interval_mask],
unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
reverse=reverse,
tail_bound=tail_bound,
min_bin_width=min_bin_width,
min_bin_height=min_bin_height,
min_derivative=min_derivative,
)
return outputs, log_abs_det
def _rational_quadratic_spline(
inputs,
unnormalized_widths,
unnormalized_heights,
unnormalized_derivatives,
reverse,
tail_bound,
min_bin_width,
min_bin_height,
min_derivative,
):
"""
This transformation represents a monotonically increasing piecewise rational quadratic function. Unlike the
function `_unconstrained_rational_quadratic_spline`, the function behaves the same across the `tail_bound`.
Args:
inputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Second half of the hidden-states input to the Vits convolutional flow module.
unnormalized_widths (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
First `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
unnormalized_heights (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
Second `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
unnormalized_derivatives (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`):
Third `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection
layer in the convolutional flow module
reverse (`bool`):
Whether the model is being run in reverse mode.
tail_bound (`float`):
Upper and lower limit bound for the rational quadratic function. Outside of this `tail_bound`, the
transform behaves as an identity function.
min_bin_width (`float`):
Minimum bin value across the width dimension for the piecewise rational quadratic function.
min_bin_height (`float`):
Minimum bin value across the height dimension for the piecewise rational quadratic function.
min_derivative (`float`):
Minimum bin value across the derivatives for the piecewise rational quadratic function.
Returns:
outputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Hidden-states as transformed by the piecewise rational quadratic function.
log_abs_det (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`:
Logarithm of the absolute value of the determinants corresponding to the `outputs`.
"""
upper_bound = tail_bound
lower_bound = -tail_bound
if torch.min(inputs) < lower_bound or torch.max(inputs) > upper_bound:
raise ValueError("Input to a transform is not within its domain")
num_bins = unnormalized_widths.shape[-1]
if min_bin_width * num_bins > 1.0:
raise ValueError(f"Minimal bin width {min_bin_width} too large for the number of bins {num_bins}")
if min_bin_height * num_bins > 1.0:
raise ValueError(f"Minimal bin height {min_bin_height} too large for the number of bins {num_bins}")
widths = nn.functional.softmax(unnormalized_widths, dim=-1)
widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
cumwidths = torch.cumsum(widths, dim=-1)
cumwidths = nn.functional.pad(cumwidths, pad=(1, 0), mode="constant", value=0.0)
cumwidths = (upper_bound - lower_bound) * cumwidths + lower_bound
cumwidths[..., 0] = lower_bound
cumwidths[..., -1] = upper_bound
widths = cumwidths[..., 1:] - cumwidths[..., :-1]
derivatives = min_derivative + nn.functional.softplus(unnormalized_derivatives)
heights = nn.functional.softmax(unnormalized_heights, dim=-1)
heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
cumheights = torch.cumsum(heights, dim=-1)
cumheights = nn.functional.pad(cumheights, pad=(1, 0), mode="constant", value=0.0)
cumheights = (upper_bound - lower_bound) * cumheights + lower_bound
cumheights[..., 0] = lower_bound
cumheights[..., -1] = upper_bound
heights = cumheights[..., 1:] - cumheights[..., :-1]
bin_locations = cumheights if reverse else cumwidths
bin_locations[..., -1] += 1e-6
bin_idx = torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
bin_idx = bin_idx[..., None]
input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
delta = heights / widths
input_delta = delta.gather(-1, bin_idx)[..., 0]
input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
input_heights = heights.gather(-1, bin_idx)[..., 0]
intermediate1 = input_derivatives + input_derivatives_plus_one - 2 * input_delta
if not reverse:
theta = (inputs - input_cumwidths) / input_bin_widths
theta_one_minus_theta = theta * (1 - theta)
numerator = input_heights * (input_delta * theta.pow(2) + input_derivatives * theta_one_minus_theta)
denominator = input_delta + intermediate1 * theta_one_minus_theta
outputs = input_cumheights + numerator / denominator
derivative_numerator = input_delta.pow(2) * (
input_derivatives_plus_one * theta.pow(2)
+ 2 * input_delta * theta_one_minus_theta
+ input_derivatives * (1 - theta).pow(2)
)
log_abs_det = torch.log(derivative_numerator) - 2 * torch.log(denominator)
return outputs, log_abs_det
else:
# find the roots of a quadratic equation
intermediate2 = inputs - input_cumheights
intermediate3 = intermediate2 * intermediate1
a = input_heights * (input_delta - input_derivatives) + intermediate3
b = input_heights * input_derivatives - intermediate3
c = -input_delta * intermediate2
discriminant = b.pow(2) - 4 * a * c
if not (discriminant >= 0).all():
raise RuntimeError(f"invalid discriminant {discriminant}")
root = (2 * c) / (-b - torch.sqrt(discriminant))
outputs = root * input_bin_widths + input_cumwidths
theta_one_minus_theta = root * (1 - root)
denominator = input_delta + intermediate1 * theta_one_minus_theta
derivative_numerator = input_delta.pow(2) * (
input_derivatives_plus_one * root.pow(2)
+ 2 * input_delta * theta_one_minus_theta
+ input_derivatives * (1 - root).pow(2)
)
log_abs_det = torch.log(derivative_numerator) - 2 * torch.log(denominator)
return outputs, -log_abs_det
class VitsWaveNet(torch.nn.Module):
def __init__(self, config: VitsConfig, num_layers: int):
super().__init__()
self.hidden_size = config.hidden_size
self.num_layers = num_layers
self.in_layers = torch.nn.ModuleList()
self.res_skip_layers = torch.nn.ModuleList()
self.dropout = nn.Dropout(config.wavenet_dropout)
if config.speaker_embedding_size != 0:
cond_layer = torch.nn.Conv1d(config.speaker_embedding_size, 2 * config.hidden_size * num_layers, 1)
self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name="weight")
for i in range(num_layers):
dilation = config.wavenet_dilation_rate**i
padding = (config.wavenet_kernel_size * dilation - dilation) // 2
in_layer = torch.nn.Conv1d(
in_channels=config.hidden_size,
out_channels=2 * config.hidden_size,
kernel_size=config.wavenet_kernel_size,
dilation=dilation,
padding=padding,
)
in_layer = torch.nn.utils.weight_norm(in_layer, name="weight")
self.in_layers.append(in_layer)
# last one is not necessary
if i < num_layers - 1:
res_skip_channels = 2 * config.hidden_size
else:
res_skip_channels = config.hidden_size
res_skip_layer = torch.nn.Conv1d(config.hidden_size, res_skip_channels, 1)
res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name="weight")
self.res_skip_layers.append(res_skip_layer)
def forward(self, inputs, padding_mask, global_conditioning=None):
outputs = torch.zeros_like(inputs)
num_channels_tensor = torch.IntTensor([self.hidden_size])
if global_conditioning is not None:
global_conditioning = self.cond_layer(global_conditioning)
for i in range(self.num_layers):
hidden_states = self.in_layers[i](inputs)
if global_conditioning is not None:
cond_offset = i * 2 * self.hidden_size
global_states = global_conditioning[:, cond_offset : cond_offset + 2 * self.hidden_size, :]
else:
global_states = torch.zeros_like(hidden_states)
acts = fused_add_tanh_sigmoid_multiply(hidden_states, global_states, num_channels_tensor[0])
acts = self.dropout(acts)
res_skip_acts = self.res_skip_layers[i](acts)
if i < self.num_layers - 1:
res_acts = res_skip_acts[:, : self.hidden_size, :]
inputs = (inputs + res_acts) * padding_mask
outputs = outputs + res_skip_acts[:, self.hidden_size :, :]
else:
outputs = outputs + res_skip_acts
return outputs * padding_mask
def remove_weight_norm(self):
if self.speaker_embedding_size != 0:
torch.nn.utils.remove_weight_norm(self.cond_layer)
for layer in self.in_layers:
torch.nn.utils.remove_weight_norm(layer)
for layer in self.res_skip_layers:
torch.nn.utils.remove_weight_norm(layer)
class VitsPosteriorEncoder(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.out_channels = config.flow_size
self.conv_pre = nn.Conv1d(config.spectrogram_bins, config.hidden_size, 1)
self.wavenet = VitsWaveNet(config, num_layers=config.posterior_encoder_num_wavenet_layers)
self.conv_proj = nn.Conv1d(config.hidden_size, self.out_channels * 2, 1)
def forward(self, inputs, padding_mask, global_conditioning=None):
inputs = self.conv_pre(inputs) * padding_mask
inputs = self.wavenet(inputs, padding_mask, global_conditioning)
stats = self.conv_proj(inputs) * padding_mask
mean, log_stddev = torch.split(stats, self.out_channels, dim=1)
sampled = (mean + torch.randn_like(mean) * torch.exp(log_stddev)) * padding_mask
return sampled, mean, log_stddev
# Copied from transformers.models.speecht5.modeling_speecht5.HifiGanResidualBlock
class HifiGanResidualBlock(nn.Module):
def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1):
super().__init__()
self.leaky_relu_slope = leaky_relu_slope
self.convs1 = nn.ModuleList(
[
nn.Conv1d(
channels,
channels,
kernel_size,
stride=1,
dilation=dilation[i],
padding=self.get_padding(kernel_size, dilation[i]),
)
for i in range(len(dilation))
]
)
self.convs2 = nn.ModuleList(
[
nn.Conv1d(
channels,
channels,
kernel_size,
stride=1,
dilation=1,
padding=self.get_padding(kernel_size, 1),
)
for _ in range(len(dilation))
]
)
def get_padding(self, kernel_size, dilation=1):
return (kernel_size * dilation - dilation) // 2
def apply_weight_norm(self):
for layer in self.convs1:
nn.utils.weight_norm(layer)
for layer in self.convs2:
nn.utils.weight_norm(layer)
def remove_weight_norm(self):
for layer in self.convs1:
nn.utils.remove_weight_norm(layer)
for layer in self.convs2:
nn.utils.remove_weight_norm(layer)
def forward(self, hidden_states):
for conv1, conv2 in zip(self.convs1, self.convs2):
residual = hidden_states
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = conv1(hidden_states)
hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
hidden_states = conv2(hidden_states)
hidden_states = hidden_states + residual
return hidden_states
class VitsHifiGan(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.config = config
self.num_kernels = len(config.resblock_kernel_sizes)
self.num_upsamples = len(config.upsample_rates)
self.conv_pre = nn.Conv1d(
config.flow_size,
config.upsample_initial_channel,
kernel_size=7,
stride=1,
padding=3,
)
self.upsampler = nn.ModuleList()
for i, (upsample_rate, kernel_size) in enumerate(zip(config.upsample_rates, config.upsample_kernel_sizes)):
self.upsampler.append(
nn.ConvTranspose1d(
config.upsample_initial_channel // (2**i),
config.upsample_initial_channel // (2 ** (i + 1)),
kernel_size=kernel_size,
stride=upsample_rate,
padding=(kernel_size - upsample_rate) // 2,
)
)
self.resblocks = nn.ModuleList()
for i in range(len(self.upsampler)):
channels = config.upsample_initial_channel // (2 ** (i + 1))
for kernel_size, dilation in zip(config.resblock_kernel_sizes, config.resblock_dilation_sizes):
self.resblocks.append(HifiGanResidualBlock(channels, kernel_size, dilation, config.leaky_relu_slope))
self.conv_post = nn.Conv1d(channels, 1, kernel_size=7, stride=1, padding=3, bias=False)
if config.speaker_embedding_size != 0:
self.cond = nn.Conv1d(config.speaker_embedding_size, config.upsample_initial_channel, 1)
def apply_weight_norm(self):
for layer in self.upsampler:
nn.utils.weight_norm(layer)
for layer in self.resblocks:
layer.apply_weight_norm()
def remove_weight_norm(self):
for layer in self.upsampler:
nn.utils.remove_weight_norm(layer)
for layer in self.resblocks:
layer.remove_weight_norm()
def forward(
self, spectrogram: torch.FloatTensor, global_conditioning: Optional[torch.FloatTensor] = None
) -> torch.FloatTensor:
r"""
Converts a spectrogram into a speech waveform.
Args:
spectrogram (`torch.FloatTensor` of shape `(batch_size, config.spectrogram_bins, sequence_length)`):
Tensor containing the spectrograms.
global_conditioning (`torch.FloatTensor` of shape `(batch_size, config.speaker_embedding_size, 1)`, *optional*):
Tensor containing speaker embeddings, for multispeaker models.
Returns:
`torch.FloatTensor`: Tensor of shape shape `(batch_size, 1, num_frames)` containing the speech waveform.
"""
hidden_states = self.conv_pre(spectrogram)
if global_conditioning is not None:
hidden_states = hidden_states + self.cond(global_conditioning)
for i in range(self.num_upsamples):
hidden_states = nn.functional.leaky_relu(hidden_states, self.config.leaky_relu_slope)
hidden_states = self.upsampler[i](hidden_states)
res_state = self.resblocks[i * self.num_kernels](hidden_states)
for j in range(1, self.num_kernels):
res_state += self.resblocks[i * self.num_kernels + j](hidden_states)
hidden_states = res_state / self.num_kernels
hidden_states = nn.functional.leaky_relu(hidden_states)
hidden_states = self.conv_post(hidden_states)
waveform = torch.tanh(hidden_states)
return waveform
class VitsResidualCouplingLayer(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.half_channels = config.flow_size // 2
self.conv_pre = nn.Conv1d(self.half_channels, config.hidden_size, 1)
self.wavenet = VitsWaveNet(config, num_layers=config.prior_encoder_num_wavenet_layers)
self.conv_post = nn.Conv1d(config.hidden_size, self.half_channels, 1)
def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False):
first_half, second_half = torch.split(inputs, [self.half_channels] * 2, dim=1)
hidden_states = self.conv_pre(first_half) * padding_mask
hidden_states = self.wavenet(hidden_states, padding_mask, global_conditioning)
mean = self.conv_post(hidden_states) * padding_mask
log_stddev = torch.zeros_like(mean)
if not reverse:
second_half = mean + second_half * torch.exp(log_stddev) * padding_mask
outputs = torch.cat([first_half, second_half], dim=1)
log_determinant = torch.sum(log_stddev, [1, 2])
return outputs, log_determinant
else:
second_half = (second_half - mean) * torch.exp(-log_stddev) * padding_mask
outputs = torch.cat([first_half, second_half], dim=1)
return outputs, None
class VitsResidualCouplingBlock(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.flows = nn.ModuleList()
for _ in range(config.prior_encoder_num_flows):
self.flows.append(VitsResidualCouplingLayer(config))
def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False):
if not reverse:
for flow in self.flows:
inputs, _ = flow(inputs, padding_mask, global_conditioning)
inputs = torch.flip(inputs, [1])
else:
for flow in reversed(self.flows):
inputs = torch.flip(inputs, [1])
inputs, _ = flow(inputs, padding_mask, global_conditioning, reverse=True)
return inputs
class VitsDilatedDepthSeparableConv(nn.Module):
def __init__(self, config: VitsConfig, dropout_rate=0.0):
super().__init__()
kernel_size = config.duration_predictor_kernel_size
channels = config.hidden_size
self.num_layers = config.depth_separable_num_layers
self.dropout = nn.Dropout(dropout_rate)
self.convs_dilated = nn.ModuleList()
self.convs_pointwise = nn.ModuleList()
self.norms_1 = nn.ModuleList()
self.norms_2 = nn.ModuleList()
for i in range(self.num_layers):
dilation = kernel_size**i
padding = (kernel_size * dilation - dilation) // 2
self.convs_dilated.append(
nn.Conv1d(
in_channels=channels,
out_channels=channels,
kernel_size=kernel_size,
groups=channels,
dilation=dilation,
padding=padding,
)
)
self.convs_pointwise.append(nn.Conv1d(channels, channels, 1))
self.norms_1.append(nn.LayerNorm(channels))
self.norms_2.append(nn.LayerNorm(channels))
def forward(self, inputs, padding_mask, global_conditioning=None):
if global_conditioning is not None:
inputs = inputs + global_conditioning
for i in range(self.num_layers):
hidden_states = self.convs_dilated[i](inputs * padding_mask)
hidden_states = self.norms_1[i](hidden_states.transpose(1, -1)).transpose(1, -1)
hidden_states = nn.functional.gelu(hidden_states)
hidden_states = self.convs_pointwise[i](hidden_states)
hidden_states = self.norms_2[i](hidden_states.transpose(1, -1)).transpose(1, -1)
hidden_states = nn.functional.gelu(hidden_states)
hidden_states = self.dropout(hidden_states)
inputs = inputs + hidden_states
return inputs * padding_mask
class VitsConvFlow(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.filter_channels = config.hidden_size
self.half_channels = config.depth_separable_channels // 2
self.num_bins = config.duration_predictor_flow_bins
self.tail_bound = config.duration_predictor_tail_bound
self.conv_pre = nn.Conv1d(self.half_channels, self.filter_channels, 1)
self.conv_dds = VitsDilatedDepthSeparableConv(config)
self.conv_proj = nn.Conv1d(self.filter_channels, self.half_channels * (self.num_bins * 3 - 1), 1)
def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False):
first_half, second_half = torch.split(inputs, [self.half_channels] * 2, dim=1)
hidden_states = self.conv_pre(first_half)
hidden_states = self.conv_dds(hidden_states, padding_mask, global_conditioning)
hidden_states = self.conv_proj(hidden_states) * padding_mask
batch_size, channels, length = first_half.shape
hidden_states = hidden_states.reshape(batch_size, channels, -1, length).permute(0, 1, 3, 2)
unnormalized_widths = hidden_states[..., : self.num_bins] / math.sqrt(self.filter_channels)
unnormalized_heights = hidden_states[..., self.num_bins : 2 * self.num_bins] / math.sqrt(self.filter_channels)
unnormalized_derivatives = hidden_states[..., 2 * self.num_bins :]
second_half, log_abs_det = _unconstrained_rational_quadratic_spline(
second_half,
unnormalized_widths,
unnormalized_heights,
unnormalized_derivatives,
reverse=reverse,
tail_bound=self.tail_bound,
)
outputs = torch.cat([first_half, second_half], dim=1) * padding_mask
if not reverse:
log_determinant = torch.sum(log_abs_det * padding_mask, [1, 2])
return outputs, log_determinant
else:
return outputs, None
class VitsElementwiseAffine(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.channels = config.depth_separable_channels
self.translate = nn.Parameter(torch.zeros(self.channels, 1))
self.log_scale = nn.Parameter(torch.zeros(self.channels, 1))
def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False):
if not reverse:
outputs = self.translate + torch.exp(self.log_scale) * inputs
outputs = outputs * padding_mask
log_determinant = torch.sum(self.log_scale * padding_mask, [1, 2])
return outputs, log_determinant
else:
outputs = (inputs - self.translate) * torch.exp(-self.log_scale) * padding_mask
return outputs, None
class VitsStochasticDurationPredictor(nn.Module):
def __init__(self, config):
super().__init__()
embed_dim = config.speaker_embedding_size
filter_channels = config.hidden_size
self.conv_pre = nn.Conv1d(filter_channels, filter_channels, 1)
self.conv_proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.conv_dds = VitsDilatedDepthSeparableConv(
config,
dropout_rate=config.duration_predictor_dropout,
)
if embed_dim != 0:
self.cond = nn.Conv1d(embed_dim, filter_channels, 1)
self.flows = nn.ModuleList()
self.flows.append(VitsElementwiseAffine(config))
for _ in range(config.duration_predictor_num_flows):
self.flows.append(VitsConvFlow(config))
self.post_conv_pre = nn.Conv1d(1, filter_channels, 1)
self.post_conv_proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.post_conv_dds = VitsDilatedDepthSeparableConv(
config,
dropout_rate=config.duration_predictor_dropout,
)
self.post_flows = nn.ModuleList()
self.post_flows.append(VitsElementwiseAffine(config))
for _ in range(config.duration_predictor_num_flows):
self.post_flows.append(VitsConvFlow(config))
def forward(self, inputs, padding_mask, global_conditioning=None, durations=None, reverse=False, noise_scale=1.0):
inputs = torch.detach(inputs)
inputs = self.conv_pre(inputs)
if global_conditioning is not None:
global_conditioning = torch.detach(global_conditioning)
inputs = inputs + self.cond(global_conditioning)
inputs = self.conv_dds(inputs, padding_mask)
inputs = self.conv_proj(inputs) * padding_mask
if not reverse:
hidden_states = self.post_conv_pre(durations)
hidden_states = self.post_conv_dds(hidden_states, padding_mask)
hidden_states = self.post_conv_proj(hidden_states) * padding_mask
random_posterior = (
torch.randn(durations.size(0), 2, durations.size(2)).to(device=inputs.device, dtype=inputs.dtype)
* padding_mask
)
log_determinant_posterior_sum = 0
latents_posterior = random_posterior
for flow in self.post_flows:
latents_posterior, log_determinant = flow(
latents_posterior, padding_mask, global_conditioning=inputs + hidden_states
)
latents_posterior = torch.flip(latents_posterior, [1])
log_determinant_posterior_sum += log_determinant
first_half, second_half = torch.split(latents_posterior, [1, 1], dim=1)
log_determinant_posterior_sum += torch.sum(
(nn.functional.logsigmoid(first_half) + nn.functional.logsigmoid(-first_half)) * padding_mask, [1, 2]
)
logq = (
torch.sum(-0.5 * (math.log(2 * math.pi) + (random_posterior**2)) * padding_mask, [1, 2])
- log_determinant_posterior_sum
)
first_half = (durations - torch.sigmoid(first_half)) * padding_mask
first_half = torch.log(torch.clamp_min(first_half, 1e-5)) * padding_mask
log_determinant_sum = torch.sum(-first_half, [1, 2])
latents = torch.cat([first_half, second_half], dim=1)
for flow in self.flows:
latents, log_determinant = flow(latents, padding_mask, global_conditioning=inputs)
latents = torch.flip(latents, [1])
log_determinant_sum += log_determinant
nll = (
torch.sum(0.5 * (math.log(2 * math.pi) + (latents**2)) * padding_mask, [1, 2]) - log_determinant_sum
)
return nll + logq
else:
flows = list(reversed(self.flows))
flows = flows[:-2] + [flows[-1]] # remove a useless vflow
latents = (
torch.randn(inputs.size(0), 2, inputs.size(2)).to(device=inputs.device, dtype=inputs.dtype)
* noise_scale
)
for flow in flows:
latents = torch.flip(latents, [1])
latents, _ = flow(latents, padding_mask, global_conditioning=inputs, reverse=True)
log_duration, _ = torch.split(latents, [1, 1], dim=1)
return log_duration
class VitsDurationPredictor(nn.Module):
def __init__(self, config):
super().__init__()
kernel_size = config.duration_predictor_kernel_size
filter_channels = config.duration_predictor_filter_channels
self.dropout = nn.Dropout(config.duration_predictor_dropout)
self.conv_1 = nn.Conv1d(config.hidden_size, filter_channels, kernel_size, padding=kernel_size // 2)
self.norm_1 = nn.LayerNorm(filter_channels, eps=config.layer_norm_eps)
self.conv_2 = nn.Conv1d(filter_channels, filter_channels, kernel_size, padding=kernel_size // 2)
self.norm_2 = nn.LayerNorm(filter_channels, eps=config.layer_norm_eps)
self.proj = nn.Conv1d(filter_channels, 1, 1)
if config.speaker_embedding_size != 0:
self.cond = nn.Conv1d(config.speaker_embedding_size, config.hidden_size, 1)
def forward(self, inputs, padding_mask, global_conditioning=None):
inputs = torch.detach(inputs)
if global_conditioning is not None:
global_conditioning = torch.detach(global_conditioning)
inputs = inputs + self.cond(global_conditioning)
inputs = self.conv_1(inputs * padding_mask)
inputs = torch.relu(inputs)
inputs = self.norm_1(inputs.transpose(1, -1)).transpose(1, -1)
inputs = self.dropout(inputs)
inputs = self.conv_2(inputs * padding_mask)
inputs = torch.relu(inputs)
inputs = self.norm_2(inputs.transpose(1, -1)).transpose(1, -1)
inputs = self.dropout(inputs)
inputs = self.proj(inputs * padding_mask)
return inputs * padding_mask
class VitsAttention(nn.Module):
"""Multi-headed attention with relative positional representation."""
def __init__(self, config: VitsConfig):
super().__init__()
self.embed_dim = config.hidden_size
self.num_heads = config.num_attention_heads
self.dropout = config.attention_dropout
self.window_size = config.window_size
self.head_dim = self.embed_dim // self.num_heads
self.scaling = self.head_dim**-0.5
if (self.head_dim * self.num_heads) != self.embed_dim:
raise ValueError(
f"hidden_size must be divisible by num_attention_heads (got `hidden_size`: {self.embed_dim}"
f" and `num_attention_heads`: {self.num_heads})."
)
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias)
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias)
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias)
self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias)
if self.window_size:
self.emb_rel_k = nn.Parameter(torch.randn(1, self.window_size * 2 + 1, self.head_dim) * self.scaling)
self.emb_rel_v = nn.Parameter(torch.randn(1, self.window_size * 2 + 1, self.head_dim) * self.scaling)
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
def forward(
self,
hidden_states: torch.Tensor,
key_value_states: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
layer_head_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
"""Input shape: Batch x Time x Channel"""
# if key_value_states are provided this layer is used as a cross-attention layer
# for the decoder
bsz, tgt_len, _ = hidden_states.size()
# get query proj
query_states = self.q_proj(hidden_states) * self.scaling
# self_attention
key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
proj_shape = (bsz * self.num_heads, -1, self.head_dim)
query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
key_states = key_states.view(*proj_shape)
value_states = value_states.view(*proj_shape)
src_len = key_states.size(1)
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
raise ValueError(
f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
f" {attn_weights.size()}"
)
if self.window_size is not None:
key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, src_len)
relative_logits = torch.matmul(query_states, key_relative_embeddings.transpose(-2, -1))
rel_pos_bias = self._relative_position_to_absolute_position(relative_logits)
attn_weights += rel_pos_bias
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, tgt_len, src_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
)
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
if layer_head_mask is not None:
if layer_head_mask.size() != (self.num_heads,):
raise ValueError(
f"Head mask for a single layer should be of size {(self.num_heads,)}, but is"
f" {layer_head_mask.size()}"
)
attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
if output_attentions:
# this operation is a bit awkward, but it's required to
# make sure that attn_weights keeps its gradient.
# In order to do so, attn_weights have to be reshaped
# twice and have to be reused in the following
attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
else:
attn_weights_reshaped = None
attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
attn_output = torch.bmm(attn_probs, value_states)
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
raise ValueError(
f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
f" {attn_output.size()}"
)
if self.window_size is not None:
value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, src_len)
relative_weights = self._absolute_position_to_relative_position(attn_probs)
rel_pos_bias = torch.matmul(relative_weights, value_relative_embeddings)
attn_output += rel_pos_bias
attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
attn_output = attn_output.transpose(1, 2)
# Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
# partitioned aross GPUs when using tensor-parallelism.
attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights_reshaped
def _get_relative_embeddings(self, relative_embeddings, length):
pad_length = max(length - (self.window_size + 1), 0)
if pad_length > 0:
relative_embeddings = nn.functional.pad(relative_embeddings, [0, 0, pad_length, pad_length, 0, 0])
slice_start_position = max((self.window_size + 1) - length, 0)
slice_end_position = slice_start_position + 2 * length - 1
return relative_embeddings[:, slice_start_position:slice_end_position]
def _relative_position_to_absolute_position(self, x):
batch_heads, length, _ = x.size()
# Concat columns of pad to shift from relative to absolute indexing.
x = nn.functional.pad(x, [0, 1, 0, 0, 0, 0])
# Concat extra elements so to add up to shape (len+1, 2*len-1).
x_flat = x.view([batch_heads, length * 2 * length])
x_flat = nn.functional.pad(x_flat, [0, length - 1, 0, 0])
# Reshape and slice out the padded elements.
x_final = x_flat.view([batch_heads, length + 1, 2 * length - 1])
x_final = x_final[:, :length, length - 1 :]
return x_final
def _absolute_position_to_relative_position(self, x):
batch_heads, length, _ = x.size()
# Pad along column
x = nn.functional.pad(x, [0, length - 1, 0, 0, 0, 0])
x_flat = x.view([batch_heads, length**2 + length * (length - 1)])
# Add 0's in the beginning that will skew the elements after reshape
x_flat = nn.functional.pad(x_flat, [length, 0, 0, 0])
x_final = x_flat.view([batch_heads, length, 2 * length])[:, :, 1:]
return x_final
class VitsFeedForward(nn.Module):
def __init__(self, config):
super().__init__()
self.conv_1 = nn.Conv1d(config.hidden_size, config.ffn_dim, config.ffn_kernel_size)
self.conv_2 = nn.Conv1d(config.ffn_dim, config.hidden_size, config.ffn_kernel_size)
self.dropout = nn.Dropout(config.activation_dropout)
if isinstance(config.hidden_act, str):
self.act_fn = ACT2FN[config.hidden_act]
else:
self.act_fn = config.hidden_act
if config.ffn_kernel_size > 1:
pad_left = (config.ffn_kernel_size - 1) // 2
pad_right = config.ffn_kernel_size // 2
self.padding = [pad_left, pad_right, 0, 0, 0, 0]
else:
self.padding = None
def forward(self, hidden_states, padding_mask):
hidden_states = hidden_states.permute(0, 2, 1)
padding_mask = padding_mask.permute(0, 2, 1)
hidden_states = hidden_states * padding_mask
if self.padding is not None:
hidden_states = nn.functional.pad(hidden_states, self.padding)
hidden_states = self.conv_1(hidden_states)
hidden_states = self.act_fn(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = hidden_states * padding_mask
if self.padding is not None:
hidden_states = nn.functional.pad(hidden_states, self.padding)
hidden_states = self.conv_2(hidden_states)
hidden_states = hidden_states * padding_mask
hidden_states = hidden_states.permute(0, 2, 1)
return hidden_states
class VitsEncoderLayer(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.attention = VitsAttention(config)
self.dropout = nn.Dropout(config.hidden_dropout)
self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.feed_forward = VitsFeedForward(config)
self.final_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
padding_mask: torch.FloatTensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False,
):
residual = hidden_states
hidden_states, attn_weights = self.attention(
hidden_states=hidden_states,
attention_mask=attention_mask,
output_attentions=output_attentions,
)
hidden_states = self.dropout(hidden_states)
hidden_states = self.layer_norm(residual + hidden_states)
residual = hidden_states
hidden_states = self.feed_forward(hidden_states, padding_mask)
hidden_states = self.dropout(hidden_states)
hidden_states = self.final_layer_norm(residual + hidden_states)
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
class VitsEncoder(nn.Module):
def __init__(self, config: VitsConfig):
super().__init__()
self.config = config
self.layers = nn.ModuleList([VitsEncoderLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False
self.layerdrop = config.layerdrop
def forward(
self,
hidden_states: torch.FloatTensor,
padding_mask: torch.FloatTensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
# expand attention_mask
if attention_mask is not None:
# [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
attention_mask = _expand_mask(attention_mask, hidden_states.dtype)
hidden_states = hidden_states * padding_mask
deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()
for encoder_layer in self.layers:
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
dropout_probability = np.random.uniform(0, 1)
skip_the_layer = self.training and (dropout_probability < self.layerdrop)
if not skip_the_layer or deepspeed_zero3_is_enabled:
# under deepspeed zero3 all gpus must run in sync
if self.gradient_checkpointing and self.training:
# create gradient checkpointing function
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(encoder_layer),
hidden_states,
padding_mask,
attention_mask,
)
else:
layer_outputs = encoder_layer(
hidden_states,
attention_mask=attention_mask,
padding_mask=padding_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if skip_the_layer:
layer_outputs = (None, None)
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
hidden_states = hidden_states * padding_mask
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)
class VitsTextEncoder(nn.Module):
"""
Transformer encoder that uses relative positional representation instead of absolute positional encoding.
"""
def __init__(self, config: VitsConfig):
super().__init__()
self.config = config
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, config.pad_token_id)
self.encoder = VitsEncoder(config)
self.project = nn.Conv1d(config.hidden_size, config.flow_size * 2, kernel_size=1)
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
def forward(
self,
input_ids: torch.Tensor,
padding_mask: torch.FloatTensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = True,
) -> Union[Tuple[torch.Tensor], VitsTextEncoderOutput]:
hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size)
encoder_outputs = self.encoder(
hidden_states=hidden_states,
padding_mask=padding_mask,
attention_mask=attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
last_hidden_state = encoder_outputs[0] if not return_dict else encoder_outputs.last_hidden_state
stats = self.project(last_hidden_state.transpose(1, 2)).transpose(1, 2) * padding_mask
prior_means, prior_log_variances = torch.split(stats, self.config.flow_size, dim=2)
if not return_dict:
outputs = (last_hidden_state, prior_means, prior_log_variances) + encoder_outputs[1:]
return outputs
return VitsTextEncoderOutput(
last_hidden_state=last_hidden_state,
prior_means=prior_means,
prior_log_variances=prior_log_variances,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
class VitsPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = VitsConfig
base_model_prefix = "vits"
main_input_name = "input_ids"
supports_gradient_checkpointing = True
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, nn.Conv1d):
nn.init.kaiming_normal_(module.weight)
if module.bias is not None:
k = math.sqrt(module.groups / (module.in_channels * module.kernel_size[0]))
nn.init.uniform_(module.bias, a=-k, b=k)
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, (VitsTextEncoder)):
module.gradient_checkpointing = value
VITS_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`VitsConfig`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
VITS_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing convolution and attention on padding token indices. Mask values selected in `[0,
1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
speaker_id (`int`, *optional*):
Which speaker embedding to use. Only used for multispeaker models.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"The complete VITS model, for text-to-speech synthesis.",
VITS_START_DOCSTRING,
)
class VitsModel(VitsPreTrainedModel):
def __init__(self, config: VitsConfig):
super().__init__(config)
self.config = config
self.text_encoder = VitsTextEncoder(config)
self.flow = VitsResidualCouplingBlock(config)
self.decoder = VitsHifiGan(config)
if config.use_stochastic_duration_prediction:
self.duration_predictor = VitsStochasticDurationPredictor(config)
else:
self.duration_predictor = VitsDurationPredictor(config)
if config.num_speakers > 1:
self.embed_speaker = nn.Embedding(config.num_speakers, config.speaker_embedding_size)
# This is used only for training.
self.posterior_encoder = VitsPosteriorEncoder(config)
# These parameters control the synthesised speech properties
self.speaking_rate = config.speaking_rate
self.noise_scale = config.noise_scale
self.noise_scale_duration = config.noise_scale_duration
# Initialize weights and apply final processing
self.post_init()
def get_encoder(self):
return self.text_encoder
@add_start_docstrings_to_model_forward(VITS_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=VitsModelOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
speaker_id: Optional[int] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
labels: Optional[torch.FloatTensor] = None,
) -> Union[Tuple[Any], VitsModelOutput]:
r"""
labels (`torch.FloatTensor` of shape `(batch_size, config.spectrogram_bins, sequence_length)`, *optional*):
Float values of target spectrogram. Timesteps set to `-100.0` are ignored (masked) for the loss
computation.
Returns:
Example:
```python
>>> from transformers import VitsTokenizer, VitsModel, set_seed
>>> import torch
>>> tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
>>> model = VitsModel.from_pretrained("facebook/mms-tts-eng")
>>> inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
>>> set_seed(555) # make deterministic
>>> with torch.no_grad():
... outputs = model(inputs["input_ids"])
>>> outputs.waveform.shape
torch.Size([1, 45824])
```
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if attention_mask is not None:
input_padding_mask = attention_mask.unsqueeze(-1).float()
else:
input_padding_mask = torch.ones_like(input_ids).unsqueeze(-1).float()
if self.config.num_speakers > 1 and speaker_id is not None:
if not 0 <= speaker_id < self.config.num_speakers:
raise ValueError(f"Set `speaker_id` in the range 0-{self.config.num_speakers - 1}.")
speaker_embeddings = self.embed_speaker(torch.tensor([speaker_id])).unsqueeze(-1)
else:
speaker_embeddings = None
if labels is not None:
raise NotImplementedError("Training of VITS is not supported yet.")
text_encoder_output = self.text_encoder(
input_ids=input_ids,
padding_mask=input_padding_mask,
attention_mask=attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = text_encoder_output[0] if not return_dict else text_encoder_output.last_hidden_state
hidden_states = hidden_states.transpose(1, 2)
input_padding_mask = input_padding_mask.transpose(1, 2)
prior_means = text_encoder_output[1] if not return_dict else text_encoder_output.prior_means
prior_log_variances = text_encoder_output[2] if not return_dict else text_encoder_output.prior_log_variances
if self.config.use_stochastic_duration_prediction:
log_duration = self.duration_predictor(
hidden_states,
input_padding_mask,
speaker_embeddings,
reverse=True,
noise_scale=self.noise_scale_duration,
)
else:
log_duration = self.duration_predictor(hidden_states, input_padding_mask, speaker_embeddings)
length_scale = 1.0 / self.speaking_rate
duration = torch.ceil(torch.exp(log_duration) * input_padding_mask * length_scale)
predicted_lengths = torch.clamp_min(torch.sum(duration, [1, 2]), 1).long()
# Create a padding mask for the output lengths of shape (batch, 1, max_output_length)
indices = torch.arange(predicted_lengths.max(), dtype=predicted_lengths.dtype, device=predicted_lengths.device)
output_padding_mask = indices.unsqueeze(0) < predicted_lengths.unsqueeze(1)
output_padding_mask = output_padding_mask.unsqueeze(1).to(input_padding_mask.dtype)
# Reconstruct an attention tensor of shape (batch, 1, out_length, in_length)
attn_mask = torch.unsqueeze(input_padding_mask, 2) * torch.unsqueeze(output_padding_mask, -1)
batch_size, _, output_length, input_length = attn_mask.shape
cum_duration = torch.cumsum(duration, -1).view(batch_size * input_length, 1)
indices = torch.arange(output_length, dtype=duration.dtype, device=duration.device)
valid_indices = indices.unsqueeze(0) < cum_duration
valid_indices = valid_indices.to(attn_mask.dtype).view(batch_size, input_length, output_length)
padded_indices = valid_indices - nn.functional.pad(valid_indices, [0, 0, 1, 0, 0, 0])[:, :-1]
attn = padded_indices.unsqueeze(1).transpose(2, 3) * attn_mask
# Expand prior distribution
prior_means = torch.matmul(attn.squeeze(1), prior_means).transpose(1, 2)
prior_log_variances = torch.matmul(attn.squeeze(1), prior_log_variances).transpose(1, 2)
prior_latents = prior_means + torch.randn_like(prior_means) * torch.exp(prior_log_variances) * self.noise_scale
latents = self.flow(prior_latents, output_padding_mask, speaker_embeddings, reverse=True)
spectrogram = latents * output_padding_mask
waveform = self.decoder(spectrogram, speaker_embeddings)
waveform = waveform.squeeze(1)
sequence_lengths = predicted_lengths * np.prod(self.config.upsample_rates)
if not return_dict:
outputs = (waveform, sequence_lengths, spectrogram) + text_encoder_output[3:]
return outputs
return VitsModelOutput(
waveform=waveform,
sequence_lengths=sequence_lengths,
spectrogram=spectrogram,
hidden_states=text_encoder_output.hidden_states,
attentions=text_encoder_output.attentions,
)
# coding=utf-8
# Copyright 2023 The Kakao Enterprise Authors, the MMS-TTS Authors and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization class for VITS."""
import json
import os
import re
from typing import Any, Dict, List, Optional, Tuple, Union
from ...tokenization_utils import PreTrainedTokenizer
from ...utils import is_phonemizer_available, logging
if is_phonemizer_available():
import phonemizer
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {"vocab_file": "vocab.json"}
PRETRAINED_VOCAB_FILES_MAP = {
"vocab_file": {
"facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/vocab.json",
}
}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
# This model does not have a maximum input length.
"facebook/mms-tts-eng": 4096,
}
def has_non_roman_characters(input_string):
# Find any character outside the ASCII range
non_roman_pattern = re.compile(r"[^\x00-\x7F]")
# Search the input string for non-Roman characters
match = non_roman_pattern.search(input_string)
has_non_roman = match is not None
return has_non_roman
class VitsTokenizer(PreTrainedTokenizer):
"""
Construct a VITS tokenizer. Also supports MMS-TTS.
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
this superclass for more information regarding those methods.
Args:
vocab_file (`str`):
Path to the vocabulary file.
language (`str`, *optional*):
Language identifier.
add_blank (`bool`, *optional*, defaults to `True`):
Whether to insert token id 0 in between the other tokens.
normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the input text by removing all casing and punctuation.
phonemize (`bool`, *optional*, defaults to `True`):
Whether to convert the input text into phonemes.
is_uroman (`bool`, *optional*, defaults to `False`):
Whether the `uroman` Romanizer needs to be applied to the input text prior to tokenizing.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
vocab_file,
pad_token="<pad>",
unk_token="<unk>",
language=None,
add_blank=True,
normalize=True,
phonemize=True,
is_uroman=False,
**kwargs,
) -> None:
super().__init__(
pad_token=pad_token,
unk_token=unk_token,
language=language,
add_blank=add_blank,
normalize=normalize,
phonemize=phonemize,
is_uroman=is_uroman,
**kwargs,
)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
self.language = language
self.add_blank = add_blank
self.normalize = normalize
self.phonemize = phonemize
self.is_uroman = is_uroman
@property
def vocab_size(self):
return len(self.encoder)
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
return vocab
def normalize_text(self, input_string):
"""Lowercase the input string, respecting any special token ids that may be part or entirely upper-cased."""
all_vocabulary = list(self.encoder.keys()) + list(self.added_tokens_encoder.keys())
filtered_text = ""
i = 0
while i < len(input_string):
found_match = False
for word in all_vocabulary:
if input_string[i : i + len(word)] == word:
filtered_text += word
i += len(word)
found_match = True
break
if not found_match:
filtered_text += input_string[i].lower()
i += 1
return filtered_text
def _preprocess_char(self, text):
"""Special treatment of characters in certain languages"""
if self.language == "ron":
text = text.replace("ț", "ţ")
return text
def prepare_for_tokenization(
self, text: str, is_split_into_words: bool = False, normalize: Optional[bool] = None, **kwargs
) -> Tuple[str, Dict[str, Any]]:
"""
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the
`kwargs` at the end of the encoding process to be sure all the arguments have been used.
Args:
text (`str`):
The text to prepare.
is_split_into_words (`bool`, *optional*, defaults to `False`):
Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
which it will tokenize.
normalize (`bool`, *optional*, defaults to `None`):
Whether or not to apply punctuation and casing normalization to the text inputs. Typically, VITS is
trained on lower-cased and un-punctuated text. Hence, normalization is used to ensure that the input
text consists only of lower-case characters.
kwargs (`Dict[str, Any]`, *optional*):
Keyword arguments to use for the tokenization.
Returns:
`Tuple[str, Dict[str, Any]]`: The prepared text and the unused kwargs.
"""
normalize = normalize if normalize is not None else self.normalize
if normalize:
# normalise for casing
text = self.normalize_text(text)
filtered_text = self._preprocess_char(text)
if has_non_roman_characters(filtered_text):
logger.warning(
"Text to the tokenizer contains non-Roman characters. Ensure the `uroman` Romanizer is "
"applied to the text prior to passing it to the tokenizer. See "
"`https://github.com/isi-nlp/uroman` for details."
)
if self.phonemize:
if not is_phonemizer_available():
raise ImportError("Please install the `phonemizer` Python package to use this tokenizer.")
filtered_text = phonemizer.phonemize(
filtered_text,
language="en-us",
backend="espeak",
strip=True,
preserve_punctuation=True,
with_stress=True,
)
filtered_text = re.sub(r"\s+", " ", filtered_text)
elif normalize:
# strip any chars outside of the vocab (punctuation)
filtered_text = "".join(list(filter(lambda char: char in self.encoder, filtered_text))).strip()
return filtered_text, kwargs
def _tokenize(self, text: str) -> List[str]:
"""Tokenize a string by inserting the `<pad>` token at the boundary between adjacent characters."""
tokens = list(text)
if self.add_blank:
interspersed = [self._convert_id_to_token(0)] * (len(tokens) * 2 + 1)
interspersed[1::2] = tokens
tokens = interspersed
return tokens
def convert_tokens_to_string(self, tokens: List[str]) -> str:
if self.add_blank and len(tokens) > 1:
tokens = tokens[1::2]
return "".join(tokens)
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
return self.encoder.get(token, self.encoder.get(self.unk_token))
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.decoder.get(index)
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Union[Tuple[str], None]:
if not os.path.isdir(save_directory):
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
return
vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
with open(vocab_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
return (vocab_file,)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment