Unverified Commit f42a35e6 authored by Yoach Lacombe's avatar Yoach Lacombe Committed by GitHub
Browse files

Add bark (#24086)



* first raw version of the bark integration

* working code on small models with single run

* add converting script from suno weights 2 hf

* many changes

* correct past_kv output

* working implementation for inference

* update the converting script according to the architecture changes

* add a working end-to-end inference code

* remove some comments and make small changes

* remove unecessary comment

* add docstrings and ensure no unecessary intermediary output during audio generation

* remove done TODOs

* make style + add config docstrings

* modification for batch inference support on the whole model

* add details to .generation_audio method

* add copyright

* convert EncodecModel from original library to transformers implementation

* add two class in order to facilitate model and sub-models loading from the hub

* add support of loading the whole model

* add BarkProcessor

* correct modeling according to processor output

* Add proper __init__ and auto support

* Add up-to-date copyright/license message

* add relative import instead of absolute

* cleaner head_dim computation

* small comment removal or changes

* more verbose LayerNorm init method

* specify eps for clearer comprehension

* more verbose variable naming in the MLP module

* remove unecessary BarkBlock parameter

* clearer code in the forward pass of the BarkBlock

* remove _initialize_modules method for cleaner code

* Remove unnecessary methods from sub-models

* move code to remove unnecessary function

* rename a variable for clarity and change an assert

* move code and change variable name for clarity

* remove unnecessary asserts

* correct small bug

* correct a comment

* change variable names for clarity

* remove asserts

* change import from absolute to relative

* correct small error due to comma missing + correct import

* Add attribute Bark config

* add first version of tests

* update attention_map

* add tie_weights and resize_token_embeddings for fineModel

* correct getting attention_mask in generate_text_semantic

* remove Bark inference trick

* leave more choices in barkProcessor

* remove _no_split_modules

* fixe error in forward of block and introduce clearer notations

* correct converting script with last changes

* make style + add draft bark.mdx

* correct BarkModelTest::test_generate_text_semantic

* add Bark in main README

* add dummy_pt_objects for Bark

* add missing models in the main init

* correct test_decoder_model_past_with_large_inputs

* disable torchscript test

* change docstring of BarkProcessor

* Add test_processor_bark

* make style

* correct copyrights

* add bark.mdx + make style, quality and consistency

* Apply suggestions from code review
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* Remove unnecessary test method

* simply logic of a test

* Only check first ids for slow audio generation

* split full end-to-end generation tests

* remove unneccessary comment

* change submodel names for clearer naming

* remove ModuleDict from modeling_bark

* combine two if statements

* ensure that an edge misued won't happen

* modify variable name

* move code snippet to the right place (coarse instead of semantic)

* change BarkSemanticModule -> BarkSemanticModel

* align BarkProcessor with transformers paradigm

* correct BarkProcessor tests with last commit changes

* change _validate_voice_preset to an instance method instead of a class method

* tie_weights already called with post_init

* add codec_model config to configuration

* update bark modeling tests with recent BarkProcessor changes

* remove SubModelPretrainedModel + change speakers embeddings prompt type in BarkModel

* change absolute imports to relative

* remove TODO

* change docstrings

* add examples to docs and docstrings

* make style

* uses BatchFeature in BarkProcessor insteads of dict

* continue improving docstrings and docs + make style

* correct docstrings examples

* more comprehensible speaker_embeddings load/Save

* rename speaker_embeddings_dict -> speaker_embeddings

* correct bark.mdx + add bark to documentation_tests

* correct docstrings configuration_bark

* integrate last nit suggestions

* integrate BarkGeneration configs

* make style

* remove bark tests from documentation_tests.txt because timeout - tested manually

* add proper generation config initialization

* small bark.mdx documentation changes

* rename bark.mdx -> bark.md

* add torch.no_grad behind BarkModel.generate_audio()

* replace assert by ValueError in convert_suno_to_hf.py

* integrate a series of short comments from reviewer

* move SemanticLogitsProcessors and remove .detach() from Bark docs and docstrings

* actually remove SemanticLogitsProcessor from modeling_bark.oy

* BarkProcessor returns a single output instead of tuple + correct docstrings

* make style + correct bug

* add initializer_range to BarkConfig + correct slow modeling tests

* add .clone() to history_prompt.coarse_prompt to avoid modifying input array

* Making sure no extra "`" are present

* remove extra characters in modeling_bark.py

* Correct output if history_prompt is None

* remove TODOs

* remove ravel comment

* completing generation_configuration_bark.py docstrings

* change docstrings - number of audio codebooks instead of Encodec codebooks

* change 'bias' docstrings in configuration_bark.py

* format code

* rename BarkModel.generate_audio -> BarkModel.generate_speech

* modify AutoConfig instead of EncodecConfig in BarkConfig

* correct AutoConfig wrong init

* refactor BarkModel and sub-models generate_coarse, generate_fine, generate_text_semantic

* remove SemanticLogitsProcessor and replace it with SuppressTokensLogitsProcessor

* move nb_codebook related config arguments to BarkFineConfig

* rename bark.mdx -> bark.md

* correcting BarkModelConfig from_pretrained + remove keys_to_ignore

* correct bark.md with correct hub path

* correct code bug in bark.md

* correct list tokens_to_suppress

* modify Processor to load nested speaker embeddings in a safer way

* correct batch sampling in BarkFineModel.generate_fine

* Apply suggestions from code review

Small docstrings correction and code improvements
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* give more details about num_layers in docstrings

* correct indentation mistake

* correct submodelconfig order of docstring variables

* put audio models in alphabetical order in utils/check_repo.my

* remove useless line from test_modeling_bark.py

* makes BarkCoarseModelTest inherits from (ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) instead of BarkSemanticModelTest

* make a Tester class for each sub-model instead of inheriting

* add test_resize_embeddings=True for Bark sub-models

* add Copied from transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoSelfAttention._split_heads

* remove 'Copied fom Bark' comment

* remove unneccessary comment

* change np.min -> min in modeling_bark.py

* refactored all custom layers to have Bark prefix

* add attention_mask as an argument of generate_text_semantic

* refactor sub-models start docstrings to have more precise config class definition

* move _tied_weights_keys overriding

* add docstrings to generate_xxx in modeling_bark.py

* add loading whole BarkModel to convert_suno_to_hf

* refactor attribute and variable names

* make style convert_suno

* update bark checkpoints

* remove never entered if statement

* move bark_modeling docstrings after BarkPretrainedModel class definition

* refactor modeling_bark.py: kv -> key_values

* small nits - code refactoring and removing unecessary lines from _init_weights

* nits - replace inplace method by variable assigning

* remove *optional* when necessary

* remove some lines in generate_speech

* add default value for optional parameter

* Refactor preprocess_histories_before_coarse -> preprocess_histories
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* correct usage after refactoring

* refactor Bark's generate_xxx -> generate and modify docstrings and tests accordingly

* update docstrings python in configuration_bark.py

* add bark files in utils/documentation_test.txt

* correct docstrings python snippet

* add the ability to use parameters in the form of e.g coarse_temperature

* add semantic_max_new_tokens in python snippet in docstrings for quicker generation

* Reformate sub-models kwargs in BakModel.generate
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* correct kwargs in BarkModel.generate

* correct attention_mask kwarg in BarkModel.generate

* add tests for sub-models args in BarkModel.generate and correct BarkFineModel.test_generate_fp16

* enrich BarkModel.generate docstrings with a description of how to use the kwargs

---------
Co-authored-by: default avatarSanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent c21c3737
...@@ -291,6 +291,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -291,6 +291,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
......
...@@ -268,6 +268,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt ...@@ -268,6 +268,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
......
...@@ -240,6 +240,7 @@ conda install -c huggingface transformers ...@@ -240,6 +240,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई। 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई।
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया। 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया।
......
...@@ -302,6 +302,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ ...@@ -302,6 +302,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321)
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)
......
...@@ -217,6 +217,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -217,6 +217,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
......
...@@ -241,6 +241,7 @@ conda install -c huggingface transformers ...@@ -241,6 +241,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。
......
...@@ -253,6 +253,7 @@ conda install -c huggingface transformers ...@@ -253,6 +253,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
......
...@@ -545,6 +545,8 @@ ...@@ -545,6 +545,8 @@
sections: sections:
- local: model_doc/audio-spectrogram-transformer - local: model_doc/audio-spectrogram-transformer
title: Audio Spectrogram Transformer title: Audio Spectrogram Transformer
- local: model_doc/bark
title: Bark
- local: model_doc/clap - local: model_doc/clap
title: CLAP title: CLAP
- local: model_doc/encodec - local: model_doc/encodec
......
...@@ -57,6 +57,7 @@ The documentation is organized into five sections: ...@@ -57,6 +57,7 @@ The documentation is organized into five sections:
1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
...@@ -282,6 +283,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -282,6 +283,7 @@ Flax), PyTorch, and/or TensorFlow.
| AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ | | AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ | | Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Autoformer | ❌ | ❌ | ✅ | ❌ | ❌ | | Autoformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Bark | ❌ | ❌ | ✅ | ❌ | ❌ |
| BART | ✅ | ✅ | ✅ | ✅ | ✅ | | BART | ✅ | ✅ | ✅ | ✅ | ✅ |
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ | | BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ | | BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Bark
## Overview
Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
Bark is made of 4 main models:
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
### Tips:
Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
```python
>>> from transformers import AutoProcessor, BarkModel
>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")
>>> voice_preset = "v2/en_speaker_6"
>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
```python
>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的!我会说中文")
>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
The model can also produce **nonverbal communications** like laughing, sighing and crying.
```python
>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
To save the audio, simply take the sample rate from the model config and some scipy utility:
```python
>>> from scipy.io.wavfile import write as write_wav
>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)
```
This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
The original code can be found [here](https://github.com/suno-ai/bark).
## BarkConfig
[[autodoc]] BarkConfig
- all
## BarkProcessor
[[autodoc]] BarkProcessor
- all
- __call__
## BarkModel
[[autodoc]] BarkModel
- generate
## BarkSemanticModel
[[autodoc]] BarkSemanticModel
- forward
## BarkCoarseModel
[[autodoc]] BarkCoarseModel
- forward
## BarkFineModel
[[autodoc]] BarkFineModel
- forward
## BarkCausalModel
[[autodoc]] BarkCausalModel
- forward
## BarkCoarseConfig
[[autodoc]] BarkCoarseConfig
- all
## BarkFineConfig
[[autodoc]] BarkFineConfig
- all
## BarkSemanticConfig
[[autodoc]] BarkSemanticConfig
- all
...@@ -160,6 +160,13 @@ _import_structure = { ...@@ -160,6 +160,13 @@ _import_structure = {
"AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"AutoformerConfig", "AutoformerConfig",
], ],
"models.bark": [
"BarkCoarseConfig",
"BarkConfig",
"BarkFineConfig",
"BarkProcessor",
"BarkSemanticConfig",
],
"models.bart": ["BartConfig", "BartTokenizer"], "models.bart": ["BartConfig", "BartTokenizer"],
"models.barthez": [], "models.barthez": [],
"models.bartpho": [], "models.bartpho": [],
...@@ -1136,6 +1143,17 @@ else: ...@@ -1136,6 +1143,17 @@ else:
"AutoformerPreTrainedModel", "AutoformerPreTrainedModel",
] ]
) )
_import_structure["models.bark"].extend(
[
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
"BarkCausalModel",
"BarkCoarseModel",
"BarkFineModel",
"BarkModel",
"BarkPreTrainedModel",
"BarkSemanticModel",
]
)
_import_structure["models.bart"].extend( _import_structure["models.bart"].extend(
[ [
"BART_PRETRAINED_MODEL_ARCHIVE_LIST", "BART_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -4098,6 +4116,13 @@ if TYPE_CHECKING: ...@@ -4098,6 +4116,13 @@ if TYPE_CHECKING:
AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
AutoformerConfig, AutoformerConfig,
) )
from .models.bark import (
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkProcessor,
BarkSemanticConfig,
)
from .models.bart import BartConfig, BartTokenizer from .models.bart import BartConfig, BartTokenizer
from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
from .models.bert import ( from .models.bert import (
...@@ -4978,6 +5003,15 @@ if TYPE_CHECKING: ...@@ -4978,6 +5003,15 @@ if TYPE_CHECKING:
AutoformerModel, AutoformerModel,
AutoformerPreTrainedModel, AutoformerPreTrainedModel,
) )
from .models.bark import (
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
BarkCausalModel,
BarkCoarseModel,
BarkFineModel,
BarkModel,
BarkPreTrainedModel,
BarkSemanticModel,
)
from .models.bart import ( from .models.bart import (
BART_PRETRAINED_MODEL_ARCHIVE_LIST, BART_PRETRAINED_MODEL_ARCHIVE_LIST,
BartForCausalLM, BartForCausalLM,
......
...@@ -1148,3 +1148,39 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor): ...@@ -1148,3 +1148,39 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
cond_logits, uncond_logits = scores.split(unguided_bsz, dim=0) cond_logits, uncond_logits = scores.split(unguided_bsz, dim=0)
scores = uncond_logits + (cond_logits - uncond_logits) * self.guidance_scale scores = uncond_logits + (cond_logits - uncond_logits) * self.guidance_scale
return scores return scores
class AlternatingCodebooksLogitsProcessor(LogitsProcessor):
r"""
[`LogitsProcessor`] enforcing alternated generation between the two codebooks of [`Bark`]'s fine submodel.
Args:
input_start_len (`int`):
The length of the initial input sequence.
semantic_vocab_size (`int`):
Vocabulary size of the semantic part, i.e number of tokens associated to the semantic vocabulary.
codebook_size (`int`):
Number of tokens associated to the codebook.
"""
def __init__(self, input_start_len: int, semantic_vocab_size: int, codebook_size: int):
if not isinstance(input_start_len, int) or input_start_len < 0:
raise ValueError(f"`input_starting_length` has to be a non-negative integer, but is {input_start_len}")
self.input_start_len = input_start_len
self.semantic_vocab_size = semantic_vocab_size
self.codebook_size = codebook_size
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
curr_len = input_ids.shape[-1]
# even -> first codebook, odd -> second codebook
is_first_codebook = ((curr_len - self.input_start_len) % 2) == 0
if is_first_codebook:
scores[:, : self.semantic_vocab_size] = -float("inf")
scores[:, self.semantic_vocab_size + self.codebook_size :] = -float("inf")
else:
scores[:, : self.semantic_vocab_size + self.codebook_size] = -float("inf")
return scores
...@@ -19,6 +19,7 @@ from . import ( ...@@ -19,6 +19,7 @@ from . import (
audio_spectrogram_transformer, audio_spectrogram_transformer,
auto, auto,
autoformer, autoformer,
bark,
bart, bart,
barthez, barthez,
bartpho, bartpho,
......
...@@ -35,6 +35,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -35,6 +35,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("altclip", "AltCLIPConfig"), ("altclip", "AltCLIPConfig"),
("audio-spectrogram-transformer", "ASTConfig"), ("audio-spectrogram-transformer", "ASTConfig"),
("autoformer", "AutoformerConfig"), ("autoformer", "AutoformerConfig"),
("bark", "BarkConfig"),
("bart", "BartConfig"), ("bart", "BartConfig"),
("beit", "BeitConfig"), ("beit", "BeitConfig"),
("bert", "BertConfig"), ("bert", "BertConfig"),
...@@ -237,6 +238,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -237,6 +238,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bark", "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -419,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -419,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("altclip", "AltCLIP"), ("altclip", "AltCLIP"),
("audio-spectrogram-transformer", "Audio Spectrogram Transformer"), ("audio-spectrogram-transformer", "Audio Spectrogram Transformer"),
("autoformer", "Autoformer"), ("autoformer", "Autoformer"),
("bark", "Bark"),
("bart", "BART"), ("bart", "BART"),
("barthez", "BARThez"), ("barthez", "BARThez"),
("bartpho", "BARTpho"), ("bartpho", "BARTpho"),
......
...@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("altclip", "AltCLIPModel"), ("altclip", "AltCLIPModel"),
("audio-spectrogram-transformer", "ASTModel"), ("audio-spectrogram-transformer", "ASTModel"),
("autoformer", "AutoformerModel"), ("autoformer", "AutoformerModel"),
("bark", "BarkModel"),
("bart", "BartModel"), ("bart", "BartModel"),
("beit", "BeitModel"), ("beit", "BeitModel"),
("bert", "BertModel"), ("bert", "BertModel"),
......
...@@ -44,6 +44,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict( ...@@ -44,6 +44,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
[ [
("align", "AlignProcessor"), ("align", "AlignProcessor"),
("altclip", "AltCLIPProcessor"), ("altclip", "AltCLIPProcessor"),
("bark", "BarkProcessor"),
("blip", "BlipProcessor"), ("blip", "BlipProcessor"),
("blip-2", "Blip2Processor"), ("blip-2", "Blip2Processor"),
("bridgetower", "BridgeTowerProcessor"), ("bridgetower", "BridgeTowerProcessor"),
......
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)
_import_structure = {
"configuration_bark": [
"BARK_PRETRAINED_CONFIG_ARCHIVE_MAP",
"BarkCoarseConfig",
"BarkConfig",
"BarkFineConfig",
"BarkSemanticConfig",
],
"processing_bark": ["BarkProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_bark"] = [
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
"BarkFineModel",
"BarkSemanticModel",
"BarkCoarseModel",
"BarkModel",
"BarkPreTrainedModel",
"BarkCausalModel",
]
if TYPE_CHECKING:
from .configuration_bark import (
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP,
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkSemanticConfig,
)
from .processing_bark import BarkProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_bark import (
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
BarkCausalModel,
BarkCoarseModel,
BarkFineModel,
BarkModel,
BarkPreTrainedModel,
BarkSemanticModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BARK model configuration"""
import copy
import os
from typing import Dict, Optional, Union
from ...configuration_utils import PretrainedConfig
from ...utils import add_start_docstrings, logging
from ..auto import AutoConfig
logger = logging.get_logger(__name__)
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"suno/bark-small": "https://huggingface.co/suno/bark-small/resolve/main/config.json",
"suno/bark": "https://huggingface.co/suno/bark/resolve/main/config.json",
}
BARK_SUBMODELCONFIG_START_DOCSTRING = """
This is the configuration class to store the configuration of a [`{model}`]. It is used to instantiate the model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the Bark [suno/bark](https://huggingface.co/suno/bark)
architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
block_size (`int`, *optional*, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
input_vocab_size (`int`, *optional*, defaults to 10_048):
Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`{model}`]. Defaults to 10_048 but should be carefully thought with
regards to the chosen sub-model.
output_vocab_size (`int`, *optional*, defaults to 10_048):
Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented
by the: `output_ids` when passing forward a [`{model}`]. Defaults to 10_048 but should be carefully thought
with regards to the chosen sub-model.
num_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the given sub-model.
num_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer architecture.
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the "intermediate" (often named feed-forward) layer in the architecture.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
bias (`bool`, *optional*, defaults to `True`):
Whether or not to use bias in the linear layers and layer norm layers.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
"""
class BarkSubModelConfig(PretrainedConfig):
model_type = "bark_module"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {
"num_attention_heads": "num_heads",
"num_hidden_layers": "num_layers",
"vocab_size": "input_vocab_size",
"window_size": "block_size",
}
def __init__(
self,
block_size=1024,
input_vocab_size=10_048,
output_vocab_size=10_048,
num_layers=12,
num_heads=12,
hidden_size=768,
dropout=0.0,
bias=True, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
initializer_range=0.02,
use_cache=True,
**kwargs,
):
self.block_size = block_size
self.input_vocab_size = input_vocab_size
self.output_vocab_size = output_vocab_size
self.num_layers = num_layers
self.num_heads = num_heads
self.hidden_size = hidden_size
self.dropout = dropout
self.bias = bias
self.use_cache = use_cache
self.initializer_range = initializer_range
super().__init__(**kwargs)
@classmethod
def from_pretrained(
cls,
pretrained_model_name_or_path: Union[str, os.PathLike],
cache_dir: Optional[Union[str, os.PathLike]] = None,
force_download: bool = False,
local_files_only: bool = False,
token: Optional[Union[str, bool]] = None,
revision: str = "main",
**kwargs,
) -> "PretrainedConfig":
kwargs["cache_dir"] = cache_dir
kwargs["force_download"] = force_download
kwargs["local_files_only"] = local_files_only
kwargs["revision"] = revision
cls._set_token_in_kwargs(kwargs, token)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the config dict if we are loading from Bark
if config_dict.get("model_type") == "bark":
config_dict = config_dict[f"{cls.model_type}_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkSemanticConfig", model="BarkSemanticModel"),
"""
Example:
```python
>>> from transformers import BarkSemanticConfig, BarkSemanticModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkSemanticConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkSemanticModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkSemanticConfig(BarkSubModelConfig):
model_type = "semantic"
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkCoarseConfig", model="BarkCoarseModel"),
"""
Example:
```python
>>> from transformers import BarkCoarseConfig, BarkCoarseModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkCoarseConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkCoarseModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkCoarseConfig(BarkSubModelConfig):
model_type = "coarse_acoustics"
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkFineConfig", model="BarkFineModel"),
"""
n_codes_total (`int`, *optional*, defaults to 8):
The total number of audio codebooks predicted. Used in the fine acoustics sub-model.
n_codes_given (`int`, *optional*, defaults to 1):
The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics
sub-models.
Example:
```python
>>> from transformers import BarkFineConfig, BarkFineModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkFineConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkFineModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkFineConfig(BarkSubModelConfig):
model_type = "fine_acoustics"
def __init__(self, tie_word_embeddings=True, n_codes_total=8, n_codes_given=1, **kwargs):
self.n_codes_total = n_codes_total
self.n_codes_given = n_codes_given
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
class BarkConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`BarkModel`]. It is used to instantiate a Bark
model according to the specified sub-models configurations, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark
[suno/bark](https://huggingface.co/suno/bark) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
semantic_config ([`BarkSemanticConfig`], *optional*):
Configuration of the underlying semantic sub-model.
coarse_acoustics_config ([`BarkCoarseConfig`], *optional*):
Configuration of the underlying coarse acoustics sub-model.
fine_acoustics_config ([`BarkFineConfig`], *optional*):
Configuration of the underlying fine acoustics sub-model.
codec_config ([`AutoConfig`], *optional*):
Configuration of the underlying codec sub-model.
Example:
```python
>>> from transformers import (
... BarkSemanticConfig,
... BarkCoarseConfig,
... BarkFineConfig,
... BarkModel,
... BarkConfig,
... AutoConfig,
... )
>>> # Initializing Bark sub-modules configurations.
>>> semantic_config = BarkSemanticConfig()
>>> coarse_acoustics_config = BarkCoarseConfig()
>>> fine_acoustics_config = BarkFineConfig()
>>> codec_config = AutoConfig.from_pretrained("facebook/encodec_24khz")
>>> # Initializing a Bark module style configuration
>>> configuration = BarkConfig.from_sub_model_configs(
... semantic_config, coarse_acoustics_config, fine_acoustics_config, codec_config
... )
>>> # Initializing a model (with random weights)
>>> model = BarkModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "bark"
is_composition = True
def __init__(
self,
semantic_config: Dict = None,
coarse_acoustics_config: Dict = None,
fine_acoustics_config: Dict = None,
codec_config: Dict = None,
initializer_range=0.02,
**kwargs,
):
if semantic_config is None:
semantic_config = {}
logger.info("semantic_config is None. initializing the semantic model with default values.")
if coarse_acoustics_config is None:
coarse_acoustics_config = {}
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
if fine_acoustics_config is None:
fine_acoustics_config = {}
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
if codec_config is None:
codec_config = {}
logger.info("codec_config is None. initializing the codec model with default values.")
self.semantic_config = BarkSemanticConfig(**semantic_config)
self.coarse_acoustics_config = BarkCoarseConfig(**coarse_acoustics_config)
self.fine_acoustics_config = BarkFineConfig(**fine_acoustics_config)
self.codec_config = AutoConfig.for_model(**codec_config)
self.initializer_range = initializer_range
super().__init__(**kwargs)
@classmethod
def from_sub_model_configs(
cls,
semantic_config: BarkSemanticConfig,
coarse_acoustics_config: BarkCoarseConfig,
fine_acoustics_config: BarkFineConfig,
codec_config: AutoConfig,
**kwargs,
):
r"""
Instantiate a [`BarkConfig`] (or a derived class) from bark sub-models configuration.
Returns:
[`BarkConfig`]: An instance of a configuration object
"""
return cls(
semantic_config=semantic_config.to_dict(),
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
fine_acoustics_config=fine_acoustics_config.to_dict(),
codec_config=codec_config.to_dict(),
**kwargs,
)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["semantic_config"] = self.semantic_config.to_dict()
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
output["codec_config"] = self.codec_config.to_dict()
output["model_type"] = self.__class__.model_type
return output
"""Convert Bark checkpoint."""
import argparse
import os
from pathlib import Path
import torch
from bark.generation import _load_model as _bark_load_model
from huggingface_hub import hf_hub_download
from transformers import EncodecConfig, EncodecModel, set_seed
from transformers.models.bark.configuration_bark import (
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkSemanticConfig,
)
from transformers.models.bark.generation_configuration_bark import (
BarkCoarseGenerationConfig,
BarkFineGenerationConfig,
BarkGenerationConfig,
BarkSemanticGenerationConfig,
)
from transformers.models.bark.modeling_bark import BarkCoarseModel, BarkFineModel, BarkModel, BarkSemanticModel
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
set_seed(770)
new_layer_name_dict = {
"c_attn": "att_proj",
"c_proj": "out_proj",
"c_fc": "in_proj",
"transformer.": "",
"h.": "layers.",
"ln_1": "layernorm_1",
"ln_2": "layernorm_2",
"ln_f": "layernorm_final",
"wpe": "position_embeds_layer",
"wte": "input_embeds_layer",
}
REMOTE_MODEL_PATHS = {
"text_small": {
"repo_id": "suno/bark",
"file_name": "text.pt",
},
"coarse_small": {
"repo_id": "suno/bark",
"file_name": "coarse.pt",
},
"fine_small": {
"repo_id": "suno/bark",
"file_name": "fine.pt",
},
"text": {
"repo_id": "suno/bark",
"file_name": "text_2.pt",
},
"coarse": {
"repo_id": "suno/bark",
"file_name": "coarse_2.pt",
},
"fine": {
"repo_id": "suno/bark",
"file_name": "fine_2.pt",
},
}
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "suno", "bark_v0")
def _get_ckpt_path(model_type, use_small=False):
key = model_type
if use_small:
key += "_small"
return os.path.join(CACHE_DIR, REMOTE_MODEL_PATHS[key]["file_name"])
def _download(from_hf_path, file_name):
os.makedirs(CACHE_DIR, exist_ok=True)
hf_hub_download(repo_id=from_hf_path, filename=file_name, local_dir=CACHE_DIR)
def _load_model(ckpt_path, device, use_small=False, model_type="text"):
if model_type == "text":
ModelClass = BarkSemanticModel
ConfigClass = BarkSemanticConfig
GenerationConfigClass = BarkSemanticGenerationConfig
elif model_type == "coarse":
ModelClass = BarkCoarseModel
ConfigClass = BarkCoarseConfig
GenerationConfigClass = BarkCoarseGenerationConfig
elif model_type == "fine":
ModelClass = BarkFineModel
ConfigClass = BarkFineConfig
GenerationConfigClass = BarkFineGenerationConfig
else:
raise NotImplementedError()
model_key = f"{model_type}_small" if use_small else model_type
model_info = REMOTE_MODEL_PATHS[model_key]
if not os.path.exists(ckpt_path):
logger.info(f"{model_type} model not found, downloading into `{CACHE_DIR}`.")
_download(model_info["repo_id"], model_info["file_name"])
checkpoint = torch.load(ckpt_path, map_location=device)
# this is a hack
model_args = checkpoint["model_args"]
if "input_vocab_size" not in model_args:
model_args["input_vocab_size"] = model_args["vocab_size"]
model_args["output_vocab_size"] = model_args["vocab_size"]
del model_args["vocab_size"]
# convert Bark model arguments to HF Bark model arguments
model_args["num_heads"] = model_args.pop("n_head")
model_args["hidden_size"] = model_args.pop("n_embd")
model_args["num_layers"] = model_args.pop("n_layer")
model_config = ConfigClass(**checkpoint["model_args"])
model = ModelClass(config=model_config)
model_generation_config = GenerationConfigClass()
model.generation_config = model_generation_config
state_dict = checkpoint["model"]
# fixup checkpoint
unwanted_prefix = "_orig_mod."
for k, v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
# replace part of the key with corresponding layer name in HF implementation
new_k = k[len(unwanted_prefix) :]
for old_layer_name in new_layer_name_dict:
new_k = new_k.replace(old_layer_name, new_layer_name_dict[old_layer_name])
state_dict[new_k] = state_dict.pop(k)
extra_keys = set(state_dict.keys()) - set(model.state_dict().keys())
extra_keys = {k for k in extra_keys if not k.endswith(".attn.bias")}
missing_keys = set(model.state_dict().keys()) - set(state_dict.keys())
missing_keys = {k for k in missing_keys if not k.endswith(".attn.bias")}
if len(extra_keys) != 0:
raise ValueError(f"extra keys found: {extra_keys}")
if len(missing_keys) != 0:
raise ValueError(f"missing keys: {missing_keys}")
model.load_state_dict(state_dict, strict=False)
n_params = model.num_parameters(exclude_embeddings=True)
val_loss = checkpoint["best_val_loss"].item()
logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss")
model.eval()
model.to(device)
del checkpoint, state_dict
return model
def load_model(pytorch_dump_folder_path, use_small=False, model_type="text"):
if model_type not in ("text", "coarse", "fine"):
raise NotImplementedError()
device = "cpu" # do conversion on cpu
ckpt_path = _get_ckpt_path(model_type, use_small=use_small)
model = _load_model(ckpt_path, device, model_type=model_type, use_small=use_small)
# load bark initial model
bark_model = _bark_load_model(ckpt_path, "cpu", model_type=model_type, use_small=use_small)
if model_type == "text":
bark_model = bark_model["model"]
if model.num_parameters(exclude_embeddings=True) != bark_model.get_num_params():
raise ValueError("initial and new models don't have the same number of parameters")
# check if same output as the bark model
batch_size = 5
sequence_length = 10
if model_type in ["text", "coarse"]:
vec = torch.randint(256, (batch_size, sequence_length), dtype=torch.int)
output_old_model = bark_model(vec)[0]
output_new_model_total = model(vec)
# take last logits
output_new_model = output_new_model_total.logits[:, [-1], :]
else:
prediction_codeboook_channel = 3
n_codes_total = 8
vec = torch.randint(256, (batch_size, sequence_length, n_codes_total), dtype=torch.int)
output_new_model_total = model(prediction_codeboook_channel, vec)
output_old_model = bark_model(prediction_codeboook_channel, vec)
output_new_model = output_new_model_total.logits
# output difference should come from the difference of self-attention implementation design
if output_new_model.shape != output_old_model.shape:
raise ValueError("initial and new outputs don't have the same shape")
if (output_new_model - output_old_model).abs().max().item() > 1e-3:
raise ValueError("initial and new outputs are not equal")
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path)
def load_whole_bark_model(
semantic_path,
coarse_path,
fine_path,
append_text,
hub_path,
folder_path,
):
pytorch_dump_folder_path = os.path.join(folder_path, append_text)
semanticConfig = BarkSemanticConfig.from_pretrained(os.path.join(semantic_path, "config.json"))
coarseAcousticConfig = BarkCoarseConfig.from_pretrained(os.path.join(coarse_path, "config.json"))
fineAcousticConfig = BarkFineConfig.from_pretrained(os.path.join(fine_path, "config.json"))
codecConfig = EncodecConfig.from_pretrained("facebook/encodec_24khz")
semantic = BarkSemanticModel.from_pretrained(semantic_path)
coarseAcoustic = BarkCoarseModel.from_pretrained(coarse_path)
fineAcoustic = BarkFineModel.from_pretrained(fine_path)
codec = EncodecModel.from_pretrained("facebook/encodec_24khz")
bark_config = BarkConfig.from_sub_model_configs(
semanticConfig, coarseAcousticConfig, fineAcousticConfig, codecConfig
)
bark_generation_config = BarkGenerationConfig.from_sub_model_configs(
semantic.generation_config, coarseAcoustic.generation_config, fineAcoustic.generation_config
)
bark = BarkModel(bark_config)
bark.semantic = semantic
bark.coarse_acoustics = coarseAcoustic
bark.fine_acoustics = fineAcoustic
bark.codec_model = codec
bark.generation_config = bark_generation_config
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
bark.save_pretrained(pytorch_dump_folder_path, repo_id=hub_path, push_to_hub=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("model_type", type=str, help="text, coarse or fine.")
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument("--is_small", action="store_true", help="convert the small version instead of the large.")
args = parser.parse_args()
load_model(args.pytorch_dump_folder_path, model_type=args.model_type, use_small=args.is_small)
# coding=utf-8
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BARK model generation configuration"""
import copy
from typing import Dict
from ...generation.configuration_utils import GenerationConfig
from ...utils import logging
logger = logging.get_logger(__name__)
class BarkSemanticGenerationConfig(GenerationConfig):
model_type = "semantic"
def __init__(
self,
eos_token_id=10_000,
renormalize_logits=True,
max_new_tokens=768,
output_scores=False,
return_dict_in_generate=False,
output_hidden_states=False,
output_attentions=False,
temperature=0.7,
do_sample=True,
text_encoding_offset=10_048,
text_pad_token=129_595,
semantic_infer_token=129_599,
semantic_vocab_size=10_000,
max_input_semantic_length=256,
semantic_rate_hz=49.9,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkSemanticModel`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
eos_token_id (`int`, *optional*, defaults to 10_000):
The id of the *end-of-sequence* token.
renormalize_logits (`bool`, *optional*, defaults to `True`):
Whether to renormalize the logits after applying all the logits processors or warpers (including the
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
score logits are normalized but some logit processors or warpers break the normalization.
max_new_tokens (`int`, *optional*, defaults to 768):
The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
output_scores (`bool`, *optional*, defaults to `False`):
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
output_hidden_states (`bool`, *optional*, defaults to `False`):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more details.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more details.
temperature (`float`, *optional*, defaults to 0.7):
The value used to modulate the next token probabilities.
do_sample (`bool`, *optional*, defaults to `True`):
Whether or not to use sampling ; use greedy decoding otherwise.
text_encoding_offset (`int`, *optional*, defaults to 10_048):
Text encoding offset.
text_pad_token (`int`, *optional*, defaults to 129_595):
Text pad token.
semantic_infer_token (`int`, *optional*, defaults to 129_599):
Semantic infer token.
semantic_vocab_size (`int`, *optional*, defaults to 10_000):
Semantic vocab size.
max_input_semantic_length (`int`, *optional*, defaults to 256):
Max length of semantic input vector.
semantic_rate_hz (`float`, *optional*, defaults to 49.9):
Semantic rate in Hertz.
"""
super().__init__(
temperature=temperature,
do_sample=do_sample,
eos_token_id=eos_token_id,
renormalize_logits=renormalize_logits,
max_new_tokens=max_new_tokens,
output_scores=output_scores,
return_dict_in_generate=return_dict_in_generate,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
**kwargs,
)
self.text_encoding_offset = text_encoding_offset
self.text_pad_token = text_pad_token
self.semantic_pad_token = eos_token_id
self.semantic_infer_token = semantic_infer_token
self.semantic_vocab_size = semantic_vocab_size
self.max_input_semantic_length = max_input_semantic_length
self.semantic_rate_hz = semantic_rate_hz
class BarkCoarseGenerationConfig(GenerationConfig):
model_type = "coarse_acoustics"
def __init__(
self,
renormalize_logits=True,
output_scores=False,
return_dict_in_generate=False,
output_hidden_states=False,
output_attentions=False,
temperature=0.7,
do_sample=True,
coarse_semantic_pad_token=12_048,
coarse_rate_hz=75,
n_coarse_codebooks=2,
coarse_infer_token=12_050,
max_coarse_input_length=256,
max_coarse_history: int = 630,
sliding_window_len: int = 60,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkCoarseModel`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
renormalize_logits (`bool`, *optional*, defaults to `True`):
Whether to renormalize the logits after applying all the logits processors or warpers (including the
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
score logits are normalized but some logit processors or warpers break the normalization.
output_scores (`bool`, *optional*, defaults to `False`):
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
output_hidden_states (`bool`, *optional*, defaults to `False`):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more details.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more details.
temperature (`float`, *optional*, defaults to 0.7):
The value used to modulate the next token probabilities.
do_sample (`bool`, *optional*, defaults to `True`):
Whether or not to use sampling ; use greedy decoding otherwise.
coarse_semantic_pad_token (`int`, *optional*, defaults to 12_048):
Coarse semantic pad token.
coarse_rate_hz (`int`, *optional*, defaults to 75):
Coarse rate in Hertz.
n_coarse_codebooks (`int`, *optional*, defaults to 2):
Number of coarse codebooks.
coarse_infer_token (`int`, *optional*, defaults to 12_050):
Coarse infer token.
max_coarse_input_length (`int`, *optional*, defaults to 256):
Max length of input coarse vector.
max_coarse_history (`int`, *optional*, defaults to 630):
Max length of the output of the coarse acoustics model used in the fine generation step.
sliding_window_len (`int`, *optional*, defaults to 60):
The coarse generation step uses a sliding window to generate raw audio.
"""
super().__init__(
temperature=temperature,
do_sample=do_sample,
renormalize_logits=renormalize_logits,
output_scores=output_scores,
return_dict_in_generate=return_dict_in_generate,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
**kwargs,
)
self.coarse_semantic_pad_token = coarse_semantic_pad_token
self.coarse_rate_hz = coarse_rate_hz
self.n_coarse_codebooks = n_coarse_codebooks
self.coarse_infer_token = coarse_infer_token
self.max_coarse_input_length = max_coarse_input_length
self.max_coarse_history = max_coarse_history
self.sliding_window_len = sliding_window_len
class BarkFineGenerationConfig(GenerationConfig):
model_type = "fine_acoustics"
def __init__(
self,
temperature=0.5,
max_fine_history_length=512,
max_fine_input_length=1024,
n_fine_codebooks=8,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkFineModel`].
[`BarkFineModel`] is an autoencoder model, so should not usually be used for generation. However, under the
hood, it uses `temperature` when used by [`BarkModel`]
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
temperature (`float`, *optional*, defaults to 0.5):
The value used to modulate the next token probabilities.
max_fine_history_length (`int`, *optional*, defaults to 512):
Max length of the fine history vector.
max_fine_input_length (`int`, *optional*, defaults to 1024):
Max length of fine input vector.
n_fine_codebooks (`int`, *optional*, defaults to 8):
Number of codebooks used.
"""
super().__init__(temperature=temperature)
self.max_fine_history_length = max_fine_history_length
self.max_fine_input_length = max_fine_input_length
self.n_fine_codebooks = n_fine_codebooks
class BarkGenerationConfig(GenerationConfig):
model_type = "bark"
is_composition = True
# TODO (joao): nested from_dict
def __init__(
self,
semantic_config: Dict = None,
coarse_acoustics_config: Dict = None,
fine_acoustics_config: Dict = None,
sample_rate=24_000,
codebook_size=1024,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkModel`].
The [`BarkModel`] does not have a `generate` method, but uses this class to generate speeches with a nested
[`BarkGenerationConfig`] which uses [`BarkSemanticGenerationConfig`], [`BarkCoarseGenerationConfig`],
[`BarkFineGenerationConfig`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
semantic_config (`Dict`, *optional*):
Semantic generation configuration.
coarse_acoustics_config (`Dict`, *optional*):
Coarse generation configuration.
fine_acoustics_config (`Dict`, *optional*):
Fine generation configuration.
sample_rate (`int`, *optional*, defaults to 24_000):
Sample rate.
codebook_size (`int`, *optional*, defaults to 1024):
Vector length for each codebook.
"""
if semantic_config is None:
semantic_config = {}
logger.info("semantic_config is None. initializing the semantic model with default values.")
if coarse_acoustics_config is None:
coarse_acoustics_config = {}
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
if fine_acoustics_config is None:
fine_acoustics_config = {}
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
self.semantic_config = BarkSemanticGenerationConfig(**semantic_config)
self.coarse_acoustics_config = BarkCoarseGenerationConfig(**coarse_acoustics_config)
self.fine_acoustics_config = BarkFineGenerationConfig(**fine_acoustics_config)
self.sample_rate = sample_rate
self.codebook_size = codebook_size
@classmethod
def from_sub_model_configs(
cls,
semantic_config: BarkSemanticGenerationConfig,
coarse_acoustics_config: BarkCoarseGenerationConfig,
fine_acoustics_config: BarkFineGenerationConfig,
**kwargs,
):
r"""
Instantiate a [`BarkGenerationConfig`] (or a derived class) from bark sub-models generation configuration.
Returns:
[`BarkGenerationConfig`]: An instance of a configuration object
"""
return cls(
semantic_config=semantic_config.to_dict(),
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
fine_acoustics_config=fine_acoustics_config.to_dict(),
**kwargs,
)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["semantic_config"] = self.semantic_config.to_dict()
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
output["model_type"] = self.__class__.model_type
return output
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment