Unverified Commit 0c3fdccf authored by Matthijs Hollemans's avatar Matthijs Hollemans Committed by GitHub
Browse files

[WIP] add EnCodec model (#23655)



* boilerplate stuff

* messing around with the feature extractor

* fix feature extractor

* unit tests for feature extractor

* rename speech to audio

* quick-and-dirty import of Meta's code

* import weights (sort of)

* cleaning up

* more cleaning up

* move encoder/decoder args into config

* cleanup model

* rename EnCodec -> Encodec

* RVQ parameters in config

* add slow test

* add lstm init and test_init

* Add save & load

* finish EncodecModel

* remove decoder_input_values as they are ont used anywhere (not removed from doc yet)

* fix test feature extraction model name

* Add better slow test

* Fix tests

* some fixup and cleaning

* Improve further

* cleaning up quantizer

* fix up conversion script

* test don't pass, _encode_fram does not work

* update tests with output per encode and decode

* more cleanup

* rename _codebook

* remove old config cruft

* ratios & hop_length

* use ModuleList instead of Sequential

* clean up resnet block

* update types

* update tests

* fixup

* quick cleanup

* fix padding

* more styl,ing

* add patrick feedback

* fix copies

* fixup

* fix lstm

* fix shape issues

* fixup

* rename conv layers

* fixup

* fix decoding

* small conv refactoring

* remove norm_params

* simplify conv layers

* rename conv layers

* stuff

* Clean up

* Add padding logic

use padding mask

small conv refactoring

remove norm_params

simplify conv layers

rename conv layers

stuff

add batched test

update

Clean up

merge and update for padding

fix padding

fixup

* clean up more

* clean up more

* More clean ups

* cleanup convolutions

* typo

* fix typos

* fixup

* build PR doc?

* start refactoring docstring

* fix don't pad when no strid and chunk

* update docstring

* update docstring

* nits

* update going to lunch

* update config and model

* fix broken testse (becaue of the config changes)

* fix scale computation

* fixu[

* only return dict if speciefied or if config returns it

* remove todos

* update defaults in config

* update conversion script

* fix doctest

* more docstring + fixup

* nits on batched_tests

* more nits

* Apply suggestions from code review
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* update basxed on review

* fix update

* updaet tests

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fixup

* add overlap and chunl_length_s

* cleanup feature extraction

* teste edge cases truncation and padding

* correct processor values

* update config encodec, nits

* fix tests

* fixup

* fix 24Hz test

* elle tests are green

* fix fixup

* Apply suggestions from code review

* revert readme changes

* fixup

* add example

* use facebook checkpoints

* fix typo

* no pipeline tests

* use slef.pad everywhere we can

* Apply suggestions from code review
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* update based on review

* update

* update mdx

* fix bug and tests

* fixup

* fix doctest

* remove comment

* more nits

* add more coverage for `test_truncation_and_padding`

* fixup

* add last test

* fix text

* nits

* Update tests/models/encodec/test_modeling_encodec.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

* take care of the last comments

* typo

* fix test

* nits

* fixup

* Update src/transformers/models/encodec/feature_extraction_encodec.py
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: default avatararthur.zucker@gmail.com <arthur.zucker@gmail.com>
Co-authored-by: default avatarArthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avataramyeroberts <22614925+amyeroberts@users.noreply.github.com>
parent 26a2ec56
...@@ -346,6 +346,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -346,6 +346,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
......
...@@ -321,6 +321,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt ...@@ -321,6 +321,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
......
...@@ -293,6 +293,7 @@ conda install -c huggingface transformers ...@@ -293,6 +293,7 @@ conda install -c huggingface transformers
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google रिसर्च/स्टैनफोर्ड यूनिवर्सिटी से) साथ में दिया गया पेपर [इलेक्ट्रा: जेनरेटर के बजाय भेदभाव करने वाले के रूप में टेक्स्ट एन्कोडर्स का पूर्व-प्रशिक्षण] (https://arxiv.org/abs/2003.10555) केविन क्लार्क, मिन्ह-थांग लुओंग, क्वोक वी. ले, क्रिस्टोफर डी. मैनिंग द्वारा पोस्ट किया गया। 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google रिसर्च/स्टैनफोर्ड यूनिवर्सिटी से) साथ में दिया गया पेपर [इलेक्ट्रा: जेनरेटर के बजाय भेदभाव करने वाले के रूप में टेक्स्ट एन्कोडर्स का पूर्व-प्रशिक्षण] (https://arxiv.org/abs/2003.10555) केविन क्लार्क, मिन्ह-थांग लुओंग, क्वोक वी. ले, क्रिस्टोफर डी. मैनिंग द्वारा पोस्ट किया गया।
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (Meta AI से) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. द्वाराअनुसंधान पत्र [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) के साथ जारी किया गया
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google रिसर्च से) साथ में दिया गया पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https:/ /arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा। 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google रिसर्च से) साथ में दिया गया पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https:/ /arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा।
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)**(Baidu से) साथ देने वाला पेपर [ERNIE: एन्हांस्ड रिप्रेजेंटेशन थ्रू नॉलेज इंटीग्रेशन](https://arxiv.org/abs/1904.09223) यू सन, शुओहुआन वांग, युकुन ली, शिकुन फेंग, ज़ुई चेन, हान झांग, शिन तियान, डैनक्सियांग झू, हाओ तियान, हुआ वू द्वारा पोस्ट किया गया। 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)**(Baidu से) साथ देने वाला पेपर [ERNIE: एन्हांस्ड रिप्रेजेंटेशन थ्रू नॉलेज इंटीग्रेशन](https://arxiv.org/abs/1904.09223) यू सन, शुओहुआन वांग, युकुन ली, शिकुन फेंग, ज़ुई चेन, हान झांग, शिन तियान, डैनक्सियांग झू, हाओ तियान, हुआ वू द्वारा पोस्ट किया गया।
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu से) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. द्वाराअनुसंधान पत्र [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) के साथ जारी किया गया 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu से) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. द्वाराअनुसंधान पत्र [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) के साथ जारी किया गया
......
...@@ -355,6 +355,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ ...@@ -355,6 +355,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (Snap Research から) Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. から公開された研究論文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (Snap Research から) Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. から公開された研究論文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191)
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University から) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning から公開された研究論文: [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University から) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning から公開された研究論文: [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555)
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (Meta AI から) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. から公開された研究論文 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research から) Sascha Rothe, Shashi Narayan, Aliaksei Severyn から公開された研究論文: [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research から) Sascha Rothe, Shashi Narayan, Aliaksei Severyn から公開された研究論文: [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461)
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu から) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu から公開された研究論文: [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu から) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu から公開された研究論文: [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223)
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu から) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. から公開された研究論文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu から) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. から公開された研究論文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674)
......
...@@ -270,6 +270,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -270,6 +270,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University 에서) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 의 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 논문과 함께 발표했습니다. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University 에서) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 의 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 논문과 함께 발표했습니다.
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (Meta AI 에서 제공)은 Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.의 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)논문과 함께 발표했습니다.
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research 에서) Sascha Rothe, Shashi Narayan, Aliaksei Severyn 의 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 논문과 함께 발표했습니다. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research 에서) Sascha Rothe, Shashi Narayan, Aliaksei Severyn 의 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 논문과 함께 발표했습니다.
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu 에서) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 의 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) 논문과 함께 발표했습니다. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu 에서) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 의 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) 논문과 함께 발표했습니다.
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu 에서 제공)은 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.의 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674)논문과 함께 발표했습니다. 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu 에서 제공)은 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.의 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674)논문과 함께 발표했습니다.
......
...@@ -294,6 +294,7 @@ conda install -c huggingface transformers ...@@ -294,6 +294,7 @@ conda install -c huggingface transformers
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (来自 Snap Research) 伴随论文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) 由 Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren 发布。 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (来自 Snap Research) 伴随论文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) 由 Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren 发布。
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (来自 Meta AI) 伴随论文 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) 由 Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi 发布。
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (来自 Baidu) 伴随论文 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 发布。 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (来自 Baidu) 伴随论文 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 发布。
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (来自 Baidu) 伴随论文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 由 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang 发布。 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (来自 Baidu) 伴随论文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 由 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang 发布。
......
...@@ -306,6 +306,7 @@ conda install -c huggingface transformers ...@@ -306,6 +306,7 @@ conda install -c huggingface transformers
1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1. **[EnCodec](https://huggingface.co/docs/transformers/main/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
......
...@@ -541,6 +541,8 @@ ...@@ -541,6 +541,8 @@
title: Audio Spectrogram Transformer title: Audio Spectrogram Transformer
- local: model_doc/clap - local: model_doc/clap
title: CLAP title: CLAP
- local: model_doc/encodec
title: EnCodec
- local: model_doc/hubert - local: model_doc/hubert
title: Hubert title: Hubert
- local: model_doc/mctct - local: model_doc/mctct
......
...@@ -107,6 +107,7 @@ The documentation is organized into five sections: ...@@ -107,6 +107,7 @@ The documentation is organized into five sections:
1. **[EfficientFormer](model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. 1. **[EfficientFormer](model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.
1. **[EfficientNet](model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[EfficientNet](model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le.
1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
1. **[EnCodec](model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
1. **[ERNIE](model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ERNIE](model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu.
1. **[ErnieM](model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ErnieM](model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
...@@ -318,6 +319,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -318,6 +319,7 @@ Flax), PyTorch, and/or TensorFlow.
| EfficientFormer | ❌ | ❌ | ✅ | ✅ | ❌ | | EfficientFormer | ❌ | ❌ | ✅ | ✅ | ❌ |
| EfficientNet | ❌ | ❌ | ✅ | ❌ | ❌ | | EfficientNet | ❌ | ❌ | ✅ | ❌ | ❌ |
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ | | ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ |
| EnCodec | ❌ | ❌ | ✅ | ❌ | ❌ |
| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ | | Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
| ERNIE | ❌ | ❌ | ✅ | ❌ | ❌ | | ERNIE | ❌ | ❌ | ✅ | ❌ | ❌ |
| ErnieM | ✅ | ❌ | ✅ | ❌ | ❌ | | ErnieM | ✅ | ❌ | ✅ | ❌ | ❌ |
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# EnCodec
## Overview
The EnCodec neural codec model was proposed in [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
The abstract from the paper is the following:
*We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio.*
This model was contributed by [Matthijs](https://huggingface.co/Matthijs), [Patrick Von Platen](https://huggingface.co/patrickvonplaten) and [Arthur Zucker](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/facebookresearch/encodec).
Here is a quick example of how to encode and decode an audio using this model:
```python
>>> from datasets import load_dataset, Audio
>>> from transformers import EncodecModel, AutoProcessor
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> model = EncodecModel.from_pretrained("facebook/encodec_24khz")
>>> processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")
>>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
>>> audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
>>> # or the equivalent with a forward pass
>>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
```
## EncodecConfig
[[autodoc]] EncodecConfig
## EncodecFeatureExtractor
[[autodoc]] EncodecFeatureExtractor
- __call__
## EncodecModel
[[autodoc]] EncodecModel
- decode
- encode
- forward
...@@ -281,6 +281,10 @@ _import_structure = { ...@@ -281,6 +281,10 @@ _import_structure = {
"models.efficientformer": ["EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "EfficientFormerConfig"], "models.efficientformer": ["EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "EfficientFormerConfig"],
"models.efficientnet": ["EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "EfficientNetConfig"], "models.efficientnet": ["EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "EfficientNetConfig"],
"models.electra": ["ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "ElectraConfig", "ElectraTokenizer"], "models.electra": ["ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "ElectraConfig", "ElectraTokenizer"],
"models.encodec": [
"ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP",
"EncodecConfig",
],
"models.encoder_decoder": ["EncoderDecoderConfig"], "models.encoder_decoder": ["EncoderDecoderConfig"],
"models.ernie": [ "models.ernie": [
"ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP",
...@@ -825,6 +829,7 @@ except OptionalDependencyNotAvailable: ...@@ -825,6 +829,7 @@ except OptionalDependencyNotAvailable:
] ]
else: else:
_import_structure["models.audio_spectrogram_transformer"].append("ASTFeatureExtractor") _import_structure["models.audio_spectrogram_transformer"].append("ASTFeatureExtractor")
_import_structure["models.encodec"].append("EncodecFeatureExtractor")
_import_structure["models.mctct"].append("MCTCTFeatureExtractor") _import_structure["models.mctct"].append("MCTCTFeatureExtractor")
_import_structure["models.speech_to_text"].append("Speech2TextFeatureExtractor") _import_structure["models.speech_to_text"].append("Speech2TextFeatureExtractor")
_import_structure["models.speecht5"].append("SpeechT5FeatureExtractor") _import_structure["models.speecht5"].append("SpeechT5FeatureExtractor")
...@@ -1568,6 +1573,13 @@ else: ...@@ -1568,6 +1573,13 @@ else:
"load_tf_weights_in_electra", "load_tf_weights_in_electra",
] ]
) )
_import_structure["models.encodec"].extend(
[
"ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST",
"EncodecModel",
"EncodecPreTrainedModel",
]
)
_import_structure["models.encoder_decoder"].append("EncoderDecoderModel") _import_structure["models.encoder_decoder"].append("EncoderDecoderModel")
_import_structure["models.ernie"].extend( _import_structure["models.ernie"].extend(
[ [
...@@ -4100,6 +4112,10 @@ if TYPE_CHECKING: ...@@ -4100,6 +4112,10 @@ if TYPE_CHECKING:
from .models.efficientformer import EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientFormerConfig from .models.efficientformer import EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientFormerConfig
from .models.efficientnet import EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientNetConfig from .models.efficientnet import EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientNetConfig
from .models.electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig, ElectraTokenizer from .models.electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig, ElectraTokenizer
from .models.encodec import (
ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP,
EncodecConfig,
)
from .models.encoder_decoder import EncoderDecoderConfig from .models.encoder_decoder import EncoderDecoderConfig
from .models.ernie import ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieConfig from .models.ernie import ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieConfig
from .models.ernie_m import ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieMConfig from .models.ernie_m import ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieMConfig
...@@ -4598,6 +4614,7 @@ if TYPE_CHECKING: ...@@ -4598,6 +4614,7 @@ if TYPE_CHECKING:
from .utils.dummy_speech_objects import * from .utils.dummy_speech_objects import *
else: else:
from .models.audio_spectrogram_transformer import ASTFeatureExtractor from .models.audio_spectrogram_transformer import ASTFeatureExtractor
from .models.encodec import EncodecFeatureExtractor
from .models.mctct import MCTCTFeatureExtractor from .models.mctct import MCTCTFeatureExtractor
from .models.speech_to_text import Speech2TextFeatureExtractor from .models.speech_to_text import Speech2TextFeatureExtractor
from .models.speecht5 import SpeechT5FeatureExtractor from .models.speecht5 import SpeechT5FeatureExtractor
...@@ -5210,6 +5227,11 @@ if TYPE_CHECKING: ...@@ -5210,6 +5227,11 @@ if TYPE_CHECKING:
ElectraPreTrainedModel, ElectraPreTrainedModel,
load_tf_weights_in_electra, load_tf_weights_in_electra,
) )
from .models.encodec import (
ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST,
EncodecModel,
EncodecPreTrainedModel,
)
from .models.encoder_decoder import EncoderDecoderModel from .models.encoder_decoder import EncoderDecoderModel
from .models.ernie import ( from .models.ernie import (
ERNIE_PRETRAINED_MODEL_ARCHIVE_LIST, ERNIE_PRETRAINED_MODEL_ARCHIVE_LIST,
......
...@@ -72,6 +72,7 @@ from . import ( ...@@ -72,6 +72,7 @@ from . import (
efficientformer, efficientformer,
efficientnet, efficientnet,
electra, electra,
encodec,
encoder_decoder, encoder_decoder,
ernie, ernie,
ernie_m, ernie_m,
......
...@@ -80,6 +80,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -80,6 +80,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("efficientformer", "EfficientFormerConfig"), ("efficientformer", "EfficientFormerConfig"),
("efficientnet", "EfficientNetConfig"), ("efficientnet", "EfficientNetConfig"),
("electra", "ElectraConfig"), ("electra", "ElectraConfig"),
("encodec", "EncodecConfig"),
("encoder-decoder", "EncoderDecoderConfig"), ("encoder-decoder", "EncoderDecoderConfig"),
("ernie", "ErnieConfig"), ("ernie", "ErnieConfig"),
("ernie_m", "ErnieMConfig"), ("ernie_m", "ErnieMConfig"),
...@@ -273,6 +274,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -273,6 +274,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("efficientformer", "EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("efficientformer", "EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("efficientnet", "EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("efficientnet", "EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("electra", "ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("electra", "ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("encodec", "ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("ernie", "ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ernie", "ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("ernie_m", "ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ernie_m", "ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("esm", "ESM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("esm", "ESM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -461,6 +463,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -461,6 +463,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("efficientformer", "EfficientFormer"), ("efficientformer", "EfficientFormer"),
("efficientnet", "EfficientNet"), ("efficientnet", "EfficientNet"),
("electra", "ELECTRA"), ("electra", "ELECTRA"),
("encodec", "EnCodec"),
("encoder-decoder", "Encoder decoder"), ("encoder-decoder", "Encoder decoder"),
("ernie", "ERNIE"), ("ernie", "ERNIE"),
("ernie_m", "ErnieM"), ("ernie_m", "ErnieM"),
......
...@@ -54,6 +54,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict( ...@@ -54,6 +54,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("dinat", "ViTFeatureExtractor"), ("dinat", "ViTFeatureExtractor"),
("donut-swin", "DonutFeatureExtractor"), ("donut-swin", "DonutFeatureExtractor"),
("dpt", "DPTFeatureExtractor"), ("dpt", "DPTFeatureExtractor"),
("encodec", "EncodecFeatureExtractor"),
("flava", "FlavaFeatureExtractor"), ("flava", "FlavaFeatureExtractor"),
("glpn", "GLPNFeatureExtractor"), ("glpn", "GLPNFeatureExtractor"),
("groupvit", "CLIPFeatureExtractor"), ("groupvit", "CLIPFeatureExtractor"),
......
...@@ -79,6 +79,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -79,6 +79,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("efficientformer", "EfficientFormerModel"), ("efficientformer", "EfficientFormerModel"),
("efficientnet", "EfficientNetModel"), ("efficientnet", "EfficientNetModel"),
("electra", "ElectraModel"), ("electra", "ElectraModel"),
("encodec", "EncodecModel"),
("ernie", "ErnieModel"), ("ernie", "ErnieModel"),
("ernie_m", "ErnieMModel"), ("ernie_m", "ErnieMModel"),
("esm", "EsmModel"), ("esm", "EsmModel"),
......
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)
_import_structure = {
"configuration_encodec": [
"ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP",
"EncodecConfig",
],
"feature_extraction_encodec": ["EncodecFeatureExtractor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_encodec"] = [
"ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST",
"EncodecModel",
"EncodecPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_encodec import (
ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP,
EncodecConfig,
)
from .feature_extraction_encodec import EncodecFeatureExtractor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_encodec import (
ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST,
EncodecModel,
EncodecPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 Meta Platforms, Inc. and affiliates, and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" EnCodec model configuration"""
import math
from typing import Optional
import numpy as np
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/encodec_24khz": "https://huggingface.co/facebook/encodec_24khz/resolve/main/config.json",
"facebook/encodec_48khz": "https://huggingface.co/facebook/encodec_48khz/resolve/main/config.json",
}
class EncodecConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of an [`EncodecModel`]. It is used to instantiate a
Encodec model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the
[facebook/encodec_24khz](https://huggingface.co/facebook/encodec_24khz) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
target_bandwidths (`List[float]`, *optional*, defaults to `[1.5, 3.0, 6.0, 12.0, 24.0]`):
The range of diffent bandwiths the model can encode audio with.
sampling_rate (`int`, *optional*, defaults to 24000):
The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).
audio_channels (`int`, *optional*, defaults to 1):
Number of channels in the audio data. Either 1 for mono or 2 for stereo.
normalize (`bool`, *optional*, defaults to `False`):
Whether the audio shall be normalized when passed.
chunk_length_s (`float`, *optional*):
If defined the audio is pre-processed into chunks of lengths `chunk_length_s` and then encoded.
overlap (`float`, *optional*):
Defines the overlap between each chunk. It is used to compute the `chunk_stride` using the following
formulae : `int((1.0 - self.overlap) * self.chunk_length)`.
hidden_size (`int`, *optional*, defaults to 128):
Intermediate representation dimension.
num_filters (`int`, *optional*, defaults to 32):
Number of convolution kernels of first `EncodecConv1d` down sampling layer.
num_residual_layers (`int`, *optional*, defaults to 1):
Number of residual layers.
upsampling_ratios (`Sequence[int]` , *optional*, defaults to `[8, 5, 4, 2]`):
Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it
will use the ratios in the reverse order to the ones specified here that must match the decoder order.
norm_type (`str`, *optional*, defaults to `"weight_norm"`):
Normalization method. Should be in `["weight_norm", "time_group_norm"]`
kernel_size (`int`, *optional*, defaults to 7):
Kernel size for the initial convolution.
last_kernel_size (`int`, *optional*, defaults to 7):
Kernel size for the last convolution layer.
residual_kernel_size (`int`, *optional*, defaults to 3):
Kernel size for the residual layers.
dilation_growth_rate (`int`, *optional*, defaults to 2):
How much to increase the dilation with each layer.
use_causal_conv (`bool`, *optional*, defaults to `True`):
Whether to use fully causal convolution.
pad_mode (`str`, *optional*, defaults to `"reflect"`):
Padding mode for the convolutions.
compress (`int`, *optional*, defaults to 2):
Reduced dimensionality in residual branches (from Demucs v3).
num_lstm_layers (`int`, *optional*, defaults to 2):
Number of LSTM layers at the end of the encoder.
trim_right_ratio (`float`, *optional*, defaults to 1.0):
Ratio for trimming at the right of the transposed convolution under the `use_causal_conv = True` setup. If
equal to 1.0, it means that all the trimming is done at the right.
codebook_size (`int`, *optional*, defaults to 1024):
Number of discret codes that make up VQVAE.
codebook_dim (`int`, *optional*):
Dimension of the codebook vectors. If not defined, uses `hidden_size`.
Example:
```python
>>> from transformers import EncodecModel, EncodecConfig
>>> # Initializing a "facebook/encodec_24khz" style configuration
>>> configuration = EncodecConfig()
>>> # Initializing a model (with random weights) from the "facebook/encodec_24khz" style configuration
>>> model = EncodecModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "encodec"
def __init__(
self,
target_bandwidths=[1.5, 3.0, 6.0, 12.0, 24.0],
sampling_rate=24_000,
audio_channels=1,
normalize=False,
chunk_length_s=None,
overlap=None,
hidden_size=128,
num_filters=32,
num_residual_layers=1,
upsampling_ratios=[8, 5, 4, 2],
norm_type="weight_norm",
kernel_size=7,
last_kernel_size=7,
residual_kernel_size=3,
dilation_growth_rate=2,
use_causal_conv=True,
pad_mode="reflect",
compress=2,
num_lstm_layers=2,
trim_right_ratio=1.0,
codebook_size=1024,
codebook_dim=None,
**kwargs,
):
self.target_bandwidths = target_bandwidths
self.sampling_rate = sampling_rate
self.audio_channels = audio_channels
self.normalize = normalize
self.chunk_length_s = chunk_length_s
self.overlap = overlap
self.hidden_size = hidden_size
self.num_filters = num_filters
self.num_residual_layers = num_residual_layers
self.upsampling_ratios = upsampling_ratios
self.norm_type = norm_type
self.kernel_size = kernel_size
self.last_kernel_size = last_kernel_size
self.residual_kernel_size = residual_kernel_size
self.dilation_growth_rate = dilation_growth_rate
self.use_causal_conv = use_causal_conv
self.pad_mode = pad_mode
self.compress = compress
self.num_lstm_layers = num_lstm_layers
self.trim_right_ratio = trim_right_ratio
self.codebook_size = codebook_size
self.codebook_dim = codebook_dim if codebook_dim is not None else hidden_size
if self.norm_type not in ["weight_norm", "time_group_norm"]:
raise ValueError(
f'self.norm_type must be one of `"weight_norm"`, `"time_group_norm"`), got {self.norm_type}'
)
super().__init__(**kwargs)
# This is a property because you might want to change the chunk_length_s on the fly
@property
def chunk_length(self) -> Optional[int]:
if self.chunk_length_s is None:
return None
else:
return int(self.chunk_length_s * self.sampling_rate)
# This is a property because you might want to change the chunk_length_s on the fly
@property
def chunk_stride(self) -> Optional[int]:
if self.chunk_length_s is None or self.overlap is None:
return None
else:
return max(1, int((1.0 - self.overlap) * self.chunk_length))
@property
def frame_rate(self) -> int:
hop_length = np.prod(self.upsampling_ratios)
return math.ceil(self.sampling_rate / hop_length)
@property
def num_quantizers(self) -> int:
return int(1000 * self.target_bandwidths[-1] // (self.frame_rate * 10))
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert EnCodec checkpoints."""
import argparse
import torch
from transformers import (
EncodecConfig,
EncodecFeatureExtractor,
EncodecModel,
logging,
)
# checkpoints downloaded from:
# https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th
# https://dl.fbaipublicfiles.com/encodec/v0/encodec_48khz-7e698e3e.th
logging.set_verbosity_info()
logger = logging.get_logger("transformers.models.encodec")
MAPPING_QUANTIZER = {
"quantizer.vq.layers.*._codebook.inited": "quantizer.layers.*.codebook.inited",
"quantizer.vq.layers.*._codebook.cluster_size": "quantizer.layers.*.codebook.cluster_size",
"quantizer.vq.layers.*._codebook.embed": "quantizer.layers.*.codebook.embed",
"quantizer.vq.layers.*._codebook.embed_avg": "quantizer.layers.*.codebook.embed_avg",
}
MAPPING_ENCODER = {
"encoder.model.0.conv.conv": "encoder.layers.0.conv",
"encoder.model.1.block.1.conv.conv": "encoder.layers.1.block.1.conv",
"encoder.model.1.block.3.conv.conv": "encoder.layers.1.block.3.conv",
"encoder.model.1.shortcut.conv.conv": "encoder.layers.1.shortcut.conv",
"encoder.model.3.conv.conv": "encoder.layers.3.conv",
"encoder.model.4.block.1.conv.conv": "encoder.layers.4.block.1.conv",
"encoder.model.4.block.3.conv.conv": "encoder.layers.4.block.3.conv",
"encoder.model.4.shortcut.conv.conv": "encoder.layers.4.shortcut.conv",
"encoder.model.6.conv.conv": "encoder.layers.6.conv",
"encoder.model.7.block.1.conv.conv": "encoder.layers.7.block.1.conv",
"encoder.model.7.block.3.conv.conv": "encoder.layers.7.block.3.conv",
"encoder.model.7.shortcut.conv.conv": "encoder.layers.7.shortcut.conv",
"encoder.model.9.conv.conv": "encoder.layers.9.conv",
"encoder.model.10.block.1.conv.conv": "encoder.layers.10.block.1.conv",
"encoder.model.10.block.3.conv.conv": "encoder.layers.10.block.3.conv",
"encoder.model.10.shortcut.conv.conv": "encoder.layers.10.shortcut.conv",
"encoder.model.12.conv.conv": "encoder.layers.12.conv",
"encoder.model.13.lstm": "encoder.layers.13.lstm",
"encoder.model.15.conv.conv": "encoder.layers.15.conv",
}
MAPPING_ENCODER_48K = {
"encoder.model.0.conv.norm": "encoder.layers.0.norm",
"encoder.model.1.block.1.conv.norm": "encoder.layers.1.block.1.norm",
"encoder.model.1.block.3.conv.norm": "encoder.layers.1.block.3.norm",
"encoder.model.1.shortcut.conv.norm": "encoder.layers.1.shortcut.norm",
"encoder.model.3.conv.norm": "encoder.layers.3.norm",
"encoder.model.4.block.1.conv.norm": "encoder.layers.4.block.1.norm",
"encoder.model.4.block.3.conv.norm": "encoder.layers.4.block.3.norm",
"encoder.model.4.shortcut.conv.norm": "encoder.layers.4.shortcut.norm",
"encoder.model.6.conv.norm": "encoder.layers.6.norm",
"encoder.model.7.block.1.conv.norm": "encoder.layers.7.block.1.norm",
"encoder.model.7.block.3.conv.norm": "encoder.layers.7.block.3.norm",
"encoder.model.7.shortcut.conv.norm": "encoder.layers.7.shortcut.norm",
"encoder.model.9.conv.norm": "encoder.layers.9.norm",
"encoder.model.10.block.1.conv.norm": "encoder.layers.10.block.1.norm",
"encoder.model.10.block.3.conv.norm": "encoder.layers.10.block.3.norm",
"encoder.model.10.shortcut.conv.norm": "encoder.layers.10.shortcut.norm",
"encoder.model.12.conv.norm": "encoder.layers.12.norm",
"encoder.model.15.conv.norm": "encoder.layers.15.norm",
}
MAPPING_DECODER = {
"decoder.model.0.conv.conv": "decoder.layers.0.conv",
"decoder.model.1.lstm": "decoder.layers.1.lstm",
"decoder.model.3.convtr.convtr": "decoder.layers.3.conv",
"decoder.model.4.block.1.conv.conv": "decoder.layers.4.block.1.conv",
"decoder.model.4.block.3.conv.conv": "decoder.layers.4.block.3.conv",
"decoder.model.4.shortcut.conv.conv": "decoder.layers.4.shortcut.conv",
"decoder.model.6.convtr.convtr": "decoder.layers.6.conv",
"decoder.model.7.block.1.conv.conv": "decoder.layers.7.block.1.conv",
"decoder.model.7.block.3.conv.conv": "decoder.layers.7.block.3.conv",
"decoder.model.7.shortcut.conv.conv": "decoder.layers.7.shortcut.conv",
"decoder.model.9.convtr.convtr": "decoder.layers.9.conv",
"decoder.model.10.block.1.conv.conv": "decoder.layers.10.block.1.conv",
"decoder.model.10.block.3.conv.conv": "decoder.layers.10.block.3.conv",
"decoder.model.10.shortcut.conv.conv": "decoder.layers.10.shortcut.conv",
"decoder.model.12.convtr.convtr": "decoder.layers.12.conv",
"decoder.model.13.block.1.conv.conv": "decoder.layers.13.block.1.conv",
"decoder.model.13.block.3.conv.conv": "decoder.layers.13.block.3.conv",
"decoder.model.13.shortcut.conv.conv": "decoder.layers.13.shortcut.conv",
"decoder.model.15.conv.conv": "decoder.layers.15.conv",
}
MAPPING_DECODER_48K = {
"decoder.model.0.conv.norm": "decoder.layers.0.norm",
"decoder.model.3.convtr.norm": "decoder.layers.3.norm",
"decoder.model.4.block.1.conv.norm": "decoder.layers.4.block.1.norm",
"decoder.model.4.block.3.conv.norm": "decoder.layers.4.block.3.norm",
"decoder.model.4.shortcut.conv.norm": "decoder.layers.4.shortcut.norm",
"decoder.model.6.convtr.norm": "decoder.layers.6.norm",
"decoder.model.7.block.1.conv.norm": "decoder.layers.7.block.1.norm",
"decoder.model.7.block.3.conv.norm": "decoder.layers.7.block.3.norm",
"decoder.model.7.shortcut.conv.norm": "decoder.layers.7.shortcut.norm",
"decoder.model.9.convtr.norm": "decoder.layers.9.norm",
"decoder.model.10.block.1.conv.norm": "decoder.layers.10.block.1.norm",
"decoder.model.10.block.3.conv.norm": "decoder.layers.10.block.3.norm",
"decoder.model.10.shortcut.conv.norm": "decoder.layers.10.shortcut.norm",
"decoder.model.12.convtr.norm": "decoder.layers.12.norm",
"decoder.model.13.block.1.conv.norm": "decoder.layers.13.block.1.norm",
"decoder.model.13.block.3.conv.norm": "decoder.layers.13.block.3.norm",
"decoder.model.13.shortcut.conv.norm": "decoder.layers.13.shortcut.norm",
"decoder.model.15.conv.norm": "decoder.layers.15.norm",
}
MAPPING_24K = {
**MAPPING_QUANTIZER,
**MAPPING_ENCODER,
**MAPPING_DECODER,
}
MAPPING_48K = {
**MAPPING_QUANTIZER,
**MAPPING_ENCODER,
**MAPPING_ENCODER_48K,
**MAPPING_DECODER,
**MAPPING_DECODER_48K,
}
TOP_LEVEL_KEYS = []
IGNORE_KEYS = []
def set_recursively(hf_pointer, key, value, full_name, weight_type):
for attribute in key.split("."):
hf_pointer = getattr(hf_pointer, attribute)
if weight_type is not None:
hf_shape = getattr(hf_pointer, weight_type).shape
else:
hf_shape = hf_pointer.shape
if hf_shape != value.shape:
raise ValueError(
f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be"
f" {value.shape} for {full_name}"
)
if weight_type == "weight":
hf_pointer.weight.data = value
elif weight_type == "weight_g":
hf_pointer.weight_g.data = value
elif weight_type == "weight_v":
hf_pointer.weight_v.data = value
elif weight_type == "bias":
hf_pointer.bias.data = value
elif weight_type == "running_mean":
hf_pointer.running_mean.data = value
elif weight_type == "running_var":
hf_pointer.running_var.data = value
elif weight_type == "num_batches_tracked":
hf_pointer.num_batches_tracked.data = value
elif weight_type == "weight_ih_l0":
hf_pointer.weight_ih_l0.data = value
elif weight_type == "weight_hh_l0":
hf_pointer.weight_hh_l0.data = value
elif weight_type == "bias_ih_l0":
hf_pointer.bias_ih_l0.data = value
elif weight_type == "bias_hh_l0":
hf_pointer.bias_hh_l0.data = value
elif weight_type == "weight_ih_l1":
hf_pointer.weight_ih_l1.data = value
elif weight_type == "weight_hh_l1":
hf_pointer.weight_hh_l1.data = value
elif weight_type == "bias_ih_l1":
hf_pointer.bias_ih_l1.data = value
elif weight_type == "bias_hh_l1":
hf_pointer.bias_hh_l1.data = value
else:
hf_pointer.data = value
logger.info(f"{key + ('.' + weight_type if weight_type is not None else '')} was initialized from {full_name}.")
def should_ignore(name, ignore_keys):
for key in ignore_keys:
if key.endswith(".*"):
if name.startswith(key[:-1]):
return True
elif ".*." in key:
prefix, suffix = key.split(".*.")
if prefix in name and suffix in name:
return True
elif key in name:
return True
return False
def recursively_load_weights(orig_dict, hf_model, model_name):
unused_weights = []
if model_name == "encodec_24khz":
MAPPING = MAPPING_24K
elif model_name == "encodec_48khz":
MAPPING = MAPPING_48K
else:
raise ValueError(f"Unsupported model: {model_name}")
for name, value in orig_dict.items():
if should_ignore(name, IGNORE_KEYS):
logger.info(f"{name} was ignored")
continue
is_used = False
for key, mapped_key in MAPPING.items():
if "*" in key:
prefix, suffix = key.split(".*.")
if prefix in name and suffix in name:
key = suffix
if key in name:
# HACK otherwise .embed gets initialized with .embed_avg too
if key.endswith("embed") and name.endswith("embed_avg"):
continue
is_used = True
if "*" in mapped_key:
layer_index = name.split(key)[0].split(".")[-2]
mapped_key = mapped_key.replace("*", layer_index)
if "weight_g" in name:
weight_type = "weight_g"
elif "weight_v" in name:
weight_type = "weight_v"
elif "weight_ih_l0" in name:
weight_type = "weight_ih_l0"
elif "weight_hh_l0" in name:
weight_type = "weight_hh_l0"
elif "bias_ih_l0" in name:
weight_type = "bias_ih_l0"
elif "bias_hh_l0" in name:
weight_type = "bias_hh_l0"
elif "weight_ih_l1" in name:
weight_type = "weight_ih_l1"
elif "weight_hh_l1" in name:
weight_type = "weight_hh_l1"
elif "bias_ih_l1" in name:
weight_type = "bias_ih_l1"
elif "bias_hh_l1" in name:
weight_type = "bias_hh_l1"
elif "bias" in name:
weight_type = "bias"
elif "weight" in name:
weight_type = "weight"
elif "running_mean" in name:
weight_type = "running_mean"
elif "running_var" in name:
weight_type = "running_var"
elif "num_batches_tracked" in name:
weight_type = "num_batches_tracked"
else:
weight_type = None
set_recursively(hf_model, mapped_key, value, name, weight_type)
continue
if not is_used:
unused_weights.append(name)
logger.warning(f"Unused weights: {unused_weights}")
@torch.no_grad()
def convert_checkpoint(
model_name,
checkpoint_path,
pytorch_dump_folder_path,
config_path=None,
repo_id=None,
):
"""
Copy/paste/tweak model's weights to transformers design.
"""
if config_path is not None:
config = EncodecConfig.from_pretrained(config_path)
else:
config = EncodecConfig()
if model_name == "encodec_24khz":
pass # config is already correct
elif model_name == "encodec_48khz":
config.upsampling_ratios = [8, 5, 4, 2]
config.target_bandwidths = [3.0, 6.0, 12.0, 24.0]
config.sampling_rate = 48_000
config.audio_channels = 2
config.use_causal_conv = False
config.norm_type = "time_group_norm"
config.normalize = True
config.chunk_length_s = 1.0
config.overlap = 0.01
else:
raise ValueError(f"Unknown model name: {model_name}")
model = EncodecModel(config)
feature_extractor = EncodecFeatureExtractor(
feature_size=config.audio_channels,
sampling_rate=config.sampling_rate,
chunk_length_s=config.chunk_length_s,
overlap=config.overlap,
)
feature_extractor.save_pretrained(pytorch_dump_folder_path)
original_checkpoint = torch.load(checkpoint_path)
recursively_load_weights(original_checkpoint, model, model_name)
model.save_pretrained(pytorch_dump_folder_path)
if repo_id:
print("Pushing to the hub...")
feature_extractor.push_to_hub(repo_id)
model.push_to_hub(repo_id)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model",
default="encodec_24khz",
type=str,
help="The model to convert. Should be one of 'encodec_24khz', 'encodec_48khz'.",
)
parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
parser.add_argument(
"--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
)
parser.add_argument(
"--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
)
args = parser.parse_args()
convert_checkpoint(
args.model,
args.checkpoint_path,
args.pytorch_dump_folder_path,
args.config_path,
args.push_to_hub,
)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for EnCodec."""
from typing import List, Optional, Union
import numpy as np
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import PaddingStrategy, TensorType, logging
logger = logging.get_logger(__name__)
class EncodecFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs an EnCodec feature extractor.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
Instantiating a feature extractor with the defaults will yield a similar configuration to that of the
[facebook/encodec_24khz](https://huggingface.co/facebook/encodec_24khz) architecture.
Args:
feature_size (`int`, *optional*, defaults to 1):
The feature dimension of the extracted features. Use 1 for mono, 2 for stereo.
sampling_rate (`int`, *optional*, defaults to 24000):
The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).
padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding values.
chunk_length_s (`float`, *optional*):
If defined the audio is pre-processed into chunks of lengths `chunk_length_s` and then encoded.
overlap (`float`, *optional*):
Defines the overlap between each chunk. It is used to compute the `chunk_stride` using the following
formulae : `int((1.0 - self.overlap) * self.chunk_length)`.
"""
model_input_names = ["input_values", "padding_mask"]
def __init__(
self,
feature_size: int = 1,
sampling_rate: int = 24000,
padding_value: float = 0.0,
chunk_length_s: float = None,
overlap: float = None,
**kwargs,
):
super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
self.chunk_length_s = chunk_length_s
self.overlap = overlap
# This is a property because you might want to change the chunk_length_s on the fly
@property
def chunk_length(self) -> Optional[int]:
if self.chunk_length_s is None:
return None
else:
return int(self.chunk_length_s * self.sampling_rate)
# This is a property because you might want to change the chunk_length_s on the fly
@property
def chunk_stride(self) -> Optional[int]:
if self.chunk_length_s is None or self.overlap is None:
return None
else:
return max(1, int((1.0 - self.overlap) * self.chunk_length))
def __call__(
self,
raw_audio: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
padding: Optional[Union[bool, str, PaddingStrategy]] = None,
truncation: Optional[bool] = False,
max_length: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
sampling_rate: Optional[int] = None,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s).
Args:
raw_audio (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays or a list of list of float values. The numpy array must be of shape
`(num_samples,)` for mono audio (`feature_size = 1`), or `(2, num_samples)` for stereo audio
(`feature_size = 2`).
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
truncation (`bool`, *optional*, defaults to `False`):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
sampling_rate (`int`, *optional*):
The sampling rate at which the `audio` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors.
"""
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
f" {self.sampling_rate}. Please make sure that the provided audio input was sampled with"
f" {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
"Failing to do so can result in silent errors that might be hard to debug."
)
if padding and truncation:
raise ValueError("Both padding and truncation were set. Make sure you only set one.")
elif padding is None:
# by default let's pad the inputs
padding = True
is_batched = bool(
isinstance(raw_audio, (list, tuple)) and (isinstance(raw_audio[0], (np.ndarray, tuple, list)))
)
if is_batched:
raw_audio = [np.asarray(audio, dtype=np.float32).T for audio in raw_audio]
elif not is_batched and not isinstance(raw_audio, np.ndarray):
raw_audio = np.asarray(raw_audio, dtype=np.float32)
elif isinstance(raw_audio, np.ndarray) and raw_audio.dtype is np.dtype(np.float64):
raw_audio = raw_audio.astype(np.float32)
# always return batch
if not is_batched:
raw_audio = [np.asarray(raw_audio).T]
# verify inputs are valid
for idx, example in enumerate(raw_audio):
if example.ndim > 2:
raise ValueError(f"Expected input shape (channels, length) but got shape {example.shape}")
if self.feature_size == 1 and example.ndim != 1:
raise ValueError(f"Expected mono audio but example has {example.shape[-1]} channels")
if self.feature_size == 2 and example.shape[-1] != 2:
raise ValueError(f"Expected stereo audio but example has {example.shape[-1]} channels")
padded_inputs = None
input_values = BatchFeature({"input_values": raw_audio})
if self.chunk_stride is not None and self.chunk_length is not None and max_length is None:
if truncation:
max_length = min(array.shape[0] for array in raw_audio)
nb_step = int(np.floor(max_length / self.chunk_stride))
max_length = (nb_step - 1) * self.chunk_stride + self.chunk_length
elif padding:
max_length = max(array.shape[0] for array in raw_audio)
nb_step = int(np.ceil(max_length / self.chunk_stride))
max_length = (nb_step - 1) * self.chunk_stride + self.chunk_length
padding = "max_length"
else:
padded_inputs = input_values
# normal padding on batch
if padded_inputs is None:
padded_inputs = self.pad(
input_values,
max_length=max_length,
truncation=truncation,
padding=padding,
return_attention_mask=padding,
)
if padding:
padded_inputs["padding_mask"] = padded_inputs.pop("attention_mask")
input_values = []
for example in padded_inputs.pop("input_values"):
if self.feature_size == 1:
example = example[..., None]
input_values.append(example.T)
padded_inputs["input_values"] = input_values
if return_tensors is not None:
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
return padded_inputs
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment