Unverified Commit 450a181d authored by Susnato Dhar's avatar Susnato Dhar Committed by GitHub
Browse files

Add Pop2Piano (#21785)



* init commit

* config updated also some modeling

* Processor and Model config combined

* extraction pipeline(upto before spectogram & mel_conditioner) added but not properly tested

* model loading successful!

* feature extractor done!

* FE can now be called from HF

* postprocessing added in fe file

* same as prev commit

* Pop2PianoConfig doc done

* cfg docs slightly changed

* fe docs done

* batched

* batched working!

* temp

* v1

* checking

* trying to go with generate

* with generate and model tests passed

* before rebasing

* .

* tests done docs done remaining others & nits

* nits

* LogMelSpectogram shifted to FeatureExtractor

* is_tf rmeoved from pop2piano/init

* import solved

* tokenization tests added

* minor fixed regarding modeling_pop2piano

* tokenizer changed to only return midi_object and other changes

* Updated paper abstract(Camera-ready version) (#2)

* more comments and nits

* ruff changes

* code quality fix

* sg comments

* t5 change added and rebased

* comments except batching

* batching done

* comments

* small doc fix

* example removed from modeling

* ckpt

* forward it compatible with fe and generation done

* comments

* comments

* code-quality fix(maybe)

* ckpts changed

* doc file changed from mdx to md

* test fixes

* tokenizer test fix

* changes

* nits done main changes remaining

* code modified

* Pop2PianoProcessor added with tests

* other comments

* added Pop2PianoProcessor to dummy_objects

* added require_onnx to modeling file

* changes

* update .md file

* remove extra line in index.md

* back to the main index

* added pop2piano to index

* Added tokenizer.__call__ with valid args and batch_decode and aligned the processor part too

* changes

* added return types to 2 tokenizer methods

* the PR build test might work now

* added backends

* PR build fix

* vocab added

* comments

* refactored vocab into 1 file

* added conversion script

* comments

* essentia version changed in .md

* comments

* more tokenizer tests added

* minor fix

* tests extended for outputs acc check

* small fix

---------
Co-authored-by: default avatarJongho Choi <sweetcocoa@snu.ac.kr>
parent 6f041fcb
......@@ -434,6 +434,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
......@@ -411,6 +411,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
......@@ -383,6 +383,7 @@ conda install -c huggingface transformers
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google से) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. द्वाराअनुसंधान पत्र [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) के साथ जारी किया गया
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv .org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा।
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग ](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया।
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. से) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. द्वाराअनुसंधान पत्र [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) के साथ जारी किया गया
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https:// arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा।
......
......@@ -445,6 +445,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google から) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. から公開された研究論文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP から) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang から公開された研究論文: [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333)
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs から) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng から公開された研究論文: [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063)
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. から) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. から公開された研究論文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf)
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA から) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius から公開された研究論文: [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602)
......
......@@ -360,6 +360,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google 에서 제공)은 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.의 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)논문과 함께 발표했습니다.
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP 에서) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 의 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 논문과 함께 발표했습니다.
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs 에서) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 의 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 논문과 함께 발표했습니다.
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다.
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. 에서 제공)은 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.의 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf)논문과 함께 발표했습니다.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA 에서) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 의 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 논문과 함께 발표했습니다.
......
......@@ -384,6 +384,7 @@ conda install -c huggingface transformers
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (来自 Google) 伴随论文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) 由 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova 发布。
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (来自 UCLA NLP) 伴随论文 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 由 Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 发布。
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (来自 Sea AI Labs) 伴随论文 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 由 Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 发布。
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (来自 Nanjing University, The University of Hong Kong etc.) 伴随论文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) 由 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao 发布。
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。
......
......@@ -396,6 +396,7 @@ conda install -c huggingface transformers
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......
......@@ -584,6 +584,8 @@
title: MMS
- local: model_doc/musicgen
title: MusicGen
- local: model_doc/pop2piano
title: Pop2Piano
- local: model_doc/sew
title: SEW
- local: model_doc/sew-d
......
......@@ -200,6 +200,7 @@ The documentation is organized into five sections:
1. **[Pix2Struct](model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
1. **[Pop2Piano](model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[PVT](model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
......@@ -415,6 +416,7 @@ Flax), PyTorch, and/or TensorFlow.
| Pix2Struct | ✅ | ❌ | ❌ |
| PLBart | ✅ | ❌ | ❌ |
| PoolFormer | ✅ | ❌ | ❌ |
| Pop2Piano | ✅ | ❌ | ❌ |
| ProphetNet | ✅ | ❌ | ❌ |
| PVT | ✅ | ❌ | ❌ |
| QDQBert | ✅ | ❌ | ❌ |
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Pop2Piano
## Overview
The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great
expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you
can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover
from pop audio without melody and chord extraction modules.
Pop2Piano is an encoder-decoder Transformer model based on [T5](https://arxiv.org/pdf/1910.10683.pdf). The input audio
is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder
uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four
different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
The abstract from the paper is the following:
*Piano covers of pop music are enjoyed by many people. However, the
task of automatically generating piano covers of pop music is still
understudied. This is partly due to the lack of synchronized
{Pop, Piano Cover} data pairs, which made it challenging to apply
the latest data-intensive deep learning-based methods. To leverage
the power of the data-driven approach, we make a large amount of
paired and synchronized {Pop, Piano Cover} data using an automated
pipeline. In this paper, we present Pop2Piano, a Transformer network
that generates piano covers given waveforms of pop music. To the best
of our knowledge, this is the first model to generate a piano cover
directly from pop audio without using melody and chord extraction
modules. We show that Pop2Piano, trained with our dataset, is capable
of producing plausible piano covers.*
Tips:
1. To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:
```
pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
```
Please note that you may need to restart your runtime after installation.
2. Pop2Piano is an Encoder-Decoder based model like T5.
3. Pop2Piano can be used to generate midi-audio files for a given audio sequence.
4. Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance.
6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
## Examples
- Example using HuggingFace Dataset:
```python
>>> from datasets import load_dataset
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
>>> ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")
>>> inputs = processor(
... audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
... )
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
... token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")
```
- Example using your own audio file:
```python
>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
>>> audio, sr = librosa.load("<your_audio_file_here>", sr=44100) # feel free to change the sr to a suitable value.
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
>>> inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt")
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
... token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")
```
- Example of processing multiple audio files in batch:
```python
>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
>>> inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt")
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
... input_features=inputs["input_features"],
... attention_mask=inputs["attention_mask"],
... composer="composer1",
... )
>>> tokenizer_output = processor.batch_decode(
... token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]
>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
```
- Example of processing multiple audio files in batch (Using `Pop2PianoFeatureExtractor` and `Pop2PianoTokenizer`):
```python
>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoFeatureExtractor, Pop2PianoTokenizer
>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
>>> tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano")
>>> inputs = feature_extractor(
... audio=[audio1, audio2],
... sampling_rate=[sr1, sr2],
... return_attention_mask=True,
... return_tensors="pt",
... )
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
... input_features=inputs["input_features"],
... attention_mask=inputs["attention_mask"],
... composer="composer1",
... )
>>> tokenizer_output = tokenizer.batch_decode(
... token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]
>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
```
## Pop2PianoConfig
[[autodoc]] Pop2PianoConfig
## Pop2PianoFeatureExtractor
[[autodoc]] Pop2PianoFeatureExtractor
- __call__
## Pop2PianoForConditionalGeneration
[[autodoc]] Pop2PianoForConditionalGeneration
- forward
- generate
## Pop2PianoTokenizer
[[autodoc]] Pop2PianoTokenizer
- __call__
## Pop2PianoProcessor
[[autodoc]] Pop2PianoProcessor
- __call__
......@@ -28,8 +28,12 @@ from .utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_bitsandbytes_available,
is_essentia_available,
is_flax_available,
is_keras_nlp_available,
is_librosa_available,
is_pretty_midi_available,
is_scipy_available,
is_sentencepiece_available,
is_speech_available,
is_tensorflow_text_available,
......@@ -475,6 +479,10 @@ _import_structure = {
],
"models.plbart": ["PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "PLBartConfig"],
"models.poolformer": ["POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PoolFormerConfig"],
"models.pop2piano": [
"POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP",
"Pop2PianoConfig",
],
"models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
"models.pvt": ["PVT_PRETRAINED_CONFIG_ARCHIVE_MAP", "PvtConfig"],
"models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
......@@ -2430,6 +2438,13 @@ else:
"PoolFormerPreTrainedModel",
]
)
_import_structure["models.pop2piano"].extend(
[
"POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
"Pop2PianoForConditionalGeneration",
"Pop2PianoPreTrainedModel",
]
)
_import_structure["models.prophetnet"].extend(
[
"PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
......@@ -3783,6 +3798,29 @@ else:
_import_structure["trainer_tf"] = ["TFTrainer"]
try:
if not (
is_librosa_available()
and is_essentia_available()
and is_scipy_available()
and is_torch_available()
and is_pretty_midi_available()
):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils import dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects
_import_structure["utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects"] = [
name
for name in dir(dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects)
if not name.startswith("_")
]
else:
_import_structure["models.pop2piano"].append("Pop2PianoFeatureExtractor")
_import_structure["models.pop2piano"].append("Pop2PianoTokenizer")
_import_structure["models.pop2piano"].append("Pop2PianoProcessor")
# FLAX-backed objects
try:
if not is_flax_available():
......@@ -4478,6 +4516,10 @@ if TYPE_CHECKING:
)
from .models.plbart import PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP, PLBartConfig
from .models.poolformer import POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, PoolFormerConfig
from .models.pop2piano import (
POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP,
Pop2PianoConfig,
)
from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
from .models.pvt import PVT_PRETRAINED_CONFIG_ARCHIVE_MAP, PvtConfig
from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
......@@ -6122,6 +6164,11 @@ if TYPE_CHECKING:
PoolFormerModel,
PoolFormerPreTrainedModel,
)
from .models.pop2piano import (
POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
Pop2PianoForConditionalGeneration,
Pop2PianoPreTrainedModel,
)
from .models.prophetnet import (
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
ProphetNetDecoder,
......@@ -7212,6 +7259,20 @@ if TYPE_CHECKING:
# Trainer
from .trainer_tf import TFTrainer
try:
if not (
is_librosa_available()
and is_essentia_available()
and is_scipy_available()
and is_torch_available()
and is_pretty_midi_available()
):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects import *
else:
from .models.pop2piano import Pop2PianoFeatureExtractor, Pop2PianoProcessor, Pop2PianoTokenizer
try:
if not is_flax_available():
raise OptionalDependencyNotAvailable()
......
......@@ -157,6 +157,7 @@ from . import (
pix2struct,
plbart,
poolformer,
pop2piano,
prophetnet,
pvt,
qdqbert,
......
......@@ -162,6 +162,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("pix2struct", "Pix2StructConfig"),
("plbart", "PLBartConfig"),
("poolformer", "PoolFormerConfig"),
("pop2piano", "Pop2PianoConfig"),
("prophetnet", "ProphetNetConfig"),
("pvt", "PvtConfig"),
("qdqbert", "QDQBertConfig"),
......@@ -361,6 +362,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("pix2struct", "PIX2STRUCT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("prophetnet", "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("pvt", "PVT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
......@@ -578,6 +580,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("pix2struct", "Pix2Struct"),
("plbart", "PLBart"),
("poolformer", "PoolFormer"),
("pop2piano", "Pop2Piano"),
("prophetnet", "ProphetNet"),
("pvt", "PVT"),
("qdqbert", "QDQBert"),
......
......@@ -73,6 +73,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("owlvit", "OwlViTFeatureExtractor"),
("perceiver", "PerceiverFeatureExtractor"),
("poolformer", "PoolFormerFeatureExtractor"),
("pop2piano", "Pop2PianoFeatureExtractor"),
("regnet", "ConvNextFeatureExtractor"),
("resnet", "ConvNextFeatureExtractor"),
("segformer", "SegformerFeatureExtractor"),
......
......@@ -346,6 +346,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
("openai-gpt", "OpenAIGPTLMHeadModel"),
("pegasus_x", "PegasusXForConditionalGeneration"),
("plbart", "PLBartForConditionalGeneration"),
("pop2piano", "Pop2PianoForConditionalGeneration"),
("qdqbert", "QDQBertForMaskedLM"),
("reformer", "ReformerModelWithLMHead"),
("rembert", "RemBertForMaskedLM"),
......@@ -670,6 +671,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
[
("pop2piano", "Pop2PianoForConditionalGeneration"),
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
("speech_to_text", "Speech2TextForConditionalGeneration"),
("speecht5", "SpeechT5ForSpeechToText"),
......
......@@ -67,6 +67,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("oneformer", "OneFormerProcessor"),
("owlvit", "OwlViTProcessor"),
("pix2struct", "Pix2StructProcessor"),
("pop2piano", "Pop2PianoProcessor"),
("sam", "SamProcessor"),
("sew", "Wav2Vec2Processor"),
("sew-d", "Wav2Vec2Processor"),
......
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_essentia_available,
is_librosa_available,
is_pretty_midi_available,
is_scipy_available,
is_torch_available,
)
_import_structure = {
"configuration_pop2piano": ["POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", "Pop2PianoConfig"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_pop2piano"] = [
"POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
"Pop2PianoForConditionalGeneration",
"Pop2PianoPreTrainedModel",
]
try:
if not (is_librosa_available() and is_essentia_available() and is_scipy_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["feature_extraction_pop2piano"] = ["Pop2PianoFeatureExtractor"]
try:
if not (is_pretty_midi_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["tokenization_pop2piano"] = ["Pop2PianoTokenizer"]
try:
if not (
is_pretty_midi_available()
and is_torch_available()
and is_librosa_available()
and is_essentia_available()
and is_scipy_available()
):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["processing_pop2piano"] = ["Pop2PianoProcessor"]
if TYPE_CHECKING:
from .configuration_pop2piano import POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP, Pop2PianoConfig
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_pop2piano import (
POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
Pop2PianoForConditionalGeneration,
Pop2PianoPreTrainedModel,
)
try:
if not (is_librosa_available() and is_essentia_available() and is_scipy_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .feature_extraction_pop2piano import Pop2PianoFeatureExtractor
try:
if not (is_pretty_midi_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .tokenization_pop2piano import Pop2PianoTokenizer
try:
if not (
is_pretty_midi_available()
and is_torch_available()
and is_librosa_available()
and is_essentia_available()
and is_scipy_available()
):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .processing_pop2piano import Pop2PianoProcessor
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Pop2Piano model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"sweetcocoa/pop2piano": "https://huggingface.co/sweetcocoa/pop2piano/blob/main/config.json"
}
class Pop2PianoConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used
to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
Pop2Piano [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Arguments:
vocab_size (`int`, *optional*, defaults to 2400):
Vocabulary size of the `Pop2PianoForConditionalGeneration` model. Defines the number of different tokens
that can be represented by the `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`].
composer_vocab_size (`int`, *optional*, defaults to 21):
Denotes the number of composers.
d_model (`int`, *optional*, defaults to 512):
Size of the encoder layers and the pooler layer.
d_kv (`int`, *optional*, defaults to 64):
Size of the key, query, value projections per attention head. The `inner_dim` of the projection layer will
be defined as `num_heads * d_kv`.
d_ff (`int`, *optional*, defaults to 2048):
Size of the intermediate feed forward layer in each `Pop2PianoBlock`.
num_layers (`int`, *optional*, defaults to 6):
Number of hidden layers in the Transformer encoder.
num_decoder_layers (`int`, *optional*):
Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
num_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
relative_attention_num_buckets (`int`, *optional*, defaults to 32):
The number of buckets to use for each attention layer.
relative_attention_max_distance (`int`, *optional*, defaults to 128):
The maximum distance of the longer sequences for the bucket separation.
dropout_rate (`float`, *optional*, defaults to 0.1):
The ratio for all dropout layers.
layer_norm_epsilon (`float`, *optional*, defaults to 1e-6):
The epsilon used by the layer normalization layers.
initializer_factor (`float`, *optional*, defaults to 1.0):
A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization
testing).
feed_forward_proj (`string`, *optional*, defaults to `"gated-gelu"`):
Type of feed forward layer to be used. Should be one of `"relu"` or `"gated-gelu"`.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
dense_act_fn (`string`, *optional*, defaults to `"relu"`):
Type of Activation Function to be used in `Pop2PianoDenseActDense` and in `Pop2PianoDenseGatedActDense`.
"""
model_type = "pop2piano"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=2400,
composer_vocab_size=21,
d_model=512,
d_kv=64,
d_ff=2048,
num_layers=6,
num_decoder_layers=None,
num_heads=8,
relative_attention_num_buckets=32,
relative_attention_max_distance=128,
dropout_rate=0.1,
layer_norm_epsilon=1e-6,
initializer_factor=1.0,
feed_forward_proj="gated-gelu", # noqa
is_encoder_decoder=True,
use_cache=True,
pad_token_id=0,
eos_token_id=1,
dense_act_fn="relu",
**kwargs,
):
self.vocab_size = vocab_size
self.composer_vocab_size = composer_vocab_size
self.d_model = d_model
self.d_kv = d_kv
self.d_ff = d_ff
self.num_layers = num_layers
self.num_decoder_layers = num_decoder_layers if num_decoder_layers is not None else self.num_layers
self.num_heads = num_heads
self.relative_attention_num_buckets = relative_attention_num_buckets
self.relative_attention_max_distance = relative_attention_max_distance
self.dropout_rate = dropout_rate
self.layer_norm_epsilon = layer_norm_epsilon
self.initializer_factor = initializer_factor
self.feed_forward_proj = feed_forward_proj
self.use_cache = use_cache
self.dense_act_fn = dense_act_fn
self.is_gated_act = self.feed_forward_proj.split("-")[0] == "gated"
self.hidden_size = self.d_model
self.num_attention_heads = num_heads
self.num_hidden_layers = num_layers
super().__init__(
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
**kwargs,
)
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" File for loading the Pop2Piano model weights from the official repository and to show how tokenizer vocab was
constructed"""
import json
import torch
from transformers import Pop2PianoConfig, Pop2PianoForConditionalGeneration
########################## MODEL WEIGHTS ##########################
# This weights were downloaded from the official pop2piano repository
# https://huggingface.co/sweetcocoa/pop2piano/blob/main/model-1999-val_0.67311615.ckpt
official_weights = torch.load("./model-1999-val_0.67311615.ckpt")
state_dict = {}
# load the config and init the model
cfg = Pop2PianoConfig.from_pretrained("sweetcocoa/pop2piano")
model = Pop2PianoForConditionalGeneration(cfg)
# load relative attention bias
state_dict["encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = official_weights["state_dict"][
"transformer.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"
]
state_dict["decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = official_weights["state_dict"][
"transformer.decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"
]
# load embed tokens and final layer norm for both encoder and decoder
state_dict["encoder.embed_tokens.weight"] = official_weights["state_dict"]["transformer.encoder.embed_tokens.weight"]
state_dict["decoder.embed_tokens.weight"] = official_weights["state_dict"]["transformer.decoder.embed_tokens.weight"]
state_dict["encoder.final_layer_norm.weight"] = official_weights["state_dict"][
"transformer.encoder.final_layer_norm.weight"
]
state_dict["decoder.final_layer_norm.weight"] = official_weights["state_dict"][
"transformer.decoder.final_layer_norm.weight"
]
# load lm_head, mel_conditioner.emb and shared
state_dict["lm_head.weight"] = official_weights["state_dict"]["transformer.lm_head.weight"]
state_dict["mel_conditioner.embedding.weight"] = official_weights["state_dict"]["mel_conditioner.embedding.weight"]
state_dict["shared.weight"] = official_weights["state_dict"]["transformer.shared.weight"]
# load each encoder blocks
for i in range(cfg.num_layers):
# layer 0
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.q.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.0.SelfAttention.q.weight"
]
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.k.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.0.SelfAttention.k.weight"
]
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.v.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.0.SelfAttention.v.weight"
]
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.o.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.0.SelfAttention.o.weight"
]
state_dict[f"encoder.block.{i}.layer.0.layer_norm.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.0.layer_norm.weight"
]
# layer 1
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight"
]
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight"
]
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wo.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wo.weight"
]
state_dict[f"encoder.block.{i}.layer.1.layer_norm.weight"] = official_weights["state_dict"][
f"transformer.encoder.block.{i}.layer.1.layer_norm.weight"
]
# load each decoder blocks
for i in range(6):
# layer 0
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.q.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.0.SelfAttention.q.weight"
]
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.k.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.0.SelfAttention.k.weight"
]
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.v.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.0.SelfAttention.v.weight"
]
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.o.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.0.SelfAttention.o.weight"
]
state_dict[f"decoder.block.{i}.layer.0.layer_norm.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.0.layer_norm.weight"
]
# layer 1
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.q.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.q.weight"
]
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.k.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.k.weight"
]
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.v.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.v.weight"
]
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.o.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.o.weight"
]
state_dict[f"decoder.block.{i}.layer.1.layer_norm.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.1.layer_norm.weight"
]
# layer 2
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight"
]
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight"
]
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wo.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wo.weight"
]
state_dict[f"decoder.block.{i}.layer.2.layer_norm.weight"] = official_weights["state_dict"][
f"transformer.decoder.block.{i}.layer.2.layer_norm.weight"
]
model.load_state_dict(state_dict, strict=True)
# save the weights
torch.save(state_dict, "./pytorch_model.bin")
########################## TOKENIZER ##########################
# the tokenize and detokenize methods are taken from the official implementation
# link : https://github.com/sweetcocoa/pop2piano/blob/fac11e8dcfc73487513f4588e8d0c22a22f2fdc5/midi_tokenizer.py#L34
def tokenize(idx, token_type, n_special=4, n_note=128, n_velocity=2):
if token_type == "TOKEN_TIME":
return n_special + n_note + n_velocity + idx
elif token_type == "TOKEN_VELOCITY":
return n_special + n_note + idx
elif token_type == "TOKEN_NOTE":
return n_special + idx
elif token_type == "TOKEN_SPECIAL":
return idx
else:
return -1
# link : https://github.com/sweetcocoa/pop2piano/blob/fac11e8dcfc73487513f4588e8d0c22a22f2fdc5/midi_tokenizer.py#L48
def detokenize(idx, n_special=4, n_note=128, n_velocity=2, time_idx_offset=0):
if idx >= n_special + n_note + n_velocity:
return "TOKEN_TIME", (idx - (n_special + n_note + n_velocity)) + time_idx_offset
elif idx >= n_special + n_note:
return "TOKEN_VELOCITY", idx - (n_special + n_note)
elif idx >= n_special:
return "TOKEN_NOTE", idx - n_special
else:
return "TOKEN_SPECIAL", idx
# crate the decoder and then the encoder of the tokenizer
decoder = {}
for i in range(cfg.vocab_size):
decoder.update({i: f"{detokenize(i)[1]}_{detokenize(i)[0]}"})
encoder = {v: k for k, v in decoder.items()}
# save the vocab
with open("./vocab.json", "w") as file:
file.write(json.dumps(encoder))
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Feature extractor class for Pop2Piano"""
import copy
import warnings
from typing import List, Optional, Union
import numpy
import numpy as np
from ...audio_utils import mel_filter_bank, spectrogram
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import (
TensorType,
is_essentia_available,
is_librosa_available,
is_scipy_available,
logging,
requires_backends,
)
if is_essentia_available():
import essentia
import essentia.standard
if is_librosa_available():
import librosa
if is_scipy_available():
import scipy
logger = logging.get_logger(__name__)
class Pop2PianoFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a Pop2Piano feature extractor.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts rhythm and preprocesses the audio before it is passed to the model. First the audio is passed
to `RhythmExtractor2013` algorithm which extracts the beat_times, beat positions and estimates their confidence as
well as tempo in bpm, then beat_times is interpolated and to get beatsteps. Later we calculate
extrapolated_beatsteps from it to be used in tokenizer. On the other hand audio is resampled to self.sampling_rate
and preprocessed and then log mel spectogram is computed from that to be used in our transformer model.
Args:
sampling_rate (`int`, *optional*, defaults to 22050):
Target Sampling rate of audio signal. It's the sampling rate that we forward to the model.
padding_value (`int`, *optional*, defaults to 0):
Padding value used to pad the audio. Should correspond to silences.
window_size (`int`, *optional*, defaults to 4096):
Length of the window in samples to which the Fourier transform is applied.
hop_length (`int`, *optional*, defaults to 1024):
Step size between each window of the waveform, in samples.
min_frequency (`float`, *optional*, defaults to 10.0):
Lowest frequency that will be used in the log-mel spectrogram.
feature_size (`int`, *optional*, defaults to 512):
The feature dimension of the extracted features.
num_bars (`int`, *optional*, defaults to 2):
Determines interval between each sequence.
"""
model_input_names = ["input_features", "beatsteps", "extrapolated_beatstep"]
def __init__(
self,
sampling_rate: int = 22050,
padding_value: int = 0,
window_size: int = 4096,
hop_length: int = 1024,
min_frequency: float = 10.0,
feature_size: int = 512,
num_bars: int = 2,
**kwargs,
):
super().__init__(
feature_size=feature_size,
sampling_rate=sampling_rate,
padding_value=padding_value,
**kwargs,
)
self.sampling_rate = sampling_rate
self.padding_value = padding_value
self.window_size = window_size
self.hop_length = hop_length
self.min_frequency = min_frequency
self.feature_size = feature_size
self.num_bars = num_bars
self.mel_filters = mel_filter_bank(
num_frequency_bins=(self.window_size // 2) + 1,
num_mel_filters=self.feature_size,
min_frequency=self.min_frequency,
max_frequency=float(self.sampling_rate // 2),
sampling_rate=self.sampling_rate,
norm=None,
mel_scale="htk",
)
def mel_spectrogram(self, sequence: np.ndarray):
"""
Generates MelSpectrogram.
Args:
sequence (`numpy.ndarray`):
The sequence of which the mel-spectrogram will be computed.
"""
mel_specs = []
for seq in sequence:
window = np.hanning(self.window_size + 1)[:-1]
mel_specs.append(
spectrogram(
waveform=seq,
window=window,
frame_length=self.window_size,
hop_length=self.hop_length,
power=2.0,
mel_filters=self.mel_filters,
)
)
mel_specs = np.array(mel_specs)
return mel_specs
def extract_rhythm(self, audio: np.ndarray):
"""
This algorithm(`RhythmExtractor2013`) extracts the beat positions and estimates their confidence as well as
tempo in bpm for an audio signal. For more information please visit
https://essentia.upf.edu/reference/std_RhythmExtractor2013.html .
Args:
audio(`numpy.ndarray`):
raw audio waveform which is passed to the Rhythm Extractor.
"""
requires_backends(self, ["essentia"])
essentia_tracker = essentia.standard.RhythmExtractor2013(method="multifeature")
bpm, beat_times, confidence, estimates, essentia_beat_intervals = essentia_tracker(audio)
return bpm, beat_times, confidence, estimates, essentia_beat_intervals
def interpolate_beat_times(
self, beat_times: numpy.ndarray, steps_per_beat: numpy.ndarray, n_extend: numpy.ndarray
):
"""
This method takes beat_times and then interpolates that using `scipy.interpolate.interp1d` and the output is
then used to convert raw audio to log-mel-spectrogram.
Args:
beat_times (`numpy.ndarray`):
beat_times is passed into `scipy.interpolate.interp1d` for processing.
steps_per_beat (`int`):
used as an parameter to control the interpolation.
n_extend (`int`):
used as an parameter to control the interpolation.
"""
requires_backends(self, ["scipy"])
beat_times_function = scipy.interpolate.interp1d(
np.arange(beat_times.size),
beat_times,
bounds_error=False,
fill_value="extrapolate",
)
ext_beats = beat_times_function(
np.linspace(0, beat_times.size + n_extend - 1, beat_times.size * steps_per_beat + n_extend)
)
return ext_beats
def preprocess_mel(self, audio: np.ndarray, beatstep: np.ndarray):
"""
Preprocessing for log-mel-spectrogram
Args:
audio (`numpy.ndarray` of shape `(audio_length, )` ):
Raw audio waveform to be processed.
beatstep (`numpy.ndarray`):
Interpolated values of the raw audio. If beatstep[0] is greater than 0.0, then it will be shifted by
the value at beatstep[0].
"""
if audio is not None and len(audio.shape) != 1:
raise ValueError(
f"Expected `audio` to be a single channel audio input of shape `(n, )` but found shape {audio.shape}."
)
if beatstep[0] > 0.0:
beatstep = beatstep - beatstep[0]
num_steps = self.num_bars * 4
num_target_steps = len(beatstep)
extrapolated_beatstep = self.interpolate_beat_times(
beat_times=beatstep, steps_per_beat=1, n_extend=(self.num_bars + 1) * 4 + 1
)
sample_indices = []
max_feature_length = 0
for i in range(0, num_target_steps, num_steps):
start_idx = i
end_idx = min(i + num_steps, num_target_steps)
start_sample = int(extrapolated_beatstep[start_idx] * self.sampling_rate)
end_sample = int(extrapolated_beatstep[end_idx] * self.sampling_rate)
sample_indices.append((start_sample, end_sample))
max_feature_length = max(max_feature_length, end_sample - start_sample)
padded_batch = []
for start_sample, end_sample in sample_indices:
feature = audio[start_sample:end_sample]
padded_feature = np.pad(
feature,
((0, max_feature_length - feature.shape[0]),),
"constant",
constant_values=0,
)
padded_batch.append(padded_feature)
padded_batch = np.asarray(padded_batch)
return padded_batch, extrapolated_beatstep
def _pad(self, features: np.ndarray, add_zero_line=True):
features_shapes = [each_feature.shape for each_feature in features]
attention_masks, padded_features = [], []
for i, each_feature in enumerate(features):
# To pad "input_features".
if len(each_feature.shape) == 3:
features_pad_value = max([*zip(*features_shapes)][1]) - features_shapes[i][1]
attention_mask = np.ones(features_shapes[i][:2], dtype=np.int64)
feature_padding = ((0, 0), (0, features_pad_value), (0, 0))
attention_mask_padding = (feature_padding[0], feature_padding[1])
# To pad "beatsteps" and "extrapolated_beatstep".
else:
each_feature = each_feature.reshape(1, -1)
features_pad_value = max([*zip(*features_shapes)][0]) - features_shapes[i][0]
attention_mask = np.ones(features_shapes[i], dtype=np.int64).reshape(1, -1)
feature_padding = attention_mask_padding = ((0, 0), (0, features_pad_value))
each_padded_feature = np.pad(each_feature, feature_padding, "constant", constant_values=self.padding_value)
attention_mask = np.pad(
attention_mask, attention_mask_padding, "constant", constant_values=self.padding_value
)
if add_zero_line:
# if it is batched then we seperate each examples using zero array
zero_array_len = max([*zip(*features_shapes)][1])
# we concatenate the zero array line here
each_padded_feature = np.concatenate(
[each_padded_feature, np.zeros([1, zero_array_len, self.feature_size])], axis=0
)
attention_mask = np.concatenate(
[attention_mask, np.zeros([1, zero_array_len], dtype=attention_mask.dtype)], axis=0
)
padded_features.append(each_padded_feature)
attention_masks.append(attention_mask)
padded_features = np.concatenate(padded_features, axis=0).astype(np.float32)
attention_masks = np.concatenate(attention_masks, axis=0).astype(np.int64)
return padded_features, attention_masks
def pad(
self,
inputs: BatchFeature,
is_batched: bool,
return_attention_mask: bool,
return_tensors: Optional[Union[str, TensorType]] = None,
):
"""
Pads the inputs to same length and returns attention_mask.
Args:
inputs (`BatchFeature`):
Processed audio features.
is_batched (`bool`):
Whether inputs are batched or not.
return_attention_mask (`bool`):
Whether to return attention mask or not.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
If nothing is specified, it will return list of `np.ndarray` arrays.
Return:
`BatchFeature` with attention_mask, attention_mask_beatsteps and attention_mask_extrapolated_beatstep added
to it:
- **attention_mask** numpy.ndarray of shape `(batch_size, max_input_features_seq_length)` --
Example :
1, 1, 1, 0, 0 (audio 1, also here it is padded to max length of 5 thats why there are 2 zeros at
the end indicating they are padded)
0, 0, 0, 0, 0 (zero pad to seperate audio 1 and 2)
1, 1, 1, 1, 1 (audio 2)
0, 0, 0, 0, 0 (zero pad to seperate audio 2 and 3)
1, 1, 1, 1, 1 (audio 3)
- **attention_mask_beatsteps** numpy.ndarray of shape `(batch_size, max_beatsteps_seq_length)`
- **attention_mask_extrapolated_beatstep** numpy.ndarray of shape `(batch_size,
max_extrapolated_beatstep_seq_length)`
"""
processed_features_dict = {}
for feature_name, feature_value in inputs.items():
if feature_name == "input_features":
padded_feature_values, attention_mask = self._pad(feature_value, add_zero_line=True)
processed_features_dict[feature_name] = padded_feature_values
if return_attention_mask:
processed_features_dict["attention_mask"] = attention_mask
else:
padded_feature_values, attention_mask = self._pad(feature_value, add_zero_line=False)
processed_features_dict[feature_name] = padded_feature_values
if return_attention_mask:
processed_features_dict[f"attention_mask_{feature_name}"] = attention_mask
# If we are processing only one example, we should remove the zero array line since we don't need it to
# seperate examples from each other.
if not is_batched and not return_attention_mask:
processed_features_dict["input_features"] = processed_features_dict["input_features"][:-1, ...]
outputs = BatchFeature(processed_features_dict, tensor_type=return_tensors)
return outputs
def __call__(
self,
audio: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
sampling_rate: Union[int, List[int]],
steps_per_beat: int = 2,
resample: Optional[bool] = True,
return_attention_mask: Optional[bool] = False,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model.
Args:
audio (`np.ndarray`, `List`):
The audio or batch of audio to be processed. Each audio can be a numpy array, a list of float values, a
list of numpy arrays or a list of list of float values.
sampling_rate (`int`):
The sampling rate at which the `audio` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors.
steps_per_beat (`int`, *optional*, defaults to 2):
This is used in interpolating `beat_times`.
resample (`bool`, *optional*, defaults to `True`):
Determines whether to resample the audio to `sampling_rate` or not before processing. Must be True
during inference.
return_attention_mask (`bool` *optional*, defaults to `False`):
Denotes if attention_mask for input_features, beatsteps and extrapolated_beatstep will be given as
output or not. Automatically set to True for batched inputs.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
If nothing is specified, it will return list of `np.ndarray` arrays.
"""
requires_backends(self, ["librosa"])
is_batched = bool(isinstance(audio, (list, tuple)) and isinstance(audio[0], (np.ndarray, tuple, list)))
if is_batched:
# This enables the user to process files of different sampling_rate at same time
if not isinstance(sampling_rate, list):
raise ValueError(
"Please give sampling_rate of each audio separately when you are passing multiple raw_audios at the same time. "
f"Received {sampling_rate}, expected [audio_1_sr, ..., audio_n_sr]."
)
return_attention_mask = True if return_attention_mask is None else return_attention_mask
else:
audio = [audio]
sampling_rate = [sampling_rate]
return_attention_mask = False if return_attention_mask is None else return_attention_mask
batch_input_features, batch_beatsteps, batch_ext_beatstep = [], [], []
for single_raw_audio, single_sampling_rate in zip(audio, sampling_rate):
bpm, beat_times, confidence, estimates, essentia_beat_intervals = self.extract_rhythm(
audio=single_raw_audio
)
beatsteps = self.interpolate_beat_times(beat_times=beat_times, steps_per_beat=steps_per_beat, n_extend=1)
if self.sampling_rate != single_sampling_rate and self.sampling_rate is not None:
if resample:
# Change sampling_rate to self.sampling_rate
single_raw_audio = librosa.core.resample(
single_raw_audio,
orig_sr=single_sampling_rate,
target_sr=self.sampling_rate,
res_type="kaiser_best",
)
else:
warnings.warn(
f"The sampling_rate of the provided audio is different from the target sampling_rate"
f"of the Feature Extractor, {self.sampling_rate} vs {single_sampling_rate}. "
f"In these cases it is recommended to use `resample=True` in the `__call__` method to"
f"get the optimal behaviour."
)
single_sampling_rate = self.sampling_rate
start_sample = int(beatsteps[0] * single_sampling_rate)
end_sample = int(beatsteps[-1] * single_sampling_rate)
input_features, extrapolated_beatstep = self.preprocess_mel(
single_raw_audio[start_sample:end_sample], beatsteps - beatsteps[0]
)
mel_specs = self.mel_spectrogram(input_features.astype(np.float32))
# apply np.log to get log mel-spectrograms
log_mel_specs = np.log(np.clip(mel_specs, a_min=1e-6, a_max=None))
input_features = np.transpose(log_mel_specs, (0, -1, -2))
batch_input_features.append(input_features)
batch_beatsteps.append(beatsteps)
batch_ext_beatstep.append(extrapolated_beatstep)
output = BatchFeature(
{
"input_features": batch_input_features,
"beatsteps": batch_beatsteps,
"extrapolated_beatstep": batch_ext_beatstep,
}
)
output = self.pad(
output,
is_batched=is_batched,
return_attention_mask=return_attention_mask,
return_tensors=return_tensors,
)
return output
def to_dict(self):
"""
Serializes this instance to a Python dictionary.
Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
"""
output = copy.deepcopy(self.__dict__)
output["feature_extractor_type"] = self.__class__.__name__
if "mel_filters" in output:
del output["mel_filters"]
return output
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment