Unverified Commit 45e14038 authored by Arthur's avatar Arthur Committed by GitHub
Browse files

Add WhisperModel to transformers (#19166)



* simplify loop

* add featur extractor

* add model

* start conversion

* add dropout

* initial commit of test files

* copnversion for all models

* update processor for correct padding

* update feature extraction

* update integration test logits match

* fmnt: off for the logits

* on the fly mel bank

* small nit

* update test

* update tokenizer

* nit feature extraction

* update

* update tokenizer test

* adds logit processor and update tokenizer to get supress tokens

* style

* clean convert

* revert to original modeling tf utils

* Update

* update

* nit

* clean convert file

* update tests and nits

* quality

* slow generation test

* ffn_dim to allow customization

* update readme

* add to toctreee

* start fixing integration tests

* update tests and code

* fix feature extractor

* fix config tests common

* update code to fix tests

* fix feature exctractor

* nit feature extraction

* update test for new feature extractor

* style

* add absrtact

* large logits wioth custom decoder input ids

* wraap around is otrch available

* fix feature extractor

* correct logits for whisper small.en

* nit

* fix encoder_attentino_mask

* some fixes

* remove unnecessary inputs

* nits

* add normalizer file

* update etst tokenization

* fix attention mask not defined

* Add model to README

* Fix doc tests

* fix generate

* remove uncoder attention mask useless

* update test modeling whisper

* update condfig to add second non supress tokens

* nits on feature exrtactor

* nit for test tokenizers

* update etsts

* update tests

* update tokenization test

* fixup

* invalidated hf token. Clean convert openai to whisper

* fix logit tests

* fixup

* clean merge

* revert toc_tree changes

* remove useless LogitProcessor

* Update whisper .mdx

* update config file doc

* update configuration docstring

* update test tokenization

* update test tokenization

* update tokenization whisper
Added copied from where needed

* update feature extraction

* nit test name

* style

* quality

* remove get suppress tokens and update non_speech tokens global variables

* Update src/transformers/models/whisper/feature_extraction_whisper.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* clean modeling whisper and test
Removed the attention mask arguments that are deprecated

* fix large test

* Add multilingual audio test, and translate test

* style

* fix larg multilingual test

* nits

* Update docs/source/en/model_doc/whisper.mdx
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* add copied from for attention layer

* remove attention masks in doc

* add english normalizer

* update tokenization test

* remove copied from in whisper attention : no bias in k_proj only

* wrap around dependencies in english normalizer

* style

* correct import generation logits

* for now, wrap feature extractor with torch

* Update src/transformers/models/whisper/convert_openai_whisper_to_tfms.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/whisper/configuration_whisper.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update docs/source/en/model_doc/whisper.mdx
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* remove torch depencies for feature extraction and style

* fixup

* nit

* update logitds

* style

* nit

* nits and fix final tests

* add `is_more_itertools_available` to utils

* quality

* add begin supress tokens, supress tokens to generate args and config

* clean supressTokensLogitProcessor in generation logits

* Nit naming

* add supressTokensAtBegin

* udpate tests, supress tokens to None or correct values

* nit and style

* update RAG to fit test and generate_logit

* add copy pasted statment on english normalizer

* add arguments to config_common_kwargs

* Update src/transformers/generation_utils.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/generation_logits_process.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/whisper/configuration_whisper.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* revert changes based on reviews

* update doc and nits

* more nits

* last nits

* update test configuration common

* add BART name in decoder attention mask documentation

* Update src/transformers/models/whisper/modeling_whisper.py
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* style

* nit

* nit

* add english.json file to git

* nits on documentation

* nit

* nits

* last styling

* add main toctree file

* remove sentence piece dependency

* clean init file

* fix tokenizer that has no dependencies on sentencepiece

* update whisper init file, nit

* remove english.json file

* add get decoder prompt id

* revert changes and add forced logit processor

* nit

* clean normalizer

* remove protected

* update

* Update src/transformers/models/whisper/configuration_whisper.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* update based on review

* Update src/transformers/models/whisper/configuration_whisper.py
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add batched tests
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: default avatarNielsRogge <niels.rogge1@gmail.com>
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 7598791c
......@@ -393,6 +393,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
......
......@@ -343,6 +343,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
......
......@@ -367,6 +367,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (来自 OpenAI) 伴随论文 [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) 由 Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever 发布。
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (来自 Microsoft Research) 伴随论文 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 由 Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 发布。
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。
......
......@@ -379,6 +379,7 @@ conda install -c huggingface transformers
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](https://huggingface.co/docs/transformers/main/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
......
......@@ -449,6 +449,8 @@
title: Wav2Vec2Phoneme
- local: model_doc/wavlm
title: WavLM
- local: model_doc/whisper
title: Whisper
- local: model_doc/xls_r
title: XLS-R
- local: model_doc/xlsr_wav2vec2
......
......@@ -183,6 +183,7 @@ The documentation is organized into five sections:
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
1. **[Whisper](model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[X-CLIP](model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
1. **[XGLM](model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li.
1. **[XLM](model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
......@@ -329,6 +330,7 @@ Flax), PyTorch, and/or TensorFlow.
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| Whisper | ✅ | ❌ | ✅ | ❌ | ❌ |
| X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| XGLM | ✅ | ✅ | ✅ | ✅ | ✅ |
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Whisper
## Overview
The Whisper model was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
The abstract from the paper is the following:
*We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.*
Tips:
- The model usually performs well without requiring any finetuning.
- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation_utils.GenerationMixin.generate`] function for inference.
- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text.
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/openai/whisper).
## WhisperConfig
[[autodoc]] WhisperConfig
## WhisperTokenizer
[[autodoc]] WhisperTokenizer
- build_inputs_with_special_tokens
- get_special_tokens_mask
- create_token_type_ids_from_sequences
- save_vocabulary
## WhisperFeatureExtractor
[[autodoc]] WhisperFeatureExtractor
- __call__
## WhisperProcessor
[[autodoc]] WhisperProcessor
- __call__
- from_pretrained
- save_pretrained
- batch_decode
- decode
## WhisperModel
[[autodoc]] WhisperModel
- forward
## WhisperForConditionalGeneration
[[autodoc]] WhisperForConditionalGeneration
- forward
......@@ -390,6 +390,13 @@ _import_structure = {
"WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WavLMConfig",
],
"models.whisper": [
"WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"WhisperConfig",
"WhisperFeatureExtractor",
"WhisperProcessor",
"WhisperTokenizer",
],
"models.x_clip": [
"XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"XCLIPConfig",
......@@ -1863,6 +1870,14 @@ else:
"Speech2TextPreTrainedModel",
]
)
_import_structure["models.whisper"].extend(
[
"WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST",
"WhisperForConditionalGeneration",
"WhisperModel",
"WhisperPreTrainedModel",
]
)
_import_structure["models.speech_to_text_2"].extend(["Speech2Text2ForCausalLM", "Speech2Text2PreTrainedModel"])
_import_structure["models.splinter"].extend(
[
......@@ -3339,6 +3354,13 @@ if TYPE_CHECKING:
from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer
from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM
from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig
from .models.whisper import (
WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP,
WhisperConfig,
WhisperFeatureExtractor,
WhisperProcessor,
WhisperTokenizer,
)
from .models.x_clip import (
XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
XCLIPConfig,
......@@ -4724,6 +4746,12 @@ if TYPE_CHECKING:
WavLMModel,
WavLMPreTrainedModel,
)
from .models.whisper import (
WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST,
WhisperForConditionalGeneration,
WhisperModel,
WhisperPreTrainedModel,
)
from .models.x_clip import (
XCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
XCLIPModel,
......
......@@ -299,6 +299,8 @@ class PretrainedConfig(PushToHubMixin):
self.forced_eos_token_id = kwargs.pop("forced_eos_token_id", None)
self.remove_invalid_values = kwargs.pop("remove_invalid_values", False)
self.exponential_decay_length_penalty = kwargs.pop("exponential_decay_length_penalty", None)
self.suppress_tokens = kwargs.pop("suppress_tokens", None)
self.begin_suppress_tokens = kwargs.pop("begin_suppress_tokens", None)
# Fine-tuning task arguments
self.architectures = kwargs.pop("architectures", None)
......
......@@ -702,3 +702,49 @@ class LogitNormalization(LogitsProcessor, LogitsWarper):
def __call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> torch.Tensor:
scores = scores.log_softmax(dim=-1)
return scores
class SuppressTokensAtBeginLogitsProcessor(LogitsProcessor):
r"""
[`SuppressTokensAtBeginLogitsProcessor`] supresses a list of tokens as soon as the `generate` function starts
generating using `begin_index` tokens. This should ensure that the tokens defined by `begin_suppress_tokens` at not
sampled at the begining of the generation.
"""
def __init__(self, begin_suppress_tokens, begin_index):
self.begin_suppress_tokens = list(begin_suppress_tokens)
self.begin_index = begin_index
def __call__(self, input_ids, scores):
if input_ids.shape[1] == self.begin_index:
scores[:, self.begin_suppress_tokens] = -float("inf")
return scores
class SuppressTokensLogitsProcessor(LogitsProcessor):
r"""This processor can be used to suppress a list of tokens. The processor will set their log probs to `-inf` so that they
are not sampled."""
def __init__(self, suppress_tokens):
self.suppress_tokens = list(suppress_tokens)
def __call__(self, input_ids, scores):
scores[:, self.suppress_tokens] = -float("inf")
return scores
class ForceTokensLogitsProcessor(LogitsProcessor):
r"""This processor can be used to force a list of tokens. The processor will set their log probs to `inf` so that they
are sampled at their corresponding index."""
def __init__(self, force_token_map):
self.force_token_map = dict(force_token_map)
def __call__(self, input_ids, scores):
generation_idx = input_ids.shape[-1]
current_token = self.force_token_map.get(generation_idx, None)
if current_token is not None:
scores[:, :] = -float("inf")
scores[:, current_token] = 0
return scores
......@@ -30,6 +30,7 @@ from .generation_logits_process import (
ExponentialDecayLengthPenalty,
ForcedBOSTokenLogitsProcessor,
ForcedEOSTokenLogitsProcessor,
ForceTokensLogitsProcessor,
HammingDiversityLogitsProcessor,
InfNanRemoveLogitsProcessor,
LogitNormalization,
......@@ -39,6 +40,8 @@ from .generation_logits_process import (
NoRepeatNGramLogitsProcessor,
PrefixConstrainedLogitsProcessor,
RepetitionPenaltyLogitsProcessor,
SuppressTokensAtBeginLogitsProcessor,
SuppressTokensLogitsProcessor,
TemperatureLogitsWarper,
TopKLogitsWarper,
TopPLogitsWarper,
......@@ -691,6 +694,9 @@ class GenerationMixin:
exponential_decay_length_penalty: Tuple,
logits_processor: Optional[LogitsProcessorList],
renormalize_logits: Optional[bool],
suppress_tokens: Optional[List[int]] = None,
begin_suppress_tokens: Optional[List[int]] = None,
forced_decoder_ids: Optional[List[int]] = None,
) -> LogitsProcessorList:
"""
This class returns a [`LogitsProcessorList`] list object that contains all relevant [`LogitsProcessor`]
......@@ -725,6 +731,12 @@ class GenerationMixin:
if exponential_decay_length_penalty is not None
else self.config.exponential_decay_length_penalty
)
suppress_tokens = suppress_tokens if suppress_tokens is not None else self.config.suppress_tokens
begin_suppress_tokens = (
begin_suppress_tokens if begin_suppress_tokens is not None else self.config.begin_suppress_tokens
)
if forced_decoder_ids is None and hasattr(self.config, "forced_decoder_ids"):
forced_decoder_ids = self.config.forced_decoder_ids
# instantiate processors list
# the following idea is largely copied from this PR: https://github.com/huggingface/transformers/pull/5420/files
......@@ -762,6 +774,16 @@ class GenerationMixin:
processors.append(
ExponentialDecayLengthPenalty(exponential_decay_length_penalty, eos_token_id, input_ids_seq_length)
)
if suppress_tokens is not None:
processors.append(SuppressTokensLogitsProcessor(suppress_tokens))
if begin_suppress_tokens is not None:
begin_index = input_ids_seq_length
begin_index = begin_index if (input_ids_seq_length > 1 or forced_bos_token_id is None) else begin_index + 1
if forced_decoder_ids is not None:
begin_index += forced_decoder_ids[-1][0] # generation starts after the last token that is forced
processors.append(SuppressTokensAtBeginLogitsProcessor(begin_suppress_tokens, begin_index))
if forced_decoder_ids is not None:
processors.append(ForceTokensLogitsProcessor(forced_decoder_ids))
processors = self._merge_criteria_processor_list(processors, logits_processor)
# `LogitNormalization` should always be the last logit processor, when present
if renormalize_logits is True:
......@@ -932,6 +954,9 @@ class GenerationMixin:
remove_invalid_values: Optional[bool] = None,
synced_gpus: Optional[bool] = False,
exponential_decay_length_penalty: Optional[Tuple[Union[int, float]]] = None,
suppress_tokens: Optional[List[int]] = None,
begin_suppress_tokens: Optional[List[int]] = None,
forced_decoder_ids: Optional[List[int]] = None,
**model_kwargs,
) -> Union[GreedySearchOutput, SampleOutput, BeamSearchOutput, BeamSampleOutput, torch.LongTensor]:
r"""
......@@ -1090,6 +1115,14 @@ class GenerationMixin:
This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been
generated. The tuple shall consist of: `(start_index, decay_factor)` where `start_index` indicates
where penalty starts and `decay_factor` represents the factor of exponential decay
suppress_tokens (`List[int]`, *optional*, defaults to `model.config.suppress_tokens`):
A list of tokens that will be supressed at generation. The `SupressTokens` logit processor will set
their log probs to `-inf` so that they are not sampled.
begin_suppress_tokens (`List[int]`, *optional*, defaults to `model.config.begin_suppress_tokens`):
A list of tokens that will be supressed at the begining of the generation. The `SupressBeginTokens`
logit processor will set their log probs to `-inf` so that they are not sampled.
forced_decoder_ids (`List[int]`, *optional*, defaults to `model.config.forced_decoder_ids`):
A list of tokens that will be forced as beginning tokens, before sampling.
model_kwargs:
Additional model specific kwargs will be forwarded to the `forward` function of the model. If the model
......@@ -1337,6 +1370,9 @@ class GenerationMixin:
exponential_decay_length_penalty=exponential_decay_length_penalty,
logits_processor=logits_processor,
renormalize_logits=renormalize_logits,
suppress_tokens=suppress_tokens,
begin_suppress_tokens=begin_suppress_tokens,
forced_decoder_ids=forced_decoder_ids,
)
# 8. prepare stopping criteria
......
......@@ -159,6 +159,7 @@ from . import (
wav2vec2_phoneme,
wav2vec2_with_lm,
wavlm,
whisper,
x_clip,
xglm,
xlm,
......
......@@ -152,6 +152,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Config"),
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
("wavlm", "WavLMConfig"),
("whisper", "WhisperConfig"),
("xclip", "XCLIPConfig"),
("xglm", "XGLMConfig"),
("xlm", "XLMConfig"),
......@@ -276,6 +277,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("whisper", "WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xclip", "X_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xglm", "XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("xlm", "XLM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
......@@ -434,6 +436,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2-Conformer"),
("wav2vec2_phoneme", "Wav2Vec2Phoneme"),
("wavlm", "WavLM"),
("whisper", "Whisper"),
("xclip", "X-CLIP"),
("xglm", "XGLM"),
("xlm", "XLM"),
......
......@@ -77,6 +77,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("vit_msn", "ViTFeatureExtractor"),
("wav2vec2", "Wav2Vec2FeatureExtractor"),
("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"),
("whisper", "WhisperFeatureExtractor"),
("xclip", "CLIPFeatureExtractor"),
("yolos", "YolosFeatureExtractor"),
]
......
......@@ -146,6 +146,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("wav2vec2", "Wav2Vec2Model"),
("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
("wavlm", "WavLMModel"),
("whisper", "WhisperModel"),
("xclip", "XCLIPModel"),
("xglm", "XGLMModel"),
("xlm", "XLMModel"),
......@@ -273,6 +274,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
("tapas", "TapasForMaskedLM"),
("transfo-xl", "TransfoXLLMHeadModel"),
("wav2vec2", "Wav2Vec2ForMaskedLM"),
("whisper", "WhisperForConditionalGeneration"),
("xlm", "XLMWithLMHeadModel"),
("xlm-roberta", "XLMRobertaForMaskedLM"),
("xlm-roberta-xl", "XLMRobertaXLForMaskedLM"),
......@@ -500,6 +502,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
[
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
("speech_to_text", "Speech2TextForConditionalGeneration"),
("whisper", "WhisperForConditionalGeneration"),
]
)
......
......@@ -61,6 +61,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("wav2vec2-conformer", "Wav2Vec2Processor"),
("wav2vec2_with_lm", "Wav2Vec2ProcessorWithLM"),
("wavlm", "Wav2Vec2Processor"),
("whisper", "WhisperProcessor"),
("xclip", "CLIPProcessor"),
]
)
......
......@@ -255,6 +255,7 @@ else:
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
("whisper", ("WhisperTokenizer" if is_sentencepiece_available() else None, None)),
("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
(
"xglm",
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_whisper": ["WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP", "WhisperConfig"],
"feature_extraction_whisper": ["WhisperFeatureExtractor"],
"processing_whisper": ["WhisperProcessor"],
"tokenization_whisper": ["WhisperTokenizer"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_whisper"] = [
"WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST",
"WhisperForConditionalGeneration",
"WhisperModel",
"WhisperPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_whisper import WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP, WhisperConfig
from .feature_extraction_whisper import WhisperFeatureExtractor
from .processing_whisper import WhisperProcessor
from .tokenization_whisper import WhisperTokenizer
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_whisper import (
WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST,
WhisperForConditionalGeneration,
WhisperModel,
WhisperPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Whisper model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"openai/whisper-base": "https://huggingface.co/openai/whisper-base/resolve/main/config.json",
}
# fmt: off
NON_SPEECH_TOKENS = [
1, 2, 6, 7, 8, 9, 10, 12, 14, 25,
26, 27, 28, 29, 31, 58, 59, 60, 61, 62,
63, 90, 91, 92, 93, 357, 366, 438, 532, 685,
705, 796, 930, 1058, 1220, 1267, 1279, 1303, 1343, 1377,
1391, 1635, 1782, 1875, 2162, 2361, 2488, 3467, 4008, 4211,
4600, 4808, 5299, 5855, 6329, 7203, 9609, 9959, 10563, 10786,
11420, 11709, 11907, 13163, 13697, 13700, 14808, 15306, 16410, 16791,
17992, 19203, 19510, 20724, 22305, 22935, 27007, 30109, 30420, 33409,
34949, 40283, 40493, 40549, 47282, 49146, 50257, 50359, 50360, 50361
]
NON_SPEECH_TOKENS_MULTI = [
1, 2, 6, 7, 8, 9, 10, 12, 14, 25,
26, 27, 28, 29, 31, 58, 59, 60, 61, 62,
63, 90, 91, 92, 93, 359, 503, 522, 542, 873,
893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627,
3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647,
7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793,
14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675,
22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865,
42863, 47425, 49870, 50254, 50258, 50360, 50361, 50362
]
# fmt: on
class WhisperConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`WhisperModel`]. It is used to instantiate a
Whisper model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Whisper
[openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 51865):
Vocabulary size of the Whisper model. Defines the number of different tokens that can be represented by the
`decoder_input_ids` passed when calling [`WhisperModel`]
num_mel_bins (`int`, *optional*, defaults to 80):
Number of mel features used per input features. Should correspond to the value used in the
`WhisperProcessor` class.
encoder_layers (`int`, *optional*, defaults to 6):
Number of encoder layers.
decoder_layers (`int`, *optional*, defaults to 6):
Number of decoder layers.
encoder_attention_heads (`int`, *optional*, defaults to 4):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (`int`, *optional*, defaults to 4):
Number of attention heads for each attention layer in the Transformer decoder.
encoder_ffn_dim (`int`, *optional*, defaults to 1536):
Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
decoder_ffn_dim (`int`, *optional*, defaults to 1536):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
decoder_layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
decoder_start_token_id (`int`, *optional*, defaults to 50257):
Corresponds to the "<|startoftranscript|>" token, which is automatically used when no `decoder_input_ids`
are provided to the `generate` function. It is used to guide the model`s generation process depending on
the task.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
activation_function (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
d_model (`int`, *optional*, defaults to 256):
Dimensionality of the layers.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
scale_embedding (`bool`, *optional*, defaults to False):
Scale embeddings by diving by sqrt(d_model).
max_source_positions (`int`, *optional*, defaults to 1500):
The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
max_target_positions (`int`, *optional*, defaults to 448):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
pad_token_id (`int`, *optional*, defaults to 50256):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 50256):
Begin of stream token id.
eos_token_id (`int`, *optional*, defaults to 50257):
End of stream token id.
tie_word_embeddings (`bool`, *optional*, defaults to `True`):
Whether to tie input and output embeddings.
suppress_tokens (`List[int]`, *optional*):
A list containing the non-speech tokens that will be used by the logit processor in the `generate`
function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the
`multilingual` model.
begin_suppress_tokens (`List[int]`, *optional*, defaults to `[220,50256]`):
A list containing tokens that will be supressed at the beginning of the sampling process. Initialized as
the token for `" "` (`blank_token_id`) and the `eos_token_id`
Example:
```python
>>> from transformers import WhisperModel, WhisperConfig
>>> # Initializing a Whisper tiny style configuration
>>> configuration = WhisperConfig()
>>> # Initializing a model from the tiny style configuration
>>> model = WhisperModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "whisper"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"}
def __init__(
self,
vocab_size=51865,
num_mel_bins=80,
encoder_layers=6,
encoder_attention_heads=4,
decoder_layers=6,
decoder_attention_heads=4,
decoder_ffn_dim=1536,
encoder_ffn_dim=1536,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
decoder_start_token_id=50257,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=256,
dropout=0.0,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
scale_embedding=False,
max_source_positions=1500,
max_target_positions=448,
pad_token_id=50256,
bos_token_id=50257,
eos_token_id=50256,
tie_word_embeddings=True,
suppress_tokens=None,
begin_suppress_tokens=[220, 50256],
**kwargs
):
self.vocab_size = vocab_size
self.num_mel_bins = num_mel_bins
self.d_model = d_model
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.encoder_ffn_dim = encoder_ffn_dim
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
self.tie_word_embeddings = tie_word_embeddings
self.max_source_positions = max_source_positions
self.max_target_positions = max_target_positions
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
decoder_start_token_id=decoder_start_token_id,
tie_word_embeddings=tie_word_embeddings,
suppress_tokens=suppress_tokens,
begin_suppress_tokens=begin_suppress_tokens,
**kwargs,
)
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment