Add Seamless M4T model (#25693)

* first raw commit * still POC * tentative convert script * almost working speech encoder conversion scripts * intermediate code for encoder/decoders * add modeling code * first version of speech encoder * make style * add new adapter layer architecture * add adapter block * add first tentative config * add working speech encoder conversion * base model convert works now * make style * remove unnecessary classes * remove unecessary functions * add modeling code speech encoder * rework logics * forward pass of sub components work * add modeling codes * some config modifs and modeling code modifs * save WIP * new edits * same output speech encoder * correct attention mask * correct attention mask * fix generation * new generation logics * erase comments * make style * fix typo * add some descriptions * new state * clean imports * add tests * make style * make beam search and num_return_sequences>1 works * correct edge case issue * correct SeamlessM4TConformerSamePadLayer copied from * replace ACT2FN relu by nn.relu * remove unecessary return variable * move back a class * change name conformer_attention_mask ->conv_attention_mask * better nit code * add some Copied from statements * small nits * small nit in dict.get * rename t2u model -> conditionalgeneration * ongoing refactoring of structure * update models architecture * remove SeamlessM4TMultiModal classes * add tests * adapt tests * some non-working code for vocoder * add seamlessM4T vocoder * remove buggy line * fix some hifigan related bugs * remove hifigan specifc config * change * add WIP tokenization * add seamlessM4T working tokenzier * update tokenization * add tentative feature extractor * Update converting script * update working FE * refactor input_values -> input_features * update FE * changes in generation, tokenizer and modeling * make style and add t2u_decoder_input_ids * add intermediate outputs for ToSpeech models * add vocoder to speech models * update valueerror * update FE with languages * add vocoder convert * update config docstrings and names * update generation code and configuration * remove todos and update config.pad_token_id to generation_config.pad_token_id * move block vocoder * remove unecessary code and uniformize tospeech code * add feature extractor import * make style and fix some copies from * correct consistency + make fix-copies * add processor code * remove comments * add fast tokenizer support * correct pad_token_id in M4TModel * correct config * update tests and codes + make style * make some suggested correstion - correct comments and change naming * rename some attributes * rename some attributes * remove unecessary sequential * remove option to use dur predictor * nit * refactor hifigan * replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config * add tests * change tgt_lang logic * update generation ToSpeech * add support import SeamlessM4TProcessor * fix generate * make tests * update integration tests, add option to only return text and update tokenizer fast * fix wrong function call * update import and convert script * update integration tests + update repo id * correct paths and add first test * update how new attention masks are computed * update tests * take first care of batching in vocoder code * add batching with the vocoder * add waveform lengths to model outputs * make style * add generate kwargs + forward kwargs of M4TModel * add docstrings forward methods * reformate docstrings * add docstrings t2u model * add another round of modeling docstrings + reformate speaker_id -> spkr_id * make style * fix check_repo * make style * add seamlessm4t to toctree * correct check_config_attributes * write config docstrings + some modifs * make style * add docstrings tokenizer * add docstrings to processor, fe and tokenizers * make style * write first version of model docs * fix FE + correct FE test * fix tokenizer + add correct integration tests * fix most tokenization tests * make style * correct most processor test * add generation tests and fix num_return_sequences > 1 * correct integration tests -still one left * make style * correct position embedding * change numbeams to 1 * refactor some modeling code and correct one test * make style * correct typo * refactor intermediate fnn * refactor feedforward conformer * make style * remove comments * make style * fix tokenizer tests * make style * correct processor tests * make style * correct S2TT integration * Apply suggestions from Sanchit code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct typo * replace torch.nn->nn + make style * change Output naming (waveforms -> waveform) and ordering * nit renaming and formating * remove return None when not necessary * refactor SeamlessM4TConformerFeedForward * nit typo * remove almost copied from comments * add a copied from comment and remove an unecessary dropout * remove inputs_embeds from speechencoder * remove backward compatibiliy function * reformate class docstrings for a few components * remove unecessary methods * split over 2 lines smthg hard to read * make style * replace two steps offset by one step as suggested * nice typo * move warnings * remove useless lines from processor * make generation non-standard test more robusts * remove torch.inference_mode from tests * split integration tests * enrich md * rename control_symbol_vocoder_offset->vocoder_offset * clean convert file * remove tgt_lang and src_lang from FE * change generate docstring of ToText models * update generate docstring of tospeech models * unify how to deal withtext_decoder_input_ids * add default spkr_id * unify tgt_lang for t2u_model * simplify tgt_lang verification * remove a todo * change config docstring * make style * simplify t2u_tgt_lang_id * make style * enrich/correct comments * enrich .md * correct typo in docstrings * add torchaudio dependency * update tokenizer * make style and fix copies * modify SeamlessM4TConverter with new tokenizer behaviour * make style * correct small typo docs * fix import * update docs and add requirement to tests * add convert_fairseq2_to_hf in utils/not_doctested.txt * update FE * fix imports and make style * remove torchaudio in FE test * add seamless_m4t.md to utils/not_doctested.txt * nits and change the way docstring dataset is loaded * move checkpoints from ylacombe/ to facebook/ orga * refactor warning/error to be in the 119 line width limit * round overly precised floats * add stereo audio behaviour * refactor .md and make style * enrich docs with more precised architecture description * readd undocumented models * make fix-copies * apply some suggestions * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * correct bug from previous commit * refactor a parameter allowing to clean the code + some small nits * clean tokenizer * make style and fix * make style * clean tokenizers arguments * add precisions for some tests * move docs from not_tested to slow * modify tokenizer according to last comments * add copied from statements in tests * correct convert script * correct parameter docstring style * correct tokenization * correct multi gpus * make style * clean modeling code * make style * add copied from statements * add copied statements * add support with ASR pipeline * remove file added inadvertently * fix docstrings seamlessM4TModel * add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown * add seamlessm4t to assisted generation ignored models --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Add Seamless M4T model (#25693)
* first raw commit * still POC * tentative convert script * almost working speech encoder conversion scripts * intermediate code for encoder/decoders * add modeling code * first version of speech encoder * make style * add new adapter layer architecture * add adapter block * add first tentative config * add working speech encoder conversion * base model convert works now * make style * remove unnecessary classes * remove unecessary functions * add modeling code speech encoder * rework logics * forward pass of sub components work * add modeling codes * some config modifs and modeling code modifs * save WIP * new edits * same output speech encoder * correct attention mask * correct attention mask * fix generation * new generation logics * erase comments * make style * fix typo * add some descriptions * new state * clean imports * add tests * make style * make beam search and num_return_sequences>1 works * correct edge case issue * correct SeamlessM4TConformerSamePadLayer copied from * replace ACT2FN relu by nn.relu * remove unecessary return variable * move back a class * change name conformer_attention_mask ->conv_attention_mask * better nit code * add some Copied from statements * small nits * small nit in dict.get * rename t2u model -> conditionalgeneration * ongoing refactoring of structure * update models architecture * remove SeamlessM4TMultiModal classes * add tests * adapt tests * some non-working code for vocoder * add seamlessM4T vocoder * remove buggy line * fix some hifigan related bugs * remove hifigan specifc config * change * add WIP tokenization * add seamlessM4T working tokenzier * update tokenization * add tentative feature extractor * Update converting script * update working FE * refactor input_values -> input_features * update FE * changes in generation, tokenizer and modeling * make style and add t2u_decoder_input_ids * add intermediate outputs for ToSpeech models * add vocoder to speech models * update valueerror * update FE with languages * add vocoder convert * update config docstrings and names * update generation code and configuration * remove todos and update config.pad_token_id to generation_config.pad_token_id * move block vocoder * remove unecessary code and uniformize tospeech code * add feature extractor import * make style and fix some copies from * correct consistency + make fix-copies * add processor code * remove comments * add fast tokenizer support * correct pad_token_id in M4TModel * correct config * update tests and codes + make style * make some suggested correstion - correct comments and change naming * rename some attributes * rename some attributes * remove unecessary sequential * remove option to use dur predictor * nit * refactor hifigan * replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config * add tests * change tgt_lang logic * update generation ToSpeech * add support import SeamlessM4TProcessor * fix generate * make tests * update integration tests, add option to only return text and update tokenizer fast * fix wrong function call * update import and convert script * update integration tests + update repo id * correct paths and add first test * update how new attention masks are computed * update tests * take first care of batching in vocoder code * add batching with the vocoder * add waveform lengths to model outputs * make style * add generate kwargs + forward kwargs of M4TModel * add docstrings forward methods * reformate docstrings * add docstrings t2u model * add another round of modeling docstrings + reformate speaker_id -> spkr_id * make style * fix check_repo * make style * add seamlessm4t to toctree * correct check_config_attributes * write config docstrings + some modifs * make style * add docstrings tokenizer * add docstrings to processor, fe and tokenizers * make style * write first version of model docs * fix FE + correct FE test * fix tokenizer + add correct integration tests * fix most tokenization tests * make style * correct most processor test * add generation tests and fix num_return_sequences > 1 * correct integration tests -still one left * make style * correct position embedding * change numbeams to 1 * refactor some modeling code and correct one test * make style * correct typo * refactor intermediate fnn * refactor feedforward conformer * make style * remove comments * make style * fix tokenizer tests * make style * correct processor tests * make style * correct S2TT integration * Apply suggestions from Sanchit code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct typo * replace torch.nn->nn + make style * change Output naming (waveforms -> waveform) and ordering * nit renaming and formating * remove return None when not necessary * refactor SeamlessM4TConformerFeedForward * nit typo * remove almost copied from comments * add a copied from comment and remove an unecessary dropout * remove inputs_embeds from speechencoder * remove backward compatibiliy function * reformate class docstrings for a few components * remove unecessary methods * split over 2 lines smthg hard to read * make style * replace two steps offset by one step as suggested * nice typo * move warnings * remove useless lines from processor * make generation non-standard test more robusts * remove torch.inference_mode from tests * split integration tests * enrich md * rename control_symbol_vocoder_offset->vocoder_offset * clean convert file * remove tgt_lang and src_lang from FE * change generate docstring of ToText models * update generate docstring of tospeech models * unify how to deal withtext_decoder_input_ids * add default spkr_id * unify tgt_lang for t2u_model * simplify tgt_lang verification * remove a todo * change config docstring * make style * simplify t2u_tgt_lang_id * make style * enrich/correct comments * enrich .md * correct typo in docstrings * add torchaudio dependency * update tokenizer * make style and fix copies * modify SeamlessM4TConverter with new tokenizer behaviour * make style * correct small typo docs * fix import * update docs and add requirement to tests * add convert_fairseq2_to_hf in utils/not_doctested.txt * update FE * fix imports and make style * remove torchaudio in FE test * add seamless_m4t.md to utils/not_doctested.txt * nits and change the way docstring dataset is loaded * move checkpoints from ylacombe/ to facebook/ orga * refactor warning/error to be in the 119 line width limit * round overly precised floats * add stereo audio behaviour * refactor .md and make style * enrich docs with more precised architecture description * readd undocumented models * make fix-copies * apply some suggestions * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * correct bug from previous commit * refactor a parameter allowing to clean the code + some small nits * clean tokenizer * make style and fix * make style * clean tokenizers arguments * add precisions for some tests * move docs from not_tested to slow * modify tokenizer according to last comments * add copied from statements in tests * correct convert script * correct parameter docstring style * correct tokenization * correct multi gpus * make style * clean modeling code * make style * add copied from statements * add copied statements * add support with ASR pipeline * remove file added inadvertently * fix docstrings seamlessM4TModel * add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown * add seamlessm4t to assisted generation ignored models --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
cb45f71c · Yoach Lacombe · GitHub · 50d0cf4f · cb45f71c · cb45f71c
Unverified Commit cb45f71c authored Oct 23, 2023 by Yoach Lacombe Committed by GitHub Oct 23, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -459,6 +459,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

--- a/README_es.md
+++ b/README_es.md
@@ -434,6 +434,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

--- a/README_hd.md
+++ b/README_hd.md
@@ -408,6 +408,7 @@ conda install -c huggingface transformers
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर] (https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित।
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng से) Bo Peng. द्वाराअनुसंधान पत्र [this repo](https://github.com/BlinkDL/RWKV-LM) के साथ जारी किया गया
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) के साथ जारी किया गया
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https ://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा।

--- a/README_ja.md
+++ b/README_ja.md
@@ -468,6 +468,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI から) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou から公開された研究論文: [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf)
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology から), Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu から公開された研究論文: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng から) Bo Peng. から公開された研究論文 [this repo](https://github.com/BlinkDL/RWKV-LM)
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)

--- a/README_ko.md
+++ b/README_ko.md
@@ -383,6 +383,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI 에서) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 의 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 논문과 함께 발표했습니다.
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology 에서) Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 의 a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 논문과 함께 발표했습니다.
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng 에서 제공)은 Bo Peng.의 [this repo](https://github.com/BlinkDL/RWKV-LM)논문과 함께 발표했습니다.
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)논문과 함께 발표했습니다.
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -407,6 +407,7 @@ conda install -c huggingface transformers
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (来自 WeChatAI), 伴随论文 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 由 HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 发布。
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (来自 Bo Peng) 伴随论文 [this repo](https://github.com/BlinkDL/RWKV-LM) 由 Bo Peng 发布。
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -419,6 +419,7 @@ conda install -c huggingface transformers
 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
 1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
+1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
 1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -614,6 +614,8 @@
        title: MusicGen
      - local: model_doc/pop2piano
        title: Pop2Piano
+      - local: model_doc/seamless_m4t
+        title: Seamless-M4T
      - local: model_doc/sew
        title: SEW
      - local: model_doc/sew-d

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -236,6 +236,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                      [RoFormer](model_doc/roformer)                      |       ✅        |         ✅         |      ✅      |
 |                          [RWKV](model_doc/rwkv)                          |       ✅        |         ❌         |      ❌      |
 |                           [SAM](model_doc/sam)                           |       ✅        |         ✅         |      ❌      |
+|                  [SeamlessM4T](model_doc/seamless_m4t)                   |       ✅        |         ❌         |      ❌      |
 |                     [SegFormer](model_doc/segformer)                     |       ✅        |         ✅         |      ❌      |
 |                           [SEW](model_doc/sew)                           |       ✅        |         ❌         |      ❌      |
 |                         [SEW-D](model_doc/sew-d)                         |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/model_doc/seamless_m4t.md
+++ b/docs/source/en/model_doc/seamless_m4t.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# SeamlessM4T
+## Overview
+The SeamlessM4T model was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team from Meta AI.
+SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
+SeamlessM4T enables multiple tasks without relying on separate models:
+- Speech-to-speech translation (S2ST)
+- Speech-to-text translation (S2TT)
+- Text-to-speech translation (T2ST)
+- Text-to-text translation (T2TT)
+- Automatic speech recognition (ASR)
+[`SeamlessM4TModel`] can perform all the above tasks, but each task also has its own dedicated sub-model.
+The abstract from the paper is the following:
+*What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication*
+## Usage
+First, load the processor and a checkpoint of the model:
+```python
+>>> from transformers import AutoProcessor, SeamlessM4TModel
+>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
+>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
+```
+You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
+Here is how to use the processor to process text and audio:
+```python
+>>> # let's load an audio sample from an Arabic speech corpus
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
+>>> audio_sample = next(iter(dataset))["audio"]
+>>> # now, process it
+>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
+>>> # now, process some English test as well
+>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
+```
+### Speech
+[`SeamlessM4TModel`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
+```python
+>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
+>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
+```
+With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
+### Text
+Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`].
+This time, let's translate to French.
+```python 
+>>> # from audio
+>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
+>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
+>>> # from text
+>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
+>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
+```
+### Tips
+#### 1. Use dedicated models
+[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
+For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: 
+```python
+>>> from transformers import SeamlessM4TForSpeechToSpeech
+>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
+```
+Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.
+```python
+>>> from transformers import SeamlessM4TForTextToText
+>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
+```
+Feel free to try out [`SeamlessM4TForSpeechToText`] and [`SeamlessM4TForTextToSpeech`] as well.
+#### 2. Change the speaker identity
+You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages!
+#### 3. Change the generation strategy
+You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
+#### 4. Generate speech and text at the same time
+Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text !
+## Model architecture
+SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.
+Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.
+Here's how the generation process works:
+- Input text or speech is processed through its specific encoder.
+- A decoder creates text tokens in the desired language.
+- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
+- These unit tokens are then passed through the final vocoder to produce the actual speech.
+This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication).
+## SeamlessM4TModel
+[[autodoc]] SeamlessM4TModel
+    - generate
+## SeamlessM4TForTextToSpeech
+[[autodoc]] SeamlessM4TForTextToSpeech
+    - generate
+## SeamlessM4TForSpeechToSpeech
+[[autodoc]] SeamlessM4TForSpeechToSpeech
+    - generate
+## SeamlessM4TForTextToText
+[[autodoc]] transformers.SeamlessM4TForTextToText
+    - forward
+    - generate
+## SeamlessM4TForSpeechToText
+[[autodoc]] transformers.SeamlessM4TForSpeechToText
+    - forward
+    - generate
+## SeamlessM4TConfig
+[[autodoc]] SeamlessM4TConfig
+## SeamlessM4TTokenizer
+[[autodoc]] SeamlessM4TTokenizer
+    - __call__
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## SeamlessM4TTokenizerFast
+[[autodoc]] SeamlessM4TTokenizerFast
+    - __call__
+## SeamlessM4TFeatureExtractor
+[[autodoc]] SeamlessM4TFeatureExtractor
+    - __call__
+## SeamlessM4TProcessor
+[[autodoc]] SeamlessM4TProcessor
+    - __call__
+## SeamlessM4TCodeHifiGan
+[[autodoc]] SeamlessM4TCodeHifiGan
+## SeamlessM4THifiGan
+[[autodoc]] SeamlessM4THifiGan
+## SeamlessM4TTextToUnitModel
+[[autodoc]] SeamlessM4TTextToUnitModel
+## SeamlessM4TTextToUnitForConditionalGeneration
+[[autodoc]] SeamlessM4TTextToUnitForConditionalGeneration
--- a/docs/source/en/tasks/summarization.md
+++ b/docs/source/en/tasks/summarization.md
@@ -35,7 +35,7 @@ The task illustrated in this tutorial is supported by the following model archit
 <!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
-[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
+[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
 <!--End of the generated tip-->

--- a/docs/source/en/tasks/translation.md
+++ b/docs/source/en/tasks/translation.md
@@ -32,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit
 <!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
-[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
+[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
 <!--End of the generated tip-->

--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -517,6 +517,12 @@ _import_structure = {
        "SamPromptEncoderConfig",
        "SamVisionConfig",
    ],
+    "models.seamless_m4t": [
+        "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "SeamlessM4TConfig",
+        "SeamlessM4TFeatureExtractor",
+        "SeamlessM4TProcessor",
+    ],
    "models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"],
    "models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
    "models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
@@ -805,6 +811,7 @@ else:
    _import_structure["models.plbart"].append("PLBartTokenizer")
    _import_structure["models.reformer"].append("ReformerTokenizer")
    _import_structure["models.rembert"].append("RemBertTokenizer")
+    _import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizer")
    _import_structure["models.speech_to_text"].append("Speech2TextTokenizer")
    _import_structure["models.speecht5"].append("SpeechT5Tokenizer")
    _import_structure["models.t5"].append("T5Tokenizer")
@@ -877,6 +884,7 @@ else:
    _import_structure["models.rembert"].append("RemBertTokenizerFast")
    _import_structure["models.roberta"].append("RobertaTokenizerFast")
    _import_structure["models.roformer"].append("RoFormerTokenizerFast")
+    _import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizerFast")
    _import_structure["models.splinter"].append("SplinterTokenizerFast")
    _import_structure["models.squeezebert"].append("SqueezeBertTokenizerFast")
    _import_structure["models.t5"].append("T5TokenizerFast")
@@ -1082,6 +1090,7 @@ else:
    _import_structure["modeling_utils"] = ["PreTrainedModel"]
    # PyTorch models structure
    _import_structure["models.albert"].extend(
        [
            "ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -2683,6 +2692,21 @@ else:
            "SamPreTrainedModel",
        ]
    )
+    _import_structure["models.seamless_m4t"].extend(
+        [
+            "SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "SeamlessM4TCodeHifiGan",
+            "SeamlessM4TForSpeechToSpeech",
+            "SeamlessM4TForSpeechToText",
+            "SeamlessM4TForTextToSpeech",
+            "SeamlessM4TForTextToText",
+            "SeamlessM4THifiGan",
+            "SeamlessM4TModel",
+            "SeamlessM4TPreTrainedModel",
+            "SeamlessM4TTextToUnitForConditionalGeneration",
+            "SeamlessM4TTextToUnitModel",
+        ]
+    )
    _import_structure["models.segformer"].extend(
        [
            "SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4658,6 +4682,12 @@ if TYPE_CHECKING:
        SamPromptEncoderConfig,
        SamVisionConfig,
    )
+    from .models.seamless_m4t import (
+        SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        SeamlessM4TConfig,
+        SeamlessM4TFeatureExtractor,
+        SeamlessM4TProcessor,
+    )
    from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig
    from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
    from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
@@ -4925,6 +4955,7 @@ if TYPE_CHECKING:
        from .models.plbart import PLBartTokenizer
        from .models.reformer import ReformerTokenizer
        from .models.rembert import RemBertTokenizer
+        from .models.seamless_m4t import SeamlessM4TTokenizer
        from .models.speech_to_text import Speech2TextTokenizer
        from .models.speecht5 import SpeechT5Tokenizer
        from .models.t5 import T5Tokenizer
@@ -4990,6 +5021,7 @@ if TYPE_CHECKING:
        from .models.rembert import RemBertTokenizerFast
        from .models.roberta import RobertaTokenizerFast
        from .models.roformer import RoFormerTokenizerFast
+        from .models.seamless_m4t import SeamlessM4TTokenizerFast
        from .models.splinter import SplinterTokenizerFast
        from .models.squeezebert import SqueezeBertTokenizerFast
        from .models.t5 import T5TokenizerFast
@@ -5157,8 +5189,6 @@ if TYPE_CHECKING:
            top_k_top_p_filtering,
        )
        from .modeling_utils import PreTrainedModel
-        # PyTorch model imports
        from .models.albert import (
            ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
            AlbertForMaskedLM,
@@ -6485,6 +6515,21 @@ if TYPE_CHECKING:
            SamModel,
            SamPreTrainedModel,
        )
+        # PyTorch model imports
+        from .models.seamless_m4t import (
+            SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST,
+            SeamlessM4TCodeHifiGan,
+            SeamlessM4TForSpeechToSpeech,
+            SeamlessM4TForSpeechToText,
+            SeamlessM4TForTextToSpeech,
+            SeamlessM4TForTextToText,
+            SeamlessM4THifiGan,
+            SeamlessM4TModel,
+            SeamlessM4TPreTrainedModel,
+            SeamlessM4TTextToUnitForConditionalGeneration,
+            SeamlessM4TTextToUnitModel,
+        )
        from .models.segformer import (
            SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
            SegformerDecodeHead,

--- a/src/transformers/convert_slow_tokenizer.py
+++ b/src/transformers/convert_slow_tokenizer.py
@@ -775,6 +775,31 @@ class NllbConverter(SpmConverter):
        )
+class SeamlessM4TConverter(SpmConverter):
+    def vocab(self, proto):
+        vocab = [
+            ("<pad>", 0.0),
+            ("<unk>", 0.0),
+            ("<s>", 0.0),
+            ("</s>", 0.0),
+        ]
+        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
+        return vocab
+    def unk_id(self, proto):
+        return self.original_tokenizer.unk_token_id
+    def post_processor(self):
+        return processors.TemplateProcessing(
+            single="__eng__ $A </s>",
+            pair="__eng__ $A $B </s>",
+            special_tokens=[
+                ("__eng__", self.original_tokenizer.convert_tokens_to_ids("__eng__")),
+                ("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
+            ],
+        )
 class XLMRobertaConverter(SpmConverter):
    def vocab(self, proto):
        vocab = [
@@ -1278,6 +1303,7 @@ SLOW_TO_FAST_CONVERTERS = {
    "RetriBertTokenizer": BertConverter,
    "RobertaTokenizer": RobertaConverter,
    "RoFormerTokenizer": RoFormerConverter,
+    "SeamlessM4TTokenizer": SeamlessM4TConverter,
    "SqueezeBertTokenizer": BertConverter,
    "T5Tokenizer": T5Converter,
    "WhisperTokenizer": WhisperConverter,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -180,6 +180,7 @@ from . import (
    roformer,
    rwkv,
    sam,
+    seamless_m4t,
    segformer,
    sew,
    sew_d,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -186,6 +186,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("roformer", "RoFormerConfig"),
        ("rwkv", "RwkvConfig"),
        ("sam", "SamConfig"),
+        ("seamless_m4t", "SeamlessM4TConfig"),
        ("segformer", "SegformerConfig"),
        ("sew", "SEWConfig"),
        ("sew-d", "SEWDConfig"),
@@ -392,6 +393,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("rwkv", "RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("sam", "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("seamless_m4t", "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -622,6 +624,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("roformer", "RoFormer"),
        ("rwkv", "RWKV"),
        ("sam", "SAM"),
+        ("seamless_m4t", "SeamlessM4T"),
        ("segformer", "SegFormer"),
        ("sew", "SEW"),
        ("sew-d", "SEW-D"),

--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -76,6 +76,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
        ("pop2piano", "Pop2PianoFeatureExtractor"),
        ("regnet", "ConvNextFeatureExtractor"),
        ("resnet", "ConvNextFeatureExtractor"),
+        ("seamless_m4t", "SeamlessM4TFeatureExtractor"),
        ("segformer", "SegformerFeatureExtractor"),
        ("sew", "Wav2Vec2FeatureExtractor"),
        ("sew-d", "Wav2Vec2FeatureExtractor"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -175,6 +175,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("roformer", "RoFormerModel"),
        ("rwkv", "RwkvModel"),
        ("sam", "SamModel"),
+        ("seamless_m4t", "SeamlessM4TModel"),
        ("segformer", "SegformerModel"),
        ("sew", "SEWModel"),
        ("sew-d", "SEWDModel"),
@@ -674,6 +675,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
        ("pegasus_x", "PegasusXForConditionalGeneration"),
        ("plbart", "PLBartForConditionalGeneration"),
        ("prophetnet", "ProphetNetForConditionalGeneration"),
+        ("seamless_m4t", "SeamlessM4TForTextToText"),
        ("switch_transformers", "SwitchTransformersForConditionalGeneration"),
        ("t5", "T5ForConditionalGeneration"),
        ("umt5", "UMT5ForConditionalGeneration"),
@@ -684,6 +686,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
    [
        ("pop2piano", "Pop2PianoForConditionalGeneration"),
+        ("seamless_m4t", "SeamlessM4TForSpeechToText"),
        ("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
        ("speech_to_text", "Speech2TextForConditionalGeneration"),
        ("speecht5", "SpeechT5ForSpeechToText"),
@@ -1047,6 +1050,7 @@ MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = OrderedDict(
        # Model for Text-To-Waveform mapping
        ("bark", "BarkModel"),
        ("musicgen", "MusicgenForConditionalGeneration"),
+        ("seamless_m4t", "SeamlessM4TForTextToSpeech"),
        ("vits", "VitsModel"),
    ]
 )

--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -71,6 +71,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("pix2struct", "Pix2StructProcessor"),
        ("pop2piano", "Pop2PianoProcessor"),
        ("sam", "SamProcessor"),
+        ("seamless_m4t", "SeamlessM4TProcessor"),
        ("sew", "Wav2Vec2Processor"),
        ("sew-d", "Wav2Vec2Processor"),
        ("speech_to_text", "Speech2TextProcessor"),

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -329,6 +329,13 @@ else:
            ("roc_bert", ("RoCBertTokenizer", None)),
            ("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)),
            ("rwkv", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
+            (
+                "seamless_m4t",
+                (
+                    "SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
+                    "SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),
            ("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
            ("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
            ("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)),