Merge branch 'parler-tts-release' into main

5eae102f · Yoach Lacombe · 85b8cac7 · 86e4eb71 · 5eae102f · 5eae102f
Commit 5eae102f authored Apr 10, 2024 by Yoach Lacombe
20 changed files
--- a/LICENSE
+++ b/LICENSE
@@ -186,7 +186,7 @@
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

-   Copyright [yyyy] [name of copyright owner]
+   Copyright [2024] [The HuggingFace Inc. team]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.

--- a/README.md
+++ b/README.md
-# Stable Speech
-
-Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
-by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
-
-Reproducing the TTS model requires the following 5 steps to be completed in order:
-1. Train the Accent Classifier
-2. Annotate the Training Set
-3. Aggregate Statistics
-4. Create Descriptions
-5. Train the Model
-
-## Step 1: Train the Accent Classifier
-
-The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from 
-the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent
-classification task.
-
-Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the 
-audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. 
-The linear classifier is randomly initialised, and is thus always trained.
-
-The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
-separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
-
-In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) 
-model, and training the linear classifier on a combination of three open-source datasets:
-1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
-2. The train split of [VCTK](https://huggingface.co/datasets/vctk) 
-3. The dev split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
-
-The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
-to give the final classification accuracy.
-
-```bash
-#!/usr/bin/env bash
-
-python run_audio_classification.py \
-    --model_name_or_path "facebook/mms-lid-126" \
-    --train_dataset_name "vctk+facebook/voxpopuli+edinburghcstr/edacc" \
-    --train_dataset_config_name "main+en_accented+default" \
-    --train_split_name "train+test+validation" \
-    --train_label_column_name "accent+accent+accent" \
-    --eval_dataset_name "edinburghcstr/edacc" \
-    --eval_dataset_config_name "default" \
-    --eval_split_name "test" \
-    --eval_label_column_name "accent" \
-    --output_dir "./" \
-    --do_train \
-    --do_eval \
-    --overwrite_output_dir \
-    --remove_unused_columns False \
-    --fp16 \
-    --learning_rate 1e-4 \
-    --max_length_seconds 20 \
-    --attention_mask False \
-    --warmup_ratio 0.1 \
-    --num_train_epochs 5 \
-    --per_device_train_batch_size 32 \
-    --per_device_eval_batch_size 32 \
-    --preprocessing_num_workers 16 \
-    --dataloader_num_workers 4 \
-    --logging_strategy "steps" \
-    --logging_steps 10 \
-    --evaluation_strategy "epoch" \
-    --save_strategy "epoch" \
-    --load_best_model_at_end True \
-    --metric_for_best_model "accuracy" \
-    --save_total_limit 3 \
-    --freeze_base_model \
-    --push_to_hub \
-    --trust_remote_code
+# Parler-TTS
+
+Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
+
+Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
+
+This repository contains the inference and training code for Parler-TTS. It is designed to accompany the [Data-Speech](https://github.com/huggingface/dataspeech) repository for dataset annotation.
+
+> [!IMPORTANT]
+> We're proud to release Parler-TTS v0.1, our first 300M parameter model, trained on 10.5K hours of audio data.
+> In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.
+
+## 📖 Quick Index
+* [Installation](#installation)
+* [Usage](#usage)
+* [Training](#training)
+* [Demo](https://huggingface.co/spaces/parler-tts/parler_tts_mini)
+* [Model weights and datasets](https://huggingface.co/parler-tts)
+
+
+## Usage
+
+> [!TIP]
+> You can directly try it out in an interactive demo [here](https://huggingface.co/spaces/parler-tts/parler_tts_mini)!
+
+Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
+
+```py
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer
+import soundfile as sf
+import torch
+
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+
+model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_300M_v0.1").to(device)
+tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_300M_v0.1")
+
+prompt = "Hey, how are you doing today?"
+description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
+
+input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
+prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+
+generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
+audio_arr = generation.cpu().numpy().squeeze()
+sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
+```
+
+## Installation
+
+Parler-TTS has light-weight dependencies and can be installed in one line:
+
+```sh
+pip install git+https://github.com/huggingface/parler-tts.git
+```
+
+## Training
+
+The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
+- [1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture)
+- [2. The first steps to get started](/training/README.md#2-getting-started)
+- [3. A training guide](/training/README.md#3-training)
+
+> [!IMPORTANT]
+> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
+
+```sh
+accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json
 ```

-Tips:
-1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
+## Acknowledgements

-## Step 2: Annotate the Training Set
+This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

-Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. 
+Special thanks to:
+- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
+- the many libraries used, namely [🤗 datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [🤗 accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [🤗 transformers](https://huggingface.co/docs/transformers/index).
+- Descript for the [DAC codec model](https://github.com/descriptinc/descript-audio-codec)
+- Hugging Face 🤗 for providing compute resources and time to explore!

-## Step 3: Aggregate Statistics

-Aggregate statistics from Step 2. Convert continuous values to discrete labels.
+## Citation

-## Step 4: Create Descriptions
+If you found this repository useful, please consider citing this work and also the original Stability AI paper:

-Convert sequence of discrete labels to text description (using an LLM). 
+```
+@misc{lacombe-etal-2024-parler-tts,
+  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
+  title = {Parler-TTS},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/huggingface/parler-tts}}
+}
+```

-## Step 5: Train the Model
+```
+@misc{lyth2024natural,
+      title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
+      author={Dan Lyth and Simon King},
+      year={2024},
+      eprint={2402.01912},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD}
+}
+```

-Train MusicGen-style model on the TTS task.
+## Contribution
+
+Contributions are welcome, as the project offers many possibilities for improvement and exploration.
+
+Namely, we're looking at ways to improve both quality and speed:
+- Datasets:
+    - Train on more data
+    - Add more features such as accents
+- Training:
+    - Add PEFT compatibility to do Lora fine-tuning.
+    - Add possibility to train without description column.
+    - Add notebook training.
+    - Explore multilingual training.
+    - Explore mono-speaker finetuning.
+    - Explore more architectures.
+- Optimization:
+    - Compilation and static cache
+    - Support to FA2 and SDPA
+- Evaluation:
+    - Add more evaluation metrics

--- a/audio_classification_scripts/run_wav2vec2_dummy.sh
+++ b/audio_classification_scripts/run_wav2vec2_dummy.sh
-#!/usr/bin/env bash
-
-python run_audio_classification.py \
-    --model_name_or_path "hf-internal-testing/tiny-random-wav2vec2" \
-    --train_dataset_name "facebook/voxpopuli" \
-    --train_dataset_config_name "en_accented" \
-    --train_split_name "test" \
-    --train_label_column_name "accent" \
-    --eval_dataset_name "facebook/voxpopuli" \
-    --eval_dataset_config_name "en_accented" \
-    --eval_split_name "test" \
-    --eval_label_column_name "accent" \
-    --trust_remote_code \
-    --output_dir "./" \
-    --do_train \
-    --do_eval \
-    --max_train_samples 100 \
-    --max_eval_samples 100 \
-    --overwrite_output_dir \
-    --remove_unused_columns False \
-    --fp16 \
-    --learning_rate 1e-4 \
-    --min_length_seconds 5 \
-    --max_length_seconds 10 \
-    --attention_mask False \
-    --warmup_ratio 0.1 \
-    --num_train_epochs 5 \
-    --per_device_train_batch_size 4 \
-    --per_device_eval_batch_size 4 \
-    --dataloader_num_workers 0 \
-    --logging_strategy "steps" \
-    --logging_steps 10 \
-    --evaluation_strategy "epoch" \
-    --save_strategy "epoch" \
-    --load_best_model_at_end True \
-    --metric_for_best_model "accuracy" \
-    --save_total_limit 3 \
-    --seed 0
--- a/helpers/gradio_demo/app.py
+++ b/helpers/gradio_demo/app.py
+import gradio as gr
+import torch
+
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer, AutoFeatureExtractor, set_seed
+
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+
+repo_id = "parler-tts/parler_tts_300M_v0.1"
+
+model = ParlerTTSForConditionalGeneration.from_pretrained(repo_id).to(device)
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id)
+
+
+SAMPLE_RATE = feature_extractor.sampling_rate
+SEED = 41
+
+default_text = "Please surprise me and speak in whatever voice you enjoy."
+
+title = "# Parler-TTS </div>"
+
+examples = [
+    [
+        "'This is the best time of my life, Bartley,' she said happily.",
+        "A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
+    ],
+    [
+        "Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom.	",
+        "A male speaker with a slightly high-pitched voice delivering his words at a slightly slow pace in a small, confined space with a touch of background noise and a quite monotone tone.",
+    ],
+    [
+        "montrose also after having experienced still more variety of good and bad fortune threw down his arms and retired out of the kingdom",
+        "A male speaker with a low-pitched voice delivering his words at a fast pace in a small, confined space with a lot of background noise and an animated tone.",
+    ],
+]
+
+
+def gen_tts(text, description):
+    inputs = tokenizer(description, return_tensors="pt").to(device)
+    prompt = tokenizer(text, return_tensors="pt").to(device)
+
+    set_seed(SEED)
+    generation = model.generate(
+        input_ids=inputs.input_ids, prompt_input_ids=prompt.input_ids, do_sample=True, temperature=1.0
+    )
+    audio_arr = generation.cpu().numpy().squeeze()
+
+    return (SAMPLE_RATE, audio_arr)
+
+
+css = """
+        #share-btn-container {
+            display: flex;
+            padding-left: 0.5rem !important;
+            padding-right: 0.5rem !important;
+            background-color: #000000;
+            justify-content: center;
+            align-items: center;
+            border-radius: 9999px !important; 
+            width: 13rem;
+            margin-top: 10px;
+            margin-left: auto;
+            flex: unset !important;
+        }
+        #share-btn {
+            all: initial;
+            color: #ffffff;
+            font-weight: 600;
+            cursor: pointer;
+            font-family: 'IBM Plex Sans', sans-serif;
+            margin-left: 0.5rem !important;
+            padding-top: 0.25rem !important;
+            padding-bottom: 0.25rem !important;
+            right:0;
+        }
+        #share-btn * {
+            all: unset !important;
+        }
+        #share-btn-container div:nth-child(-n+2){
+            width: auto !important;
+            min-height: 0px !important;
+        }
+        #share-btn-container .wrap {
+            display: none !important;
+        }
+"""
+with gr.Blocks(css=css) as block:
+    gr.Markdown(title)
+    with gr.Row():
+        with gr.Column():
+            input_text = gr.Textbox(label="Input Text", lines=2, value=default_text, elem_id="input_text")
+            description = gr.Textbox(label="Description", lines=2, value="", elem_id="input_description")
+            run_button = gr.Button("Generate Audio", variant="primary")
+        with gr.Column():
+            audio_out = gr.Audio(label="Parler-TTS generation", type="numpy", elem_id="audio_out")
+
+    inputs = [input_text, description]
+    outputs = [audio_out]
+    gr.Examples(examples=examples, fn=gen_tts, inputs=inputs, outputs=outputs, cache_examples=True)
+    run_button.click(fn=gen_tts, inputs=inputs, outputs=outputs, queue=True)
+
+block.queue()
+block.launch(share=True)
--- a/helpers/model_init_scripts/init_dummy_model.py
+++ b/helpers/model_init_scripts/init_dummy_model.py
+from parler_tts import ParlerTTSForCausalLM, ParlerTTSForConditionalGeneration, ParlerTTSDecoderConfig
+from transformers import AutoConfig
+import os
+import argparse
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("save_directory", type=str, help="Directory where to save the model and the decoder.")
+    parser.add_argument("--text_model", type=str, help="Repository id or path to the text encoder.")
+    parser.add_argument("--audio_model", type=str, help="Repository id or path to the audio encoder.")
+
+    args = parser.parse_args()
+
+    text_model = args.text_model
+    encodec_version = args.audio_model
+
+    t5 = AutoConfig.from_pretrained(text_model)
+    encodec = AutoConfig.from_pretrained(encodec_version)
+
+    encodec_vocab_size = encodec.codebook_size
+    num_codebooks = encodec.num_codebooks
+    print("num_codebooks", num_codebooks)
+
+    decoder_config = ParlerTTSDecoderConfig(
+        vocab_size=encodec_vocab_size + 1,
+        max_position_embeddings=2048,
+        num_hidden_layers=4,
+        ffn_dim=512,
+        num_attention_heads=8,
+        layerdrop=0.0,
+        use_cache=True,
+        activation_function="gelu",
+        hidden_size=512,
+        dropout=0.0,
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        pad_token_id=encodec_vocab_size,
+        eos_token_id=encodec_vocab_size,
+        bos_token_id=encodec_vocab_size + 1,
+        num_codebooks=num_codebooks,
+    )
+
+    decoder = ParlerTTSForCausalLM(decoder_config)
+    decoder.save_pretrained(os.path.join(args.save_directory, "decoder"))
+
+    model = ParlerTTSForConditionalGeneration.from_sub_models_pretrained(
+        text_encoder_pretrained_model_name_or_path=text_model,
+        audio_encoder_pretrained_model_name_or_path=encodec_version,
+        decoder_pretrained_model_name_or_path=os.path.join(args.save_directory, "decoder"),
+        vocab_size=t5.vocab_size,
+    )
+
+    # set the appropriate bos/pad token ids
+    model.generation_config.decoder_start_token_id = encodec_vocab_size + 1
+    model.generation_config.pad_token_id = encodec_vocab_size
+    model.generation_config.eos_token_id = encodec_vocab_size
+
+    # set other default generation config params
+    model.generation_config.max_length = int(30 * model.audio_encoder.config.frame_rate)
+    model.generation_config.do_sample = True  # True
+    model.generation_config.guidance_scale = 1  # 3.0
+    
+    model.config.pad_token_id = encodec_vocab_size
+    model.config.decoder_start_token_id = encodec_vocab_size+1
+
+    model.save_pretrained(os.path.join(args.save_directory, "tiny-model"))
--- a/helpers/model_init_scripts/init_dummy_model_with_encodec.py
+++ b/helpers/model_init_scripts/init_dummy_model_with_encodec.py
+from parler_tts import ParlerTTSForCausalLM, ParlerTTSForConditionalGeneration, ParlerTTSDecoderConfig
+from transformers import AutoConfig
+import os
+import argparse
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("save_directory", type=str, help="Directory where to save the model and the decoder.")
+    args = parser.parse_args()
+
+    text_model = "google-t5/t5-small"
+    encodec_version = "facebook/encodec_24khz"
+
+    t5 = AutoConfig.from_pretrained(text_model)
+    encodec = AutoConfig.from_pretrained(encodec_version)
+
+    encodec_vocab_size = encodec.codebook_size
+    num_codebooks = 8
+    print("num_codebooks", num_codebooks)
+
+    decoder_config = ParlerTTSDecoderConfig(
+        vocab_size=encodec_vocab_size + 1,
+        max_position_embeddings=2048,
+        num_hidden_layers=4,
+        ffn_dim=512,
+        num_attention_heads=8,
+        layerdrop=0.0,
+        use_cache=True,
+        activation_function="gelu",
+        hidden_size=512,
+        dropout=0.0,
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        pad_token_id=encodec_vocab_size,
+        eos_token_id=encodec_vocab_size,
+        bos_token_id=encodec_vocab_size + 1,
+        num_codebooks=num_codebooks,
+    )
+
+    decoder = ParlerTTSForCausalLM(decoder_config)
+
+    decoder.save_pretrained(os.path.join(args.save_directory, "decoder"))
+
+    model = ParlerTTSForConditionalGeneration.from_sub_models_pretrained(
+        text_encoder_pretrained_model_name_or_path=text_model,
+        audio_encoder_pretrained_model_name_or_path=encodec_version,
+        decoder_pretrained_model_name_or_path=os.path.join(args.save_directory, "decoder"),
+        vocab_size=t5.vocab_size,
+    )
+
+    # set the appropriate bos/pad token ids
+    model.generation_config.decoder_start_token_id = encodec_vocab_size + 1
+    model.generation_config.pad_token_id = encodec_vocab_size
+    model.generation_config.eos_token_id = encodec_vocab_size
+
+    # set other default generation config params
+    model.generation_config.max_length = int(30 * model.audio_encoder.config.frame_rate)
+    model.generation_config.do_sample = True  # True
+    model.generation_config.guidance_scale = 1  # 3.0
+
+    model.save_pretrained(os.path.join(args.save_directory, "tiny-model"))
--- a/helpers/model_init_scripts/init_model_300M.py
+++ b/helpers/model_init_scripts/init_model_300M.py
+from parler_tts import ParlerTTSForCausalLM, ParlerTTSForConditionalGeneration, ParlerTTSDecoderConfig
+from transformers import AutoConfig
+import os
+import argparse
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("save_directory", type=str, help="Directory where to save the model and the decoder.")
+    parser.add_argument("--text_model", type=str, help="Repository id or path to the text encoder.")
+    parser.add_argument("--audio_model", type=str, help="Repository id or path to the audio encoder.")
+
+    args = parser.parse_args()
+
+    text_model = args.text_model
+    encodec_version = args.audio_model
+
+    t5 = AutoConfig.from_pretrained(text_model)
+    encodec = AutoConfig.from_pretrained(encodec_version)
+
+    encodec_vocab_size = encodec.codebook_size
+    num_codebooks = encodec.num_codebooks
+    print("num_codebooks", num_codebooks)
+
+    decoder_config = ParlerTTSDecoderConfig(
+        vocab_size=encodec_vocab_size + 64,  # + 64 instead of +1 to have a multiple of 64
+        max_position_embeddings=4096,  # 30 s = 2580
+        num_hidden_layers=24,
+        ffn_dim=4096,
+        num_attention_heads=16,
+        layerdrop=0.0,
+        use_cache=True,
+        activation_function="gelu",
+        hidden_size=1024,
+        dropout=0.1,
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        pad_token_id=encodec_vocab_size,
+        eos_token_id=encodec_vocab_size,
+        bos_token_id=encodec_vocab_size + 1,
+        num_codebooks=num_codebooks,
+    )
+
+    decoder = ParlerTTSForCausalLM(decoder_config)
+    decoder.save_pretrained(os.path.join(args.save_directory, "decoder"))
+
+    model = ParlerTTSForConditionalGeneration.from_sub_models_pretrained(
+        text_encoder_pretrained_model_name_or_path=text_model,
+        audio_encoder_pretrained_model_name_or_path=encodec_version,
+        decoder_pretrained_model_name_or_path=os.path.join(args.save_directory, "decoder"),
+        vocab_size=t5.vocab_size,
+    )
+
+    # set the appropriate bos/pad token ids
+    model.generation_config.decoder_start_token_id = encodec_vocab_size + 1
+    model.generation_config.pad_token_id = encodec_vocab_size
+    model.generation_config.eos_token_id = encodec_vocab_size
+
+    # set other default generation config params
+    model.generation_config.max_length = int(30 * model.audio_encoder.config.frame_rate)
+    model.generation_config.do_sample = True  # True
+    model.generation_config.guidance_scale = 1  # 3.0
+    
+    model.config.pad_token_id = encodec_vocab_size
+    model.config.decoder_start_token_id = encodec_vocab_size+1
+
+    model.save_pretrained(os.path.join(args.save_directory, "parler-tts-untrained-300M/"))
--- a/helpers/push_to_hub_scripts/push_dac_to_hub.py
+++ b/helpers/push_to_hub_scripts/push_dac_to_hub.py
+import dac
+from parler_tts import DACConfig, DACModel
+from transformers import AutoConfig, AutoModel
+from transformers import EncodecFeatureExtractor
+AutoConfig.register("dac", DACConfig)
+AutoModel.register(DACConfig, DACModel)
+
+# Download a model
+model_path = dac.utils.download(model_type="44khz")
+model = dac.DAC.load(model_path)
+
+hf_dac = DACModel(DACConfig())
+hf_dac.model.load_state_dict(model.state_dict())
+
+hf_dac.push_to_hub("parler-tts/dac_44khZ_8kbps")
+EncodecFeatureExtractor(sampling_rate=44100).push_to_hub("parler-tts/dac_44khZ_8kbps")
--- a/helpers/push_to_hub_scripts/push_trained_parler_tts_to_hub.py
+++ b/helpers/push_to_hub_scripts/push_trained_parler_tts_to_hub.py
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer, AutoFeatureExtractor
+
+path = "TODO"
+repo_id = "parler_tts_300M"
+
+
+AutoFeatureExtractor.from_pretrained("ylacombe/dac_44khZ_8kbps").push_to_hub(repo_id)
+AutoTokenizer.from_pretrained("google/t5-v1_1-base").push_to_hub(repo_id)
+
+ParlerTTSForConditionalGeneration.from_pretrained(path).push_to_hub(repo_id)
--- a/helpers/training_configs/librispeech_tts_r_300M_dummy.json
+++ b/helpers/training_configs/librispeech_tts_r_300M_dummy.json
+{
+    "model_name_or_path": "./parler-tts-untrained-300M/parler-tts-untrained-300M/",
+    "save_to_disk":  "./tmp_dataset_audio/",
+    "temporary_save_to_disk": "./audio_code_tmp/",
+
+
+    "feature_extractor_name":"ylacombe/dac_44khZ_8kbps",
+    "description_tokenizer_name":"google/flan-t5-base",
+    "prompt_tokenizer_name":"google/flan-t5-base",
+
+    "report_to": ["wandb"],
+    "overwrite_output_dir": true,
+    "output_dir": "./output_dir_training",
+
+    "train_dataset_name": "blabble-io/libritts_r",
+    "train_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated",
+    "train_dataset_config_name": "clean",
+    "train_split_name": "test.clean",
+
+    "eval_dataset_name": "blabble-io/libritts_r",
+    "eval_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated",
+    "eval_dataset_config_name": "clean",
+    "eval_split_name": "test.clean",
+
+    "target_audio_column_name": "audio", 
+    "description_column_name": "text_description",
+    "prompt_column_name": "text",
+
+    "max_eval_samples": 48,
+    "max_train_samples": 96,
+    
+    "max_duration_in_seconds": 20,
+    "min_duration_in_seconds": 2.0,
+
+    "add_audio_samples_to_wandb": true,
+    "id_column_name": "id",
+
+    "preprocessing_num_workers": 8,
+
+    "do_train": true,
+    "num_train_epochs": 50,
+    "gradient_accumulation_steps": 1,
+    "gradient_checkpointing": false,
+    "per_device_train_batch_size": 4,
+    "learning_rate": 1e-3,
+    "adam_beta1": 0.9,
+    "adam_beta2": 0.99,
+    "weight_decay": 0.01,
+
+    "lr_scheduler_type": "cosine",
+    "warmup_steps":  40,
+
+
+    "logging_steps": 2,
+    "freeze_text_encoder": true,
+
+
+    "do_eval": true, 
+    "predict_with_generate": true,
+    "include_inputs_for_metrics": true,
+    "evaluation_strategy": "steps",
+    "eval_steps": 500,
+    "save_steps": 5000,
+
+    "per_device_eval_batch_size": 12,
+
+    "audio_encoder_per_device_batch_size":24,
+    "dtype": "bfloat16",
+    "seed": 456,
+
+    "dataloader_num_workers":8
+}
--- a/helpers/training_configs/starting_point_0.01.json
+++ b/helpers/training_configs/starting_point_0.01.json
+{
+    "model_name_or_path": "./parler-tts-untrained-300M/parler-tts-untrained-300M/",
+    "save_to_disk":  "./tmp_dataset_audio/",
+    "temporary_save_to_disk": "./audio_code_tmp/",
+
+
+    "feature_extractor_name":"ylacombe/dac_44khZ_8kbps",
+    "description_tokenizer_name":"google/flan-t5-base",
+    "prompt_tokenizer_name":"google/flan-t5-base",
+
+    "report_to": ["wandb"],
+    "overwrite_output_dir": true,
+    "output_dir": "./output_dir_training",
+
+    "train_dataset_name": "blabble-io/libritts_r+blabble-io/libritts_r+blabble-io/libritts_r+parler-tts/mls_eng_10k",
+    "train_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated",
+    "train_dataset_config_name": "clean+clean+other+default",
+    "train_split_name": "train.clean.360+train.clean.100+train.other.500+train",
+
+    "eval_dataset_name": "blabble-io/libritts_r+parler-tts/mls_eng_10k",
+    "eval_metadata_dataset_name": "parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/mls-eng-10k-tags_tagged_10k_generated",
+    "eval_dataset_config_name": "other+default",
+    "eval_split_name": "test.other+test",
+
+    "target_audio_column_name": "audio", 
+    "description_column_name": "text_description",
+    "prompt_column_name": "text",
+
+    "max_eval_samples": 96,
+    
+    "max_duration_in_seconds": 30,
+    "min_duration_in_seconds": 2.0,
+    "max_text_length": 400,
+
+    "group_by_length": true,
+
+    "add_audio_samples_to_wandb": true,
+    "id_column_name": "id",
+
+    "preprocessing_num_workers": 8,
+
+    "do_train": true,
+    "num_train_epochs": 40,
+    "gradient_accumulation_steps": 8,
+    "gradient_checkpointing": false,
+    "per_device_train_batch_size": 3,
+    "learning_rate": 0.00095,
+    "adam_beta1": 0.9,
+    "adam_beta2": 0.99,
+    "weight_decay": 0.01,
+
+    "lr_scheduler_type": "constant_with_warmup",
+    "warmup_steps":  20000,
+
+
+    "logging_steps": 1000,
+    "freeze_text_encoder": true,
+
+
+    "do_eval": true, 
+    "predict_with_generate": true,
+    "include_inputs_for_metrics": true,
+    "evaluation_strategy": "steps",
+    "eval_steps": 10000,
+    "save_steps": 10000,
+
+    "per_device_eval_batch_size": 12,
+
+    "audio_encoder_per_device_batch_size":20,
+    "dtype": "bfloat16",
+    "seed": 456,
+
+    "dataloader_num_workers":8
+}
--- a/parler_tts/__init__.py
+++ b/parler_tts/__init__.py
+__version__ = "0.1"
+
+
+from .configuration_parler_tts import ParlerTTSConfig, ParlerTTSDecoderConfig
+from .modeling_parler_tts import (
+    ParlerTTSForCausalLM,
+    ParlerTTSForConditionalGeneration,
+    apply_delay_pattern_mask,
+    build_delay_pattern_mask,
+)
+
+from .dac_wrapper import DACConfig, DACModel
+from transformers import AutoConfig, AutoModel
+
+AutoConfig.register("dac", DACConfig)
+AutoModel.register(DACConfig, DACModel)
--- a/stable_speech/configuration_stable_speech.py
+++ b/stable_speech/configuration_stable_speech.py
 # coding=utf-8
-# Copyright 2023 Meta AI and The HuggingFace Inc. team. All rights reserved.
+# Copyright 2024 and The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Stable Speech model configuration"""
+""" Parler-TTS model configuration"""

 from transformers import AutoConfig, logging
 from transformers.configuration_utils import PretrainedConfig
@@ -21,26 +21,26 @@ from transformers.configuration_utils import PretrainedConfig
 logger = logging.get_logger(__name__)

 MUSICGEN_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "facebook/stable_speech-small": "https://huggingface.co/facebook/stable_speech-small/resolve/main/config.json",
-    # See all StableSpeech models at https://huggingface.co/models?filter=stable_speech
+    "facebook/parler_tts-small": "https://huggingface.co/facebook/parler_tts-small/resolve/main/config.json",
+    # See all ParlerTTS models at https://huggingface.co/models?filter=parler_tts
 }


-class StableSpeechDecoderConfig(PretrainedConfig):
+class ParlerTTSDecoderConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of an [`StableSpeechDecoder`]. It is used to instantiate a
-    Stable Speech decoder according to the specified arguments, defining the model architecture. Instantiating a
-    configuration with the defaults will yield a similar configuration to that of the Stable Speech
-    [facebook/stable_speech-small](https://huggingface.co/facebook/stable_speech-small) architecture.
+    This is the configuration class to store the configuration of an [`ParlerTTSDecoder`]. It is used to instantiate a
+    Parler-TTS decoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Parler-TTS
+    [facebook/parler_tts-small](https://huggingface.co/facebook/parler_tts-small) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (`int`, *optional*, defaults to 2048):
-            Vocabulary size of the StableSpeechDecoder model. Defines the number of different tokens that can be
-            represented by the `inputs_ids` passed when calling [`StableSpeechDecoder`].
+        vocab_size (`int`, *optional*, defaults to 2049):
+            Vocabulary size of the ParlerTTSDecoder model. Defines the number of different tokens that can be
+            represented by the `inputs_ids` passed when calling [`ParlerTTSDecoder`]. 
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 24):
@@ -76,12 +76,12 @@ class StableSpeechDecoderConfig(PretrainedConfig):
            Whether input and output word embeddings should be tied.
    """

-    model_type = "stable_speech_decoder"
+    model_type = "parler_tts_decoder"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
-        vocab_size=2048,
+        vocab_size=2049,  # vocab size = 2048 (encodec vocab size) + 1 (eos)
        max_position_embeddings=2048,
        num_hidden_layers=24,
        ffn_dim=4096,
@@ -97,8 +97,8 @@ class StableSpeechDecoderConfig(PretrainedConfig):
        scale_embedding=False,
        num_codebooks=4,
        pad_token_id=2048,
-        bos_token_id=2048,
-        eos_token_id=None,
+        bos_token_id=2049,
+        eos_token_id=2048,
        tie_word_embeddings=False,
        **kwargs,
    ):
@@ -127,16 +127,19 @@ class StableSpeechDecoderConfig(PretrainedConfig):
        )


-class StableSpeechConfig(PretrainedConfig):
+class ParlerTTSConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a [`StableSpeechModel`]. It is used to instantiate a
-    Stable Speech model according to the specified arguments, defining the text encoder, audio encoder and Stable Speech decoder
+    This is the configuration class to store the configuration of a [`ParlerTTSModel`]. It is used to instantiate a
+    Parler-TTS model according to the specified arguments, defining the text encoder, audio encoder and Parler-TTS decoder
    configs.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
+        vocab_size (`int`, *optional*, defaults to 1024):
+            Vocabulary size of the prompt token ids. Defines the number of different tokens that can be
+            represented by the `prompt_inputs_ids`.
        kwargs (*optional*):
            Dictionary of keyword arguments. Notably:

@@ -151,24 +154,24 @@ class StableSpeechConfig(PretrainedConfig):

    ```python
    >>> from transformers import (
-    ...     StableSpeechConfig,
-    ...     StableSpeechDecoderConfig,
+    ...     ParlerTTSConfig,
+    ...     ParlerTTSDecoderConfig,
    ...     T5Config,
    ...     EncodecConfig,
-    ...     StableSpeechForConditionalGeneration,
+    ...     ParlerTTSForConditionalGeneration,
    ... )

    >>> # Initializing text encoder, audio encoder, and decoder model configurations
    >>> text_encoder_config = T5Config()
    >>> audio_encoder_config = EncodecConfig()
-    >>> decoder_config = StableSpeechDecoderConfig()
+    >>> decoder_config = ParlerTTSDecoderConfig()

-    >>> configuration = StableSpeechConfig.from_sub_models_config(
+    >>> configuration = ParlerTTSConfig.from_sub_models_config(
    ...     text_encoder_config, audio_encoder_config, decoder_config
    ... )

-    >>> # Initializing a StableSpeechForConditionalGeneration (with random weights) from the facebook/stable_speech-small style configuration
-    >>> model = StableSpeechForConditionalGeneration(configuration)
+    >>> # Initializing a ParlerTTSForConditionalGeneration (with random weights) from the facebook/parler_tts-small style configuration
+    >>> model = ParlerTTSForConditionalGeneration(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
@@ -177,17 +180,17 @@ class StableSpeechConfig(PretrainedConfig):
    >>> config_decoder = model.config.decoder

    >>> # Saving the model, including its configuration
-    >>> model.save_pretrained("stable_speech-model")
+    >>> model.save_pretrained("parler_tts-model")

    >>> # loading model and config from pretrained folder
-    >>> stable_speech_config = StableSpeechConfig.from_pretrained("stable_speech-model")
-    >>> model = StableSpeechForConditionalGeneration.from_pretrained("stable_speech-model", config=stable_speech_config)
+    >>> parler_tts_config = ParlerTTSConfig.from_pretrained("parler_tts-model")
+    >>> model = ParlerTTSForConditionalGeneration.from_pretrained("parler_tts-model", config=parler_tts_config)
    ```"""

-    model_type = "stable_speech"
+    model_type = "parler_tts"
    is_composition = True

-    def __init__(self, **kwargs):
+    def __init__(self, vocab_size=1024, **kwargs):
        super().__init__(**kwargs)
        if "text_encoder" not in kwargs or "audio_encoder" not in kwargs or "decoder" not in kwargs:
            raise ValueError("Config has to be initialized with text_encoder, audio_encoder and decoder config")
@@ -200,9 +203,10 @@ class StableSpeechConfig(PretrainedConfig):

        decoder_config = kwargs.pop("decoder")

+        self.vocab_size = vocab_size
        self.text_encoder = AutoConfig.for_model(text_encoder_model_type, **text_encoder_config)
        self.audio_encoder = AutoConfig.for_model(audio_encoder_model_type, **audio_encoder_config)
-        self.decoder = StableSpeechDecoderConfig(**decoder_config)
+        self.decoder = ParlerTTSDecoderConfig(**decoder_config)
        self.is_encoder_decoder = True

    @classmethod
@@ -210,15 +214,15 @@ class StableSpeechConfig(PretrainedConfig):
        cls,
        text_encoder_config: PretrainedConfig,
        audio_encoder_config: PretrainedConfig,
-        decoder_config: StableSpeechDecoderConfig,
+        decoder_config: ParlerTTSDecoderConfig,
        **kwargs,
    ):
        r"""
-        Instantiate a [`StableSpeechConfig`] (or a derived class) from text encoder, audio encoder and decoder
+        Instantiate a [`ParlerTTSConfig`] (or a derived class) from text encoder, audio encoder and decoder
        configurations.

        Returns:
-            [`StableSpeechConfig`]: An instance of a configuration object
+            [`ParlerTTSConfig`]: An instance of a configuration object
        """

        return cls(

--- a/parler_tts/dac_wrapper/__init__.py
+++ b/parler_tts/dac_wrapper/__init__.py
+from .configuration_dac import DACConfig
+from .modeling_dac import DACModel
--- a/parler_tts/dac_wrapper/configuration_dac.py
+++ b/parler_tts/dac_wrapper/configuration_dac.py
+from transformers import PretrainedConfig
+from typing import List
+
+
+class DACConfig(PretrainedConfig):
+    model_type = "dac"
+
+    def __init__(
+        self,
+        num_codebooks: int = 9,
+        model_bitrate: int = 8,  # kbps
+        codebook_size: int = 1024,
+        latent_dim: int = 1024,
+        frame_rate: int = 86,
+        sampling_rate: int = 44100,
+        **kwargs,
+    ):
+        self.codebook_size = codebook_size
+        self.model_bitrate = model_bitrate
+        self.latent_dim = latent_dim
+        self.num_codebooks = num_codebooks
+        self.frame_rate = frame_rate
+        self.sampling_rate = sampling_rate
+
+        super().__init__(**kwargs)
--- a/parler_tts/dac_wrapper/modeling_dac.py
+++ b/parler_tts/dac_wrapper/modeling_dac.py
+import torch
+
+from transformers import PreTrainedModel
+from transformers.models.encodec.modeling_encodec import EncodecEncoderOutput, EncodecDecoderOutput
+from .configuration_dac import DACConfig
+
+from dac.model import DAC
+
+
+# model doesn't support batching yet
+
+
+class DACModel(PreTrainedModel):
+    config_class = DACConfig
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.model = DAC(
+            n_codebooks=config.num_codebooks,
+            latent_dim=config.latent_dim,
+            codebook_size=config.codebook_size,
+        )
+
+    def encode(
+        self, input_values, padding_mask=None, bandwidth=None, return_dict=None, n_quantizers=None, sample_rate=None
+    ):
+        """
+        Encodes the input audio waveform into discrete codes.
+
+        Args:
+            input_values (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Float values of the input audio waveform.
+            padding_mask (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Padding mask used to pad the `input_values`.
+            bandwidth (`float`, *optional*):
+                Not used, kept to have the same inferface as HF encodec.
+            n_quantizers (`int`, *optional*) :
+                Number of quantizers to use, by default None
+                If None, all quantizers are used.
+            sample_rate (`int`, *optional*) :
+                Signal sampling_rate
+
+        Returns:
+            A list of frames containing the discrete encoded codes for the input audio waveform, along with rescaling
+            factors for each chunk when `normalize` is True. Each frames is a tuple `(codebook, scale)`, with
+            `codebook` of shape `[batch_size, num_codebooks, frames]`.
+            Scale is not used here.
+
+        """
+        _, channels, input_length = input_values.shape
+
+        if channels < 1 or channels > 2:
+            raise ValueError(f"Number of audio channels must be 1 or 2, but got {channels}")
+
+        audio_data = self.model.preprocess(input_values, sample_rate)
+
+        return_dict = return_dict if return_dict is not None else self.config.return_dict
+
+        # TODO: for now, no chunk length
+
+        chunk_length = None  # self.config.chunk_length
+        if chunk_length is None:
+            chunk_length = input_length
+            stride = input_length
+        else:
+            stride = self.config.chunk_stride
+
+        if padding_mask is None:
+            padding_mask = torch.ones_like(input_values).bool()
+
+        encoded_frames = []
+        scales = []
+
+        step = chunk_length - stride
+        if (input_length % stride) - step != 0:
+            raise ValueError(
+                "The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly."
+            )
+
+        for offset in range(0, input_length - step, stride):
+            mask = padding_mask[..., offset : offset + chunk_length].bool()
+            frame = audio_data[:, :, offset : offset + chunk_length]
+
+            scale = None
+
+            _, encoded_frame, _, _, _ = self.model.encode(frame, n_quantizers=n_quantizers)
+            encoded_frames.append(encoded_frame)
+            scales.append(scale)
+
+        encoded_frames = torch.stack(encoded_frames)
+
+        if not return_dict:
+            return (encoded_frames, scales)
+
+        return EncodecEncoderOutput(encoded_frames, scales)
+
+    def decode(
+        self,
+        audio_codes,
+        audio_scales,
+        padding_mask=None,
+        return_dict=None,
+    ):
+        """
+        Decodes the given frames into an output audio waveform.
+
+        Note that the output might be a bit bigger than the input. In that case, any extra steps at the end can be
+        trimmed.
+
+        Args:
+            audio_codes (`torch.FloatTensor`  of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
+                Discret code embeddings computed using `model.encode`.
+            audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*):
+                Not used, kept to have the same inferface as HF encodec.
+            padding_mask (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Padding mask used to pad the `input_values`.
+                Not used yet, kept to have the same inferface as HF encodec.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+
+        """
+        return_dict = return_dict or self.config.return_dict
+
+        # TODO: for now, no chunk length
+
+        if len(audio_codes) != 1:
+            raise ValueError(f"Expected one frame, got {len(audio_codes)}")
+
+        audio_values = self.model.quantizer.from_codes(audio_codes.squeeze(0))[0]
+        audio_values = self.model.decode(audio_values)
+        if not return_dict:
+            return (audio_values,)
+        return EncodecDecoderOutput(audio_values)
+
+    def forward(self, tensor):
+        raise ValueError(f"`DACModel.forward` not implemented yet")
--- a/stable_speech/modeling_stable_speech.py
+++ b/stable_speech/modeling_stable_speech.py
--- a/prompt_creation_scripts/run_prompt_creation_dummy.sh
+++ b/prompt_creation_scripts/run_prompt_creation_dummy.sh
-#!/usr/bin/env bash
-
-python run_prompt_creation.py \
-  --dataset_name "ylacombe/libritts_r_tags_and_text" \
-  --dataset_config_name "clean" \
-  --dataset_split_name "dev.clean" \
-  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
-  --per_device_eval_batch_size 2 \
-  --attn_implementation "sdpa" \
-  --dataloader_num_workers 0 \
-  --output_dir "./" \
-  --load_in_4bit
--- a/run_audio_classification.py
+++ b/run_audio_classification.py
--- a/run_dataset_concatenation.py
+++ b/run_dataset_concatenation.py
-import os
-import sys
-from dataclasses import dataclass, field
-from pathlib import Path
-
-import numpy as np
-from datasets import Audio, concatenate_datasets, load_dataset
-from huggingface_hub import get_full_repo_name
-from transformers import HfArgumentParser, WhisperTokenizerFast
-
-
-@dataclass
-class DataTrainingArguments:
-    """
-    Arguments pertaining to what data we are going to input our model for training and eval.
-    """
-
-    dataset_name: str = field(
-        default=None,
-        metadata={"help": "The name of the dataset to use (via the datasets library)."},
-    )
-    dataset_config_name: str = field(
-        default=None,
-        metadata={"help": "The configuration name of the dataset to use (via the datasets library)."},
-    )
-    dataset_split_name: str = field(
-        default=None,
-        metadata={
-            "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
-        },
-    )
-    label_column_name: str = field(
-        default="labels",
-        metadata={"help": "The name of the dataset column containing the labels in the dataset. Defaults to 'label'"},
-    )
-    text_column_name: str = field(
-        default="text",
-        metadata={
-            "help": "The name of the dataset column containing the text transcriptions in the dataset. Defaults to 'text'"
-        },
-    )
-    speaker_column_name: str = field(
-        default="speaker_id",
-        metadata={
-            "help": "The name of the dataset column containing the speaker ids in the dataset. Defaults to 'speaker_id'"
-        },
-    )
-    dataset_cache_dir: str = field(
-        default=None,
-        metadata={"help": "Path to cache directory for saving and loading datasets"},
-    )
-    preprocessing_num_workers: int = field(
-        default=None,
-        metadata={"help": "The number of processes to use for the preprocessing."},
-    )
-    batch_size: int = field(
-        default=500,
-        metadata={"help": "Number of examples per batch provided to the preprocessing function."},
-    )
-    download_only: bool = field(
-        default=False,
-        metadata={"help": "Whether to only do data download and skip pre-processing."},
-    )
-    audio_column_name: str = field(
-        default="audio",
-        metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
-    )
-    max_duration_in_seconds: float = field(
-        default=20.0,
-        metadata={"help": "Filter audio files that are longer than `max_duration_in_seconds` seconds"},
-    )
-    sampling_rate: int = field(
-        default=16_000,
-        metadata={
-            "help": "Sampling rate at which to resample the audio data. Should be set to the same sampling rate as the target model."
-        },
-    )
-    max_samples: int = field(
-        default=None,
-        metadata={
-            "help": "For debugging purposes, truncate the number of examples in the dataset to this value if set."
-        },
-    )
-    output_dir: str = field(
-        default=None,
-        metadata={
-            "help": "Where to save the processed dataset to disk. If unspecified, uses a 'pretty' version of the "
-            "original dataset name. E.g. 'facebook/voxpopuli' will be saved under 'voxpopuli'."
-        },
-    )
-    push_to_hub: bool = field(
-        default=False,
-        metadata={"help": "Whether or not to push the processed dataset to the Hub."},
-    )
-    seed: int = field(
-        default=0,
-        metadata={"help": "RNG seed for reproducibility. Used during the final shuffling of the combined dataset."},
-    )
-
-
-def convert_dataset_str_to_list(
-    dataset_names,
-    dataset_config_names,
-    splits=None,
-    label_column_names=None,
-    text_column_names=None,
-    speaker_column_names=None,
-    dataset_samples=None,
-    default_split="train",
-):
-    if isinstance(dataset_names, str):
-        dataset_names = dataset_names.split("+")
-        dataset_config_names = dataset_config_names.split("+")
-        splits = splits.split("+") if splits is not None else None
-        label_column_names = label_column_names.split("+") if label_column_names is not None else None
-        text_column_names = text_column_names.split("+") if text_column_names is not None else None
-        speaker_column_names = speaker_column_names.split("+") if speaker_column_names is not None else None
-        dataset_samples = dataset_samples.split("+") if dataset_samples is not None else None
-
-    # basic checks to ensure we've got the right number of datasets/configs/splits/columns/probs
-    if len(dataset_names) != len(dataset_config_names):
-        raise ValueError(
-            f"Ensure one config is passed for each dataset, got {len(dataset_names)} datasets and"
-            f" {len(dataset_config_names)} configs."
-        )
-
-    if splits is not None and len(splits) != len(dataset_names):
-        raise ValueError(
-            f"Ensure one split is passed for each dataset, got {len(dataset_names)} datasets and {len(splits)} splits."
-        )
-
-    if label_column_names is not None and len(label_column_names) != len(dataset_names):
-        raise ValueError(
-            f"Ensure one label column name is passed for each dataset, got {len(dataset_names)} datasets and"
-            f" {len(label_column_names)} label column names."
-        )
-    if text_column_names is not None and len(text_column_names) != len(dataset_names):
-        raise ValueError(
-            f"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and"
-            f" {len(text_column_names)} text column names."
-        )
-    if speaker_column_names is not None and len(speaker_column_names) != len(dataset_names):
-        raise ValueError(
-            f"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and"
-            f" {len(speaker_column_names)} speaker column names."
-        )
-
-    if dataset_samples is not None:
-        if len(dataset_samples) != len(dataset_names):
-            raise ValueError(
-                f"Ensure one sample is passed for each dataset, got {len(dataset_names)} datasets and "
-                f"{len(dataset_samples)} samples."
-            )
-        dataset_samples = [float(ds_sample) for ds_sample in dataset_samples]
-    else:
-        dataset_samples = [None] * len(dataset_names)
-
-    label_column_names = (
-        label_column_names if label_column_names is not None else ["labels" for _ in range(len(dataset_names))]
-    )
-    text_column_names = (
-        text_column_names if text_column_names is not None else ["text" for _ in range(len(dataset_names))]
-    )
-    speaker_column_names = (
-        speaker_column_names if speaker_column_names is not None else ["speaker_id" for _ in range(len(dataset_names))]
-    )
-    splits = splits if splits is not None else [default_split for _ in range(len(dataset_names))]
-
-    dataset_names_dict = []
-    for i, ds_name in enumerate(dataset_names):
-        dataset_names_dict.append(
-            {
-                "name": ds_name,
-                "config": dataset_config_names[i],
-                "split": splits[i],
-                "label_column_name": label_column_names[i],
-                "text_column_name": text_column_names[i],
-                "speaker_column_name": speaker_column_names[i],
-                "samples": dataset_samples[i],
-            }
-        )
-    return dataset_names_dict
-
-
-def main():
-    # 1. Parse input arguments
-    parser = HfArgumentParser(DataTrainingArguments)
-    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
-        # If we pass only one argument to the script and it's the path to a json file,
-        # let's parse it to get our arguments.
-        data_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))[0]
-    else:
-        data_args = parser.parse_args_into_dataclasses()[0]
-
-    dataset_names_dict = convert_dataset_str_to_list(
-        data_args.dataset_name,
-        data_args.dataset_config_name,
-        splits=data_args.dataset_split_name,
-        label_column_names=data_args.label_column_name,
-        text_column_names=data_args.text_column_name,
-        speaker_column_names=data_args.speaker_column_name,
-    )
-
-    # load whisper tokenizer for normalisation
-    sampling_rate = data_args.sampling_rate
-    tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny.en")
-    max_input_length = int(data_args.max_duration_in_seconds * sampling_rate)
-    batch_size = data_args.batch_size
-    preprocessing_num_workers = data_args.preprocessing_num_workers
-    all_vectorized_datasets = []
-
-    for dataset_dict in dataset_names_dict:
-        print(10 * "=", dataset_dict["name"], 10 * "=")
-        raw_datasets = load_dataset(
-            dataset_dict["name"],
-            dataset_dict["config"],
-            split=dataset_dict["split"],
-            cache_dir=data_args.dataset_cache_dir,
-            num_proc=data_args.preprocessing_num_workers,
-        )
-
-        if data_args.download_only:
-            continue
-
-        features = raw_datasets.column_names
-        if dataset_dict["label_column_name"] not in features:
-            raise ValueError(
-                f"--label_column_name {dataset_dict['label_column_name']} not found in dataset '{dataset_dict['name']}'. "
-                "Make sure to set `--label_column_name` to the correct text column - one of "
-                f"{', '.join(features)}."
-            )
-        elif dataset_dict["label_column_name"] != "labels":
-            raw_datasets = raw_datasets.rename_column(dataset_dict["label_column_name"], "labels")
-
-        if dataset_dict["text_column_name"] not in features:
-            raise ValueError(
-                f"--text_column_name {dataset_dict['text_column_name']} not found in dataset '{dataset_dict['name']}'. "
-                "Make sure to set `--text_column_name` to the correct text column - one of "
-                f"{', '.join(features)}."
-            )
-        elif dataset_dict["text_column_name"] != "text":
-            raw_datasets = raw_datasets.rename_column(dataset_dict["text_column_name"], "text")
-
-        if dataset_dict["speaker_column_name"] not in features:
-            raise ValueError(
-                f"--speaker_column_name {dataset_dict['speaker_column_name']} not found in dataset '{dataset_dict['name']}'. "
-                "Make sure to set `--speaker_column_name` to the correct speaker id column - one of "
-                f"{', '.join(features)}."
-            )
-        elif dataset_dict["speaker_column_name"] != "speaker_id":
-            raw_datasets = raw_datasets.rename_column(dataset_dict["speaker_column_name"], "speaker_id")
-
-        raw_datasets = raw_datasets.remove_columns(
-            set(raw_datasets.features.keys()) - {"audio", "labels", "text", "speaker_id"}
-        )
-
-        if data_args.max_samples is not None:
-            raw_datasets = raw_datasets.select(range(data_args.max_samples))
-
-        raw_datasets = raw_datasets.cast_column(data_args.audio_column_name, Audio(sampling_rate=sampling_rate))
-        raw_datasets = raw_datasets.sort("speaker_id")
-
-        def filter_transcriptions(text):
-            normalized_text = tokenizer.normalize(text).strip()
-            return bool(normalized_text) and text.lower() != "ignore_time_segment_in_scoring"
-
-        raw_datasets = raw_datasets.filter(
-            filter_transcriptions, input_columns=["text"], desc="Filtering non-speech transcriptions"
-        )
-
-        def prepare_dataset(batch):
-            audio = [sample["array"] for sample in batch["audio"]]
-            input_lengths = [len(sample) for sample in audio]
-
-            concatenated_audio = []
-            concatenated_text = []
-            concatenated_speaker = []
-            concatenated_labels = []
-            audio_sample = audio[0]
-            text_sample = batch["text"][0]
-            label_sample = batch["labels"][0]
-
-            for idx in range(1, len(audio)):
-                prev_speaker = batch["speaker_id"][idx - 1]
-                speaker = batch["speaker_id"][idx]
-
-                if len(audio_sample) + input_lengths[idx] < max_input_length:
-                    if speaker == prev_speaker:
-                        # we have no information about whether the segments follow on sequentially
-                        # so we just ensure the same speaker as we concatenate across files
-                        audio_sample = np.append(audio_sample, audio[idx])
-                        # extra spaces in the text transcription don't matter, since we only use it for the WER computation
-                        text_sample += " " + batch["text"][idx]
-                    else:
-                        # segments do not follow sequentially, save the audio and start looping again
-                        concatenated_audio.append(audio_sample)
-                        concatenated_text.append(text_sample)
-                        concatenated_labels.append(label_sample)
-                        concatenated_speaker.append(speaker)
-                        audio_sample = audio[idx]
-                        text_sample = batch["text"][idx]
-                        label_sample = batch["labels"][idx]
-
-                else:
-                    # concatenated audio exceeds max length, save the audio and start looping again
-                    concatenated_audio.append(audio_sample)
-                    concatenated_text.append(text_sample)
-                    concatenated_labels.append(label_sample)
-                    concatenated_speaker.append(speaker)
-                    audio_sample = audio[idx]
-                    text_sample = batch["text"][idx]
-                    label_sample = batch["labels"][idx]
-
-            batch["audio"] = [{"array": array, "sampling_rate": sampling_rate} for array in concatenated_audio]
-            batch["text"] = concatenated_text
-            batch["labels"] = concatenated_labels
-            batch["speaker_id"] = concatenated_speaker
-            return batch
-
-        raw_datasets = raw_datasets.map(
-            prepare_dataset,
-            batched=True,
-            batch_size=batch_size,
-            num_proc=preprocessing_num_workers,
-            desc="Concatenating dataset...",
-        )
-
-        pretty_name = dataset_dict["name"].split("/")[-1]
-
-        def postprocess_ids(speaker_id, idx):
-            formatted_idx = f"{pretty_name}-{speaker_id}-{idx}"
-            return {"id": formatted_idx}
-
-        raw_datasets = raw_datasets.map(
-            postprocess_ids,
-            input_columns=["speaker_id"],
-            with_indices=True,
-            desc="Setting sample idxs...",
-            num_proc=preprocessing_num_workers,
-        )
-        print(f"Final length {pretty_name}: ", len(raw_datasets))
-        # Re-format transcriptions and condition on prev as numpy arrays
-        raw_datasets = raw_datasets.with_format("np")
-        all_vectorized_datasets.append(raw_datasets)
-
-    all_vectorized_datasets = concatenate_datasets(all_vectorized_datasets)
-    dataset_features = all_vectorized_datasets.features.copy()
-    dataset_features["audio"] = Audio(sampling_rate=sampling_rate)
-    all_vectorized_datasets = all_vectorized_datasets.cast(
-        dataset_features, batch_size=batch_size, writer_batch_size=batch_size, num_proc=preprocessing_num_workers
-    )
-    all_vectorized_datasets = all_vectorized_datasets.shuffle(seed=data_args.seed)
-
-    all_vectorized_datasets.save_to_disk(data_args.output_dir)
-    repo_name = get_full_repo_name(Path(data_args.output_dir).absolute().name)
-    if data_args.push_to_hub:
-        all_vectorized_datasets.push_to_hub(repo_name, config_name="train", max_shard_size="1GB")
-
-
-if __name__ == "__main__":
-    main()